尚硅谷爬蟲note15n

1. 多條管道

多條管道開啟（2步）：
????????? (1)定義管道類
????????（2）在settings中開啟管道

在pipelines中：

import urllib.request
# 多條管道開啟
#(1)定義管道類
#（2）在settings中開啟管道
# "demo_nddw.pipelines.dangdangDownloadPipeline": 301
class dangdangDownloadPipeline:def process_item(self, item, spider):url = 'http:' + item.get('src')filename = './books' + item.get('name') + '.jpg'urllib.request.urlretrieve(url = url, filename=filename())return item

2. 多頁下載

??????? 爬取每頁的業務邏輯都是一樣的，將執行那頁的請求再次調用就可以了

# 如果是多頁下載的話，必須調整allowed_domains的范圍，一般只寫域名

allowed_domains = ["category.dangdang.com"]

ddw.py中：

#多頁下載
# 爬取每頁的業務邏輯是一樣的，將執行那頁的請求再次調用parse方法就可以了
# https://category.dangdang.com/cp01.27.01.06.00.00.html
# https://category.dangdang.com/pg2-cp01.27.01.06.00.00.html
# https://category.dangdang.com/pg3-cp01.27.01.06.00.00.html
if self.page < 100:self.page = self.page + 1url = self.basic_url + str(self.page) + '-cp01.27.01.06.00.00.html'# 怎么調用parse方法# scrapy.Request就是scrapy的get請求#url就是請求地址、callback就是要執行的函數,不需要（）yield scrapy.Request(url = url, callback =self.parse)

3. 電影天堂

??????? 獲取：

????????????????第一頁的名字

??????????????? 第二頁的圖片

涉及到兩個頁面：使用meta進行傳遞

pipelines.py

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html# useful for handling different item types with a single interface
from itemadapter import ItemAdapterclass DemoDyttPipeline:#開始def open_spider(self,spider):self.fp = open('dytt.json','w',encoding='utf-8')def process_item(self, item, spider):#中間self.fp.write(str(item))return item#結束def close_spider(self,spider):self.fp.close()

items.py

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.htmlimport scrapyclass DemoDyttItem(scrapy.Item):# define the fields for your item here like:# name = scrapy.Field()# pass# 名字name = scrapy.Field()# 圖片src = scrapy.Field()

dytt.py

import scrapy#導入：從項目的items中導入
from demo_dytt.items import DemoDyttItemclass DyttSpider(scrapy.Spider):name = "dytt"# 調整allowed_domains訪問范圍：只要域名allowed_domains = ["www.dydytt.net"]start_urls = ["https://www.dydytt.net/html/gndy/dyzz/20250306/65993.html"]def parse(self, response):# pass# print('===========================================================')# 要第一頁的圖片，和第二頁的名字a_list = response.xpath('//div[@class = "co_content8"]//tr[2]//a[2]')for a in a_list:# 獲取第一頁的name,和要點擊的鏈接name = a.xpath('./text()').extract_first()href  =a.xpath('/@href').extract_first()# 第二頁的地址url = 'https://www.dydytt.net' + href# 對第二頁鏈接發起訪問# 1)meta字典：傳遞yield scrapy.Request(url = url,callback=self.parse_second,meta={'name':name})def parse_second(self,response):# 拿不到數據，檢查xpath語法是否錯誤src = response.xpath('//div[@id = "Zoom"]/span/img/@src').extract_first()print(src)#2)接收meta字典meta = response.meta['name']dytt = DemoDyttItem(src = src, name = name)#將dytt返回給管道,需要在settings中開啟管道：解除管道注釋即是開啟管道# ITEM_PIPELINES = {#     "demo_dytt.pipelines.DemoDyttPipeline": 300,# }yield dytt

開啟管道：

????????在settings.py中解除管道的注釋即是開啟管道

ITEM_PIPELINES = {"demo_dytt.pipelines.DemoDyttPipeline": 300,
}

4. CrawlSpider

??????? 繼承自scrapy.spider

CrawlSpider：what？

????????1）定義規則

????????2）提取符合規則的鏈接

????????3）解析

鏈接提取器

??????? 1）導入鏈接提取器

from scrapy.linkextractors import LinkExtractor

??????? 2)

allow = ()??????? ：正則表達式

restrict_xpaths = （）??????? ：xpath

restrict_css = ()??????? ：不推薦

??????? scrapy shell 網址，然后進行3）4）的鏈接提取

導入鏈接提取器：

????????from scrapy.linkextractors import LinkExtractor

????????3）allow = ()??語法

link = LinkExtraactor(allow = r' /book/1188_\d+\.html')

??????? \d表示數字

??????? +表示1~多

查看：

link.extract_links(response)

????????4)restrict_xpaths = （）語法

link1 = LinkExtractor（restrict_xpaths = r' //div[@class = "pages"]/a/@href ')

查看：

link.extract_links(response)

5. CrawlSpider案例

1）創建文件：

??????? scrapy genspider -t crawl 文件名網址

2)首頁不在提取規則，所以不能提取首頁

修改start_urls：

????????start_urls = ["https://www.dushu.com/book/1157.html"]

?修改后：

????????start_urls = ["https://www.dushu.com/book/1157_1.html"]

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/web/71876.shtml
繁體地址，請注明出處：http://hk.pswp.cn/web/71876.shtml
英文地址，請注明出處：http://en.pswp.cn/web/71876.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！