Python爬蟲——scrapy_當當網圖書管道封裝

創建爬蟲項目

srcapy startproject scrapy_dangdang

進入到spider文件里創建爬蟲文件（這里爬取的是青春文學，仙俠玄幻分類）

srcapy genspider dang http://category.dangdang.com/cp01.01.07.00.00.00.html

獲取圖片、名字和價格

# 所有的seletor的對象，都可以再次調用xpath方法
li_list = response.xpath('//div[@id="search_nature_rg"]//li')for li in li_list:# 獲取圖片src = li.xpath('.//img/@data-original').extract_first()# 第一張圖片和其他圖片的標簽的屬性不一樣# 第一張圖片的src是可以使用的，其他圖片的地址在data-original里if src:src = srcelse:src = li.xpath('.//img/@src').extract_first()# 獲取名字name = li.xpath('.//img/@alt').extract_first()# 獲取價格price = li.xpath('.//p[@class="price"]/span[1]/text()').extract_first()print(src, name, price)

在items里定義要下載的數據

import scrapyclass ScrapyDangdang39Item(scrapy.Item):# 要下載的數據都有什么# 圖片src = scrapy.Field()# 名字name = scrapy.Field()# 價格price = scrapy.Field()

在dang.py里導入items

from ..items import ScrapyDangdang39Item

在parse方法里定義一個對象book，然后把獲取到的值傳遞到pipelines

book = ScrapyDangdang39Item(src=src, name=name, price=price)# 獲取一個book就將book傳遞給pipelines
yield book

開啟管道
在settings中，把這幾行代碼取消注釋

管道可以有很多個，但是管道是有優先級的，優先級的范圍是1到1000 值越小，優先級越高
下載數據
打開piplines.py

class ScrapyDangdang39Pipeline:# 方法1# 在爬蟲文件執行前執行的一個方法def open_spider(self, spider):self.fp = open('book.json', 'w', encoding='utf-8')def process_item(self, item, spider):# item就是yield后面的book對象# 1.write方法必須要寫一個字符串，而不是其他的對象# 2.w模式，每一個對象都會打開一次文件，然后覆蓋之前的內容，所以使用a模式with open('book.json', 'a', encoding='utf-8')as fp:fp.write(str(item))return item

但是這種模式不推薦，因為每傳遞過來一個數據，就要打開一次文件，對文件的操作太過頻繁
換一種方法

class ScrapyDangdang39Pipeline:# 在爬蟲文件執行前執行的一個方法def open_spider(self, spider):self.fp = open('book.json', 'w', encoding='utf-8')def process_item(self, item, spider):# item就是yield后面的book對象self.fp.write(str(item))return item# 在爬蟲文件執行完后執行的一個方法def close_spider(self, spider):self.fp.close()

運行dang.py文件就可以把數據保存到本地了

完整代碼
dang.py

import scrapy
from ..items import ScrapyDangdang39Itemclass DangSpider(scrapy.Spider):name = "dang"allowed_domains = ["category.dangdang.com"]start_urls = ["http://category.dangdang.com/cp01.01.07.00.00.00.html"]def parse(self, response):# 所有的seletor的對象，都可以再次調用xpath方法li_list = response.xpath('//div[@id="search_nature_rg"]//li')for li in li_list:# 獲取圖片src = li.xpath('.//img/@data-original').extract_first()# 第一張圖片和其他圖片的標簽的屬性不一樣# 第一張圖片的src是可以使用的，其他圖片的地址在data-original里if src:src = srcelse:src = li.xpath('.//img/@src').extract_first()# 獲取名字name = li.xpath('.//img/@alt').extract_first()# 獲取價格price = li.xpath('.//p[@class="price"]/span[1]/text()').extract_first()book = ScrapyDangdang39Item(src=src, name=name, price=price)# 獲取一個book就將book傳遞給pipelinesyield book

items.py

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.htmlimport scrapyclass ScrapyDangdang39Item(scrapy.Item):# define the fields for your item here like:# name = scrapy.Field()# 要下載的數據都有什么# 圖片src = scrapy.Field()# 名字name = scrapy.Field()# 價格price = scrapy.Field()

pipelines.py

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html# useful for handling different item types with a single interface
from itemadapter import ItemAdapter# 如果想使用管道，就必須在settings中開啟管道
class ScrapyDangdang39Pipeline:# 方法1# 在爬蟲文件執行前執行的一個方法def open_spider(self, spider):self.fp = open('book.json', 'w', encoding='utf-8')def process_item(self, item, spider):# item就是yield后面的book對象# 這種模式不推薦# with open('book.json', 'a', encoding='utf-8')as fp:#     fp.write(str(item))self.fp.write(str(item))return item# 在爬蟲文件執行完后執行的一個方法def close_spider(self, spider):self.fp.close()

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/news/40872.shtml
繁體地址，請注明出處：http://hk.pswp.cn/news/40872.shtml
英文地址，請注明出處：http://en.pswp.cn/news/40872.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！