Python爬蟲——scrapy_crawlspider讀書網

創建crawlspider爬蟲文件：

scrapy genspider -t crawl 爬蟲文件名 爬取的域名scrapy genspider -t crawl read https://www.dushu.com/book/1206.html

LinkExtractor 鏈接提取器通過它，Spider可以知道從爬取的頁面中提取出哪些鏈接，提取出的鏈接會自動生成Request請求對象

class ReadSpider(CrawlSpider):name = "read"allowed_domains = ["www.dushu.com"]start_urls = ["https://www.dushu.com/book/1206_1.html"]# LinkExtractor 鏈接提取器通過它，Spider可以知道從爬取的頁面中提取出哪些鏈接。提取出的鏈接會自動生成Request請求對象rules = (Rule(LinkExtractor(allow=r"/book/1206_\d+\.html"), callback="parse_item", follow=False),)def parse_item(self, response):name_list = response.xpath('//div[@class="book-info"]//img/@alt')src_list = response.xpath('//div[@class="book-info"]//img/@data-original')for i in range(len(name_list)):name = name_list[i].extract()src = src_list[i].extract()book = ScarpyReadbook41Item(name=name, src=src)yield book

開啟管道、
寫入文件

class ScarpyReadbook41Pipeline:def open_spider(self, spider):self.fp = open('books.json', 'w', encoding='utf-8')def process_item(self, item, spider):self.fp.write(str(item))return itemdef close_spider(self, spider):self.fp.close()

運行之后發現沒有第一頁數據
需要在start_urls里加上_1，不然不會讀取第一頁數據

start_urls = ["https://www.dushu.com/book/1206_1.html"]

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/news/43352.shtml
繁體地址，請注明出處：http://hk.pswp.cn/news/43352.shtml
英文地址，請注明出處：http://en.pswp.cn/news/43352.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！