Scrapy框架簡介
Scrapy
是:由Python
語言開發的一個快速、高層次的屏幕抓取和web抓取框架,用于抓取web站點并從頁面中提取結構化的數據,只需要實現少量的代碼,就能夠快速的抓取。
- 新建項目 (scrapy startproject xxx):新建一個新的爬蟲項目
- 明確目標 (編寫items.py):明確你想要抓取的目標
- 制作爬蟲 (spiders/xxspider.py):制作爬蟲開始爬取網頁
- 存儲內容 (pipelines.py):設計管道存儲爬取內容
注意!只有當調度器中不存在任何request了,整個程序才會停止,(也就是說,對于下載失敗的URL,Scrapy也會重新下載。)
本篇文章不會講基本項目創建,創建的話可以移步這個文章,基礎的肯定沒什么重要性,這篇說明一下一些比較細的知識
Scrapy 官網:https://scrapy.org/
Scrapy 文檔:https://docs.scrapy.org/en/latest/
GitHub:https://github.com/scrapy/scrapy/
基本結構
定義爬取的數據結構
-
首先在items中定義一個需要爬取的數據結構
class ScrapySpiderItem(scrapy.Item):# 創建一個類來定義爬取的數據結構name = scrapy.Field()title = scrapy.Field()url = scrapy.Field()
那為什么要這樣定義:
在Scrapy框架中,
scrapy.Field()
是用于定義Item字段的特殊類,它的作用相當于一個標記。具體來說:- 數據結構聲明
每個Field實例代表Item中的一個數據字段(如你代碼中的name/title/url),用于聲明爬蟲要收集哪些數據字段。 - 元數據容器
雖然看起來像普通賦值,但實際可以通過Field()傳遞元數據參數:
在這里定義變量之后,后續就可以這樣進行使用
item = ScrapySpiderItem() item['name'] = '股票名稱' item['title'] = '股價數據' item['url'] = 'http://example.com'
然后就可以輸入
scrapy genspider itcast "itcast.cn"
命令來創建一個爬蟲,爬取itcast.cn域里的代碼 - 數據結構聲明
數據爬取
注意這里如果你是跟著菜鳥教程來的,一定要改為這樣,在itcast.py中
import scrapyclass ItcastSpider(scrapy.Spider):name = "itcast"allowed_domains = ["iscast.cn"]start_urls = ["http://www.itcast.cn/channel/teacher.shtml"]def parse(self, response):filename = "teacher.html"open(filename, 'wb').write(response.body)
改為wb,因為返回的是byte數據,如果用w不能正常返回值
那么基本的框架就是這樣:
from mySpider.items import ItcastItemdef parse(self, response):#open("teacher.html","wb").write(response.body).close()# 存放老師信息的集合items = []for each in response.xpath("//div[@class='li_txt']"):# 將我們得到的數據封裝到一個 `ItcastItem` 對象item = ItcastItem()#extract()方法返回的都是unicode字符串name = each.xpath("h3/text()").extract()title = each.xpath("h4/text()").extract()info = each.xpath("p/text()").extract()#xpath返回的是包含一個元素的列表item['name'] = name[0]item['title'] = title[0]item['info'] = info[0]items.append(item)# 直接返回最后數據return items
爬取信息后,使用xpath提取信息,返回值轉化為unicode編碼后儲存到聲明好的變量中,返回
數據保存
主要有四種格式
- scrapy crawl itcast -o teachers.json
- scrapy crawl itcast -o teachers.jsonl //json lines格式
- scrapy crawl itcast -o teachers.csv
- scrapy crawl itcast -o teachers.xml
不過上面只是一些項目的搭建和基本使用,我們通過爬蟲漸漸進入框架,一定也好奇這個框架的優點在哪里,有什么特別的作用
scrapy結構
pipelines(管道)
這個文件也就是我們說的管道,當Item在Spider中被收集之后,它將會被傳遞到Item Pipeline(管道),這些Item Pipeline組件按定義的順序處理Item。每個Item Pipeline都是實現了簡單方法的Python類,比如決定此Item是丟棄而存儲。以下是item pipeline的一些典型應用:
- 驗證爬取的數據(檢查item包含某些字段,比如說name字段)
- 查重(并丟棄)
- 將爬取結果保存到文件或者數據庫中
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html# useful for handling different item types with a single interface
from itemadapter import ItemAdapterclass MyspiderPipeline:def process_item(self, item, spider):return item
settings(設置)
代碼里給了注釋,一些基本的設置
# Scrapy settings for mySpider project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://docs.scrapy.org/en/latest/topics/settings.html
# https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#項目名稱
BOT_NAME = "mySpider"SPIDER_MODULES = ["mySpider.spiders"]
NEWSPIDER_MODULE = "mySpider.spiders"# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = "mySpider (+http://www.yourdomain.com)"
#是否遵守規則協議
# Obey robots.txt rules
ROBOTSTXT_OBEY = True# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32 #最大并發量32,默認16# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#下載延遲3秒
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16# Disable cookies (enabled by default)
#COOKIES_ENABLED = False# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False# Override the default request headers:
#請求頭
#DEFAULT_REQUEST_HEADERS = {
# "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
# "Accept-Language": "en",
#}# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# "mySpider.middlewares.MyspiderSpiderMiddleware": 543,
#}# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# "mySpider.middlewares.MyspiderDownloaderMiddleware": 543,
#}# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# "scrapy.extensions.telnet.TelnetConsole": None,
#}# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
# "mySpider.pipelines.MyspiderPipeline": 300,
#}# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = "httpcache"
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = "scrapy.extensions.httpcache.FilesystemCacheStorage"# Set settings whose default value is deprecated to a future-proof value
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
FEED_EXPORT_ENCODING = "utf-8"
spiders
爬蟲代碼目錄,定義了爬蟲的邏輯
import scrapy
from mySpider.items import ItcastItemclass ItcastSpider(scrapy.Spider):name = "itcast"allowed_domains = ["iscast.cn"]start_urls = ["http://www.itcast.cn/"]def parse(self, response):# 獲取網站標題list=response.xpath('//*[@id="mCSB_1_container"]/ul/li[@*]')
實戰(大學信息)
目標網站:爬取大學信息
base64:
aHR0cDovL3NoYW5naGFpcmFua2luZy5jbi9yYW5raW5ncy9iY3VyLzIwMjQ=
變量命名
在剛剛創建的itcast里更改一下域名,在items里改一下接收數據格式
itcast.py
import scrapy
from mySpider.items import ItcastItemclass ItcastSpider(scrapy.Spider):name = "itcast"allowed_domains = ["iscast.cn"]start_urls = ["https://www.shanghairanking.cn/rankings/bcur/2024"]def parse(self, response):# 獲取網站標題list=response.xpath('(//*[@class="align-left"])[position() > 1 and position() <= 31]')item=ItcastItem()for i in list:name=i.xpath('./div/div[2]/div[1]/div/div/span/text()').extract()description=i.xpath('./div/div[2]/p/text()').extract()location=i.xpath('../td[3]/text()').extract()item['name']=str(name).strip().replace('\\n','').replace(' ','')item['description']=str(description).strip().replace('\\n','').replace(' ','')item['location']=str(location).strip().replace('\\n','').replace(' ','')print(item)yield item
這里xpath感不太了解的可以看我之前的博客
一些爬蟲基礎知識備忘錄-xpath
items.py
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.htmlimport scrapyclass ItcastItem(scrapy.Item):name = scrapy.Field()description=scrapy.Field()location=scrapy.Field()
pipelines.py
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html# useful for handling different item types with a single interface
import csv
from itemadapter import ItemAdapterclass MyspiderPipeline:def __init__(self):#在初始化函數中先創建一個csv文件self.f=open('school.csv','w',encoding='utf-8',newline='')self.file_name=['name','description','location']self.writer=csv.DictWriter(self.f,fieldnames=self.file_name)self.writer.writeheader()#寫入第一段字段名def process_item(self, item, spider):self.writer.writerow(dict(item))#在寫入的時候,要轉化為字典對象print(item)return itemdef close_spider(self,spider):self.f.close()#關閉文件
setting.py
# Scrapy settings for mySpider project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://docs.scrapy.org/en/latest/topics/settings.html
# https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#項目名稱
BOT_NAME = "mySpider"SPIDER_MODULES = ["mySpider.spiders"]
NEWSPIDER_MODULE = "mySpider.spiders"# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = "mySpider (+http://www.yourdomain.com)"
#是否遵守規則協議
# Obey robots.txt rules
ROBOTSTXT_OBEY = False# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32 #最大并發量32,默認16# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#下載延遲3秒
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16# Disable cookies (enabled by default)
#COOKIES_ENABLED = False# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False# Override the default request headers:
#請求頭
DEFAULT_REQUEST_HEADERS = {"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8","Accept-Language": "en",
}# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# "mySpider.middlewares.MyspiderSpiderMiddleware": 543,
#}# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# "mySpider.middlewares.MyspiderDownloaderMiddleware": 543,
#}# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# "scrapy.extensions.telnet.TelnetConsole": None,
#}# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {"mySpider.pipelines.MyspiderPipeline": 300,
}
LOG_LEVEL = 'WARNING'
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = "httpcache"
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = "scrapy.extensions.httpcache.FilesystemCacheStorage"# Set settings whose default value is deprecated to a future-proof value
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
FEED_EXPORT_ENCODING = "utf-8"
start.py
然后在myspider文件夾下可以創建一個start.py文件,這樣我們直接運行這個文件即可,不需要使用命令
from scrapy import cmdline
cmdline.execute("scrapy crawl itcast".split())
然后我們就正常保存為csv格式啦!
一些問題
實現繼續爬取,翻頁
scrapy使用yield進行數據解析和爬取request,如果想實現翻頁或者在請求完單次請求后繼續請求,使用yield繼續請求如果這里你使用一個return肯定會直接退出,感興趣的可以去深入了解一下
一些setting配置
LOG_LEVEL = 'WARNING'
可以把不太重要的日志關掉,讓我們專注于看數據的爬取與分析
然后管道什么的在運行前記得開一下,把原先注釋掉的打開就行
另外robot也要記得關一下
然后管道什么的在運行前記得開一下
文件命名
csv格式的文件命名一定要和items中的命名一致,不然數據進不去
到了結束的時候了,本篇文章是對scrapy框架的入門,更加深入的知識請期待后續文章,一起進步!