scrapy入門（深入）

Scrapy框架簡介

Scrapy是:由Python語言開發的一個快速、高層次的屏幕抓取和web抓取框架，用于抓取web站點并從頁面中提取結構化的數據，只需要實現少量的代碼，就能夠快速的抓取。

新建項目 (scrapy startproject xxx)：新建一個新的爬蟲項目
明確目標（編寫items.py）：明確你想要抓取的目標
制作爬蟲（spiders/xxspider.py）：制作爬蟲開始爬取網頁
存儲內容（pipelines.py）：設計管道存儲爬取內容

注意！只有當調度器中不存在任何request了，整個程序才會停止，（也就是說，對于下載失敗的URL，Scrapy也會重新下載。）

本篇文章不會講基本項目創建，創建的話可以移步這個文章，基礎的肯定沒什么重要性，這篇說明一下一些比較細的知識

Scrapy 官網：https://scrapy.org/
Scrapy 文檔：https://docs.scrapy.org/en/latest/
GitHub：https://github.com/scrapy/scrapy/

基本結構

定義爬取的數據結構

首先在items中定義一個需要爬取的數據結構
```
class ScrapySpiderItem(scrapy.Item):# 創建一個類來定義爬取的數據結構name = scrapy.Field()title = scrapy.Field()url = scrapy.Field()
```
那為什么要這樣定義：

在Scrapy框架中，scrapy.Field() 是用于定義Item字段的特殊類，它的作用相當于一個標記。具體來說：
1. 數據結構聲明
  每個Field實例代表Item中的一個數據字段（如你代碼中的name/title/url），用于聲明爬蟲要收集哪些數據字段。
2. 元數據容器
  雖然看起來像普通賦值，但實際可以通過Field()傳遞元數據參數：
在這里定義變量之后，后續就可以這樣進行使用
```
item = ScrapySpiderItem()
item['name'] = '股票名稱'
item['title'] = '股價數據'
item['url'] = 'http://example.com'
```
然后就可以輸入scrapy genspider itcast "itcast.cn"命令來創建一個爬蟲，爬取itcast.cn域里的代碼

數據爬取

注意這里如果你是跟著菜鳥教程來的，一定要改為這樣，在itcast.py中

import scrapyclass ItcastSpider(scrapy.Spider):name = "itcast"allowed_domains = ["iscast.cn"]start_urls = ["http://www.itcast.cn/channel/teacher.shtml"]def parse(self, response):filename = "teacher.html"open(filename, 'wb').write(response.body)

改為wb，因為返回的是byte數據，如果用w不能正常返回值

那么基本的框架就是這樣：

from mySpider.items import ItcastItemdef parse(self, response):#open("teacher.html","wb").write(response.body).close()# 存放老師信息的集合items = []for each in response.xpath("//div[@class='li_txt']"):# 將我們得到的數據封裝到一個 `ItcastItem` 對象item = ItcastItem()#extract()方法返回的都是unicode字符串name = each.xpath("h3/text()").extract()title = each.xpath("h4/text()").extract()info = each.xpath("p/text()").extract()#xpath返回的是包含一個元素的列表item['name'] = name[0]item['title'] = title[0]item['info'] = info[0]items.append(item)# 直接返回最后數據return items

爬取信息后，使用xpath提取信息，返回值轉化為unicode編碼后儲存到聲明好的變量中，返回

數據保存

主要有四種格式

scrapy crawl itcast -o teachers.json
scrapy crawl itcast -o teachers.jsonl //json lines格式
scrapy crawl itcast -o teachers.csv
scrapy crawl itcast -o teachers.xml

不過上面只是一些項目的搭建和基本使用，我們通過爬蟲漸漸進入框架，一定也好奇這個框架的優點在哪里，有什么特別的作用

scrapy結構

pipelines（管道）

這個文件也就是我們說的管道,當Item在Spider中被收集之后，它將會被傳遞到Item Pipeline(管道)，這些Item Pipeline組件按定義的順序處理Item。每個Item Pipeline都是實現了簡單方法的Python類，比如決定此Item是丟棄而存儲。以下是item pipeline的一些典型應用：

驗證爬取的數據(檢查item包含某些字段，比如說name字段)
查重(并丟棄)
將爬取結果保存到文件或者數據庫中

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html# useful for handling different item types with a single interface
from itemadapter import ItemAdapterclass MyspiderPipeline:def process_item(self, item, spider):return item

settings（設置）

代碼里給了注釋，一些基本的設置

# Scrapy settings for mySpider project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#項目名稱
BOT_NAME = "mySpider"SPIDER_MODULES = ["mySpider.spiders"]
NEWSPIDER_MODULE = "mySpider.spiders"# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = "mySpider (+http://www.yourdomain.com)"
#是否遵守規則協議
# Obey robots.txt rules
ROBOTSTXT_OBEY = True# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32 #最大并發量32，默認16# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#下載延遲3秒
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16# Disable cookies (enabled by default)
#COOKIES_ENABLED = False# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False# Override the default request headers:
#請求頭
#DEFAULT_REQUEST_HEADERS = {
#    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
#    "Accept-Language": "en",
#}# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    "mySpider.middlewares.MyspiderSpiderMiddleware": 543,
#}# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    "mySpider.middlewares.MyspiderDownloaderMiddleware": 543,
#}# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    "scrapy.extensions.telnet.TelnetConsole": None,
#}# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
#    "mySpider.pipelines.MyspiderPipeline": 300,
#}# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = "httpcache"
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = "scrapy.extensions.httpcache.FilesystemCacheStorage"# Set settings whose default value is deprecated to a future-proof value
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
FEED_EXPORT_ENCODING = "utf-8"

spiders

爬蟲代碼目錄，定義了爬蟲的邏輯

import scrapy
from mySpider.items import ItcastItemclass ItcastSpider(scrapy.Spider):name = "itcast"allowed_domains = ["iscast.cn"]start_urls = ["http://www.itcast.cn/"]def parse(self, response):# 獲取網站標題list=response.xpath('//*[@id="mCSB_1_container"]/ul/li[@*]')

實戰（大學信息）

目標網站：爬取大學信息

base64：

aHR0cDovL3NoYW5naGFpcmFua2luZy5jbi9yYW5raW5ncy9iY3VyLzIwMjQ=

變量命名

在剛剛創建的itcast里更改一下域名，在items里改一下接收數據格式

itcast.py

import scrapy
from mySpider.items import ItcastItemclass ItcastSpider(scrapy.Spider):name = "itcast"allowed_domains = ["iscast.cn"]start_urls = ["https://www.shanghairanking.cn/rankings/bcur/2024"]def parse(self, response):# 獲取網站標題list=response.xpath('(//*[@class="align-left"])[position() > 1 and position() <= 31]')item=ItcastItem()for i in list:name=i.xpath('./div/div[2]/div[1]/div/div/span/text()').extract()description=i.xpath('./div/div[2]/p/text()').extract()location=i.xpath('../td[3]/text()').extract()item['name']=str(name).strip().replace('\\n','').replace(' ','')item['description']=str(description).strip().replace('\\n','').replace(' ','')item['location']=str(location).strip().replace('\\n','').replace(' ','')print(item)yield item

這里xpath感不太了解的可以看我之前的博客

一些爬蟲基礎知識備忘錄-xpath

items.py

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.htmlimport scrapyclass ItcastItem(scrapy.Item):name = scrapy.Field()description=scrapy.Field()location=scrapy.Field()

pipelines.py

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html# useful for handling different item types with a single interface
import csv
from itemadapter import ItemAdapterclass MyspiderPipeline:def __init__(self):#在初始化函數中先創建一個csv文件self.f=open('school.csv','w',encoding='utf-8',newline='')self.file_name=['name','description','location']self.writer=csv.DictWriter(self.f,fieldnames=self.file_name)self.writer.writeheader()#寫入第一段字段名def process_item(self, item, spider):self.writer.writerow(dict(item))#在寫入的時候，要轉化為字典對象print(item)return itemdef close_spider(self,spider):self.f.close()#關閉文件

setting.py

# Scrapy settings for mySpider project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#項目名稱
BOT_NAME = "mySpider"SPIDER_MODULES = ["mySpider.spiders"]
NEWSPIDER_MODULE = "mySpider.spiders"# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = "mySpider (+http://www.yourdomain.com)"
#是否遵守規則協議
# Obey robots.txt rules
ROBOTSTXT_OBEY = False# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32 #最大并發量32，默認16# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#下載延遲3秒
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16# Disable cookies (enabled by default)
#COOKIES_ENABLED = False# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False# Override the default request headers:
#請求頭
DEFAULT_REQUEST_HEADERS = {"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8","Accept-Language": "en",
}# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    "mySpider.middlewares.MyspiderSpiderMiddleware": 543,
#}# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    "mySpider.middlewares.MyspiderDownloaderMiddleware": 543,
#}# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    "scrapy.extensions.telnet.TelnetConsole": None,
#}# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {"mySpider.pipelines.MyspiderPipeline": 300,
}
LOG_LEVEL = 'WARNING'
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = "httpcache"
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = "scrapy.extensions.httpcache.FilesystemCacheStorage"# Set settings whose default value is deprecated to a future-proof value
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
FEED_EXPORT_ENCODING = "utf-8"

start.py

然后在myspider文件夾下可以創建一個start.py文件，這樣我們直接運行這個文件即可，不需要使用命令

from scrapy import cmdline
cmdline.execute("scrapy crawl itcast".split())

然后我們就正常保存為csv格式啦！

一些問題

實現繼續爬取，翻頁

scrapy使用yield進行數據解析和爬取request，如果想實現翻頁或者在請求完單次請求后繼續請求，使用yield繼續請求如果這里你使用一個return肯定會直接退出，感興趣的可以去深入了解一下

一些setting配置

LOG_LEVEL = 'WARNING'可以把不太重要的日志關掉，讓我們專注于看數據的爬取與分析

然后管道什么的在運行前記得開一下,把原先注釋掉的打開就行

另外robot也要記得關一下

然后管道什么的在運行前記得開一下

文件命名

csv格式的文件命名一定要和items中的命名一致，不然數據進不去

到了結束的時候了，本篇文章是對scrapy框架的入門，更加深入的知識請期待后續文章，一起進步！