settings.py
文件的主要作用是對 Scrapy
項目的全局設置進行集中管理。借助修改這個文件中的配置項,你可以對爬蟲的行為、性能、數據處理等方面進行靈活調整,而無需修改爬蟲代碼。
①默認英文注釋settings.py
# Scrapy settings for douban project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://docs.scrapy.org/en/latest/topics/settings.html
# https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# https://docs.scrapy.org/en/latest/topics/spider-middleware.htmlBOT_NAME = "douban"SPIDER_MODULES = ["douban.spiders"]
NEWSPIDER_MODULE = "douban.spiders"# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = "douban (+http://www.yourdomain.com)"# Obey robots.txt rules
ROBOTSTXT_OBEY = True# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16# Disable cookies (enabled by default)
#COOKIES_ENABLED = False# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
# "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
# "Accept-Language": "en",
#}# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# "douban.middlewares.doubanSpiderMiddleware": 543,
#}# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# "douban.middlewares.doubanDownloaderMiddleware": 543,
#}# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# "scrapy.extensions.telnet.TelnetConsole": None,
#}# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
# "douban.pipelines.doubanPipeline": 300,
#}# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = "httpcache"
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = "scrapy.extensions.httpcache.FilesystemCacheStorage"# Set settings whose default value is deprecated to a future-proof value
REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7"
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
FEED_EXPORT_ENCODING = "utf-8"
②中文注釋settings.py
# Scrapy 豆瓣項目的設置文件
# 為簡潔起見,本文件僅包含重要或常用的設置。
# 更多設置可參考官方文檔:
# https://docs.scrapy.org/en/latest/topics/settings.html
# https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html
# 項目名稱
BOT_NAME = "douban"# 蜘蛛模塊路徑
SPIDER_MODULES = ["douban.spiders"]
NEWSPIDER_MODULE = "douban.spiders"# 爬蟲身份標識(用戶代理)
#USER_AGENT = "douban (+http://www.yourdomain.com)"# 是否遵守robots.txt規則
ROBOTSTXT_OBEY = True# 配置Scrapy的最大并發請求數(默認:16)
#CONCURRENT_REQUESTS = 32# 配置對同一網站的請求延遲(默認:0)
# 參見 https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# 另見自動限速設置和文檔
#DOWNLOAD_DELAY = 3# 下載延遲設置將只采用以下一項:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16# 是否啟用Cookies(默認啟用)
#COOKIES_ENABLED = False# 是否啟用Telnet控制臺(默認啟用)
#TELNETCONSOLE_ENABLED = False# 覆蓋默認請求頭:
#DEFAULT_REQUEST_HEADERS = {
# "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
# "Accept-Language": "en",
#}# 啟用或禁用蜘蛛中間件
# 參見 https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# "douban.middlewares.doubanSpiderMiddleware": 543,
#}# 啟用或禁用下載器中間件
# 參見 https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# "douban.middlewares.doubanDownloaderMiddleware": 543,
#}# 啟用或禁用擴展
# 參見 https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# "scrapy.extensions.telnet.TelnetConsole": None,
#}# 配置項目管道
# 參見 https://docs.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
# "douban.pipelines.doubanPipeline": 300,
#}# 啟用并配置AutoThrottle擴展(默認禁用)
# 參見 https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True# 初始下載延遲
#AUTOTHROTTLE_START_DELAY = 5# 高延遲情況下的最大下載延遲
#AUTOTHROTTLE_MAX_DELAY = 60# Scrapy應向每個遠程服務器并行發送的平均請求數
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0# 是否顯示每個接收到的響應的限速統計信息:
#AUTOTHROTTLE_DEBUG = False# 啟用并配置HTTP緩存(默認禁用)
# 參見 https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = "httpcache"
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = "scrapy.extensions.httpcache.FilesystemCacheStorage"# 將已棄用默認值的設置設置為面向未來的值
REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7"
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
FEED_EXPORT_ENCODING = "utf-8"
1.基本項目設置
# 項目名稱
BOT_NAME = "douban"# 蜘蛛模塊路徑
SPIDER_MODULES = ["douban.spiders"]
NEWSPIDER_MODULE = "douban.spiders"
BOT_NAME
:該項目爬蟲的名稱,此名稱會在日志和統計信息里體現。SPIDER_MODULES
:這是一個列表,其中包含了 Scrapy 要查找爬蟲的 Python 模塊。這里表明 Scrapy 會在douban.spiders
模塊中搜尋爬蟲類。NEWSPIDER_MODULE
:當你運用scrapy genspider
命令來創建新的爬蟲時,新爬蟲文件會被生成到douban.spiders
模塊里。
2.用戶代理與 robots.txt 規則
# 爬蟲身份標識(用戶代理)
#USER_AGENT = "douban (+http://www.yourdomain.com)"# 是否遵守robots.txt規則
ROBOTSTXT_OBEY = True
USER_AGENT
:注釋掉的這一行可用于設定爬蟲的用戶代理。用戶代理能讓服務器知曉請求是由何種客戶端發出的。你可將其設定為特定值,從而讓服務器識別你的爬蟲及其所屬網站。ROBOTSTXT_OBEY
:若設為True
,Scrapy 就會遵循目標網站的robots.txt
文件規則。這意味著爬蟲會依據robots.txt
里的規則來決定是否可以訪問某些頁面。
3.并發請求與下載延遲設置
# 配置Scrapy的最大并發請求數(默認:16)
#CONCURRENT_REQUESTS = 32# 配置對同一網站的請求延遲(默認:0)
# 參見 https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# 另見自動限速設置和文檔
#DOWNLOAD_DELAY = 3# 下載延遲設置將只采用以下一項:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
CONCURRENT_REQUESTS
:可設置 Scrapy 同時執行的最大請求數,默認值是 16。
DOWNLOAD_DELAY
:設置對同一網站的請求之間的延遲時間(單位為秒)。這有助于避免對目標網站造成過大壓力。CONCURRENT_REQUESTS_PER_DOMAIN
和CONCURRENT_REQUESTS_PER_IP
:這兩個設置項只能啟用一個,用于限制對同一域名或同一 IP 地址的并發請求數。
4.Cookies 與 Telnet 控制臺設置
# 是否啟用Cookies(默認啟用)
#COOKIES_ENABLED = False# 是否啟用Telnet控制臺(默認啟用)
#TELNETCONSOLE_ENABLED = False
COOKIES_ENABLED
:若設為False
,Scrapy 就不會處理 cookies。默認情況下,Scrapy 會啟用 cookies 處理。TELNETCONSOLE_ENABLED
:若設為False
,則會禁用 Telnet 控制臺。默認情況下,Scrapy 會啟用 Telnet 控制臺,借助該控制臺你能在爬蟲運行時與其交互。COOKIES_DEBUG
:若設為True
,啟用 Cookies 調試模式。Scrapy 會在日志中輸出詳細的 Cookies 相關信息,包括請求中發送的 Cookies 以及響應中收到的 Cookies。默認是禁用Cookies 調試模式
5.默認請求頭設置
# 覆蓋默認請求頭:
#DEFAULT_REQUEST_HEADERS = {
# "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
# "Accept-Language": "en",
#}
DEFAULT_REQUEST_HEADERS
:可用于覆蓋默認的請求頭。這里給出了一個示例,設置了Accept
和Accept-Language
請求頭。
6.中間件設置
# 啟用或禁用蜘蛛中間件
# 參見 https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# "douban.middlewares.doubanSpiderMiddleware": 543,
#}# 啟用或禁用下載器中間件
# 參見 https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# "douban.middlewares.doubanDownloaderMiddleware": 543,
#}
SPIDER_MIDDLEWARES
:用于啟用或禁用爬蟲中間件。中間件是在爬蟲處理請求和響應時執行的組件。數字543
代表中間件的優先級,數值越小優先級越高。DOWNLOADER_MIDDLEWARES
:用于啟用或禁用下載器中間件。下載器中間件會在下載請求和響應時發揮作用。
7.擴展設置
# 啟用或禁用擴展
# 參見 https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# "scrapy.extensions.telnet.TelnetConsole": None,
#}
EXTENSIONS
:用于啟用或禁用 Scrapy 的擴展。擴展是能增強 Scrapy 功能的組件。這里的示例是禁用 Telnet 控制臺擴展。
8.項目管道設置
# 配置項目管道
# 參見 https://docs.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
# "douban.pipelines.doubanPipeline": 300,
#}
ITEM_PIPELINES
:用于配置項目管道。項目管道是在爬蟲提取到數據后對數據進行處理的組件,例如數據清洗、存儲等。數字300
代表管道的優先級,數值越小優先級越高。
9.自動節流擴展設置
# 啟用并配置AutoThrottle擴展(默認禁用)
# 參見 https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True# 初始下載延遲
#AUTOTHROTTLE_START_DELAY = 5# 高延遲情況下的最大下載延遲
#AUTOTHROTTLE_MAX_DELAY = 60# Scrapy應向每個遠程服務器并行發送的平均請求數
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0# 是否顯示每個接收到的響應的限速統計信息:
#AUTOTHROTTLE_DEBUG = False
AUTOTHROTTLE_ENABLED
:若設為True
,則啟用自動節流擴展。該擴展會依據目標網站的響應時間自動調整請求的延遲。AUTOTHROTTLE_START_DELAY
:初始的下載延遲時間(單位為秒)。AUTOTHROTTLE_MAX_DELAY
:在高延遲情況下允許的最大下載延遲時間(單位為秒)。AUTOTHROTTLE_TARGET_CONCURRENCY
:Scrapy 向每個遠程服務器并行發送請求的平均數量。AUTOTHROTTLE_DEBUG
:若設為True
,則會在每次收到響應時顯示節流統計信息。
10.HTTP 緩存設置
# 啟用并配置HTTP緩存(默認禁用)
# 參見 https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = "httpcache"
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = "scrapy.extensions.httpcache.FilesystemCacheStorage"
HTTPCACHE_ENABLED
:若設為True
,則啟用 HTTP 緩存。HTTPCACHE_EXPIRATION_SECS
:緩存的過期時間(單位為秒)。設為 0 表示緩存永不過期。HTTPCACHE_DIR
:緩存文件的存儲目錄。HTTPCACHE_IGNORE_HTTP_CODES
:一個列表,包含需要忽略緩存的 HTTP 狀態碼。HTTPCACHE_STORAGE
:指定緩存的存儲方式,這里使用的是文件系統緩存存儲。
11.其他設置
# 將已棄用默認值的設置設置為面向未來的值
REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7"
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
FEED_EXPORT_ENCODING = "utf-8"
REQUEST_FINGERPRINTER_IMPLEMENTATION
:設置請求指紋的實現版本,"2.7"
是為了保證未來兼容性。TWISTED_REACTOR
:指定 Twisted 事件循環的實現,這里使用AsyncioSelectorReactor
以支持異步 I/O。FEED_EXPORT_ENCODING
:設置數據導出時的編碼格式為UTF-8
。
③其他常用settings.py配置
1.日志相關配置
# 日志級別,可選值有 'DEBUG', 'INFO', 'WARNING', 'ERROR', 'CRITICAL'
LOG_LEVEL = 'DEBUG'
# 日志文件路徑,若設置則日志會輸出到該文件,而非控制臺
LOG_FILE = 'scrapy.log'
LOG_LEVEL
:借助設定不同的日志級別,你能夠控制 Scrapy 輸出的日志詳細程度。比如,DEBUG
會輸出最詳盡的日志信息,而CRITICAL
僅輸出關鍵錯誤信息。LOG_FILE
:把日志保存到指定文件,便于后續查看和分析。
2.下載超時與重試配置
# 下載超時時間(秒)
DOWNLOAD_TIMEOUT = 180
# 重試次數
RETRY_TIMES = 3
# 需要重試的 HTTP 狀態碼
RETRY_HTTP_CODES = [500, 502, 503, 504, 408]
DOWNLOAD_TIMEOUT
:若請求在規定時間內未得到響應,就會被判定為超時。RETRY_TIMES
:請求失敗時的重試次數。RETRY_HTTP_CODES
:遇到這些 HTTP 狀態碼時,Scrapy 會對請求進行重試。
3.代理配置
# 設置代理服務器
HTTP_PROXY = 'http://proxy.example.com:8080'
HTTPS_PROXY = 'http://proxy.example.com:8080'
4.數據存儲配置
# 導出數據的格式,如 'json', 'csv', 'xml' 等
FEED_FORMAT = 'json'
# 導出數據的文件路徑
FEED_URI = 'output.json'
5.調度器相關配置
# 調度器隊列類型,'priority' 為優先級隊列
SCHEDULER_DISK_QUEUE = 'scrapy.squeues.PickleFifoDiskQueue'
SCHEDULER_MEMORY_QUEUE = 'scrapy.squeues.FifoMemoryQueue'
SCHEDULER_DISK_QUEUE
和SCHEDULER_MEMORY_QUEUE
:分別設置調度器的磁盤隊列和內存隊列類型。
6.爬蟲并發配置
# 每個域名的并發請求數
CONCURRENT_REQUESTS_PER_DOMAIN = 8
# 每個 IP 的并發請求數
CONCURRENT_REQUESTS_PER_IP = 8
7.下載器中間件配置
# 隨機更換 User-Agent 的中間件
DOWNLOADER_MIDDLEWARES = {'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,'scrapy_fake_useragent.middleware.RandomUserAgentMiddleware': 400,
}
-
scrapy.downloadermiddlewares.useragent.UserAgentMiddleware
這是 Scrapy 框架內置的下載器中間件,它的主要功能是處理請求時的User-Agent
頭。不過,它的功能相對基礎,一般只能設置單一的User-Agent
或者從一個簡單的列表里隨機選擇User-Agent
。在上述配置里,將其值設為None
意味著禁用這個默認的中間件。 -
scrapy_fake_useragent.middleware.RandomUserAgentMiddleware
此中間件來自于第三方庫scrapy-fake-useragent
。這個庫的作用是在爬取過程中隨機選取不同的User-Agent
來模擬不同的客戶端,從而降低被目標網站識別為爬蟲的可能性。pip install scrapy-fake-useragent
安裝
8.深度限制配置
# 最大爬取深度
DEPTH_LIMIT = 3
# 深度優先級,值越大表示深度優先
DEPTH_PRIORITY = 1
DEPTH_LIMIT
:限制爬蟲的最大爬取深度,避免爬取過深導致數據量過大。DEPTH_PRIORITY
:設置深度優先的優先級,值越大越傾向于深度優先爬取。