"Scrapy到底該怎么學?"今天,我將用這篇萬字長文,帶你從零開始掌握Scrapy框架的核心用法,并分享我在實際項目中的實戰經驗!建議收藏?!
一、Scrapy簡介:為什么選擇它?
1.1 Scrapy vs Requests+BeautifulSoup
很多新手會問:“我已經會用Requests+BeautifulSoup了,為什么還要學Scrapy?”
對比項 | Requests+BS4 | Scrapy |
---|---|---|
性能 | 同步請求,速度慢 | 異步IO,高性能 |
擴展性 | 需要手動實現 | 內置中間件、管道系統 |
功能完整性 | 僅基礎爬取 | 自帶去重、隊列管理、異常處理 |
適用場景 | 小規模數據采集 | 企業級爬蟲項目 |
👉 結論:如果是小型項目,Requests夠用;但如果是商業級爬蟲,Scrapy是更好的選擇!
1.2 Scrapy核心架構
(圖解Scrapy架構,建議配合流程圖理解)
二、手把手實戰:開發你的第一個Scrapy爬蟲
2.1 環境準備
# 推薦使用虛擬環境
python -m venv scrapy_env
source scrapy_env/bin/activate # Linux/Mac
scrapy_env\Scripts\activate # Windowspip install scrapy
2.2 創建項目
scrapy startproject book_crawler
cd book_crawler
scrapy genspider books books.toscrape.com
2.3 編寫爬蟲代碼
# spiders/books.py
import scrapyclass BooksSpider(scrapy.Spider):name = "books"def start_requests(self):urls = ['http://books.toscrape.com/']for url in urls:yield scrapy.Request(url=url, callback=self.parse)def parse(self, response):# 提取書籍信息for book in response.css('article.product_pod'):yield {'title': book.css('h3 a::attr(title)').get(),'price': book.css('p.price_color::text').get(),'rating': book.css('p.star-rating::attr(class)').get().split()[-1]}# 翻頁邏輯next_page = response.css('li.next a::attr(href)').get()if next_page:yield response.follow(next_page, callback=self.parse)
2.4 運行爬蟲
scrapy crawl books -o books.csv
三、Scrapy高級技巧(企業級應用)
3.1 突破反爬:隨機UserAgent+代理IP
# middlewares.py
from fake_useragent import UserAgent
import randomclass RandomUserAgentMiddleware:def process_request(self, request, spider):request.headers['User-Agent'] = UserAgent().randomclass ProxyMiddleware:PROXY_LIST = ['http://proxy1.example.com:8080','http://proxy2.example.com:8080']def process_request(self, request, spider):proxy = random.choice(self.PROXY_LIST)request.meta['proxy'] = proxy
在settings.py
中啟用:
DOWNLOADER_MIDDLEWARES = {'book_crawler.middlewares.RandomUserAgentMiddleware': 543,'book_crawler.middlewares.ProxyMiddleware': 544,
}
3.2 數據存儲:MySQL+Pipeline
# pipelines.py
import pymysqlclass MySQLPipeline:def __init__(self):self.conn = pymysql.connect(host='localhost',user='root',password='123456',db='scrapy_data',charset='utf8mb4')self.cursor = self.conn.cursor()def process_item(self, item, spider):sql = """INSERT INTO books(title, price, rating) VALUES (%s, %s, %s)"""self.cursor.execute(sql, (item['title'],item['price'],item['rating']))self.conn.commit()return itemdef close_spider(self, spider):self.conn.close()
四、常見問題Q&A
Q1:如何爬取JavaScript渲染的頁面?
方案一:Scrapy+Splash
# 安裝:docker run -p 8050:8050 scrapinghub/splash
yield scrapy.Request(url,self.parse,meta={'splash': {'args': {'wait': 2.5}}}
)
方案二:Scrapy+Playwright(推薦)
# settings.py
DOWNLOAD_HANDLERS = {"http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler","https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}
Q2:如何實現分布式爬蟲?
使用scrapy-redis
:
# settings.py
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
REDIS_URL = 'redis://:password@localhost:6379/0'
五、性能優化技巧
-
并發控制:
# settings.py CONCURRENT_REQUESTS = 32 # 默認16 DOWNLOAD_DELAY = 0.25 # 防止被封
-
緩存請求:
HTTPCACHE_ENABLED = True HTTPCACHE_EXPIRATION_SECS = 86400 # 緩存1天
-
自動限速:
AUTOTHROTTLE_ENABLED = True AUTOTHROTTLE_START_DELAY = 5.0