1. 知乎評論爬取的技術挑戰
知乎的評論數據通常采用動態加載(Ajax),這意味著直接使用**<font style="color:rgb(64, 64, 64);background-color:rgb(236, 236, 236);">requests</font>**
+**<font style="color:rgb(64, 64, 64);background-color:rgb(236, 236, 236);">BeautifulSoup</font>**
無法獲取完整數據。此外,知乎還設置了反爬機制,包括:
- 請求頭(Headers)驗證(如
**<font style="color:rgb(64, 64, 64);background-color:rgb(236, 236, 236);">User-Agent</font>**
、**<font style="color:rgb(64, 64, 64);background-color:rgb(236, 236, 236);">Referer</font>**
) - Cookie/Session 校驗(未登錄用戶只能獲取部分數據)
- 頻率限制(頻繁請求可能導致IP被封)
因此,我們需要:
- 模擬瀏覽器請求(攜帶Headers和Cookies)
- 解析動態API接口(而非靜態HTML)
- 優化爬取速度(多線程/異步)
2. 獲取知乎評論API分析
(1)查找評論API
打開知乎任意一個問題(如 **<font style="color:rgb(64, 64, 64);background-color:rgb(236, 236, 236);">https://www.zhihu.com/question/xxxxxx</font>**
),按**<font style="color:rgb(64, 64, 64);background-color:rgb(236, 236, 236);">F12</font>**
進入開發者工具,切換到**<font style="color:rgb(64, 64, 64);background-color:rgb(236, 236, 236);">Network</font>**
選項卡,篩選**<font style="color:rgb(64, 64, 64);background-color:rgb(236, 236, 236);">XHR</font>**
請求
(2)解析評論數據結構
評論通常嵌套在**<font style="color:rgb(64, 64, 64);background-color:rgb(236, 236, 236);">data</font>**
字段中,結構如下:
{"data": [{"content": "評論內容","author": { "name": "用戶名" },"created_time": 1620000000}],"paging": { "is_end": false, "next": "下一頁URL" }
}
我們需要遞歸翻頁(**<font style="color:rgb(64, 64, 64);background-color:rgb(236, 236, 236);">paging.next</font>**
)爬取所有評論。
3. Python爬取知乎評論的三種方式
(1)單線程爬蟲(基準測試)
使用**<font style="color:rgb(64, 64, 64);background-color:rgb(236, 236, 236);">requests</font>**
庫直接請求API,逐頁爬取:
import requests
import timedef fetch_comments(question_id, max_pages=5):base_url = f"https://www.zhihu.com/api/v4/questions/{question_id}/answers"headers = {"User-Agent": "Mozilla/5.0","Cookie": "你的Cookie" # 登錄后獲取}comments = []for page in range(max_pages):url = f"{base_url}?offset={page * 10}&limit=10"resp = requests.get(url, headers=headers).json()for answer in resp["data"]:comments.append(answer["content"])time.sleep(1) # 避免請求過快return commentsstart_time = time.time()
comments = fetch_comments("12345678") # 替換為知乎問題ID
print(f"單線程爬取完成,耗時:{time.time() - start_time:.2f}秒")
缺點:逐頁請求,速度慢(假設每頁1秒,10頁需10秒)。
(2)多線程爬蟲(ThreadPoolExecutor)
使用**<font style="color:rgb(64, 64, 64);background-color:rgb(236, 236, 236);">concurrent.futures</font>**
實現多線程并發請求:
from concurrent.futures import ThreadPoolExecutordef fetch_page(page, question_id):url = f"https://www.zhihu.com/api/v4/questions/{question_id}/answers?offset={page * 10}&limit=10"headers = {"User-Agent": "Mozilla/5.0"}resp = requests.get(url, headers=headers).json()return [answer["content"] for answer in resp["data"]]def fetch_comments_multi(question_id, max_pages=5, threads=4):with ThreadPoolExecutor(max_workers=threads) as executor:futures = [executor.submit(fetch_page, page, question_id) for page in range(max_pages)]comments = []for future in futures:comments.extend(future.result())return commentsstart_time = time.time()
comments = fetch_comments_multi("12345678", threads=4)
print(f"多線程爬取完成,耗時:{time.time() - start_time:.2f}秒")
優化點:
- 線程池控制并發數(避免被封)
- 比單線程快約3-4倍(4線程爬10頁僅需2-3秒)
(3)異步爬蟲(Asyncio + aiohttp)
使用**<font style="color:rgb(64, 64, 64);background-color:rgb(236, 236, 236);">aiohttp</font>**
實現異步HTTP請求,進一步提高效率:
import aiohttp
import asyncio
import time# 代理配置
proxyHost = "www.16yun.cn"
proxyPort = "5445"
proxyUser = "16QMSOML"
proxyPass = "280651"async def fetch_page_async(session, page, question_id):url = f"https://www.zhihu.com/api/v4/questions/{question_id}/answers?offset={page * 10}&limit=10"headers = {"User-Agent": "Mozilla/5.0"}async with session.get(url, headers=headers) as resp:data = await resp.json()return [answer["content"] for answer in data["data"]]async def fetch_comments_async(question_id, max_pages=5):# 設置代理連接器proxy_auth = aiohttp.BasicAuth(proxyUser, proxyPass)connector = aiohttp.TCPConnector(limit=20, # 并發連接數限制force_close=True,enable_cleanup_closed=True,proxy=f"http://{proxyHost}:{proxyPort}",proxy_auth=proxy_auth)async with aiohttp.ClientSession(connector=connector) as session:tasks = [fetch_page_async(session, page, question_id) for page in range(max_pages)]comments = await asyncio.gather(*tasks)return [item for sublist in comments for item in sublist]if __name__ == "__main__":start_time = time.time()comments = asyncio.run(fetch_comments_async("12345678")) # 替換為知乎問題IDprint(f"異步爬取完成,耗時:{time.time() - start_time:.2f}秒")print(f"共獲取 {len(comments)} 條評論")
優勢:
- 無GIL限制,比多線程更高效
- 適合高并發IO密集型任務(如爬蟲)
4. 性能對比與優化建議
爬取方式 | 10頁耗時(秒) | 適用場景 |
---|---|---|
單線程 | ~10 | 少量數據,簡單爬取 |
多線程(4線程) | ~2.5 | 中等規模,需控制并發 |
異步(Asyncio) | ~1.8 | 大規模爬取,高并發需求 |
優化建議
- 控制并發數:避免觸發反爬(建議10-20并發)。
- 隨機延遲:
**<font style="color:rgb(64, 64, 64);background-color:rgb(236, 236, 236);">time.sleep(random.uniform(0.5, 2))</font>**
模擬人類操作。 - 代理IP池:防止IP被封(如使用
**<font style="color:rgb(64, 64, 64);background-color:rgb(236, 236, 236);">requests</font>**
+**<font style="color:rgb(64, 64, 64);background-color:rgb(236, 236, 236);">ProxyPool</font>**
)。 - 數據存儲優化:異步寫入數據庫(如
**<font style="color:rgb(64, 64, 64);background-color:rgb(236, 236, 236);">MongoDB</font>**
或**<font style="color:rgb(64, 64, 64);background-color:rgb(236, 236, 236);">MySQL</font>**
)。