Python爬取知乎評論：多線程與異步爬蟲的性能優化

1. 知乎評論爬取的技術挑戰

知乎的評論數據通常采用動態加載（Ajax），這意味著直接使用**requests**+**BeautifulSoup**無法獲取完整數據。此外，知乎還設置了反爬機制，包括：

請求頭（Headers）驗證（如**User-Agent**、**Referer**）
Cookie/Session 校驗（未登錄用戶只能獲取部分數據）
頻率限制（頻繁請求可能導致IP被封）

因此，我們需要：

模擬瀏覽器請求（攜帶Headers和Cookies）
解析動態API接口（而非靜態HTML）
優化爬取速度（多線程/異步）

2. 獲取知乎評論API分析

（1）查找評論API

打開知乎任意一個問題（如 **https://www.zhihu.com/question/xxxxxx**），按**F12**進入開發者工具，切換到**Network**選項卡，篩選**XHR**請求

（2）解析評論數據結構

評論通常嵌套在**data**字段中，結構如下：

{"data": [{"content": "評論內容","author": { "name": "用戶名" },"created_time": 1620000000}],"paging": { "is_end": false, "next": "下一頁URL" }
}

我們需要遞歸翻頁（**paging.next**）爬取所有評論。

3. Python爬取知乎評論的三種方式

（1）單線程爬蟲（基準測試）

使用**requests**庫直接請求API，逐頁爬取：

import requests
import timedef fetch_comments(question_id, max_pages=5):base_url = f"https://www.zhihu.com/api/v4/questions/{question_id}/answers"headers = {"User-Agent": "Mozilla/5.0","Cookie": "你的Cookie"  # 登錄后獲取}comments = []for page in range(max_pages):url = f"{base_url}?offset={page * 10}&limit=10"resp = requests.get(url, headers=headers).json()for answer in resp["data"]:comments.append(answer["content"])time.sleep(1)  # 避免請求過快return commentsstart_time = time.time()
comments = fetch_comments("12345678")  # 替換為知乎問題ID
print(f"單線程爬取完成，耗時：{time.time() - start_time:.2f}秒")

缺點：逐頁請求，速度慢（假設每頁1秒，10頁需10秒）。

（2）多線程爬蟲（ThreadPoolExecutor）

使用**concurrent.futures**實現多線程并發請求：

from concurrent.futures import ThreadPoolExecutordef fetch_page(page, question_id):url = f"https://www.zhihu.com/api/v4/questions/{question_id}/answers?offset={page * 10}&limit=10"headers = {"User-Agent": "Mozilla/5.0"}resp = requests.get(url, headers=headers).json()return [answer["content"] for answer in resp["data"]]def fetch_comments_multi(question_id, max_pages=5, threads=4):with ThreadPoolExecutor(max_workers=threads) as executor:futures = [executor.submit(fetch_page, page, question_id) for page in range(max_pages)]comments = []for future in futures:comments.extend(future.result())return commentsstart_time = time.time()
comments = fetch_comments_multi("12345678", threads=4)
print(f"多線程爬取完成，耗時：{time.time() - start_time:.2f}秒")

優化點：

線程池控制并發數（避免被封）
比單線程快約3-4倍（4線程爬10頁僅需2-3秒）

（3）異步爬蟲（Asyncio + aiohttp）

使用**aiohttp**實現異步HTTP請求，進一步提高效率：

import aiohttp
import asyncio
import time# 代理配置
proxyHost = "www.16yun.cn"
proxyPort = "5445"
proxyUser = "16QMSOML"
proxyPass = "280651"async def fetch_page_async(session, page, question_id):url = f"https://www.zhihu.com/api/v4/questions/{question_id}/answers?offset={page * 10}&limit=10"headers = {"User-Agent": "Mozilla/5.0"}async with session.get(url, headers=headers) as resp:data = await resp.json()return [answer["content"] for answer in data["data"]]async def fetch_comments_async(question_id, max_pages=5):# 設置代理連接器proxy_auth = aiohttp.BasicAuth(proxyUser, proxyPass)connector = aiohttp.TCPConnector(limit=20,  # 并發連接數限制force_close=True,enable_cleanup_closed=True,proxy=f"http://{proxyHost}:{proxyPort}",proxy_auth=proxy_auth)async with aiohttp.ClientSession(connector=connector) as session:tasks = [fetch_page_async(session, page, question_id) for page in range(max_pages)]comments = await asyncio.gather(*tasks)return [item for sublist in comments for item in sublist]if __name__ == "__main__":start_time = time.time()comments = asyncio.run(fetch_comments_async("12345678"))  # 替換為知乎問題IDprint(f"異步爬取完成，耗時：{time.time() - start_time:.2f}秒")print(f"共獲取 {len(comments)} 條評論")

優勢：

無GIL限制，比多線程更高效
適合高并發IO密集型任務（如爬蟲）

4. 性能對比與優化建議

爬取方式	10頁耗時（秒）	適用場景
單線程	~10	少量數據，簡單爬取
多線程（4線程）	~2.5	中等規模，需控制并發
異步（Asyncio）	~1.8	大規模爬取，高并發需求

優化建議

控制并發數：避免觸發反爬（建議10-20并發）。
隨機延遲：**time.sleep(random.uniform(0.5, 2))** 模擬人類操作。
代理IP池：防止IP被封（如使用**requests**+**ProxyPool**）。
數據存儲優化：異步寫入數據庫（如**MongoDB**或**MySQL**）。

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/pingmian/87932.shtml
繁體地址，請注明出處：http://hk.pswp.cn/pingmian/87932.shtml
英文地址，請注明出處：http://en.pswp.cn/pingmian/87932.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！