Python 爬蟲教程 | 豆瓣 TOP250 數據抓取與分析實戰

一、項目背景與數據價值

豆瓣TOP250是影視行業的重要榜單，具有以下數據價值：

評分與評價人數：衡量電影市場熱度；
導演與演員信息：分析人才價值與影視趨勢；
類型 / 地區 / 年份：洞察電影類型與年代變遷；
經典臺詞：可用于 NLP 情感分析或推薦系統訓練數據。

二、技術棧與環境配置

安裝核心 Python 庫：

pip install requests beautifulsoup4 pandas numpy matplotlib seaborn fake_useragent

版本建議：

庫	應用場景	版本要求
requests	網絡請求	≥ 2.25
BeautifulSoup	HTML 解析	≥ 4.9
pandas	數據處理	≥ 1.2
fake_useragent	偽裝 User?Agent	≥ 1.1
time、random	請求延時	Python 標準庫

三、網頁結構深入解析

目標頁面：https://movie.douban.com/top250。通過瀏覽器開發者工具（F12）分析頁面結構存在以下關鍵點：

電影單元數據 位于 <div class="item"> 內，包含標題、導演、評分等；
翻頁機制 使用參數 start=0, 25, … 225 控制頁面切換。

四、反爬策略與突破技巧

豆瓣可能設置以下反爬措施：

User?Agent 檢測：使用 fake_useragent 隨機生成請求頭；
頻率限制：結合 time.sleep() 與 random.uniform() 實現隨機延遲；
IP 封鎖防范：配置代理 IP 池，實現請求匿名化。

示例代碼：

from fake_useragent import UserAgent
import random, timeheaders = {'User-Agent': UserAgent().random
}
time.sleep(random.uniform(1.5, 3.5))

五、完整代碼實現

以下為完整爬蟲流程，包括抓取、解析、異常處理和數據存儲：

import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
import random
import re
from fake_useragent import UserAgentdef scrape_douban_top250():base_url = "https://movie.douban.com/top250?start={}"movies = []ua = UserAgent()for page in range(0, 250, 25):url = base_url.format(page)headers = {'User-Agent': ua.random,'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8','Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8','Connection': 'keep-alive','Referer': 'https://movie.douban.com/','Cookie': 'bid=123456789;'}try:print(f"正在抓取第{page // 25 + 1}頁數據...")response = requests.get(url, headers=headers, timeout=15)response.encoding = 'utf-8'if response.status_code != 200:print(f"請求失敗，狀態碼: {response.status_code}")continuesoup = BeautifulSoup(response.text, 'html.parser')grid_view = soup.find('ol', class_='grid_view')if not grid_view:print("未找到電影列表，可能頁面結構有變化")continueitems = grid_view.find_all('li')for item in items:# 解析內容（排名、標題、導演、年份、地區、類型、評分、評價人數、經典臺詞）rank = item.find('em').text if item.find('em') else "N/A"title = item.find('span', class_='title').text.strip() if item.find('span', class_='title') else "未知標題"other_title = item.find('span', class_='other').text.strip() if item.find('span', class_='other') else ""bd = item.find('div', class_='bd')info_text = bd.find('p').get_text(strip=True).replace('\xa0', ' ') if bd and bd.find('p') else ""director = info_text.split("導演:")[1].split("主演")[0].strip().split('/')[0] if "導演:" in info_text else ""year = re.search(r'\d{4}', info_text).group(0) if re.search(r'\d{4}', info_text) else ""parts = info_text.split("/")region = parts[-2].strip() if len(parts) >= 2 else ""genre = parts[-1].strip() if len(parts) >= 1 else ""rating = item.find('span', class_='rating_num').text if item.find('span', class_='rating_num') else "0.0"star_div = item.find('div', class_='star')spans = star_div.find_all('span') if star_div else []votes = spans[3].text.replace('人評價', '') if len(spans) >= 4 else "0"quote = item.find('span', class_='inq').text if item.find('span', class_='inq') else ""movies.append({'排名': rank,'標題': title,'其他標題': other_title,'導演': director,'年份': year,'地區': region,'類型': genre,'評分': float(rating) if rating.replace('.', '', 1).isdigit() else 0.0,'評價人數': int(votes.replace(',', '')) if votes.isdigit() else 0,'經典臺詞': quote})print(f"成功抓取第{page // 25 + 1}頁數據，共{len(movies)}條記錄")time.sleep(random.uniform(3, 7))except Exception as e:print(f"第{page // 25 + 1}頁抓取失敗: {str(e)}")return pd.DataFrame(movies)df = scrape_douban_top250()
if not df.empty:df['年份'] = pd.to_numeric(df['年份'], errors='coerce').fillna(0).astype(int)df.to_csv('douban_top250.csv', index=False, encoding='utf-8-sig')print(f"數據抓取完成，共{len(df)}條記錄，已保存至 douban_top250.csv")print("\n前 5 條數據預覽：")print(df.head())
else:print("未抓取到任何數據，請檢查網絡或反爬策略是否生效")

代碼亮點：

隨機延時防止被封；
異常捕獲增強穩定性；
數據清洗：評分轉為 float，票數轉為 int；
通過結構化字典構建 DataFrame 。

六、數據清洗與轉換建議

建議在 DataFrame 構建后進一步處理：

df['year'] = df['year'].apply(lambda x: re.search(r'\d{4}', x).group() if re.search(r'\d{4}', x) else None)
df['director'] = df['director'].str.split('/').str[0]
df['quote'] = df['quote'].fillna('無')

這些能優化數據可用性，便于后續分析。

七、數據可視化分析范例

結合 Matplotlib 和 Seaborn，展示深度分析流程：

import matplotlib.pyplot as plt
import seaborn as snsplt.figure(figsize=(15,10))# 1. 評分分布直方圖
plt.subplot(2,2,1)
sns.histplot(df['評分'], bins=20, kde=True)
plt.title('豆瓣TOP250評分分布')# 2. 年代趨勢（TOP10）
plt.subplot(2,2,2)
sns.countplot(x='年份', data=df, order=df['年份'].value_counts().index[:10])
plt.xticks(rotation=45)
plt.title('電影上映年代分布TOP10')# 3. 導演上榜數TOP10
plt.subplot(2,2,3)
top_directors = df['導演'].value_counts().head(10)
sns.barplot(x=top_directors.values, y=top_directors.index)
plt.title('導演上榜作品數量TOP10')# 4. 評分與評價人數關系（對數坐標）
plt.subplot(2,2,4)
sns.scatterplot(x='評價人數', y='評分', data=df, hue='評分', palette='viridis')
plt.xscale('log')
plt.title('評分 vs 投票數（對數刻度）')plt.tight_layout()
plt.savefig('douban_analysis.png', dpi=300)

數據洞察：

評分偏左集中，均值約為 8.9，最低約 8.3；
1994–2004 年份涌現大量經典影片；
宮崎駿電影上榜數量最多，全勝導演之一；
投票數超過 150 萬的影片評分均超過 9.0 。

八、常見問題與解決方案一覽

問題現象	解決方案
返回 403 內	升級 UA 庫、更換代理 IP
數據部分缺失	添加備用 CSS 選擇路徑
爬取速度過慢	縮短延時為 1–2 秒
pandas 中文亂碼	使用 `encoding='utf-8-sig'` 保存
請求超時	增加重試機制或設置更高 `timeout`

九、項目拓展建議

推薦系統構建：使用 TF-IDF 分析臺詞文本，構建內容推薦模型；
自動定時更新：部署定時任務（如 crontab 或 Airflow）實現數據自動抓取與增量更新。

結語

本文全面涵蓋了從豆瓣 TOP 250 數據 抓取 → 清洗 → 可視化 → 洞察 的完整流程，核心技巧包括：

動態 UA 和代理 IP 破解反爬；
穩定的異常處理機制；
深度數據清洗與類型轉換；
多維數據分析洞察趨勢。

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/pingmian/94223.shtml
繁體地址，請注明出處：http://hk.pswp.cn/pingmian/94223.shtml
英文地址，請注明出處：http://en.pswp.cn/pingmian/94223.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！