Python爬蟲實戰：爬取豆瓣電影

引言

1. 爬蟲基礎

1.1 什么是爬蟲？

1.2 Python爬蟲常用庫

2. 實戰：抓取豆瓣電影Top250

2.1 安裝依賴庫

2.2 發送HTTP請求

?編輯

2.3 解析HTML

?編輯

2.4 存儲數據

2.5 完整代碼

3. 進階：處理分頁和動態內容

3.1 抓取多頁數據

3.2 處理動態內容

4. 反爬蟲策略與應對

4.1 常見的反爬蟲策略

4.2 應對策略

5. 總結

引言

在當今大數據時代，網絡爬蟲（Web Crawler）成為了獲取互聯網數據的重要工具。無論是數據分析、機器學習還是市場調研，爬蟲技術都能幫助我們快速獲取所需的數據。本文將帶你從零開始，使用Python編寫一個簡單的網絡爬蟲，并逐步擴展到更復雜的應用場景。

1. 爬蟲基礎

1.1 什么是爬蟲？

網絡爬蟲是一種自動化程序，能夠從互聯網上抓取數據。它通過模擬瀏覽器請求，訪問網頁并提取所需的信息。爬蟲的核心任務包括：

發送HTTP請求：向目標網站發送請求，獲取網頁內容。
解析HTML：從網頁中提取有用的數據。
存儲數據：將提取的數據保存到本地或數據庫中。

1.2 Python爬蟲常用庫

Python擁有豐富的庫來支持爬蟲開發，以下是常用的幾個庫：

Requests：用于發送HTTP請求，獲取網頁內容。
BeautifulSoup：用于解析HTML，提取數據。
Scrapy：一個強大的爬蟲框架，適合大規模數據抓取。
Selenium：用于處理動態網頁，模擬瀏覽器操作。

2. 實戰：抓取豆瓣電影Top250

將以抓取豆瓣電影Top250為例，演示如何使用Python編寫一個簡單的爬蟲。

2.1 安裝依賴庫

首先，確保你已經安裝了requests和BeautifulSoup庫。如果沒有安裝，可以使用以下命令進行安裝：

pip install requests beautifulsoup4

2.2 發送HTTP請求

我們使用requests庫向豆瓣電影Top250頁面發送請求，獲取網頁內容。

import requestsurl = "https://movie.douban.com/top250"
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
}response = requests.get(url, headers=headers)
if response.status_code == 200:print("請求成功！")html_content = response.text
else:print("請求失敗，狀態碼：", response.status_code)

2.3 解析HTML

使用BeautifulSoup解析HTML，提取電影名稱、評分等信息。

from bs4 import BeautifulSoupsoup = BeautifulSoup(html_content, "html.parser")movies = soup.find_all("div", class_="info")for movie in movies:title = movie.find("span", class_="title").textrating = movie.find("span", class_="rating_num").textprint(f"電影名稱：{title}，評分：{rating}")

2.4 存儲數據

將提取的數據保存到CSV文件中。

import csvwith open("douban_top250.csv", mode="w", newline="", encoding="utf-8") as file:writer = csv.writer(file)writer.writerow(["電影名稱", "評分"])for movie in movies:title = movie.find("span", class_="title").textrating = movie.find("span", class_="rating_num").textwriter.writerow([title, rating])

2.5 完整代碼

import requests
from bs4 import BeautifulSoup
import csvurl = "https://movie.douban.com/top250"
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
}response = requests.get(url, headers=headers)
if response.status_code == 200:html_content = response.text
else:print("請求失敗，狀態碼：", response.status_code)exit()soup = BeautifulSoup(html_content, "html.parser")
movies = soup.find_all("div", class_="info")with open("douban_top250.csv", mode="w", newline="", encoding="utf-8") as file:writer = csv.writer(file)writer.writerow(["電影名稱", "評分"])for movie in movies:title = movie.find("span", class_="title").textrating = movie.find("span", class_="rating_num").textwriter.writerow([title, rating])print("數據已保存到douban_top250.csv")

3. 進階：處理分頁和動態內容

3.1 抓取多頁數據

豆瓣電影Top250有10頁數據，我們需要遍歷所有頁面進行抓取。

base_url = "https://movie.douban.com/top250"
all_movies = []for page in range(0, 250, 25):url = f"{base_url}?start={page}"response = requests.get(url, headers=headers)if response.status_code == 200:soup = BeautifulSoup(response.text, "html.parser")movies = soup.find_all("div", class_="info")for movie in movies:title = movie.find("span", class_="title").textrating = movie.find("span", class_="rating_num").textall_movies.append([title, rating])else:print(f"第{page//25 + 1}頁請求失敗，狀態碼：", response.status_code)with open("douban_top250_all.csv", mode="w", newline="", encoding="utf-8") as file:writer = csv.writer(file)writer.writerow(["電影名稱", "評分"])writer.writerows(all_movies)print("所有數據已保存到douban_top250_all.csv")

3.2 處理動態內容

如果網頁內容是通過JavaScript動態加載的，可以使用Selenium模擬瀏覽器操作。

from selenium import webdriver
from selenium.webdriver.common.by import By
import timedriver = webdriver.Chrome()
driver.get("https://movie.douban.com/top250")movies = driver.find_elements(By.CLASS_NAME, "info")
for movie in movies:title = movie.find_element(By.CLASS_NAME, "title").textrating = movie.find_element(By.CLASS_NAME, "rating_num").textprint(f"電影名稱：{title}，評分：{rating}")driver.quit()

4. 反爬蟲策略與應對

4.1 常見的反爬蟲策略

User-Agent檢測：服務器通過檢查請求頭中的User-Agent來判斷是否為爬蟲。
IP封禁：頻繁請求可能導致IP被封禁。
驗證碼：某些網站會要求輸入驗證碼。

4.2 應對策略

設置合理的請求頭：模擬瀏覽器請求，設置User-Agent。
使用代理IP：通過代理IP池避免IP被封禁。
降低請求頻率：使用time.sleep()控制請求間隔。

5. 總結

本文通過一個簡單的豆瓣電影Top250爬蟲實例，介紹了Python爬蟲的基本流程。我們從發送HTTP請求、解析HTML到存儲數據，逐步實現了一個完整的爬蟲程序。此外，我們還探討了如何處理分頁和動態內容，以及應對常見的反爬蟲策略。

爬蟲技術雖然強大，但在使用時務必遵守相關法律法規和網站的使用條款，避免對目標網站造成不必要的負擔。

參考資料：

Requests官方文檔
BeautifulSoup官方文檔
Selenium官方文檔

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/web/70286.shtml
繁體地址，請注明出處：http://hk.pswp.cn/web/70286.shtml
英文地址，請注明出處：http://en.pswp.cn/web/70286.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！