Python 爬蟲初學者教程

?一、爬蟲基礎概念

什么是爬蟲？

爬蟲是模擬瀏覽器行為，自動獲取網頁數據的程序，常用于數據采集、信息監控等場景。

爬蟲的基本流程：

1.?發送請求獲取網頁內容

2.?解析內容提取數據

3.?存儲數據

二、環境準備

1.?安裝 Python：推薦 Python 3.8+，官網下載后按提示安裝，記得勾選“Add to PATH”。

2.?安裝必要庫：

- ?requests?：發送 HTTP 請求（?pip install requests?）

- ?BeautifulSoup?：解析 HTML/XML 數據（?pip install beautifulsoup4?）

- ?lxml?：高效解析庫（?pip install lxml?，BeautifulSoup 可配合此庫使用）

三、第一個爬蟲：獲取網頁標題

以獲取豆瓣電影首頁標題為例，代碼如下：

import requests

from bs4 import BeautifulSoup

# 1. 發送請求

url = "https://movie.douban.com/"

response = requests.get(url)

# 2. 處理編碼（避免中文亂碼）

response.encoding = response.apparent_encoding

# 3. 解析網頁

soup = BeautifulSoup(response.text, 'lxml')

# 4. 提取數據：獲取所有電影標題

movie_titles = soup.find_all('span', class_='title')

# 5. 輸出結果

print("豆瓣電影首頁部分標題：")

for title in movie_titles:

? ? # 過濾非中文標題（避免廣告等干擾）

? ? if "·" not in title.text:

? ? ? ? print(title.text)

代碼解析：

- ?requests.get(url)? 發送 GET 請求獲取網頁內容

- ?BeautifulSoup? 用 lxml 解析器處理 HTML

- ?find_all('span', class_='title')? 根據標簽和類名提取元素

- 過濾邏輯避免輸出非電影標題（如廣告）

四、進階：處理動態網頁（以豆瓣短評為例）

動態網頁數據通常通過 API 接口返回，需分析網絡請求獲取真實數據地址：

import requests

import json

# 豆瓣電影《奧本海默》短評 API（需從瀏覽器開發者工具獲取）

api_url = "https://movie.douban.com/j/chart/top_list_comments"

params = {

? ? "movie_id": "35477223", # 電影ID

? ? "start": 0, # 起始評論數

? ? "limit": 20, # 每頁評論數

}

# 發送請求（帶參數）

response = requests.get(api_url, params=params)

comments_data = json.loads(response.text) # 解析JSON數據

# 提取并輸出評論

print("《奧本海默》短評：")

for comment in comments_data:

? ? print(f"用戶 {comment['author']}：{comment['content'][:50]}...")

五、爬蟲注意事項（避免被封IP）

1.?設置請求頭：模擬瀏覽器行為（添加 ?User-Agent? 等）

headers = {

? ? "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",

? ? "Accept": "text/html,application/xhtml+xml,application/xml"

}

response = requests.get(url, headers=headers)

2.?控制請求頻率：添加延時（避免頻繁請求）

import time

time.sleep(1) # 每次請求間隔1秒

3.?遵守網站規則：查看網站 ?robots.txt?（如豆瓣允許合理爬蟲，但禁止高頻請求）

六、實戰練習：爬取小說網站章節

以爬取某小說網站章節為例，完整代碼框架：

import requests

from bs4 import BeautifulSoup

import os

import time

# 小說主頁

novel_url = "https://example.com/novel"

# 1. 獲取章節列表

def get_chapter_list(url):

? ? response = requests.get(url)

? ? soup = BeautifulSoup(response.text, 'lxml')

? ? chapters = soup.find_all('a', class_='chapter-link')

? ? return [(chapter.text, chapter['href']) for chapter in chapters]

# 2. 獲取章節內容

def get_chapter_content(chapter_url):

? ? response = requests.get(chapter_url)

? ? soup = BeautifulSoup(response.text, 'lxml')

? ? content = soup.find('div', class_='content').text

? ? return content

# 3. 保存內容到文件

def save_to_file(chapter_name, content, novel_name):

? ? if not os.path.exists(novel_name):

? ? ? ? os.makedirs(novel_name)

? ? file_path = f"{novel_name}/{chapter_name}.txt"

? ? with open(file_path, 'w', encoding='utf-8') as f:

? ? ? ? f.write(content)

? ? print(f"已保存：{chapter_name}")

# 主流程

if __name__ == "__main__":

? ? novel_name = "小說名稱"

? ? chapters = get_chapter_list(novel_url)

? ? for i, (chapter_name, chapter_url) in enumerate(chapters):

? ? ? ? print(f"正在爬取第 {i+1}/{len(chapters)} 章：{chapter_name}")

? ? ? ? content = get_chapter_content(chapter_url)

? ? ? ? save_to_file(chapter_name, content, novel_name)

? ? ? ? time.sleep(2) # 間隔2秒，避免頻繁請求

七、進一步學習資源

- 書籍：《Python爬蟲開發與項目實戰》《精通Python網絡爬蟲》

- 在線課程：

- 廖雪峰 Python 教程

- 爬蟲實戰：B站視頻信息采集

- 工具推薦：

- 瀏覽器開發者工具（F12）：分析網絡請求

- Postman：調試 API 請求

通過以上步驟，你可以完成基礎爬蟲的開發。

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/news/912002.shtml
繁體地址，請注明出處：http://hk.pswp.cn/news/912002.shtml
英文地址，請注明出處：http://en.pswp.cn/news/912002.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！