新手學習爬蟲的案例

首先你的電腦上肯定已經安裝了python,沒安裝的去官網安裝,我使用的是Pycharm作為操作的IDE

環境準備

安裝必要的庫

爬蟲需要用到requests和beautifulsoup4

使用命令行或者終端運行下面的命令

pip install requests beautifulsoup4 -i https://mirrors.aliyun.com/pypi/simple

這里使用的是阿里云的鏡像源,下載的快一點

下載完成之后可以使用下面這段代碼檢驗下載的結果

import requests
from bs4 import BeautifulSoup
print("所有庫安裝成功！")

封面等內容的爬取

接下來是一段爬蟲的代碼,我們逐行對它進行講解

import requests
from bs4 import BeautifulSoup# 1. 定義目標URL
url = "http://books.toscrape.com/"try:# 2. 發送HTTP請求response = requests.get(url)# 檢查請求是否成功（狀態碼200表示成功）response.raise_for_status()  # 3. 解析HTML內容soup = BeautifulSoup(response.text, "html.parser")# 4. 定位所有書籍的容器books = soup.find_all("article", class_="product_pod")# 5. 遍歷每個書籍容器并提取信息for book in books:# 提取書名title = book.h3.a["title"]# 提取價格price = book.find("p", class_="price_color").text# 提取評分（例如："Three" -> 3星）rating = book.p["class"][1]# 打印結果print(f"書名: {title}")print(f"價格: {price}")print(f"評分: {rating} 星")print("-" * 50)except requests.exceptions.RequestException as e:print(f"請求失敗: {e}")
except Exception as e:print(f"發生錯誤: {e}")

分步詳解

導入庫?

 import requests # 用于發送HTTP請求 from bs4 import BeautifulSoup # 用于解析HTML

發送HTTP請求
```
url = "http://books.toscrape.com/"
response = requests.get(url)
response.raise_for_status()  # 如果狀態碼不是200，拋出異常
```
requests.get(url)：向目標URL發送GET請求。
response.raise_for_status()：檢查請求是否成功（狀態碼200），失敗則拋出異常。
解析HTML內容
```
soup = BeautifulSoup(response.text, "html.parser")
```
response.text：獲取網頁的HTML文本內容
BeautifulSoup()：將HTML文本轉換為可操作的對象（DOM樹）。
定位書籍容器
```
books = soup.find_all("article", class_="product_pod")
```
find_all()：查找所有符合條件的目標元素。
article 是標簽名，class="product_pod" 是類名（注意 class_ 的寫法）。

提取書籍信息

for book in books:title = book.h3.a["title"]  # 書名存儲在a標簽的title屬性中price = book.find("p", class_="price_color").text  # 價格在p標簽的文本中rating = book.p["class"][1]  # 評分在p標簽的第二個類名中（如 "star-rating Three"）

book.h3.a["title"]：通過層級關系直接定位到書名。
find("p", class="price_color")：在書籍容器內查找價格元素。
book.p"class"：評分通過類名的第二部分獲取（例如 Three 表示3星）。

錯誤處理

except requests.exceptions.RequestException as e:print(f"請求失敗: {e}")
except Exception as e:print(f"發生錯誤: {e}")

捕獲網絡請求錯誤和其他異常，避免程序崩潰。運行結果示例

????????運行結果示例

書名: A Light in the Attic 價格: ￡51.77

評分: Three 星

書名: Tipping the Velvet 價格: ￡53.74

評分: One 星

...

與該網站中的信息進行比對,發現爬取成功

?接下來我們進入每一本書的詳情頁面,并爬取該頁面中的Product_Description

?詳情頁面爬取

將上面的代碼更改為下面這一段代碼

import requests
from bs4 import BeautifulSoup
import os# 目標網站
url = "http://books.toscrape.com/"# 創建文件夾保存圖片
if not os.path.exists("book_covers"):os.makedirs("book_covers")try:# 發送請求response = requests.get(url)response.raise_for_status()# 解析HTMLsoup = BeautifulSoup(response.text, "html.parser")books = soup.find_all("article", class_="product_pod")for book in books:# 提取書名（用于命名圖片文件）title = book.h3.a["title"].strip().replace("/", "-")  # 去除非法字符# 提取封面圖片的相對路徑（如：../../media/.../image.jpg）image_relative_url = book.img["src"]# 將相對路徑轉換為絕對URLimage_absolute_url = url + image_relative_url.replace("../", "")# 下載圖片image_response = requests.get(image_absolute_url, stream=True)image_response.raise_for_status()# 保存圖片到本地filename = f"book_covers/{title}.jpg"with open(filename, "wb") as f:for chunk in image_response.iter_content(1024):f.write(chunk)print(f"已下載封面: {title}")except requests.exceptions.RequestException as e:print(f"請求失敗: {e}")
except Exception as e:print(f"發生錯誤: {e}")

分步解釋

創建保存圖片的文件夾
```
import os
if not os.path.exists("book_covers"):os.makedirs("book_covers")
```
使用 os 模塊檢查并創建文件夾，避免重復下載時文件覆蓋。
定位封面圖片的URL

查看圖片的HTML結構：右鍵點擊封面圖片 → 檢查，發現結構如下：
```
<img src="../../media/cache/2c/da/2cdad67c.../a-light-in-the-attic_1000.jpg" alt="A Light in the Attic" class="thumbnail">
```
src 屬性包含圖片的相對路徑（如 ../../media/...）。

處理相對路徑

相對路徑需要拼接網站的完整URL：
```
image_relative_url = book.img["src"]  # 例如：../../media/...
image_absolute_url = url + image_relative_url.replace("../", "")
```
url 是基礎地址（All products | Books to Scrape - Sandbox）。替換 ../ 為空字符串，得到完整路徑（如 http://books.toscrape.com/media/...）。
下載并保存圖片
```
image_response = requests.get(image_absolute_url, stream=True)
with open(filename, "wb") as f:for chunk in image_response.iter_content(1024):f.write(chunk)
```
stream=True：以流式下載大文件，避免內存溢出。 iter_content(1024)：每次下載 1024 字節的塊，適合大文件。
文件名處理
```
title = book.h3.a["title"].strip().replace("/", "-")
filename = f"book_covers/{title}.jpg"
```
replace("/", "-")：替換書名中的非法字符（如斜杠），避免保存文件時報錯。

運行結果

已下載封面: A Light in the Attic 已下載封面: Tipping the Velvet 已下載封面: Soumission ... 所有封面圖片會保存在 book_covers 文件夾中，文件名格式為書名.jpg。

擴展優化

處理分頁

爬取所有頁面的書籍封面（觀察分頁URL規律，如 page-2.html）：

for page in range(1, 51):  # 共50頁url = f"http://books.toscrape.com/catalogue/page-{page}.html"# 發送請求并解析...

添加延遲避免封禁

在請求間添加隨機延遲，模擬人類操作：

import time
import random
time.sleep(random.uniform(0.5, 2.0))  # 隨機延遲0.5~2秒

錯誤重試機制

使用 try-except 捕獲下載失敗的圖片并重試：

try:image_response = requests.get(...)
except requests.exceptions.RequestException:print(f"下載失敗: {title}")continue  # 跳過當前，繼續下一個

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/pingmian/71878.shtml
繁體地址，請注明出處：http://hk.pswp.cn/pingmian/71878.shtml
英文地址，請注明出處：http://en.pswp.cn/pingmian/71878.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！