Python 爬蟲入門 Day 3 - 實現爬蟲多頁抓取與翻頁邏輯

Python 第二階段 - 爬蟲入門

🎯 今日目標

掌握網頁分頁的原理和定位“下一頁”的鏈接
能編寫循環邏輯自動翻頁抓取內容
將多頁抓取整合到爬蟲系統中

📘 學習內容詳解

🔁 網頁分頁邏輯介紹
以 quotes.toscrape.com 為例：

首頁鏈接：https://quotes.toscrape.com/
下一頁鏈接：<li class="next"><a href="/page/2/">Next</a></li>

我們可以通過 BeautifulSoup 查找li.next > a['href'] 獲取下一頁地址，并拼接 URL。

🧪 核心思路偽代碼

while True:1. 請求當前頁 URL2. 解析 HTML，提取所需內容3. 判斷是否存在下一頁鏈接- 如果有，拼接新 URL，繼續循環- 如果沒有，break 退出循環

💻 示例代碼（多頁抓取）

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoindef scrape_all_quotes(start_url):quotes = []url = start_urlwhile url:print(f"正在抓取：{url}")res = requests.get(url)soup = BeautifulSoup(res.text, 'lxml')for quote_block in soup.find_all("div", class_="quote"):quote_text = quote_block.find("span", class_="text").text.strip()author = quote_block.find("small", class_="author").text.strip()tags = [tag.text for tag in quote_block.find_all("a", class_="tag")]quotes.append({"quote": quote_text,"author": author,"tags": tags})# 查找下一頁next_link = soup.select_one("li.next > a")if next_link:next_href = next_link['href']url = urljoin(url, next_href)  # 拼接為完整URLelse:url = Nonereturn quotesif __name__ == "__main__":all_quotes = scrape_all_quotes("https://quotes.toscrape.com/")print(f"共抓取到 {len(all_quotes)} 條名言")# 示例輸出前3條for quote in all_quotes[:3]:print(f"\n{quote['quote']}\n—— {quote['author']}｜標簽：{', '.join(quote['tags'])}")

🧠 今日練習任務

修改已有爬蟲，實現抓取所有頁面的名言數據
使用 len() 查看共抓取多少條數據

額外挑戰：將所有數據保存為 JSON 文件（使用 json.dump）

練習代碼：

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import jsondef scrape_all_quotes(start_url):quotes = []url = start_urlwhile url:print(f"抓取頁面：{url}")response = requests.get(url)soup = BeautifulSoup(response.text, "lxml")quote_blocks = soup.find_all("div", class_="quote")for block in quote_blocks:text = block.find("span", class_="text").text.strip()author = block.find("small", class_="author").text.strip()tags = [tag.text for tag in block.find_all("a", class_="tag")]quotes.append({"quote": text,"author": author,"tags": tags})# 找到下一頁鏈接next_link = soup.select_one("li.next > a")if next_link:next_href = next_link['href']url = urljoin(url, next_href)else:url = Nonereturn quotesif __name__ == "__main__":start_url = "https://quotes.toscrape.com/"all_quotes = scrape_all_quotes(start_url)print(f"\n共抓取到 {len(all_quotes)} 條名言。\n")# 保存到 JSON 文件output_file = "quotes.json"with open(output_file, "w", encoding="utf-8") as f:json.dump(all_quotes, f, ensure_ascii=False, indent=2)print(f"數據已保存到文件：{output_file}")

運行輸出：

正在抓取：https://quotes.toscrape.com/
正在抓取：https://quotes.toscrape.com/page/2/
正在抓取：https://quotes.toscrape.com/page/3/
正在抓取：https://quotes.toscrape.com/page/4/
正在抓取：https://quotes.toscrape.com/page/5/
正在抓取：https://quotes.toscrape.com/page/6/
正在抓取：https://quotes.toscrape.com/page/7/
正在抓取：https://quotes.toscrape.com/page/8/
正在抓取：https://quotes.toscrape.com/page/9/
正在抓取：https://quotes.toscrape.com/page/10/
共抓取到 100 條名言
數據已保存到文件：quotes.json

quotes.json文件輸出：

[{"quote": "“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”","author": "Albert Einstein","tags": ["change","deep-thoughts","thinking","world"]},{"quote": "“It is our choices, Harry, that show what we truly are, far more than our abilities.”","author": "J.K. Rowling","tags": ["abilities","choices"]},{"quote": "“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”","author": "Albert Einstein","tags": ["inspirational","life","live","miracle","miracles"]},... # 此處省去95條數據{"quote": "“A person's a person, no matter how small.”","author": "Dr. Seuss","tags": ["inspirational"]},{"quote": "“... a mind needs books as a sword needs a whetstone, if it is to keep its edge.”","author": "George R.R. Martin","tags": ["books","mind"]}

📎 小技巧

urljoin(base_url, relative_path) 可以自動拼接絕對路徑
網站有時采用 JavaScript 動態分頁 —— 這類網站需用 Selenium/Playwright（后續學習）

📝 今日總結

學會了如何從網頁中提取“下一頁”鏈接
掌握了自動翻頁抓取邏輯的實現方式
距離構建完整的數據采集工具更進一步

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/diannao/87061.shtml
繁體地址，請注明出處：http://hk.pswp.cn/diannao/87061.shtml
英文地址，請注明出處：http://en.pswp.cn/diannao/87061.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！