Python 第二階段 - 爬蟲入門
🎯 今日目標
- 掌握網頁分頁的原理和定位“下一頁”的鏈接
- 能編寫循環邏輯自動翻頁抓取內容
- 將多頁抓取整合到爬蟲系統中
📘 學習內容詳解
- 🔁 網頁分頁邏輯介紹
以 quotes.toscrape.com 為例:
- 首頁鏈接:https://quotes.toscrape.com/
- 下一頁鏈接:
<li class="next"><a href="/page/2/">Next</a></li>
我們可以通過 BeautifulSoup
查找li.next > a['href']
獲取下一頁地址,并拼接 URL。
-
🧪 核心思路偽代碼
while True:1. 請求當前頁 URL2. 解析 HTML,提取所需內容3. 判斷是否存在下一頁鏈接- 如果有,拼接新 URL,繼續循環- 如果沒有,break 退出循環
💻 示例代碼(多頁抓取)
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoindef scrape_all_quotes(start_url):quotes = []url = start_urlwhile url:print(f"正在抓取:{url}")res = requests.get(url)soup = BeautifulSoup(res.text, 'lxml')for quote_block in soup.find_all("div", class_="quote"):quote_text = quote_block.find("span", class_="text").text.strip()author = quote_block.find("small", class_="author").text.strip()tags = [tag.text for tag in quote_block.find_all("a", class_="tag")]quotes.append({"quote": quote_text,"author": author,"tags": tags})# 查找下一頁next_link = soup.select_one("li.next > a")if next_link:next_href = next_link['href']url = urljoin(url, next_href) # 拼接為完整URLelse:url = Nonereturn quotesif __name__ == "__main__":all_quotes = scrape_all_quotes("https://quotes.toscrape.com/")print(f"共抓取到 {len(all_quotes)} 條名言")# 示例輸出前3條for quote in all_quotes[:3]:print(f"\n{quote['quote']}\n—— {quote['author']}|標簽:{', '.join(quote['tags'])}")
🧠 今日練習任務
-
修改已有爬蟲,實現抓取所有頁面的名言數據
-
使用 len() 查看共抓取多少條數據
-
額外挑戰:將所有數據保存為 JSON 文件(使用 json.dump)
練習代碼:
import requests from bs4 import BeautifulSoup from urllib.parse import urljoin import jsondef scrape_all_quotes(start_url):quotes = []url = start_urlwhile url:print(f"抓取頁面:{url}")response = requests.get(url)soup = BeautifulSoup(response.text, "lxml")quote_blocks = soup.find_all("div", class_="quote")for block in quote_blocks:text = block.find("span", class_="text").text.strip()author = block.find("small", class_="author").text.strip()tags = [tag.text for tag in block.find_all("a", class_="tag")]quotes.append({"quote": text,"author": author,"tags": tags})# 找到下一頁鏈接next_link = soup.select_one("li.next > a")if next_link:next_href = next_link['href']url = urljoin(url, next_href)else:url = Nonereturn quotesif __name__ == "__main__":start_url = "https://quotes.toscrape.com/"all_quotes = scrape_all_quotes(start_url)print(f"\n共抓取到 {len(all_quotes)} 條名言。\n")# 保存到 JSON 文件output_file = "quotes.json"with open(output_file, "w", encoding="utf-8") as f:json.dump(all_quotes, f, ensure_ascii=False, indent=2)print(f"數據已保存到文件:{output_file}")
運行輸出:
正在抓取:https://quotes.toscrape.com/ 正在抓取:https://quotes.toscrape.com/page/2/ 正在抓取:https://quotes.toscrape.com/page/3/ 正在抓取:https://quotes.toscrape.com/page/4/ 正在抓取:https://quotes.toscrape.com/page/5/ 正在抓取:https://quotes.toscrape.com/page/6/ 正在抓取:https://quotes.toscrape.com/page/7/ 正在抓取:https://quotes.toscrape.com/page/8/ 正在抓取:https://quotes.toscrape.com/page/9/ 正在抓取:https://quotes.toscrape.com/page/10/ 共抓取到 100 條名言 數據已保存到文件:quotes.json
quotes.json文件輸出:
[{"quote": "“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”","author": "Albert Einstein","tags": ["change","deep-thoughts","thinking","world"]},{"quote": "“It is our choices, Harry, that show what we truly are, far more than our abilities.”","author": "J.K. Rowling","tags": ["abilities","choices"]},{"quote": "“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”","author": "Albert Einstein","tags": ["inspirational","life","live","miracle","miracles"]},... # 此處省去95條數據{"quote": "“A person's a person, no matter how small.”","author": "Dr. Seuss","tags": ["inspirational"]},{"quote": "“... a mind needs books as a sword needs a whetstone, if it is to keep its edge.”","author": "George R.R. Martin","tags": ["books","mind"]}
📎 小技巧
-
urljoin(base_url, relative_path)
可以自動拼接絕對路徑 -
網站有時采用 JavaScript 動態分頁 —— 這類網站需用 Selenium/Playwright(后續學習)
📝 今日總結
- 學會了如何從網頁中提取“下一頁”鏈接
- 掌握了自動翻頁抓取邏輯的實現方式
- 距離構建完整的數據采集工具更進一步