Python 爬蟲入門 Day 5 - 使用 XPath 進行網頁解析（lxml + XPath）

Python 第二階段 - 爬蟲入門

🎯 今日目標

掌握 XPath 的基本語法
使用 lxml.etree 解析 HTML，提取數據
與 BeautifulSoup 比較：誰更強？

📘 學習內容詳解

? 安裝依賴

pip install lxml

🧩 XPath 簡介

XPath 是一種用于在 XML/HTML 中查找信息的語言，功能強大，支持復雜結構提取。

常見語法：

XPath 表達式	含義
`//tag`	所有指定標簽
`//div[@class="quote"]`	class 為 quote 的所有 div 標簽
`.//span[@class="text"]/text()`	當前元素內的 span.text 的內容
`//a/@href`	提取 a 標簽的 href 屬性值

📌 示例代碼

from lxml import etree
import requestsurl = "https://quotes.toscrape.com/"
res = requests.get(url)
tree = etree.HTML(res.text)quotes = tree.xpath('//div[@class="quote"]')for q in quotes:text = q.xpath('.//span[@class="text"]/text()')[0]author = q.xpath('.//small[@class="author"]/text()')[0]tags = q.xpath('.//div[@class="tags"]/a[@class="tag"]/text()')print(f"{text} —— {author} [Tags: {', '.join(tags)}]")

📊 XPath vs BeautifulSoup

對比項	BeautifulSoup	XPath (lxml)
學習曲線	簡單	稍復雜
功能強度	中	強
性能	一般	較快
選擇方式	標簽/類名/選擇器	路徑表達式
適合人群	初學者	熟悉 HTML 的開發者

🧪 今日練習任務

使用 XPath 提取名言、作者、標簽
獲取所有頁數據（分頁跳轉）
統計作者數量 & 不重復的標簽數

保存數據為 JSON 文件

示例代碼：

import requests
from lxml import etree
import json
import timeBASE_URL = "https://quotes.toscrape.com"
HEADERS = {"User-Agent": "Mozilla/5.0"
}def fetch_html(url):response = requests.get(url, headers=HEADERS)return response.text if response.status_code == 200 else Nonedef parse_quotes(html):tree = etree.HTML(html)quotes = tree.xpath('//div[@class="quote"]')data = []for q in quotes:text = q.xpath('.//span[@class="text"]/text()')[0]author = q.xpath('.//small[@class="author"]/text()')[0]tags = q.xpath('.//div[@class="tags"]/a[@class="tag"]/text()')data.append({"text": text,"author": author,"tags": tags})return datadef get_next_page(html):tree = etree.HTML(html)next_page = tree.xpath('//li[@class="next"]/a/@href')return BASE_URL + next_page[0] if next_page else Nonedef main():all_quotes = []url = BASE_URLwhile url:print(f"正在抓取：{url}")html = fetch_html(url)if not html:print("頁面加載失敗")breakquotes = parse_quotes(html)all_quotes.extend(quotes)url = get_next_page(html)time.sleep(0.5)  # 模擬人類行為，防止被封# 輸出抓取結果print(f"\n共抓取名言：{len(all_quotes)} 條")# 保存為 JSONwith open("quotes_xpath.json", "w", encoding="utf-8") as f:json.dump(all_quotes, f, ensure_ascii=False, indent=2)print("已保存為 quotes_xpath.json")if __name__ == "__main__":main()

?? 今日總結

學會使用 XPath 精確定位 HTML 元素
掌握了 lxml.etree.HTML 的解析方法
對比了兩種主流網頁解析方式，為后續復雜數據提取打好基礎

題外話

在這里插入圖片描述

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/bicheng/85604.shtml
繁體地址，請注明出處：http://hk.pswp.cn/bicheng/85604.shtml
英文地址，請注明出處：http://en.pswp.cn/bicheng/85604.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！