Python爬蟲（三）：BeautifulSoup庫

1. BeautifulSoup是什么？

BeautifulSoup 是一個 Python 庫，專門用來解析 HTML 或 XML 文件，方便我們提取數據。它能把網頁源代碼轉換成樹形結構，讓我們可以輕松查找、修改內容，并自動處理編碼問題（如 Unicode 和 UTF-8）。

支持的解析器

BeautifulSoup 支持多種解析器，不同解析器的速度和容錯能力不同：

解析器	使用方法	優點	缺點
Python標準庫	`BeautifulSoup(html, "html.parser")`	內置，無需安裝	速度較慢
lxml (HTML)	`BeautifulSoup(html, "lxml")`	速度快，容錯強	需要額外安裝
lxml (XML)	`BeautifulSoup(html, "xml")`	唯一支持 XML 的解析器	需要額外安裝
html5lib	`BeautifulSoup(html, "html5lib")`	容錯最好，像瀏覽器一樣解析	速度最慢

推薦使用 lxml，因為它最快！

2. 安裝 BeautifulSoup

安裝 BeautifulSoup

pip install beautifulsoup4

安裝 lxml（推薦）

pip install lxml

安裝 html5lib（可選）

pip install html5lib

3. 快速上手

（1）解析 HTML 字符串

from bs4 import BeautifulSouphtml = '''
<!DOCTYPE html>
<html>
<head><title>BeautifulSoup學習</title>
</head>
<body><p>Hello BeautifulSoup</p>
</body>
</html>
'''# 使用 lxml 解析器
soup = BeautifulSoup(html, "lxml")
print(soup.p.text)  # 輸出: Hello BeautifulSoup

（2）解析本地 HTML 文件

soup = BeautifulSoup(open("index.html"), "lxml")
print(soup.title.text)  # 輸出網頁標題

4. BeautifulSoup 的 4 種對象

（1）Tag（標簽）

對應 HTML 里的標簽，如 <title>、<p>。
可以獲取標簽名 .name 和屬性 .attrs。

tag = soup.title
print(tag.name)  # 輸出: title
print(tag.attrs)  # 輸出: {'class': ['tl']}

（2）NavigableString（文本內容）

對應標簽里的文本內容，如 <p>Hello</p> 里的 "Hello"。
可以用 .string 獲取文本。

text = soup.p.string
print(text)  # 輸出: Hello BeautifulSoup

（3）BeautifulSoup（整個文檔）

代表整個 HTML 文檔，可以看作最大的 Tag。
它的 .name 是 [document]。

print(soup.name)  # 輸出: [document]

（4）Comment（注釋）

對應 HTML 注釋 ，但 .string 會去掉注釋符號。

comment = soup.find(text=lambda text: isinstance(text, Comment))
print(comment)  # 輸出: 這是注釋

5. 搜索文檔樹

（1）`find_all()` 查找所有匹配的標簽

# 查找所有 <a> 標簽
links = soup.find_all("a")# 查找 class="elsie" 的所有標簽
elsie_tags = soup.find_all(class_="elsie")# 查找 id="link1" 的標簽
link1 = soup.find_all(id="link1")# 查找文本包含 "BeautifulSoup" 的標簽
text_match = soup.find_all(text="BeautifulSoup")

（2）`find()` 查找第一個匹配的標簽

first_link = soup.find("a")  # 返回第一個 <a> 標簽

（3）CSS 選擇器（推薦！）

# 查找所有 <a> 標簽
soup.select("a")# 查找 class="elsie" 的所有標簽
soup.select(".elsie")# 查找 id="link1" 的標簽
soup.select("#link1")# 查找 body 下的所有 <a> 標簽
soup.select("body a")# 查找第一個匹配的標簽
soup.select_one(".elsie")

6. 總結

BeautifulSoup 是 Python 爬蟲必備的 HTML/XML 解析庫。
推薦使用 lxml 解析器，因為它最快。
4 種對象：Tag、NavigableString、BeautifulSoup、Comment。
搜索方法：
- find_all() 查找所有匹配的標簽
- find() 查找第一個匹配的標簽
- select() 使用 CSS 選擇器（最方便！）

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/pingmian/84272.shtml
繁體地址，請注明出處：http://hk.pswp.cn/pingmian/84272.shtml
英文地址，請注明出處：http://en.pswp.cn/pingmian/84272.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！