Python 爬蟲 – BeautifulSoup

Python 爬蟲（Web Scraping）是指通過編寫 Python 程序從互聯網上自動提取信息的過程。

爬蟲的基本流程通常包括發送 HTTP 請求獲取網頁內容、解析網頁并提取數據，然后存儲數據。

Python 的豐富生態使其成為開發爬蟲的熱門語言，特別是由于其強大的庫支持。

一般來說，爬蟲的流程可以分為以下幾個步驟：

發送 HTTP 請求：爬蟲通過 HTTP 請求從目標網站獲取 HTML 頁面，常用的庫包括 [requests](https://www.runoob.com/python3/python-requests.html)。
解析 HTML 內容：獲取 HTML 頁面后，爬蟲需要解析內容并提取數據，常用的庫有 BeautifulSoup、lxml、Scrapy 等。
提取數據：通過定位 HTML 元素（如標簽、屬性、類名等）來提取所需的數據。
存儲數據：將提取的數據存儲到數據庫、CSV 文件、JSON 文件等格式中，以便后續使用或分析。

本章節主要介紹 BeautifulSoup，它是一個用于解析 HTML 和 XML 文檔的 Python 庫，能夠從網頁中提取數據，常用于網頁抓取和數據挖掘。

BeautifulSoup

BeautifulSoup 是一個用于從網頁中提取數據的 Python 庫，特別適用于解析 HTML 和 XML 文件。

BeautifulSoup 能夠通過提供簡單的 API 來提取和操作網頁中的內容，非常適合用于網頁抓取和數據提取的任務。

安裝 BeautifulSoup

要使用 BeautifulSoup，需要安裝 beautifulsoup4 和 lxml 或 html.parser（一個 HTML 解析器）。

我們可以使用 pip 來安裝這些依賴：

pip install beautifulsoup4
pip install lxml  # 推薦使用 lxml 作為解析器（速度更快）

如果你沒有 lxml，可以使用 Python 內置的 html.parser 作為解析器。

基本用法

BeautifulSoup 用于解析 HTML 或 XML 數據，并提供了一些方法來導航、搜索和修改解析樹。

BeautifulSoup 常見的操作包括查找標簽、獲取標簽屬性、提取文本等。

要使用 BeautifulSoup，需要先導入 BeautifulSoup，并將 HTML 頁面加載到 BeautifulSoup 對象中。

通常，你會先用爬蟲庫（如 requests）獲取網頁內容:

實例

from bs4 import BeautifulSoup
import requests

# 使用 requests 獲取網頁內容
url = ‘https://cn.bing.com/’ # 抓取bing搜索引擎的網頁內容
response = requests.get(url)

# 使用 BeautifulSoup 解析網頁
soup = BeautifulSoup(response.text, ‘lxml’) # 使用 lxml 解析器
# 解析網頁內容 html.parser 解析器
# soup = BeautifulSoup(response.text, ‘html.parser’)

獲取網頁標題：

實例

from bs4 import BeautifulSoup
import requests

# 指定你想要獲取標題的網站
url = ‘https://cn.bing.com/’ # 抓取bing搜索引擎的網頁內容

# 發送HTTP請求獲取網頁內容
response = requests.get(url)
# 中文亂碼問題
response.encoding = ‘utf-8’
# 確保請求成功
if response.status_code == 200:
# 使用BeautifulSoup解析網頁內容
soup = BeautifulSoup(response.text, ‘lxml’)

    \# 查找<title>標簽  
title\_tag \= soup.find('title')\# 打印標題文本  
if title\_tag:  print(title\_tag.get\_text())  
else:  print("未找到<title>標簽")

else:
print(“請求失敗，狀態碼：”, response.status_code)

執行以上代碼，輸出標題為：

搜索 - Microsoft 必應

中文亂碼問題

使用 requests 庫抓取中文網頁時，可能會遇到編碼問題，導致中文內容無法正確顯示，為了確保能夠正確抓取并顯示中文網頁，通常需要處理網頁的字符編碼。

自動檢測編碼 requests 通常會自動根據響應頭中的 Content-Type 來推測網頁的編碼，但有時可能不準確，此時可以使用 chardet 來自動檢測編碼。

實例

import requests

url = ‘https://cn.bing.com/’
response = requests.get(url)

# 使用 chardet 自動檢測編碼
import chardet
encoding = chardet.detect(response.content)[‘encoding’]
print(encoding)
response.encoding = encoding

執行以上代碼，輸出為：

utf-8

如果你知道網頁的編碼（例如 utf-8 或 gbk），可以直接設置 response.encoding:

response.encoding = 'utf-8'  # 或者 'gbk'，根據實際情況選擇

查找標簽

BeautifulSoup 提供了多種方法來查找網頁中的標簽，最常用的包括 find() 和 find_all()。

find() 返回第一個匹配的標簽
find_all() 返回所有匹配的標簽

實例

from bs4 import BeautifulSoup
import requests

# 指定你想要獲取標題的網站
url = ‘https://www.baidu.com/’ # 抓取bing搜索引擎的網頁內容

# 發送HTTP請求獲取網頁內容
response = requests.get(url)
# 中文亂碼問題
response.encoding = ‘utf-8’

soup = BeautifulSoup(response.text, ‘lxml’)

# 查找第一個標簽
first_link = soup.find(‘a’)
print(first_link)
print(“----------------------------”)

# 獲取第一個標簽的 href 屬性
first_link_url = first_link.get(‘href’)
print(first_link_url)
print(“----------------------------”)

# 查找所有標簽
all_links = soup.find_all(‘a’)
print(all_links)

輸出結果類似如下：

<a class="mnav" href="http://news.baidu.com" name="tj_trnews">新聞</a>
----------------------------
http://news.baidu.com
----------------------------
[<a class="mnav" href="http://news.baidu.com" name="tj_trnews">新聞</a>, <a class="mnav" href="https://www.hao123.com" name="tj_trhao123">hao123</a>, <a class="mnav" href="http://map.baidu.com" name="tj_trmap">地圖</a>, <a class="mnav" href="http://v.baidu.com" name="tj_trvideo">視頻</a>,

獲取標簽的文本

通過 get_text() 方法，你可以提取標簽中的文本內容：

實例

from bs4 import BeautifulSoup
import requests

# 指定你想要獲取標題的網站
url = ‘https://www.baidu.com/’ # 抓取bing搜索引擎的網頁內容

# 發送HTTP請求獲取網頁內容
response = requests.get(url)
# 中文亂碼問題
response.encoding = ‘utf-8’

soup = BeautifulSoup(response.text, ‘lxml’)

# 獲取第一個

標簽中的文本內容
paragraph_text = soup.find(‘p’).get_text()

# 獲取頁面中所有文本內容
all_text = soup.get_text()
print(all_text)

輸出結果類似如下：

 百度一下，你就知道           
...

查找子標簽和父標簽

你可以通過 parent 和 children 屬性訪問標簽的父標簽和子標簽：

# 獲取當前標簽的父標簽
parent_tag = first_link.parent# 獲取當前標簽的所有子標簽
children = first_link.children

實例

from bs4 import BeautifulSoup
import requests

# 指定你想要獲取標題的網站
url = ‘https://www.baidu.com/’ # 抓取bing搜索引擎的網頁內容

# 發送HTTP請求獲取網頁內容
response = requests.get(url)
# 中文亂碼問題
response.encoding = ‘utf-8’

soup = BeautifulSoup(response.text, ‘lxml’)

# 查找第一個標簽
first_link = soup.find(‘a’)
print(first_link)
print(“----------------------------”)

# 獲取當前標簽的父標簽
parent_tag = first_link.parent
print(parent_tag.get_text())

輸出結果類似如下：

<a class="mnav" href="http://news.baidu.com" name="tj_trnews">新聞</a>
----------------------------新聞 hao123 地圖 視頻 貼吧  登錄   更多產品

查找具有特定屬性的標簽

你可以通過傳遞屬性來查找具有特定屬性的標簽。

例如，查找類名為 example-class 的所有 div 標簽：

# 查找所有 class="example-class" 的 <div> 標簽
divs_with_class = soup.find_all('div', class_='example-class')# 查找具有 id="unique-id" 的 <p> 標簽
unique_paragraph = soup.find('p', id='unique-id')

獲取搜索按鈕，id 為 su ：

實例

from bs4 import BeautifulSoup
import requests

# 指定你想要獲取標題的網站
url = ‘https://www.baidu.com/’ # 抓取bing搜索引擎的網頁內容

# 發送HTTP請求獲取網頁內容
response = requests.get(url)
# 中文亂碼問題
response.encoding = ‘utf-8’

soup = BeautifulSoup(response.text, ‘lxml’)

# 查找具有 id=“unique-id” 的標簽
unique_input = soup.find(‘input’, id=‘su’)

input_value = unique_input[‘value’] # 獲取 input 輸入框的值

print(input_value)

輸出結果為：

百度一下

高級用法

CSS 選擇器

BeautifulSoup 也支持通過 CSS 選擇器來查找標簽。

select() 方法允許使用類似 jQuery 的選擇器語法來查找標簽：

# 使用 CSS 選擇器查找所有 class 為 'example' 的 <div> 標簽
example_divs = soup.select('div.example')# 查找所有 <a> 標簽中的 href 屬性
links = soup.select('a[href]')

處理嵌套標簽

BeautifulSoup 支持深度嵌套的 HTML 結構，你可以通過遞歸查找子標簽來處理這些結構：

# 查找嵌套的 <div> 標簽
nested_divs = soup.find_all('div', class_='nested')
for div in nested_divs:print(div.get_text())

修改網頁內容

BeautifulSoup 允許你修改 HTML 內容。

我們可以修改標簽的屬性、文本或刪除標簽：

實例

# 修改第一個標簽的 href 屬性
first_link[‘href’] = ‘http://new-url.com’

# 修改第一個

標簽的文本內容
first_paragraph = soup.find(‘p’)
first_paragraph.string = ‘Updated content’

# 刪除某個標簽
first_paragraph.decompose()

轉換為字符串

你可以將解析的 BeautifulSoup 對象轉換回 HTML 字符串：

# 轉換為字符串
html_str = str(soup)

BeautifulSoup 屬性與方法

以下是 BeautifulSoup 中常用的屬性和方法:

方法/屬性	描述	示例
`BeautifulSoup()`	用于解析 HTML 或 XML 文檔并返回一個 BeautifulSoup 對象。	`soup = BeautifulSoup(html_doc, 'html.parser')`
`.prettify()`	格式化并美化文檔內容，生成結構化的字符串。	`print(soup.prettify())`
`.find()`	查找第一個匹配的標簽。	`tag = soup.find('a')`
`.find_all()`	查找所有匹配的標簽，返回一個列表。	`tags = soup.find_all('a')`
`.find_all_next()`	查找當前標簽后所有符合條件的標簽。	`tags = soup.find('div').find_all_next('p')`
`.find_all_previous()`	查找當前標簽前所有符合條件的標簽。	`tags = soup.find('div').find_all_previous('p')`
`.find_parent()`	返回當前標簽的父標簽。	`parent = tag.find_parent()`
`.find_all_parents()`	查找當前標簽的所有父標簽。	`parents = tag.find_all_parents()`
`.find_next_sibling()`	查找當前標簽的下一個兄弟標簽。	`next_sibling = tag.find_next_sibling()`
`.find_previous_sibling()`	查找當前標簽的前一個兄弟標簽。	`prev_sibling = tag.find_previous_sibling()`
`.parent`	獲取當前標簽的父標簽。	`parent = tag.parent`
`.next_sibling`	獲取當前標簽的下一個兄弟標簽。	`next_sibling = tag.next_sibling`
`.previous_sibling`	獲取當前標簽的前一個兄弟標簽。	`prev_sibling = tag.previous_sibling`
`.get_text()`	提取標簽內的文本內容，忽略所有HTML標簽。	`text = tag.get_text()`
`.attrs`	返回標簽的所有屬性，以字典形式表示。	`href = tag.attrs['href']`
`.string`	獲取標簽內的字符串內容。	`string_content = tag.string`
`.name`	返回標簽的名稱。	`tag_name = tag.name`
`.contents`	返回標簽的所有子元素，以列表形式返回。	`children = tag.contents`
`.descendants`	返回標簽的所有后代元素，生成器形式。	`for child in tag.descendants: print(child)`
`.parent`	獲取當前標簽的父標簽。	`parent = tag.parent`
`.previous_element`	獲取當前標簽的前一個元素（不包括文本）。	`prev_elem = tag.previous_element`
`.next_element`	獲取當前標簽的下一個元素（不包括文本）。	`next_elem = tag.next_element`
`.decompose()`	從樹中刪除當前標簽及其內容。	`tag.decompose()`
`.unwrap()`	移除標簽本身，只保留其子內容。	`tag.unwrap()`
`.insert()`	向標簽內插入新標簽或文本。	`tag.insert(0, new_tag)`
`.insert_before()`	在當前標簽前插入新標簽。	`tag.insert_before(new_tag)`
`.insert_after()`	在當前標簽后插入新標簽。	`tag.insert_after(new_tag)`
`.extract()`	刪除標簽并返回該標簽。	`extracted_tag = tag.extract()`
`.replace_with()`	替換當前標簽及其內容。	`tag.replace_with(new_tag)`
`.has_attr()`	檢查標簽是否有指定的屬性。	`if tag.has_attr('href'):`
`.get()`	獲取指定屬性的值。	`href = tag.get('href')`
`.clear()`	清空標簽的所有內容。	`tag.clear()`
`.encode()`	編碼標簽內容為字節流。	`encoded = tag.encode()`
`.is_empty_element`	檢查標簽是否是空元素（例如 `<br>`、`<img>` 等）。	`if tag.is_empty_element:`
`.is_ancestor_of()`	檢查當前標簽是否是指定標簽的祖先元素。	`if tag.is_ancestor_of(another_tag):`
`.is_descendant_of()`	檢查當前標簽是否是指定標簽的后代元素。	`if tag.is_descendant_of(another_tag):`

其他屬性

方法/屬性	描述	示例
`.style`	獲取標簽的內聯樣式。	`style = tag['style']`
`.id`	獲取標簽的 `id` 屬性。	`id = tag['id']`
`.class_`	獲取標簽的 `class` 屬性。	`class_name = tag['class']`
`.string`	獲取標簽內部的字符串內容，忽略其他標簽。	`content = tag.string`
`.parent`	獲取標簽的父元素。	`parent = tag.parent`

其他

方法/屬性	描述	示例
`find_all(string)`	使用字符串查找匹配的標簽。	`tag = soup.find_all('div', class_='container')`
`find_all(id)`	查找指定 `id` 的標簽。	`tag = soup.find_all(id='main')`
`find_all(attrs)`	查找具有指定屬性的標簽。	`tag = soup.find_all(attrs={"href": "http://example.com"})`