Python - 爬蟲利器 - BeautifulSoup4常用 API

文章目錄

- 前言
- BeautifulSoup4 簡介
- - 主要特點：
  - 安裝方式:
- 常用 API
- - 1. 創建 BeautifulSoup 對象
  - 2. 查找標簽
  - - find(): 返回匹配的第一個元素
    - find_all(): 返回所有匹配的元素列表
    - select_one() & select(): CSS 選擇器
  - 3. 訪問標簽內容
  - - text 屬性: 獲取標簽內純文本
    - get_text(): 同樣作用于獲取文本
    - attrs 屬性: 獲取標簽的所有屬性
    - [attribute]: 直接訪問某個屬性值
  - 4. 修改文檔
  - - 添加新標簽
    - 刪除標簽
    - 替換標簽
  - 5. 導航樹結構
  - - parent: 上級父節點
    - children: 下級子節點迭代器
    - siblings: 并列兄弟節點
- 實戰小技巧(關鍵點)
- - F12打開控制臺
  - 復制對應圖片的css選擇器
  - 直接代碼中使用
- 結束語

前言

在時光的長河里，每一滴水都是昨日的星辰，映照著永不重復的今天。

BeautifulSoup4 簡介

BeautifulSoup4（通常簡稱為 BS4）是一個用于解析 HTML 和 XML 文檔的 Python 庫。它的設計目的是簡化從復雜網頁中提取數據的過程。BeautifulSoup4 可以處理各種各樣的標記語言，并提供了一個簡單的接口來進行文檔導航、搜索和修改。

主要特點：

跨平臺支持: Beautiful Soup 支持 Windows、Linux、Mac OS X 等多個操作系統。
兼容性強: 支持多種解析器，包括 Python 內置的標準庫解析器 (html.parser)、第三方解析器 lxml 和 html5lib。
易于學習: 提供了簡單且直觀的 API，適合初學者使用。
強大功能: 包含豐富的函數和方法，可以幫助開發者高效地完成任務。

安裝方式:

你可以通過 pip 工具輕松安裝 BeautifulSoup4:

pip install beautifulsoup4

常用 API

以下是 BeautifulSoup4 中一些常用的 API 方法和功能：

1. 創建 BeautifulSoup 對象

首先，你需要創建一個 BeautifulSoup 對象來解析 HTML 或 XML 文檔。

from bs4 import BeautifulSoup# 使用默認的 html.parser 解析器
html_doc = "<html><head><title>Example Page</title></head><body id='id'><a href='123'></a><p class='my-class child-class'><i>444</i><h1>Hello World</h1></p></body></html>"
soup = BeautifulSoup(html_doc, 'html.parser')# 打印解析后的結果
print(soup.prettify())

2. 查找標簽

可以通過標簽名稱或其他屬性來查找特定的元素。

find(): 返回匹配的第一個元素

first_paragraph = soup.find('p')
print(first_paragraph)  # 輸出: <p>Hello World</p>

find_all(): 返回所有匹配的元素列表

all_headings = soup.find_all(['h1', 'h2'])
for heading in all_headings:print(heading.text)

select_one() & select(): CSS 選擇器

css_selector_example = soup.select_one('.my-class')
print(css_selector_example)css_selectors_examples = soup.select('#id > .child-class')
for element in css_selectors_examples:print(element.text)

3. 訪問標簽內容

訪問標簽內的文本和其他屬性。

text 屬性: 獲取標簽內純文本

text_content = first_paragraph.text
print(text_content)  # 輸出: Hello World

get_text(): 同樣作用于獲取文本

get_text_content = first_paragraph.get_text()
print(get_text_content)  # 輸出: Hello World

attrs 屬性: 獲取標簽的所有屬性

attributes = first_paragraph.attrs
print(attributes)  # 如果沒有其他屬性，則為空字典 {}

[attribute]: 直接訪問某個屬性值

link_tag = soup.a
href_value = link_tag['href']
print(href_value)

4. 修改文檔

除了查詢外，還可以動態地添加、刪除或修改文檔中的節點。

添加新標簽

new_tag = soup.new_tag("b")
new_tag.string = "Bold Text"
first_paragraph.append(new_tag)
print(first_paragraph)  # 輸出: <p>Hello World<b>Bold Text</b></p>

刪除標簽

tag_to_remove = soup.b
tag_to_remove.decompose()
print(first_paragraph)  # 輸出: <p>Hello World</p>

替換標簽

replacement_tag = soup.new_tag("i")
replacement_tag.string = "Italic Text"
first_paragraph.i.replace_with(replacement_tag)
print(first_paragraph)  # 輸出: <p>Hello World<i>Italic Text</i></p>

5. 導航樹結構

BeautifulSoup 還提供了多種方法來遍歷和操作 DOM 樹。

parent: 上級父節點

parent_node = first_paragraph.parent
print(parent_node.name)  # 輸出: body

children: 下級子節點迭代器

children_nodes = list(first_paragraph.children)
for child in children_nodes:print(child)

siblings: 并列兄弟節點

next_sibling = first_paragraph.next_sibling
previous_sibling = first_paragraph.previous_sibling
print(next_sibling)
print(previous_sibling)

實戰小技巧(關鍵點)

實際情況下，很多節點不好找到，可以利用瀏覽器功能，可以直接復制css選擇器

F12打開控制臺

復制對應圖片的css選擇器

復制css選擇器

直接代碼中使用

from bs4 import BeautifulSoup# 使用默認的 html.parser 解析器
html_doc = "<html></html>"
soup = BeautifulSoup(html_doc, 'html.parser')
# 只是為了示例  不可運行 以下是復制出來的內容
soup.select('#ice-container > div.tbpc-layout > div.screen-outer.clearfix > div.main > div.core.J_Core > div > div:nth-child(1) > div:nth-child(1) > div > div > div > div > div:nth-child(3) > div > div > a')

結束語

文章中API都驗證過，可直接運行👽👽👽
運行有問題可聯系作者評論交流🤭🤭🤭
風是自由的，你也是自由🤠🤠🤠
歡迎一起交流學習??????
有幫助請留下足跡一鍵三連🥰🥰🥰
爬蟲大佬勿噴，歡迎指正問題😈😈😈
后面會做一系列的爬蟲文章，請持續關注作者🤡🤡🤡。

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/web/70083.shtml
繁體地址，請注明出處：http://hk.pswp.cn/web/70083.shtml
英文地址，請注明出處：http://en.pswp.cn/web/70083.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！