數據提取之bs4（BeautifuSoup4）模塊與Css選擇器

BeautifuSoup4

from bs4 import BeautifulSoup

創建對象 <class 'bs4.BeautifulSoup'>

soup = BeautifulSoup(源碼, '解析器')

bs4標簽種類

（1）tag: 標簽
print(soup.title, type(soup.title))
（2）獲取標簽里面的文本內容, 可導航的字符串，數據類型是<class 'bs4.element.NavigableString'>對象，可以使用字符串的方法
title = soup.title
# string
print(title.string, type(title.string))
（3）注釋
# 注釋 <class 'bs4.element.Comment'>
html = ''
soup2 = BeautifulSoup(html, 'lxml')

遍歷文檔樹

# 解析數據
head_tag = soup.p #默認獲取第一個p標簽
# 獲取標簽的子節點, .contents: 返回的是一個所有子節點的列表
# print(head_tag.contents)

print(head_tag.children)? # 返回的是一個生成器對象，通過循環遍歷取值
for head in head_tag.children:
print(head)

源碼：

# 1. 導入模塊
from bs4 import BeautifulSoup# 源碼
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p><p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p><p class="story">...</p>
"""# 2. 創建對象  <class 'bs4.BeautifulSoup'>
soup = BeautifulSoup(html_doc, 'lxml')

獲取節點文本內容

.string

# 通過上一級標簽，去獲取子級的標簽文本內容
# head = soup.head
# print(head.string)

.text

# print(head.text)? # 獲取的是多個子級標簽的文本內容，內容都拼接在一起

# strings/stripped_strings
contents = soup.html
# print(contents.string)? # 沒有獲取
# print(contents.text)

.strings

# print(contents.strings)? # <generator object Tag._all_strings at 0x000001E214912820> 生成器對象
# strings可以獲取這個標簽下的所有的文本，文本內容包含很多空行
# for data in contents.strings:
#???? print(data)

.stripped_strings

# stripped_strings可以獲取這個標簽下的所有的文本，去除了多空行
# for data in contents.stripped_strings:
#???? print(data)

總結：

獲取標簽文本內容
string: 標簽里面只有一個標簽有文本內容，可導航的字符串，（使用 BeautifulSoup 解析文檔后，標簽（Tag）之間的文本會被轉換為NavigableString對象，比如：Hello, World!，其中的"Hello, World!"就是一個NavigableString對象。它之所以被稱為 "可導航"，是因為他可以通過它訪問周圍的其他元素（如父標簽、兄弟標簽等），形成一個樹形結構）
text: 將所有的標簽文本內容拼接在一起
strings: 依次獲取所有的標簽文本內容，包含空行，返回的是一個生成器對象
stripped_strings: 依次獲取所有的標簽文本內容，去除多余的空行看，返回的是一個生成器對象

獲取父節點

.parent

title_tag = soup.title
print(title_tag)? #<title>The Dormouse's story</title>
print(title_tag.parent)? #<head><title>The Dormouse's story</title></head>

.parents

# 一旦是獲取的是多個標簽內容，返回的就是生成器
a_tag = soup.a
print(a_tag.parents)? ?#<generator object PageElement.parents at 0x0000027F14C502E0>
for p in a_tag.parents:
print(p.name)
'''
p
body
html
[document]
'''

獲取同級節點

源代碼：

from bs4 import BeautifulSouphtml2 = """<a>
<b>bbbb</b><c>ccccc</c><d>dddd</d>
</a>"""soup = BeautifulSoup(html2, 'lxml')
b_tag = soup.c
# print(b_tag.next_sibling)  # 跟他相鄰的下一個節點
# print(b_tag.next_siblings)  # 跟他相鄰的下一個節點print(b_tag.previous_sibling)  # 上一個所有的兄弟節點
print(b_tag.previous_siblings)  # 上一個所有的兄弟節點

.next_sibling/.next_siblings

# print(b_tag.next_sibling)? # 跟他相鄰的下一個節點
# print(b_tag.next_siblings)? # 跟他相鄰的下一個所有節點

.previous_sibling)/.previous_siblings

print(b_tag.previous_sibling)? # 上一個所有的兄弟節點
print(b_tag.previous_siblings)? # 上一個所有的兄弟節點

方法使用（核心）

find(): 查找一個

# 獲取a標簽
a_tag = soup.find('a')? # 默認查找第一個標簽
print(a_tag)

find_all():查找所有

a_tag_all = soup.find_all('a')
print(a_tag_all)? # 返回的是一個列表，用循環遍歷

# 同時找兩個標簽，標簽使用元組或列表使其成為一個整體，不能分開寫
a_p_tag = soup.find_all(('title', 'b'))? ? ? #也可以寫為：a_p_tag = soup.find_all(['title', 'b'])
print(a_p_tag)

案例：

# 導入模塊
from bs4 import BeautifulSouphtml = """
<table class="tablelist" cellpadding="0" cellspacing="0"><tbody><tr class="h"><td class="l" width="374">職位名稱</td><td>職位類別</td><td>人數</td><td>地點</td><td>發布時間</td></tr><tr class="even"><td class="l square"><a target="_blank" href="position_detail.php?id=33824&keywords=python&tid=87&lid=2218">22989-金融云區塊鏈高級研發工程師（深圳）</a></td><td>技術類</td><td>1</td><td>深圳</td><td>2017-11-25</td></tr><tr class="odd"><td class="l square"><a target="_blank" href="position_detail.php?id=29938&keywords=python&tid=87&lid=2218">22989-金融云高級后臺開發</a></td><td>技術類</td><td>2</td><td>深圳</td><td>2017-11-25</td></tr><tr class="even"><td class="l square"><a target="_blank" href="position_detail.php?id=31236&keywords=python&tid=87&lid=2218">SNG16-騰訊音樂運營開發工程師（深圳）</a></td><td>技術類</td><td>2</td><td>深圳</td><td>2017-11-25</td></tr><tr class="odd"><td class="l square"><a target="_blank" href="position_detail.php?id=31235&keywords=python&tid=87&lid=2218">SNG16-騰訊音樂業務運維工程師（深圳）</a></td><td>技術類</td><td>1</td><td>深圳</td><td>2017-11-25</td></tr><tr class="even"><td class="l square"><a target="_blank" href="position_detail.php?id=34531&keywords=python&tid=87&lid=2218">TEG03-高級研發工程師（深圳）</a></td><td>技術類</td><td>1</td><td>深圳</td><td>2017-11-24</td></tr><tr class="odd"><td class="l square"><a target="_blank" href="position_detail.php?id=34532&keywords=python&tid=87&lid=2218">TEG03-高級圖像算法研發工程師（深圳）</a></td><td>技術類</td><td>1</td><td>深圳</td><td>2017-11-24</td></tr><tr class="even"><td class="l square"><a target="_blank" href="position_detail.php?id=31648&keywords=python&tid=87&lid=2218">TEG11-高級AI開發工程師（深圳）</a></td><td>技術類</td><td>4</td><td>深圳</td><td>2017-11-24</td></tr><tr class="odd"><td class="l square"><a target="_blank" href="position_detail.php?id=32218&keywords=python&tid=87&lid=2218">15851-后臺開發工程師</a></td><td>技術類</td><td>1</td><td>深圳</td><td>2017-11-24</td></tr><tr class="even"><td class="l square"><a target="_blank" href="position_detail.php?id=32217&keywords=python&tid=87&lid=2218">15851-后臺開發工程師</a></td><td>技術類</td><td>1</td><td>深圳</td><td>2017-11-24</td></tr><tr class="odd"><td class="l square"><a id="test" class="test" target='_blank' href="position_detail.php?id=34511&keywords=python&tid=87&lid=2218">SNG11-高級業務運維工程師（深圳）</a></td><td>技術類</td><td>1</td><td>深圳</td><td>2017-11-24</td></tr></tbody>
</table>
<a href="https://www.baid.com">百度一下</a>
<a href="https://www.douban.com">豆瓣一下</a>
<a herf="https://www.python.com">python一下</a>
"""# 實例化一個對象
soup = BeautifulSoup(html, 'lxml')# 獲取的是所有的tr標簽
# trs = soup.find_all('tr')# 獲取class="even"標簽, bs4 class屬性定位
# trs = soup.find_all('tr', class_="even")
# trs = soup.find_all('tr', attrs={"class": "even"})
# print(trs)
# <a id="test" class="test"
# a = soup.find_all('a', id="test", class_="test")
# find_all沒有獲取到數據，返回的是一個空列表
# trs = soup.find_all('a', attrs={"class": "test", "id":"test"})
# print(trs)# 獲取所有a標簽里面的href屬性值
# a_lst = soup.find_all('a')
# for a in a_lst:
#     '''
#     get(): 屬性不存在，返回的是一個none
#     對象['屬性名']: 屬性名 不存在，報錯
#     '''
#     # href = a.get('href')
#     href = a['href']
#     print(href)# 獲取職位名稱
trs = soup.find_all('tr')[1:]
for tr in trs:# print(tr)a = tr.find('a')print(a.string)

獲取class="even"標簽, bs4 class屬性定位

# trs = soup.find_all('tr', class_="even")
# trs = soup.find_all('tr', attrs={"class": "even"})
# print(trs)

通過id和屬性定位

# <a id="test" class="test"
# a = soup.find_all('a', id="test", class_="test")
# find_all沒有獲取到數據，返回的是一個空列表
# trs = soup.find_all('a', attrs={"class": "test", "id":"test"})
# print(trs)

獲取所有a標簽里面的href屬性值

get(): 屬性不存在，返回的是一個none

對象['屬性名']: 屬性名不存在，報錯

# a_lst = soup.find_all('a')
# for a in a_lst:
#???? '''
#???? get(): 屬性不存在，返回的是一個none
#???? 對象['屬性名']: 屬性名不存在，報錯
#???? '''
#???? # href = a.get('href')
#???? href = a['href']
#???? print(href)

# 獲取職位名稱
trs = soup.find_all('tr')[1:]
for tr in trs:
# print(tr)
a = tr.find('a')
print(a.string)

Css選擇器

select：查找標簽，默認查找所有，并且返回的數據類型list

print(soup.select('a'))? ?#查找所以的a標簽

定位class='sister'的標簽，使用.sister? ? ?.屬性值

print(soup.select('.sister'))

定位id="link1"的標簽，使用#link1

print(soup.select('#link1'))

定位p標簽并且id值為link1，使用p#link1

print(soup.select('p#link1'))

獲取的title標簽下的文本信息，select: 只有一個元素，返回的數據類型還是列表

print(soup.select('title')[0].string)
print(soup.select('title')[0].text)
print(list(soup.select('title')[0].strings))
print(list(soup.select('title')[0].stripped_strings))

p a: 選擇所有位于元素內a元素,無論嵌套了多少層，都能找到

print(soup.select('p a'))

p>a: 選擇所有作為直接的子元素標簽

print(soup.select('p>a'))

<p class="story">Once upon a time there were three little sisters; and their names were<span><a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>,<a href="sister">余承東</a>;and they lived at the bottom of a well.</span>
</p>