網絡爬蟲部分應掌握的重要知識點

- 一、預備知識
- - 1、Web基本工作原理
  - 2、網絡爬蟲的Robots協議
- 二、爬取網頁
- - 1、請求服務器并獲取網頁
  - 2、查看服務器端響應的狀態碼
  - 3、輸出網頁內容
- 三、使用BeautifulSoup定位網頁元素
- - 1、首先需要導入BeautifulSoup庫
  - 2、使用find/find_all函數查找所需的標簽元素
- 四、獲取元素的屬性值
- 五、獲取元素包含的文本
- - 1、使用get_text屬性查看該元素所包含的html文本
  - 2、使用text屬性查看該元素及子孫元素包含的文本（可能包含空白字符）
  - 3、使用stripped_strings屬性查看元素及其子孫包含的不帶空白字符的文本
- 六、遍歷文檔元素
- 七、練習

一、預備知識

1、Web基本工作原理

Web 服務是互聯網提供的 World wide Web 服務的簡稱，最簡單的 Web 服務是如下的2層體系結構：
Alt
這種瀏覽器和 Web 服務器交的體系結構也稱為 B/S 結構，文本、圖片等信息在請求到達之前即通過 HTML 語言以靜態網頁形式存儲在 Web 服務器上，HTTP 請求到達后，Web 服務器把網頁發給客戶端的瀏覽器進行響應，屬于靜態網頁技術。

2、網絡爬蟲的Robots協議

Robots 協議：在網站根目錄下的 robots.txt 文件，用于告知網絡爬蟲哪些頁面可以抓取，哪些不行，例如：http://baidu.com/robots.txt Robots 協議是建議但非約束性，網絡爬蟲可以不遵守，但存在法律風險。

二、爬取網頁

1、請求服務器并獲取網頁

假設要使用Requests庫爬取網址為 http://httpbin.org/ 的網頁內容，主要步驟包括：
（1）導入requests庫
（2）調用requests.get()方法獲取網頁

import requests
url='http://httpbin.org/'
response = requests.get(url=url)

2、查看服務器端響應的狀態碼

response.status_code

運行結果：

status_code等于200，表示瀏覽器正確獲取了服務器端傳遞過來的網頁。

3、輸出網頁內容

print(response.text)

運行結果：

<!DOCTYPE html>
<html lang="en"><head><meta charset="UTF-8"><title>httpbin.org</title><link href="https://fonts.googleapis.com/css?family=Open+Sans:400,700|Source+Code+Pro:300,600|Titillium+Web:400,600,700"rel="stylesheet"><link rel="stylesheet" type="text/css" href="/flasgger_static/swagger-ui.css"><link rel="icon" type="image/png" href="/static/favicon.ico" sizes="64x64 32x32 16x16" /><style>html {box-sizing: border-box;overflow: -moz-scrollbars-vertical;overflow-y: scroll;}*,*:before,*:after {box-sizing: inherit;}body {margin: 0;background: #fafafa;}</style>
</head><body><a href="https://github.com/requests/httpbin" class="github-corner" aria-label="View source on Github"><svg width="80" height="80" viewBox="0 0 250 250" style="fill:#151513; color:#fff; position: absolute; top: 0; border: 0; right: 0;"aria-hidden="true"><path d="M0,0 L115,115 L130,115 L142,142 L250,250 L250,0 Z"></path><path d="M128.3,109.0 C113.8,99.7 119.0,89.6 119.0,89.6 C122.0,82.7 120.5,78.6 120.5,78.6 C119.2,72.0 123.4,76.3 123.4,76.3 C127.3,80.9 125.5,87.3 125.5,87.3 C122.9,97.6 130.6,101.9 134.4,103.2"fill="currentColor" style="transform-origin: 130px 106px;" class="octo-arm"></path><path d="M115.0,115.0 C114.9,115.1 118.7,116.5 119.8,115.4 L133.7,101.6 C136.9,99.2 139.9,98.4 142.2,98.6 C133.8,88.0 127.5,74.4 143.8,58.0 C148.5,53.4 154.0,51.2 159.7,51.0 C160.3,49.4 163.2,43.6 171.4,40.1 C171.4,40.1 176.1,42.5 178.8,56.2 C183.1,58.6 187.2,61.8 190.9,65.4 C194.5,69.0 197.7,73.2 200.1,77.6 C213.8,80.2 216.3,84.9 216.3,84.9 C212.7,93.1 206.9,96.0 205.4,96.6 C205.1,102.4 203.0,107.8 198.3,112.5 C181.9,128.9 168.3,122.5 157.7,114.1 C157.9,116.9 156.7,120.9 152.7,124.9 L141.0,136.5 C139.8,137.7 141.6,141.9 141.8,141.8 Z"fill="currentColor" class="octo-body"></path></svg></a><svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" style="position:absolute;width:0;height:0"><defs><symbol viewBox="0 0 20 20" id="unlocked"><path d="M15.8 8H14V5.6C14 2.703 12.665 1 10 1 7.334 1 6 2.703 6 5.6V6h2v-.801C8 3.754 8.797 3 10 3c1.203 0 2 .754 2 2.199V8H4c-.553 0-1 .646-1 1.199V17c0 .549.428 1.139.951 1.307l1.197.387C5.672 18.861 6.55 19 7.1 19h5.8c.549 0 1.428-.139 1.951-.307l1.196-.387c.524-.167.953-.757.953-1.306V9.199C17 8.646 16.352 8 15.8 8z"></path></symbol><symbol viewBox="0 0 20 20" id="locked"><path d="M15.8 8H14V5.6C14 2.703 12.665 1 10 1 7.334 1 6 2.703 6 5.6V8H4c-.553 0-1 .646-1 1.199V17c0 .549.428 1.139.951 1.307l1.197.387C5.672 18.861 6.55 19 7.1 19h5.8c.549 0 1.428-.139 1.951-.307l1.196-.387c.524-.167.953-.757.953-1.306V9.199C17 8.646 16.352 8 15.8 8zM12 8H8V5.199C8 3.754 8.797 3 10 3c1.203 0 2 .754 2 2.199V8z"/></symbol><symbol viewBox="0 0 20 20" id="close"><path d="M14.348 14.849c-.469.469-1.229.469-1.697 0L10 11.819l-2.651 3.029c-.469.469-1.229.469-1.697 0-.469-.469-.469-1.229 0-1.697l2.758-3.15-2.759-3.152c-.469-.469-.469-1.228 0-1.697.469-.469 1.228-.469 1.697 0L10 8.183l2.651-3.031c.469-.469 1.228-.469 1.697 0 .469.469.469 1.229 0 1.697l-2.758 3.152 2.758 3.15c.469.469.469 1.229 0 1.698z"/></symbol><symbol viewBox="0 0 20 20" id="large-arrow"><path d="M13.25 10L6.109 2.58c-.268-.27-.268-.707 0-.979.268-.27.701-.27.969 0l7.83 7.908c.268.271.268.709 0 .979l-7.83 7.908c-.268.271-.701.27-.969 0-.268-.269-.268-.707 0-.979L13.25 10z"/></symbol><symbol viewBox="0 0 20 20" id="large-arrow-down"><path d="M17.418 6.109c.272-.268.709-.268.979 0s.271.701 0 .969l-7.908 7.83c-.27.268-.707.268-.979 0l-7.908-7.83c-.27-.268-.27-.701 0-.969.271-.268.709-.268.979 0L10 13.25l7.418-7.141z"/></symbol><symbol viewBox="0 0 24 24" id="jump-to"><path d="M19 7v4H5.83l3.58-3.59L8 6l-6 6 6 6 1.41-1.41L5.83 13H21V7z" /></symbol><symbol viewBox="0 0 24 24" id="expand"><path d="M10 18h4v-2h-4v2zM3 6v2h18V6H3zm3 7h12v-2H6v2z" /></symbol></defs></svg><div id="swagger-ui"><div data-reactroot="" class="swagger-ui"><div><div class="information-container wrapper"><section class="block col-12"><div class="info"><hgroup class="main"><h2 class="title">httpbin.org<small><pre class="version">0.9.2</pre></small></h2><pre class="base-url">[ Base URL: httpbin.org/ ]</pre></hgroup><div class="description"><div class="markdown"><p>A simple HTTP Request &amp; Response Service.<br><br><b>Run locally: </b><code>$ docker run -p 80:80 kennethreitz/httpbin</code></p></div></div><div><div><a href="https://kennethreitz.org" target="_blank">the developer - Website</a></div><a href="mailto:me@kennethreitz.org">Send email to the developer</a></div></div><!-- ADDS THE LOADER SPINNER --><div class="loading-container"><div class="loading"></div></div></section></div></div></div></div><div class='swagger-ui'><div class="wrapper"><section class="clear"><span style="float: right;">[Powered by<a target="_blank" href="https://github.com/rochacbruno/flasgger">Flasgger</a>]<br></span></section></div></div><script src="/flasgger_static/swagger-ui-bundle.js"> </script><script src="/flasgger_static/swagger-ui-standalone-preset.js"> </script><script src='/flasgger_static/lib/jquery.min.js' type='text/javascript'></script><script>window.onload = function () {fetch("/spec.json").then(function (response) {response.json().then(function (json) {var current_protocol = window.location.protocol.slice(0, -1);if (json.schemes[0] != current_protocol) {// Switches scheme to the current in usevar other_protocol = json.schemes[0];json.schemes[0] = current_protocol;json.schemes[1] = other_protocol;}json.host = window.location.host;  // sets the current hostconst ui = SwaggerUIBundle({spec: json,validatorUrl: null,dom_id: '#swagger-ui',deepLinking: true,jsonEditor: true,docExpansion: "none",apisSorter: "alpha",//operationsSorter: "alpha",presets: [SwaggerUIBundle.presets.apis,// yay ES6 modules ↘Array.isArray(SwaggerUIStandalonePreset) ? SwaggerUIStandalonePreset : SwaggerUIStandalonePreset.default],plugins: [SwaggerUIBundle.plugins.DownloadUrl],// layout: "StandaloneLayout"  // uncomment to enable the green top header})window.ui = ui// uncomment to rename the top brand if layout is enabled// $(".topbar-wrapper .link span").replaceWith("<span>httpbin</span>");})})
}</script>  <div class='swagger-ui'><div class="wrapper"><section class="block col-12 block-desktop col-12-desktop"><div><h2>Other Utilities</h2><ul><li><a href="/forms/post">HTML form</a> that posts to /post /forms/post</li></ul><br /><br /></div></section></div>
</div>
</body></html>

三、使用BeautifulSoup定位網頁元素

下面給出部分網頁內容，用于演示如何使用BeautifulSoup查找網頁上需要的內容。

html='''<html><head><title>The Dormouse's story</title></head><body><p class="title"><b>The Dormouse's story</b></p><p class="story">Once upon a time there were three little sisters; and their names were<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and<a class="sister" href="http://example.com/tillie" id="link2">Tillie</a>; and they lived at the bottom of a well.</p><p class="story">愛麗絲夢游仙境</p></body></html>'''

1、首先需要導入BeautifulSoup庫

參數說明：html就是上面的html文檔字符串，'html.parser'指明了解析該文檔字符串的解析器是html解析器。

from bs4 import BeautifulSoup
soup=BeautifulSoup(html,'html.parser')

Alt

基本元素	說明
Tag	標簽，最基本的信息組織單元，分別用<>和</>標明開頭和結尾
Name	標簽的名字，`<p>...</p>`的名字是’p’，格式：`<tag>.name`
Attributes	標簽的屬性，字典形式組織，格式：`<tag>.attrs`
NavigableString	標簽內非屬性字符串，`<>...</>`中字符串，格式：`<tag>.string`

2、使用find/find_all函數查找所需的標簽元素

（1）認識html的標簽元素
Alt
上面一整行是img標簽，它由開始標簽和結束標簽兩部分構成，標簽名是img，它含有src和size兩個屬性。

（2）find函數用于尋找滿足條件的第一個標簽

查看find函數的幫助信息：

soup.find?

運行結果：

Signature: soup.find(name=None, attrs={}, recursive=True, text=None, **kwargs)
Docstring:
Return only the first child of this Tag matching the given
criteria.
File:      d:\dell\appdata\anaconda3\lib\site-packages\bs4\element.py
Type:      method

查找文檔中的第一個<p>元素/標簽：

first_p=soup.find("p")
first_p

運行結果：

<p class="title">
<b>The Dormouse's story</b>
</p>

（3）查看找到的元素類型和屬性

#輸出找到的元素類型，是bs4.element.Tag類型
print(type(first_p))
#輸出找到的元素的屬性，是一個字典
first_p.attrs

運行結果：

<class 'bs4.element.Tag'>
{'class': ['title']}

（4）find_all函數用于尋找滿足條件的所有標簽，這些標簽將被放入一個列表中

find_all函數的原型如下：

find_all(self, name=None attrs=f, recursive=True, text=None, limit=None, **kwargs)

self表明它是一個類成員函數；
name是要查找的tag元素名稱，默認是None，如果不提供，就是查找所有的元素；
attrs是元素的屬性，它是一個字典，默認是空，如果提供就是查找有這個指定屬性的元素；
recursive指定查找是否在元素節點的子樹下面全范圍進行，默認是True；
后面的text、limit、kwargs參數比較復雜，將在后面用到時介紹；
find_all函數返回查找到的所有指定的元素的列表，每個元素是一個 bs4.element.Tag對象。

查找文檔中的所有<a>元素：

a_ls=soup.find_all('a')
for a in a_ls:print(a)

運行結果：

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
<a class="sister" href="http://example.com/tillie" id="link2">Tillie</a>

（5）查找文檔中class='story’的p元素

p_story=soup.find_all('p',attrs={"class":"story"})
p_story

運行結果：

[<p class="story">Once upon a time there were three little sisters; and their names were<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and<a class="sister" href="http://example.com/tillie" id="link2">Tillie</a>; and they lived at the bottom of a well.</p>, <p class="story">愛麗絲夢游仙境</p>]

（6）練習：請找出文檔中class='sister’的元素

all_sister=soup.find_all(attrs={"class":"sister"})
all_sister

運行結果：

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link2">Tillie</a>]

四、獲取元素的屬性值

（1）判斷元素是否含有某屬性

#判斷文檔中的第一個<p>元素是否含有class屬性
first_p.has_attr("class")

運行結果：

True

（2）得到元素的屬性值

因為屬性名和值構成字典，所以采用字典的訪問形式得到屬性值。

#輸出文檔中所有<a>元素的href屬性值：
a_ls=soup.find_all('a')
for a in a_ls:print(a["href"])

運行結果：

http://example.com/elsie
http://example.com/lacie
http://example.com/tillie

五、獲取元素包含的文本

先找到class='story'的第一個p元素。

p_story_fst=soup.find('p',attrs={"class":"story"})

1、使用get_text屬性查看該元素所包含的html文本

print(p_story_fst.get_text)

運行結果：

<bound method Tag.get_text of <p class="story">Once upon a time there were three little sisters; and their names were<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and<a class="sister" href="http://example.com/tillie" id="link2">Tillie</a>; and they lived at the bottom of a well.</p>>

2、使用text屬性查看該元素及子孫元素包含的文本（可能包含空白字符）

p_story_fst.text

運行結果：

'\n    Once upon a time there were three little sisters; and their names were\n    \n     Elsie\n    \n    ,\n    \n     Lacie\n    \n    and\n    \n     Tillie\n    \n    ; and they lived at the bottom of a well.\n   '

3、使用stripped_strings屬性查看元素及其子孫包含的不帶空白字符的文本

list(p_story_fst.stripped_strings)

運行結果：

['Once upon a time there were three little sisters; and their names were','Elsie',',','Lacie','and','Tillie','; and they lived at the bottom of a well.']

六、遍歷文檔元素

Alt
（1）先找到class='story’的第一個p元素

p_story_fst=soup.find('p',attrs={"class":"story"})
p_story_fst

運行結果：

<p class="story">Once upon a time there were three little sisters; and their names were<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and<a class="sister" href="http://example.com/tillie" id="link2">Tillie</a>; and they lived at the bottom of a well.</p>

（2）向下遍歷找到孩子元素

for child in p_story_fst.children:print(child)

運行結果：

Once upon a time there were three little sisters; and their names were<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and<a class="sister" href="http://example.com/tillie" id="link2">Tillie</a>; and they lived at the bottom of a well.

（3）向上遍歷找到父親元素

parnt=p_story_fst.parent
parnt.name

運行結果：

'body'

（4）平行遍歷找到前面的兄弟節點

list(p_story_fst.previous_siblings)

運行結果：

['\n', <p class="title"><b>The Dormouse's story</b></p>, '\n']

（5）平行遍歷找到后面的兄弟節點

list(p_story_fst.next_siblings)

運行結果：

['\n', <p class="story">愛麗絲夢游仙境</p>, '\n']

七、練習

test='''<html><head></head><body><span>1234 
<a href="www.test.edu.cn">This is a test!<b>abc</b></a></span> 
</body></html>'''

（1）寫出導入BeautifulSoup庫和創建BeautifulSoup對象的代碼：

from bs4 import BeautifulSoup 
soup=BeautifulSoup(test,'html.parser')

（2）完善代碼，使得pos能定位到（指向）上述html代碼中的span元素節點：

pos=soup.find('span')
pos

運行結果：

<span>1234 
<a href="www.test.edu.cn">This is a test!<b>abc</b></a></span>

（3）完善代碼，能輸出span元素內部包含的所有文本（包含子孫元素的文本）：

print(pos.get_text())

運行結果：

1234 
This is a test!abc

（4）完善代碼，能輸出span元素后面直接包含的文本（不包含子孫元素的文本）：

print(pos.next_sibling.string.strip())

運行結果：

（5）找出a元素的孩子和父親節點名稱

# 定位到a元素節點
a_tag=soup.find('a')# 輸出a元素的孩子節點名稱
for child in a_tag.children:print("Child node name:", child.name)# 輸出a元素的父親節點名稱
print("Parent node name:", a_tag.parent.name)

運行結果：

Child node name: None
Child node name: b
Parent node name: span

（6）找出a元素包含的超鏈接信息

# 定位到a元素節點
a_tag=soup.find('a')# 獲取超鏈接的URL
link_url=a_tag.get('href')
print("Link URL:", link_url)# 獲取超鏈接文本
link_text=a_tag.get_text()
print("Link Text:", link_text)

運行結果：

Link URL: www.test.edu.cn
Link Text: This is a test!abc

（7）找出a元素包含的兄弟信息

# 定位到a元素節點
a_tag=soup.find('a')# 獲取下一個兄弟節點的文本內容
next_sibling_text=a_tag.next_sibling.string.strip()
if a_tag.next_sibling else None
print("Next Sibling Text:", next_sibling_text)# 獲取上一個兄弟節點的文本內容
prev_sibling_text=a_tag.previous_sibling.string.strip()
if a_tag.previous_sibling else None
print("Previous Sibling Text:", prev_sibling_text)

運行結果：

Next Sibling Text: None
Previous Sibling Text: 1234