Python爬蟲第6節-requests庫的基本用法

前言

一、準備工作

二、實例引入

三、GET請求

3.1 基本示例

3.2 抓取網頁

3.3 抓取二進制數據

3.4 添加headers

四、POST請求

五、響應

前言

????????前面我們學習了urllib的基礎使用方法。不過，urllib在實際應用中存在一些不便之處。以網頁驗證和Cookies處理為例，使用 urllib 時，得編寫Opener和Handler才能完成操作。

????????為了更輕松地實現這些操作，功能更強大的 requests 庫應運而生。利用 requests 庫，管理Cookies、進行登錄驗證，以及設置代理等操作，都能輕松搞定。下面，我們先介紹 requests 庫的基本使用方法。

一、準備工作

????????在開始使用requests庫前，要確保已經正確安裝了該庫。如果還未安裝，可以通過下面命令安裝：
- 對于 Python 2：

 pip install requests

- 對于 Python 3（推薦）：

pip3 install requests

二、實例引入

????????urllib庫用urlopen()方法發起GET方式的網頁請求。與之對應，requests庫提供了get()方法。相比之下，get()這個名字，能讓人更直接地明白它用于發起GET請求。?

下面通過實例來看：

import requests
r = requests.get('https://www.baidu.com/')
print(type(r))
print(r.status_code)
print(type(r.text))
print(r.text)
print(r.cookies)

運行結果如下：

<class 'requests.models.Response'>
200
<class 'str'>
<html>
<head>
<script>
location.replace(location.href.replace("https://","http://"));
</script>
</head>
<body>
<noscript><meta http-equiv="refresh" content="0;url=http://www.baidu.com/"></noscript>
</body>
</html>
<RequestsCookieJar[<Cookie BIDUPSID=992C3B26F4C4D09505C5E959D5FBC005 for .baidu.com/>,<Cookie PSTM=1472227535 for .baidu.com/>,<Cookie bs=153047544986095451480040NN20303CO2FNNNO for .www.baidu.com/>,<Cookie BD_NOT_HTTPS=1 for www.baidu.com/>]>

????????在這兒，我們通過調用requests庫的get()方法，完成了和urllib庫中urlopen()一樣的操作，得到一個Response對象。緊接著，我們又輸出了這個Response對象的類型、狀態碼，響應體的類型、內容，以及Cookies信息。從運行結果能知道，Response對象的類型是requests.models.Response，響應體是str字符串類型，Cookies是RequestsCookieJar類型。

????????用get()方法實現GET請求并不稀奇，requests庫更方便的地方在于，用一句話就能實現POST、PUT等其他類型的請求，示例如下：

r = requests.post('http://httpbin.org/post')
r = requests.put('http://httpbin.org/put')
r = requests.delete('http://httpbin.org/delete')
r = requests.head('http://httpbin.org/get')
requests.options('http://httpbin.org/get')

????????這里分別用post()、put()、delete()等方法實現了POST、PUT、DELETE等請求。相比urllib，是不是簡單很多？這其實只是requests庫強大功能的一小部分。

三、GET請求

????????GET請求是HTTP協議里特別常見的一種請求方式。接下來，咱們就深入了解一下，怎么用requests庫來發起GET請求。

3.1 基本示例

????????第一步，搭建一個最為簡單的GET請求，請求的目標鏈接是http://httpbin.org/get。這個網站能識別客戶端發起的請求類型，要是檢測到是GET請求，就會把對應的請求信息返回過來。

import requests
r = requests.get('http://httpbin.org/get')
print(r.text)

運行結果如下：

{"args": {},"headers": {"Accept": "*/*","Accept-Encoding": "gzip, deflate","Host": "httpbin.org","User-Agent": "python-requests/2.10.0"},"origin": "122.4.215.33","url": "http://httpbin.org/get"
}

????????從結果能看出，我們順利發起了GET請求，返回內容里有請求頭、URL、IP等信息。那么，當發起GET請求，需要添加額外信息時，通常該怎么做呢？舉個例子，現在要添加兩個參數，一個是name，值為germey，另一個是age，值為22 。?

????????要構造這個請求鏈接，是不是直接寫成：

r = requests.get('http://httpbin.org/get?name=germey&age=22')

????????這樣做可行，但不夠人性化。一般情況下，這類信息數據會用字典來存儲。那么，該如何構造鏈接呢？利用params參數就可以，示例如下：

import requests
data = {'name': 'germey','age': 22
}
r = requests.get("http://httpbin.org/get", params=data)
print(r.text)

運行結果如下：

{"args": {"age": "22","name": "germey"},"headers": {"Accept": "*/*","Accept-Encoding": "gzip, deflate","Host": "httpbin.org","User-Agent": "python-requests/2.10.0"},"origin": "122.4.215.33","url": "http://httpbin.org/get?age=22&name=germey"
}

????????運行程序后能發現，請求鏈接自動生成了，就是http://httpbin.org/get?age=22&name=germey。另外，網頁返回內容的數據類型是str，不過它遵循JSON格式規范。所以，要是想把返回結果解析成字典格式，直接調用json()方法就能實現。?

示例如下：

import requests
r = requests.get("http://httpbin.org/get")
print(type(r.text))
print(r.json())
print(type(r.json()))

運行結果如下：

<class 'str'>
{
??? "headers": {
??????? "Accept-Encoding": "gzip, deflate",
??????? "Accept": "*/*",
??????? "Host": "httpbin.org",
??????? "User-Agent": "python-requests/2.10.0"
??? },
??? "url": "http://httpbin.org/get",
??? "args": {},
??? "origin": "182.33.248.131"
}
<class 'dict'>

????????可以發現，調用json()方法，能將返回結果為JSON格式的字符串轉化為字典。但需要注意，如果返回結果不是JSON格式，就會出現解析錯誤，拋出json.decoder.JSONDecodeError異常。

3.2 抓取網頁

????????上面請求的鏈接，返回內容是JSON格式字符串。既然如此，請求普通網頁時，自然也能獲取到對應內容。下面，我們以知乎的“發現”頁面為例展開說明：?

import requests
import re
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36'
}
r = requests.get("https://www.zhihu.com/explore", headers=headers)
pattern = re.compile('explore-feed.*?question_link.*?>(.*?)</a>', re.S)
titles = re.findall(pattern, r.text)
print(titles)

????????這里，我們在請求中添加了headers信息，里面有User - Agent字段，這個字段用來標識瀏覽器。要是不加上這一項，知乎就不讓我們抓取頁面內容。之后，我們用最基礎的正則表達式，把頁面上所有問題內容匹配出來了。正則表達式的知識，我們會在后面深入講解，這里先借助這個實例，給大家做個初步介紹。?

運行結果如下：

['\n為什么很多人喜歡提及「拉丁語系」這個詞?\n', '\n在沒有水的情況下水系寶可夢如何戰斗?\n', '\n有哪些經驗可以送給 Kindle 新人?\n', '\n谷歌的廣告業務是如何賺錢的?\n', "\n程序員該學習什么，能在上學期間掙錢?\n", '\n有哪些原本只是一個小消息，但回看發現是個驚天大新聞的例子?\n', "\n如何評價今敏?\n", '\n源氏是怎么把那么長的刀從背后拔出來的?\n', "\n年輕時得了絕癥或大病是怎樣的感受?\n", "\n年輕時得了絕癥或大病是怎樣的感受?\n"]

????????我們發現，這里成功提取出了所有的問題內容。

3.3 抓取二進制數據

????????上面例子里，我們抓取了知乎的一個頁面，得到的是HTML文檔。那要是想抓取圖片、音頻、視頻這類文件，該怎么做呢？圖片、音頻、視頻這些文件，本質都是二進制碼。因為它們有特定保存格式，配合對應的解析方法，我們才能看到這些豐富多彩的多媒體內容。所以，要抓取這些文件，就得獲取它們的二進制碼。?

????????下面以GitHub的站點圖標為例來看：

import requests
r = requests.get("https://github.com/favicon.ico")
print(r.text)
print(r.content)

????????這次抓取的是站點圖標，就是瀏覽器每個標簽上顯示的小圖標。我們打印了Response對象的text和content這兩個屬性。運行程序后，前兩行顯示的是r.text的結果，最后一行是r.content的結果。能看到，r.text結果出現亂碼，r.content結果前面有個b，這表明它是bytes類型數據。因為圖片屬于二進制數據，r.text打印時會把圖片數據轉成str類型，相當于直接將圖片轉為字符串，亂碼也就不可避免了。?

????????接著，我們將剛才提取到的圖片保存下來：

import requests
r = requests.get("https://github.com/favicon.ico")
with open('favicon.ico', 'wb') as f:f.write(r.content)

????????這里用到了open()方法。使用它的時候，第一個參數要設定為文件名稱，第二個參數表示以二進制寫入模式打開文件，這樣就能往文件里寫入二進制數據。程序運行完畢，會發現在文件夾里多了一個名為favicon.ico的圖標。同樣道理，獲取音頻和視頻文件，也能采用這種方法。?

3.4 添加headers

和urllib.request一樣，requests也能借助headers參數來傳遞請求頭信息。就拿上面抓取知乎頁面的例子來說，如果不設置headers參數傳遞請求頭信息，就無法正常發起請求。

import requests
r = requests.get("https://www.zhihu.com/explore")
print(r.text)

運行結果如下：

<html><body><h1>500 Server Error</h1>An internal server error occured.</body></html>

但如果加上headers并添加User-Agent信息，就沒問題了：

import requests
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36'
}
r = requests.get("https://www.zhihu.com/explore", headers=headers)
print(r.text)

????????當然，我們可以在headers這個參數中任意添加其他的字段信息。

四、POST請求

????????前面，我們認識了最基礎的GET請求。在HTTP請求里，還有一種常見的請求方式，那就是POST請求。用requests庫實現POST請求，操作起來同樣不復雜。?

示例如下：

import requests
data = {'name': 'germey', 'age': '22'}
r = requests.post("http://httpbin.org/post", data=data)
print(r.text)

????????這里還是請求http://httpbin.org/post，該網站可以判斷如果請求是POST方式，就把相關請求信息返回。

運行結果如下：

{"args": {},"data": "","files": {},"form": {"age": "22","name": "germey"},"headers": {"Accept": "*/*","Accept-Encoding": "gzip, deflate","Content-Length": "18","Content-Type": "application/x-www-form-urlencoded","Host": "httpbin.org","User-Agent": "python-requests/2.10.0"},"json": null,"origin": "182.33.248.131","url": "http://httpbin.org/post"
}

????????從結果能看到，我們順利拿到了返回數據。返回結果里的form部分，正是我們提交的數據，這就說明POST請求成功發送出去了。?

五、響應

????????發送請求之后，肯定會得到響應結果。就像上面的例子，我們通過text和content屬性，獲取到了響應內容。其實，除了這兩種，借助其他屬性和方法，還能獲取更多信息，像狀態碼、響應頭，以及Cookies等內容。?

示例如下：

import requests
r = requests.get('http://www.jianshu.com')
print(type(r.status_code), r.status_code)
print(type(r.headers), r.headers)
print(type(r.cookies), r.cookies)
print(type(r.url), r.url)
print(type(r.history), r.history)

????????在這里，通過打印status_code屬性，就能得到請求的狀態碼；打印headers屬性，可獲取響應頭信息；打印cookies屬性，能拿到Cookies數據；打印url屬性，會顯示請求的URL；而打印history屬性，就能看到請求歷史記錄。?

運行結果如下：

<class 'int'> 200
<class 'requests.structures.CaseInsensitiveDict'> {'X-Runtime': '0.006363', 'Connection': 'keep-alive', 'Content-Type': 'text/html; charset=utf-8', 'X-Content-Type-Options': 'nosniff', 'Date': 'Sat, 27 Aug 2016 17:18:51 GMT', 'Server': 'nginx', 'X-Frame-Options': 'DENY', 'Content-Encoding': 'gzip', 'Vary': 'Accept-Encoding', 'ETag': 'W/"3abda885e0e123bfde06dgb61e696159"', 'X-XSS-Protection': '1;mode-block', 'X-Request-Id': 'a8a3c4d5-f660-422f-8df9-49719ddgb5d4', 'Transfer-Encoding': 'chunked','set-Cookie':'read mode=day; path=/', 'default font=font2; path=/','session id=xxx; path=/; HttpOnly', 'Cache-Control':'max-age=0, private, must-revalidate'}
<class'requests.cookies.RequestsCookieJar'> <RequestsCookieJar[<Cookie session id=xxx for www.jianshu.com/>, <Cookie default font=font2 for www.jianshu.com/>, <Cookie read mode=day for www.jianshu.com/>]>
<class'str'> http://www.jianshu.com/
<class 'list'> []

????????由于session id太長，這里就簡寫了。從運行結果能看出，通過headers屬性獲取到的數據類型是CaseInsensitiveDict，通過cookies屬性獲取到的數據類型則是RequestsCookieJar。

????????在判斷請求是否成功時，狀態碼是常用的依據。requests庫還內置了一個狀態碼查詢對象，叫requests.codes 。

示例如下：

import requests
r = requests.get('http://www.jianshu.com')
exit() if not r.status_code == requests.codes.ok else print('Request Successfully')

????????在這兒，我們把請求的返回碼和requests庫內置的成功返回碼作比較。要是二者匹配，就說明請求正常響應，程序會輸出成功請求的消息；要是不匹配，程序就會終止。這里，我們用requests.codes.ok獲取到的成功狀態碼為200。?

????????當然，requests.codes里可不只有ok這一個條件碼。下面，給大家列出各類返回碼以及對應的查詢條件。?

# 信息性狀態碼
100: ('continue',),
101: ('switching protocols',),
102: ('processing',),
103: ('checkpoint',),
122: ('uri too long', "request uri too long"),
# 成功狀態碼
200: ('ok', 'okay', 'all_ok', 'all_okay', 'all_good', '\\o/', '√'),
201: ('created',),
202: ('accepted',),
203: ('non_authoritative_info', 'non_authoritative_information'),
204: ('no_content',),
205: ('reset_content', 'reset'),
206: ('partial_content', 'partial'),
207: ('multi_status','multiple_status','multi_stati','multiple_stati'),
208: ('already_reported',),
226: ('im_used',),
# 重定向狀態碼
300: ('multiple_choices',),
301: ('moved_permanently','moved', '\\o-'),
302: ('found',),
303: ('see_other', 'other'),
304: ('not_modified',),
305: ('use_proxy',),
306: ('switch_proxy',),
307: ('temporary_redirect', 'temporary_moved', 'temporary'),
308: ('permanent_redirect','resume_incomplete','resume',),
# 客戶端錯誤狀態碼
400: ('bad_request', 'bad'),
401: ('unauthorized',),
402: ('payment_required', 'payment'),
403: ('forbidden',),
404: ('not_found', '-0-'),
405: ('method_not_allowed', 'not_allowed'),
406: ('not_acceptable',),
407: ('proxy_authentication_required', 'proxy_auth', 'proxy_authentication'),
408: ('request_timeout', 'timeout'),
409: ('conflict',),
410: ('gone',),
411: ('length_required',),
412: ('precondition_failed', 'precondition'),
413: ('request_entity_too_large',),
414: ('request_uri_too_large',),
415: ('unsupported_media_type', 'unsupported_media','media_type'),
416: ('requested_range_not_satisfiable','requested_range', 'range_not_satisfiable'),
417: ('expectation_failed',),
418: ('im_a_teapot', 'teapot', 'i_am_a_teapot'),
421: ('misdirected_request',),
422: ('unprocessable_entity', 'unprocessable'),
423: ('locked',),
424: ('failed_dependency', 'dependency'),
425: ('unordered_collection', 'unordered'),
426: ('upgrade_required', 'upgrade'),
428: ('precondition_required', 'precondition'),
429: ('too_many_requests', 'too_many'),
431: ('header_fields_too_large', 'fields_too_large'),
444: ('no_response', 'none'),
449: ('retry_with','retry'),
450: ('blocked_by_windows_parental_controls', 'parental_controls'),
451: ('unavailable_for_legal_reasons', 'legal_reasons'),
499: ('client_closed_request', 'client_closed_request'),
# 服務端錯誤狀態碼
500: ('internal_server_error','server_error', '/o\\', 'X'),
501: ('not_implemented',),
502: ('bad_gateway',),
503: ('service_unavailable', 'unavailable'),
504: ('gateway_timeout',),
505: ('http_version_not_supported', 'http_version'),
506: ('variant_also_negotiates',),
507: ('insufficient_storage',),
509: ('bandwidth_limit_exceeded', 'bandwidth'),
510: ('not_extended',),
511: ('network_authentication_required', 'network_auth', 'network_authentication')

????????舉個例子，要是你想知道請求結果是不是404狀態，就可以用`requests.codes.not_found`去做對比。?