requests模塊相關用法

requests模塊

-1. 什么是requests模塊- python原生的一個基于網絡請求的模塊，模擬瀏覽器發起請求。
-2. 為什么使用requests模塊-1. 自動處理url編碼-2. 自動處理post請求參數-3. 簡化cookie和代理的操作-3. requests模塊如何被使用安裝： pip install requests使用流程：1. 指定url2. 使用requests模塊發送請求3. 獲取響應數據4. 進行持久化存儲-4. 通過5個基于requests模塊的爬蟲項目對該模塊進行系統學習和鞏固-get請求-post請求-ajax的get-ajax的post-綜合

基于requests模塊發起get請求

- 需求：爬取搜狗首頁的頁面數據

import requests 
#指定url
url= 'https://www.sogou.com/'
#發起get請求：get方法會返回成功的響應對象
response = requests.get(url=url)
# 獲取響應中的數據值：text可以獲取響應對象中字符串形式的頁面數據
page_data = response.text
#持久化操作
with open('sogo.html','w',encoding='utf-8') as f:f.write(page_data)

#response對象中其他重要屬性
import requests 
#指定url
url= 'https://www.sogou.com/'
#發起get請求：get方法會返回成功的響應對象
response = requests.get(url=url)response.content     #獲取的是response對象中二進制（byte）類型的頁面數據

response.status_code #返回一個響應狀態碼如200 或404 

response.headers     #返回響應頭信息

response.url         #獲取請求的url

requests模塊如何處理帶參數的get請求（兩種方式）

需求：指定一個詞條，獲取搜狗搜索結果對應的頁面數據

#方式1
import requestsurl = 'http://www.sogou.com/web?query=金角大王&ie=utf-8'response = requests.get(url=url)page_text = response.textwith open('jinjiao.html','w',encoding='utf-8') as f:f.write(page_text)

#方式2
import requestsurl = 'http://www.sogou.com/web'#將參數封裝到字典中
params = {'query':'金角大王','ie':'utf-8'
}response = requests.get(url=url,params=params)
page_text = response.textwith open('jinjiao.html','w',encoding='utf-8') as f:f.write(page_text)

#自定義請求頭信息
import requestsurl = 'http://www.sogou.com/web'#將參數封裝到字典中
params = {'query':'金角大王','ie':'utf-8'
}
#自定義請求頭信息 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'
}
response = requests.get(url=url,params=params,headers=headers)response.status_code

基于requests模塊發起post請求

- 需求：登錄豆瓣網，獲取登錄成功后的頁面數據
- tip 如果找不到post請求，試試輸錯密碼尋找
- 現在豆瓣改版了，只能獲取到登錄信息cookiesession等類型的東西。但是可以用request.session登錄后繼續獲取頁面

import requests
#1 指定post請求的url
url = 'https://accounts.douban.com/j/mobile/login/basic'#封裝post請求的參數
data= {'ck': '','name': '13520750458','password': '1123lys','remember': 'false','ticket':'',
}
#自定義請求頭信息
headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36' 
}
#2 發起post請求
response = requests.post(url=url,data=data,headers=headers)#獲取響應對象中的頁面數據
page_text = response.text
print(page_text)
#持久化操作
with open('douban.html','w',encoding='utf-8') as f:f.write(page_text)

基于ajax的get請求

- 抓取豆瓣電影上的電影詳情的數據

import requests
#方式一
url = 'https://movie.douban.com/j/chart/top_list?type=20&interval_id=100%3A90&action=&start=80&limit=20'
#自定義請求頭信息
headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36' 
}
response = requests.get(url=url,headers=headers)
# print(response.text) #返回的是json格式的數據

import requests
#方式二
url = 'https://movie.douban.com/j/chart/top_list?'
#封裝ajax中get請求攜帶的參數
params={'type':'20','interval_id':'100:90','action':'','start':'80','limit':'20',}#自定義請求頭信息
headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36' 
}
response = requests.get(url=url,params=params,headers=headers)
# print(response.text)  #返回的是json格式的數據

基于ajax的post請求

-需求 爬取肯德基城市餐廳的位置數據

import requestsurl = 'http://www.kfc.com.cn/kfccda/ashx/GetStoreList.ashx?op=keyword'
# 處理post請求參數
data = {'cname':'','pid':'','keyword':'北京','pageIndex':'1','pageSize':'10',
}
headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36' 
}
#發起基于ajax的post請求
response = requests.post(url=url,data=data,headers=headers)response.text
type(response.text)

總結

基于ajax的post和get請求和普通的get請求沒什么區別，唯一不同就是獲取url，ajax要用抓包工具去獲取基于ajax的異步請求的url，因為ajax的地址欄的url不變的，必須去獲取真正的url

requests模塊高級(cookie操作)：

- cookie：基于用戶的用戶數據- 需求： 爬取張三用戶的豆瓣網的個人主頁頁面數據
- cookie作用：服務器端使用cookie來記錄客戶端的狀態信息
實現流程：1.執行登錄操作（獲取cookie）2.在發起個人主頁請求時，需要將cookie攜帶到該請求中注意：session對象：發送請求（會將cookie對象進行自動存儲）

import requests
#創建session對象
session = requests.session()
#1 發起登錄請求：session對象將cookie獲取，且自動存儲到session對象中
login_url = 'https://accounts.douban.com/j/mobile/login/basic'
data= {'ck': '','name': '13520750458','password': '1123lys','remember': 'false','ticket':'',
}
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'
}
#使用session對象發起post請求
login_response = session.post(url=login_url,data=data,headers=headers)#2 對個人主頁發起請求（session（cookie）），獲取響應頁面數據
url = 'https://www.douban.com/people/191748483/'
response = session.get(url=url,headers=headers)
page_text = response.textwith open('douban110.html','w',encoding='utf-8')as f:f.write(page_text)print(page_text)

requests模塊高級(代理操作)：

- 1 代理：第三方代理本體執行相關的事物
- 2 為什么要使用代理？-反爬手段-反反扒手段
- 3 分類：-正向代理：代理客戶端獲取數據-反向代理：代替服務端提供數據
- 4 免費代理IP的網站提供商：- www.goubanjia.com- 快代理- 西祠代理
- 5 代碼

import requestsurl = 'https://www.baidu.com/s?wd=ip&ie=utf-8'headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'
}
#將代理ip封裝到字典,(前面url的協議要與代理ip的協議類型相同，如都是https或http，如果不同需修改url的協議)
proxy = {'https':'222.74.237.246:808'
}
#更換網絡ip
response = requests.get(url=url,proxies=proxy,headers=headers)with open('./daili.html','wb')as f:f.write(response.content)print(response.content)

綜合項目實戰

-需求 爬取搜狗知乎某一個詞條對應一定范圍頁碼表示的頁面數據

import requests
import os# 創建一個文件夾
if not os.path.exists('./pages'):os.mkdir('./pages')
word = input('enter a word')
#動態指定頁碼范圍
start_page = int(input('enter a start pageNum'))
end_page = int(input('enter an end pageNum'))url = 'https://zhihu.sogou.com/zhihu?'
headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36' 
}
for page in range(start_page,end_page+1):params = {'query':word,'page':page,'ie':'utf-8',      }response = requests.get(url=url,params=params,headers=headers)page_text = response.textfilename = word+str(page)+'.html'filePath = 'pages/'+filenamewith open(filePath,'w',encoding='utf-8') as f:f.write(page_text)print('第%s頁寫入成功'%page)