Python 使用 Requests 模塊進行爬蟲

一、請求數據
二、獲取并解析數據
四、保存數據
- 1. 保存為 CSV 文件
- 2. 保存為 Excel 文件
- - 打開網頁圖片并將其插入到 Excel 文件中
五、加密參數逆向分析
- 1. 定位加密位置
- 2. 斷點調試分析
- 3. 復制相關 js 加密代碼，在本地進行調試（難）
- 4. 獲取 sign 加密參數
六、其他示例
- 1. 單頁數據處理與保存
- 2. 翻頁數據采集 — MD5 加密
- - 1）分析請求鏈接 / 參數的變化
  - 2）sign 加密參數逆向分析
  - 3）Python 代碼實現

學習視頻；【Python爬蟲實戰：采集常用軟件數據（得物、閑魚、小紅書、微信小程序、驗證碼識別）】
參考文檔：【Requests: 讓 HTTP 服務人類】

一、請求數據

Requests 模塊通過模擬瀏覽器對 url 地址發送請求。

打開需要采集數據的網站，通過瀏覽器的開發者工具分析對應的數據位置。
- 右鍵選擇 “檢查” → “網絡” ，打開開發者工具；
- 刷新網頁；
- 通過關鍵字搜索找到對應的數據位置。

請求標頭：通過使用請求標頭中的參數內容模擬瀏覽器，該參數需要使用字典 dict={'鍵':'值',} 接收。

PyCharm 批量添加引號和逗號的方法：

選中要替換的內容，輸入 Ctrl + R 打開替換欄；
勾選 .* ，使用正則命令匹配數據進行替換；
第一欄輸入 (.*?):(.*) ，第二欄輸入 '$1':'$2', ，再選中要替換的內容點擊 “全部替換” 。

請求網址：復制抓包分析找到的鏈接地址。
請求方法：
- POST 請求 → 需要向服務器提交表單數據 / 請求載荷；
- GET 請求 → 向服務器獲取數據。

請求參數：可以在 “載荷” 中進行查看
- POST 請求 → 隱性；
- GET 請求 → 顯性（查詢的參數直接在請求網址的鏈接中就可以看出）。

參考文章：【HTTP 方法：GET 對比 POST | 菜鳥教程】

發送請求：使用 requests 模塊。
- 如果沒有安裝該模塊，則 Win + R 輸入 cmd ，輸入 pip install requests 命令并運行即可。
- 在 PyCharm 中輸入 import requests 導入數據請求模塊。

Python 代碼：

import requests# 請求標頭
request_header = {'accept': '*/*','accept-encoding': 'gzip, deflate, br, zstd','accept-language': 'zh-CN,zh;q=0.9','connection': 'keep-alive','content-length': '124','content-type': 'application/json','cookie': '...','host': 'app.dewu.com','ltk': '...','origin': 'https://www.dewu.com','referer': 'https://www.dewu.com/','sec-ch-ua': '"Google Chrome";v="137", "Chromium";v="137", "Not/A)Brand";v="24"','sec-ch-ua-mobile': '?0','sec-ch-ua-platform': '"Windows"','sec-fetch-dest': 'empty','sec-fetch-mode': 'cors','sec-fetch-site': 'same-site','sessionid': '...','shumeiid': '...','sk': '','traceparent': '...','user-agent': '...'
}# 請求網址
request_url = r'https://app.dewu.com/api/v1/h5/commodity-pick-interfaces/pc/pick-rule-result/feeds/info'# 請求載荷
request_parameters = {'filterUnbid': True,'pageNum': 1,  # 頁數'pageSize': 24,'pickRuleId': 644443,'showCspu': True,'sign': "0e5d10fb111f2afef6ac0a1776187e23"  # 簽名（加密參數）
}# 請求數據
response = requests.post(url=request_url, json=request_parameters, headers=request_header)

二、獲取并解析數據

獲取服務器返回的響應數據。
- response.text → 獲取響應的文本數據（字符串）
- response.json() → 獲取響應的 json 數據（字典 / 列表）
- response.content → 獲取響應的二進制數據（二進制數據）
對鍵值對進行取值，提取出所需信息。

Python 代碼：

# from pprint import pprint# 獲取數據
data_json = response.json()
# print(data_json)# 解析數據
info_list = data_json['data']['list']
for index in info_list:# pprint(index)# print('-' * 50)info_dict = {'標題': index['title'],'價格': index['price'],'圖片網址': index['logoUrl']}for key, value in info_dict.items():print(f'{key} : {value}')print('-' * 50)

注：如果出現 “requests.exceptions.InvalidHeader: xxx” 的報錯，說明你的 request_header 字典中的值存在多余的空格，仔細檢查后刪除即可。

四、保存數據

1. 保存為 CSV 文件

Python 代碼：

import requests
import csvrequest_header = {...}
request_url = r'https://app.dewu.com/api/v1/h5/commodity-pick-interfaces/pc/pick-rule-result/feeds/info'
request_parameters = {'filterUnbid': True,'pageNum': 1,  # 頁數'pageSize': 24,'pickRuleId': 644443,'showCspu': True,'sign': "0e5d10fb111f2afef6ac0a1776187e23"  # 簽名（加密參數）
}# 創建文件對象
f = open('dewu.csv', mode='w', encoding='utf-8-sig', newline='')
# 字典寫入方法
cd = csv.DictWriter(f, fieldnames=['標題', '價格', '圖片網址'])
# 寫入表頭
cd.writeheader()# 請求數據
response = requests.post(url=request_url, json=request_parameters, headers=request_header)
# 獲取數據
data_json = response.json()
# 解析數據
info_list = data_json['data']['list']
for index in info_list:info_dict = {'標題': index['title'],'價格': index['price'] / 100,'圖片網址': index['logoUrl']}# 寫入數據cd.writerow(info_dict)f.close()

2. 保存為 Excel 文件

Python 代碼：

import requests
import pandas as pdrequest_header = {...}
request_url = r'https://app.dewu.com/api/v1/h5/commodity-pick-interfaces/pc/pick-rule-result/feeds/info'
request_parameters = {'filterUnbid': True,'pageNum': 1,  # 頁數'pageSize': 24,'pickRuleId': 644443,'showCspu': True,'sign': "0e5d10fb111f2afef6ac0a1776187e23"  # 簽名（加密參數）
}# 請求數據
response = requests.post(url=request_url, json=request_parameters, headers=request_header)
# 獲取數據
data_json = response.json()
# 創建一個空列表
dewu_info = []# 解析數據
info_list = data_json['data']['list']
for index in info_list:info_dict = {'標題': index['title'],'價格': index['price'] / 100,'圖片網址': index['logoUrl']}# 寫入數據dewu_info.append(info_dict)# 轉換數據
df = pd.DataFrame(dewu_info)
# 導出保存為 Excel 表格
df.to_excel('dewu.xlsx', index=False)

打開網頁圖片并將其插入到 Excel 文件中

Python 代碼：

import openpyxl
from openpyxl.drawing.image import Image as xlImage
from openpyxl.utils import get_column_letter
from PIL import Image
from io import BytesIOdef download_image(url):rg_url = requests.get(url)# 檢查響應狀態碼if rg_url.status_code == 200:# 創建圖像對象image = Image.open(BytesIO(rg_url.content))# 統一圖像類型if image.mode != 'RGB':image = image.convert('RGB')# 調整圖像大小return image.resize((150, 96))else:raise Exception(f"無法下載圖片，狀態碼: {rg_url.status_code}")# 加載 Excel 文件
wb = openpyxl.load_workbook(r'dewu.xlsx')
# 默認為第一個 sheet
sheet = wb.active
# 調整行高和列寬
for row in range(2, sheet.max_row + 1):sheet.row_dimensions[row].height = 75
sheet.column_dimensions['C'].width = 20# 讀取鏈接并下載圖片插入到對應位置
for row in range(2, sheet.max_row + 1):# 假設圖片鏈接在第 2 行開始，第 C 列是鏈接（對應 column = 3），獲取鏈接單元格的值link = sheet.cell(row=row, column=3).value# 清空內容sheet.cell(row=row, column=3).value = None# 如果鏈接不為空if link:# 發送 HTTP 請求下載圖片try:# 嘗試下載圖像resized_image = download_image(link)except OSError:print(f"下載圖片 {link} 失敗")continueelse:# 將調整后的圖像插入到工作表中img_bytes = BytesIO()resized_image.save(img_bytes, format='PNG')  # 將圖片保存到內存中img = xlImage(img_bytes)sheet.add_image(img, f'{get_column_letter(3)}{row}')  # 插入圖片到指定位置wb.save(r'dewu_result.xlsx')  # 必要
wb.close()  # 必要

參考文章：【Python：openpyxl在excel中讀取url并下載、插入圖片】

五、加密參數逆向分析

1. 定位加密位置

通過開發者工具定位加密位置。

2. 斷點調試分析

斷點調試分析，分析加密規則。

搜索后返回了三個文件中的四個匹配行，分析可能的加密位置，然后添加斷點。

通過對網頁進行操作來調試斷點，程序停止的位置就是我們要找的斷點位置。

通過過濾請求網址，找到對應的請求載荷數據，查看 sign: 后的數據是否與剛剛斷點處的 sign: c(e) 值一致。

移除上面的那個不需要的 sign: c(e) 斷點。

注意：

c(e) 返回 sign 值，其中 e 是參數，c 是方法。

e 是除 sign 以外的 POST 請求載荷。

c() 的返回值是 “0e5d10fb111f2afef6ac0a1776187e23”（由 0-9 a-f 組成的 32 位字符）。
由 0-9 a-f 組成的 32 位字符可能是 md5 加密。
驗證是否為標準的 md5 加密，只需要調用加密函數，并傳入字符串參數 ‘123456’ ，如果返回值是以 ‘e10adc’ 開頭、‘883e’ 結尾，那么就是標準的 md5 加密。

由參數 e 是字典也可以看出，該方法 c() 不是 md5 加密，因為 md5 加密的參數一般是字符串。

3. 復制相關 js 加密代碼，在本地進行調試（難）

較難理解的一部分，詳細講解請見：【Python爬蟲實戰：采集常用軟件數據（得物）】的 0:50:25 處。

進入 c() 函數。

新建一個 JavaScript 文件（我的命名：js_file.js），并將上圖紅框中的代碼復制進去。

function c(t) {...}t = {filterUnbid: true,pageNum: 1,pageSize: 24,pickRuleId: 644443,showCspu: true
}  // 請求載荷console.log(c(t))

運行出現錯誤：ReferenceError: u is not defined ，出現此類報錯是正常的，說明存在代碼缺失。解決方案就是：缺方法補方法、缺參數補參數、缺環境補環境。

找到相應的加載器，并將代碼添加至 JavaScript 文件里。

JavaScript 代碼如下：

var a_temp;  // 添加!function a_method(e) {var n = {}function a(r) {...}a_temp = a  // 添加a.e = function (e) {...},a.m = e,a.c = n,a.d = function (e, r, t) {...},a.r = function (e) {...},a.t = function (e, r) {...},a.n = function (e) {...},a.o = function (e, r) {...},a.p = "",a.oe = function (e) {...}
}({});a = (a_temp("cnSC"), a_temp("ODXe"), a_temp("aCH8"))  // 將 r 修改為 a_temp
u = a_temp.n(a);  // 將 r 修改為 a_tempfunction c(t) {...}t = {filterUnbid: true,pageNum: 1,pageSize: 24,pickRuleId: 644443,showCspu: true
}console.log(c(t))

運行上述代碼會出現 TypeError: Cannot read properties of undefined (reading 'call') 的錯誤。如下圖所示添加代碼：

將運行后輸出的缺失代碼添加至 JavaScript 文件里。

JavaScript 代碼如下：

var a_temp;  // 添加!function a_method(e) {var n = {}function a(r) {...try {console.log(r)  // 添加e[r].call(t.exports, t, t.exports, a),o = !1} finally {o && delete n[r]}...}a_temp = a  // 添加a.e = function (e) {...},a.m = e,a.c = n,a.d = function (e, r, t) {...},a.r = function (e) {...},a.t = function (e, r) {...},a.n = function (e) {...},a.o = function (e, r) {...},a.p = "",a.oe = function (e) {...}
}({  // 添加cnSC: function (t, e) {...},ODXe: function (e, t, n) {...},BsWD: function (e, t, n) {...},a3WO: function (e, t, n) {...},aCH8: function (t, e, r) {...},ANhw: function (t, e) {...},mmNF: function (t, e) {...},BEtg: function (t, e) {...}
});a = (a_temp("cnSC"), a_temp("ODXe"), a_temp("aCH8"))  // 將 r 修改為 a_temp
u = a_temp.n(a);  // 將 r 修改為 a_tempfunction c(t) {...}t = {filterUnbid: true,pageNum: 1,pageSize: 24,pickRuleId: 644443,showCspu: true
}console.log(c(t))

詳細的 js_file.js 文件代碼見：【對得物進行爬蟲時使用到的 js 模塊】

運行結果如下圖所示：

4. 獲取 sign 加密參數

Win + R 輸入 cmd 進入命令提示符，輸入命令 pip install pyexecjs 安裝 execjs 庫，安裝好后在 PyCharm 中輸入 import execjs 就可以使用該模塊了。
編譯 js 代碼并獲取 sign 加密參數，并將 sign 值添加至請求載荷中。

Python 代碼：

import requests
# 導入編譯 js 代碼模塊
import execjs# 請求標頭
request_header = {...}
# 請求網址
request_url = r'https://app.dewu.com/api/v1/h5/commodity-pick-interfaces/pc/pick-rule-result/feeds/info'
# 請求載荷
request_parameters = {'filterUnbid': True,'pageNum': 1,  # 頁碼'pageSize': 24,'pickRuleId': 644443,  # 類目 ID'showCspu': True
}# 編譯 js 代碼
js_code = execjs.compile(open('./js_file.js', encoding='utf-8').read())
# 獲取 sign 加密參數
sign_data = js_code.call('c', request_parameters)
# 0e5d10fb111f2afef6ac0a1776187e23
# 將 sign 添加至請求載荷中
request_parameters['sign'] = sign_data# 請求數據
response = requests.post(url=request_url, json=request_parameters, headers=request_header)
# 獲取數據
data_json = response.json()
# 解析數據
info_list = data_json['data']['list']
for index in info_list:info_dict = {'標題': index['title'],'價格': index['price'] / 100,'圖片網址': index['logoUrl']}for key, value in info_dict.items():print(f'{key} : {value}')print('-' * 50)

六、其他示例

1. 單頁數據處理與保存

Python 代碼：

# 導入數據請求模塊
import requests
import csvdef get_data_csv(file_path, head_name):# 模擬瀏覽器（請求標頭）request_header = {'Referer': 'https://www.goofish.com/',# cookie 代表用戶信息，常用于檢測是否有登陸賬戶（不論是否登錄都有 cookie）'Cookie': '...'# user-agent 代表用戶代理，顯示瀏覽器 / 設備的基本身份信息'User-Agent': '...'}# 請求網址request_url = r'https://h5api.m.goofish.com/h5/mtop.taobao.idlemtopsearch.pc.search/1.0/'# 查詢參數query_parameters = {'jsv': '2.7.2','appKey': '34839810','t': '1750520204194','sign': '0dba40964b402d00dc448081c8e04127','v': '1.0','type': 'originaljson','accountSite': 'xianyu','dataType': 'json','timeout': '20000','api': 'mtop.taobao.idlemtopsearch.pc.search','sessionOption': 'AutoLoginOnly','spm_cnt': 'a21ybx.search.0.0','spm_pre': 'a21ybx.home.searchSuggest.1.4c053da6IXTxSx','log_id': '4c053da6IXTxSx'}# 表單數據form_data = {"pageNumber": 1,"keyword": "python爬蟲書籍","fromFilter": False,"rowsPerPage": 30,"sortValue": "","sortField": "","customDistance": "","gps": "","propValueStr": {},"customGps": "","searchReqFromPage": "pcSearch","extraFilterValue": "{}","userPositionJson": "{}"}print('Data is being requested and processed…')# 發送請求response = requests.post(url=request_url, params=query_parameters, data=form_data, headers=request_header)# 獲取響應的 json 數據 → 字典數據類型data_json = response.json()# 鍵值對取值，提取商品信息所在列表info_list = data_json['data']['resultList']# 創建文件對象f = open(file_path, mode='a', encoding='utf-8-sig', newline='')# 字典寫入方法cd = csv.DictWriter(f, fieldnames=head_name)cd.writeheader()# for 循環遍歷，提取列表里的元素for index in info_list:# 處理用戶名nick_name = '未知'if 'userNickName' in index['data']['item']['main']['exContent']:nick_name = index['data']['item']['main']['exContent']['userNickName']# 處理售價price_list = index['data']['item']['main']['exContent']['price']price = ''for p in price_list:price += p['text']# 處理詳情頁鏈接item_id = index['data']['item']['main']['exContent']['itemId']link = f'https://www.goofish.com/item?id={item_id}'temporarily_dict = {'標題': index['data']['item']['main']['exContent']['title'],'地區': index['data']['item']['main']['exContent']['area'],'售價': price,'用戶名': nick_name,'詳情頁鏈接': link}cd_file.writerow(temporarily_dict)f.close()if __name__ == '__main__':f_path = './fish.csv'h_name = ['標題', '地區', '售價', '用戶名', '詳情頁鏈接']get_data_csv(f_path, h_name)

2. 翻頁數據采集 — MD5 加密

1）分析請求鏈接 / 參數的變化

如下圖所示，其中 t 可以通過 time 模塊獲取；pageNumber 可以通過 for 循環構建。

2）sign 加密參數逆向分析

通過開發者工具定位加密位置。

斷點調試分析。

k = i(d.token + "&" + j + "&" + h + "&" + c.data) ，其中：

d.token = "b92a905a245d2523e9ca49dd382dad12"  // 固定
j = 1750571387066  // 時間戳（變化）
h = "34839810"  // 固定
// 表單數據，其中只有頁碼 pageNumber 會變化
c.data = ('{"pageNumber": 1, ''"keyword": "python爬蟲書籍", ''"fromFilter": false, ''"rowsPerPage": 30, ''"sortValue": "", ''"sortField": "", ''"customDistance": "", ''"gps": "", ''"propValueStr": {}, ''"customGps": "", ''"searchReqFromPage": "pcSearch", ''"extraFilterValue": "{}", ''"userPositionJson": "{}"}')k = "1c32f4de228112a3a59df6972d186b41"  // 返回值 由 0-9 a-f 構成的 32 位字符

判斷是否為 md5 加密的方法：調用加密函數 i() ，并傳入字符串參數 ‘123456’ ，如果返回值是以 ‘e10adc’ 開頭、‘883e’ 結尾，那么就是標準的 md5 加密。

# 導入哈希模塊
import hashlibd_token = 'b92a905a245d2523e9ca49dd382dad12'
j = 1750571387066  # <class 'int'>
h = '34839810'
c_data = ('{"pageNumber": 1, ''"keyword": "python爬蟲書籍", ''"fromFilter": false, ''"rowsPerPage": 30, ''"sortValue": "", ''"sortField": "", ''"customDistance": "", ''"gps": "", ''"propValueStr": {}, ''"customGps": "", ''"searchReqFromPage": "pcSearch", ''"extraFilterValue": "{}", ''"userPositionJson": "{}"}')
result_str = d_token + "&" + str(j) + "&" + h + "&" + c_data
# 使用 md5 加密
md_str = hashlib.md5()
# 傳入加密參數
md_str.update(result_str.encode('utf-8'))
# 進行加密處理
sign = md_str.hexdigest()  # <class 'str'>
print(sign)  # 1c32f4de228112a3a59df6972d186b41

3）Python 代碼實現

# 導入數據請求模塊
import requests
import csv
# 導入哈希模塊
import hashlib
import timedef get_sign(page):d_token = '...'  # token 是有時效性的，請自行填入j = int(time.time() * 1000)h = '...'c_data = ('{"pageNumber": %d, ...}') % pageresult_str = d_token + "&" + str(j) + "&" + h + "&" + c_data# 使用 md5 加密md_str = hashlib.md5()# 傳入加密參數md_str.update(result_str.encode('utf-8'))# 進行加密處理sign = md_str.hexdigest()return sign, j, c_datadef get_data_csv(file_path, head_name):# 模擬瀏覽器（請求標頭）request_header = {'Referer': 'https://www.goofish.com/',# cookie 代表用戶信息，常用于檢測是否有登陸賬戶（不論是否登錄都有 cookie）# cookie 是有時效性的，請自行填入'Cookie': '...',# user-agent 代表用戶代理，顯示瀏覽器 / 設備的基本身份信息'User-Agent': '...'}# 請求網址request_url = r'https://h5api.m.goofish.com/h5/mtop.taobao.idlemtopsearch.pc.search/1.0/'# 創建文件對象f = open(file_path, mode='a', encoding='utf-8-sig', newline='')# 字典寫入方法cd = csv.DictWriter(f, fieldnames=head_name)cd.writeheader()# for 構建循環翻頁num = 10for i in range(1, num + 1):print(f'正在采集第 {i} 頁數據…')# 獲取 sign 加密參數、時間戳和表單數據sign, j_time, c_data = get_sign(i)# 查詢參數query_parameters = {'jsv': '2.7.2','appKey': '34839810','t': str(j_time),'sign': sign,'v': '1.0','type': 'originaljson','accountSite': 'xianyu','dataType': 'json','timeout': '20000','api': 'mtop.taobao.idlemtopsearch.pc.search','sessionOption': 'AutoLoginOnly','spm_cnt': 'a21ybx.search.0.0','spm_pre': 'a21ybx.home.searchSuggest.1.4c053da6IXTxSx','log_id': '4c053da6IXTxSx'}# 表單數據form_data = {"data": c_data}# 發送請求response = requests.post(url=request_url, params=query_parameters, data=form_data, headers=request_header)# 獲取響應的 json 數據 → 字典數據類型data_json = response.json()# 鍵值對取值，提取商品信息所在列表info_list = data_json['data']['resultList']# for 循環遍歷，提取列表里的元素for index in info_list:# 處理用戶名nick_name = '未知'if 'userNickName' in index['data']['item']['main']['exContent']:nick_name = index['data']['item']['main']['exContent']['userNickName']# 處理售價price_list = index['data']['item']['main']['exContent']['price']price = ''for p in price_list:price += p['text']# 處理詳情頁鏈接item_id = index['data']['item']['main']['exContent']['itemId']link = f'https://www.goofish.com/item?id={item_id}'temporarily_dict = {'標題': index['data']['item']['main']['exContent']['title'],'地區': index['data']['item']['main']['exContent']['area'],'售價': price,'用戶名': nick_name,'詳情頁鏈接': link}cd.writerow(temporarily_dict)f.close()if __name__ == '__main__':f_path = './fish_python.csv'h_name = ['標題', '地區', '售價', '用戶名', '詳情頁鏈接']get_data_csv(f_path, h_name)

運行結果展示：

注意：運行時可能會出現 {'api': 'mtop.taobao.idlemtopsearch.pc.search', 'data': {}, 'ret': ['FAIL_SYS_TOKEN_EXOIRED::令牌過期'], 'v': '1.0'} 的錯誤，那是因為 d_token 和 cookie 都是具有時效性的，每過一段時間都會改變，因此自行修改成當下的 d_token 值和 cookie 值即可。