基于Scrapy的天貓商品數據爬取與分析實戰（含API簽名破解與可視化）

在這里插入圖片描述

本文以華為Mate 60 Pro為例，詳細介紹如何使用Scrapy框架爬取天貓商品數據，涵蓋API簽名破解、反爬應對、數據存儲及可視化全流程，適合爬蟲進階學習者實踐。

一、抓包分析：定位天貓商品API

1.1 目標與工具

目標：獲取華為Mate 60 Pro的價格、銷量等數據
工具：Chrome開發者工具（F12）、Mitmproxy（可選）

1.2 操作步驟

登錄天貓并打開商品頁
訪問 https://detail.tmall.com/item.htm?id=725643567890，右鍵點擊頁面 → 檢查，切換到 Network 面板。

刷新頁面抓包
輸入關鍵詞 detail/get.json 過濾請求，找到目標API：

https://api.tmall.com/rest/item/1.0/item/detail/get.json?itemId=725643567890&t=1685275400000&sign=abc123...

提取關鍵參數
- itemId：商品ID（725643567890）
- t：13位時間戳（如 1685275400000）
- sign：MD5簽名（需逆向生成）

二、環境搭建：Scrapy項目初始化

2.1 創建虛擬環境與依賴安裝

# 創建虛擬環境
python -m venv venv
# 激活環境（Windows）
venv\Scripts\activate.bat
# 安裝依賴
pip install scrapy requests cryptography matplotlib

2.2 初始化Scrapy項目

scrapy startproject tmall_huawei
cd tmall_huawei
scrapy genspider huawei_spider tmall.com

2.3 項目結構

tmall_huawei/
├── scrapy.cfg
└── tmall_huawei/├── items.py         # 數據結構定義├── middlewares.py   # 反爬中間件├── pipelines.py     # 數據存儲├── settings.py      # 配置文件└── spiders/└── huawei_spider.py  # 爬蟲邏輯└── utils/└── crypto.py    # 簽名生成函數

三、核心開發：簽名生成與爬蟲邏輯

3.1 編寫簽名生成函數（`utils/crypto.py`）

import hashlib
import timedef generate_tmall_sign(item_id, app_key="12574478", salt="0c8a5244c7d2b6e1b"):"""生成天貓API簽名"""t = str(int(time.time() * 1000))  # 13位時間戳sign_str = f"{t}{item_id}{app_key}{salt}"  # 拼接規則需與服務端一致sign = hashlib.md5(sign_str.encode()).hexdigest().lower()  # 轉小寫return {"t": t, "sign": sign, "appKey": app_key}# 測試函數
if __name__ == "__main__":params = generate_tmall_sign("725643567890")print(f"生成的時間戳：{params['t']}，簽名：{params['sign']}")

3.2 定義數據結構（`items.py`）

import scrapyclass TmallHuaweiItem(scrapy.Item):item_id = scrapy.Field()     # 商品IDtitle = scrapy.Field()       # 商品標題price = scrapy.Field()       # 價格sales = scrapy.Field()       # 月銷量shop_name = scrapy.Field()   # 店鋪名稱timestamp = scrapy.Field()   # 采集時間

3.3 編寫爬蟲邏輯（`spiders/huawei_spider.py`）

import scrapy
from urllib.parse import urlencode
from ..utils.crypto import generate_tmall_sign
from ..items import TmallHuaweiItemclass HuaweiSpiderSpider(scrapy.Spider):name = "huawei_spider"allowed_domains = ["tmall.com"]start_urls = ["https://detail.tmall.com/item.htm?id=725643567890"]def start_requests(self):item_id = "725643567890"sign_params = generate_tmall_sign(item_id)# 構造API參數params = {"itemId": item_id,"type": "json","version": "1.0","isLowPrice": "false",**sign_params}api_url = f"https://api.tmall.com/rest/item/1.0/item/detail/get.json?{urlencode(params)}"# 發送帶請求頭的API請求yield scrapy.Request(api_url,callback=self.parse_item,headers=self.get_headers(),meta={"item_id": item_id})def parse_item(self, response):item = TmallHuaweiItem()data = response.json()item_info = data.get("data", {}).get("item", {})shop_info = item_info.get("shop", {})item["item_id"] = response.meta["item_id"]item["title"] = item_info.get("title", "")item["price"] = item_info.get("sellPrice", {}).get("price", "0.0")item["sales"] = item_info.get("sales", "0")item["shop_name"] = shop_info.get("name", "")item["timestamp"] = int(time.time())yield itemdef get_headers(self):"""模擬瀏覽器請求頭（含Referer和User-Agent）"""return {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/114.0.0.0 Safari/537.36","Referer": "https://detail.tmall.com/item.htm?id=725643567890","Accept": "application/json, text/plain, */*"}

四、數據存儲與可視化

4.1 存儲到CSV（`pipelines.py`）

import csv
import timeclass CSVPipeline:def __init__(self):self.filename = f"huawei_mate60_{time.strftime('%Y%m%d_%H%M')}.csv"self.file = open(self.filename, "w", newline="utf-8", encoding="utf-8-sig")  # 防止中文亂碼self.writer = csv.DictWriter(self.file, fieldnames=["item_id", "title", "price", "sales", "shop_name", "timestamp"])self.writer.writeheader()def process_item(self, item, spider):self.writer.writerow(dict(item))return itemdef close_spider(self, spider):self.file.close()

4.2 啟用管道（`settings.py`）

ITEM_PIPELINES = {"tmall_huawei.pipelines.CSVPipeline": 300,
}

4.3 價格趨勢可視化（`visualize.py`）

import pandas as pd
import matplotlib.pyplot as plt
from datetime import datetimedef plot_price_trend(csv_path):df = pd.read_csv(csv_path)df["price"] = df["price"].astype(float)df["time"] = df["timestamp"].apply(lambda x: datetime.fromtimestamp(x))plt.figure(figsize=(12, 6))plt.plot(df["time"], df["price"], marker="o", color="#FF6B6B", linestyle="-")plt.title("華為Mate 60 Pro價格趨勢", fontsize=16)plt.xlabel("時間", fontsize=12)plt.ylabel("價格（元）", fontsize=12)plt.xticks(rotation=45)plt.grid(True, linestyle="--", alpha=0.7)plt.tight_layout()plt.show()if __name__ == "__main__":plot_price_trend("huawei_mate60_20250527_1530.csv")

五、調試技巧與反爬應對

5.1 常見錯誤與解決

錯誤類型	原因分析	解決方法
`403 Forbidden`	簽名錯誤或缺少請求頭	對比抓包簽名，添加Cookie和Referer
`JSONDecodeError`	API返回非JSON數據	檢查URL是否正確，確保登錄態
`KeyError: 'data'`	響應結構變化	重新抓包分析JSON路徑

5.2 反爬策略

請求間隔：在settings.py中設置 DOWNLOAD_DELAY = 2
User-Agent輪換：使用fake_useragent庫動態切換UA
代理IP池：集成scrapy-proxies中間件（需準備代理服務）

六、總結

本文通過實戰演示了天貓商品數據爬取的完整流程，核心技術點包括：

通過Chrome抓包定位API及參數逆向
使用Scrapy框架實現分布式爬蟲
MD5簽名生成與反爬應對
數據存儲與可視化分析

實際應用中需根據網站反爬機制動態調整策略（如動態鹽值、驗證碼處理），進一步可擴展為分布式集群或集成監控告警系統。

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/news/907271.shtml
繁體地址，請注明出處：http://hk.pswp.cn/news/907271.shtml
英文地址，請注明出處：http://en.pswp.cn/news/907271.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！