Python爬蟲爬取天貓商品數據，詳細教程【Python經典實戰項目】

Python爬取天貓商品數據詳細教程

一、前期準備

1. 環境配置

Python環境：確保已安裝Python 3.x版本，建議使用Anaconda或直接從Python官網下載安裝。
第三方庫：
- requests：用于發送HTTP請求。
- BeautifulSoup：用于解析HTML內容。
- lxml：作為BeautifulSoup的解析器，提高解析效率。
- selenium（可選）：用于處理動態加載的內容。
- pandas（可選）：用于數據處理和存儲。

安裝命令：

pip install requests beautifulsoup4 lxml selenium pandas

2. 了解天貓的反爬機制

天貓等電商平臺通常有完善的反爬蟲機制，包括但不限于：

User-Agent檢測：檢查請求頭中的User-Agent字段。
IP限制：頻繁請求可能導致IP被封禁。
驗證碼：部分操作可能需要輸入驗證碼。
動態加載：部分內容通過JavaScript動態加載。

二、爬取天貓商品數據的基本步驟

1. 分析目標頁面

打開天貓商品頁面：在瀏覽器中打開天貓商品詳情頁，右鍵選擇“檢查”或按F12打開開發者工具。
查看網絡請求：在開發者工具的“Network”選項卡中，刷新頁面，查看請求的URL和響應內容。
定位數據：找到包含商品信息的HTML元素，記錄其標簽名、類名或ID。

2. 發送HTTP請求

使用requests庫發送HTTP請求，獲取頁面內容。

import requestsurl = 'https://detail.tmall.com/item.htm?id=商品ID'  # 替換為實際的商品ID
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
}response = requests.get(url, headers=headers)
if response.status_code == 200:html_content = response.text
else:print(f"請求失敗，狀態碼：{response.status_code}")

3. 解析HTML內容

使用BeautifulSoup解析HTML內容，提取商品信息。

import requestsurl = 'https://detail.tmall.com/item.htm?id=商品ID'  # 替換為實際的商品ID
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
}response = requests.get(url, headers=headers)
if response.status_code == 200:html_content = response.text
else:print(f"請求失敗，狀態碼：{response.status_code}")

4. 處理動態加載的內容（可選）

如果商品信息是通過JavaScript動態加載的，可以使用selenium模擬瀏覽器行為。

from selenium import webdriver
from selenium.webdriver.common.by import By
import time# 配置ChromeDriver路徑
driver_path = 'path/to/chromedriver'  # 替換為實際的ChromeDriver路徑
driver = webdriver.Chrome(executable_path=driver_path)driver.get(url)
time.sleep(5)  # 等待頁面加載完成# 提取動態加載的內容（示例：提取商品標題）
title_element = driver.find_element(By.CSS_SELECTOR, 'span.J_TSearch_Title')
title = title_element.text.strip()print(f"商品標題：{title}")# 關閉瀏覽器
driver.quit()

5. 存儲數據

將爬取的數據保存到本地文件或數據庫中。

保存到CSV文件

import pandas as pddata = {'商品標題': [title],'商品價格': [price],'商品銷量': [sales]
}df = pd.DataFrame(data)
df.to_csv('tmall_products.csv', index=False, encoding='utf-8-sig')

保存到數據庫（以MySQL為例）

import pymysql# 連接數據庫
conn = pymysql.connect(host='localhost',user='username',password='password',database='database_name',charset='utf8mb4'
)cursor = conn.cursor()# 創建表（如果不存在）
cursor.execute('''
CREATE TABLE IF NOT EXISTS tmall_products (id INT AUTO_INCREMENT PRIMARY KEY,title VARCHAR(255),price VARCHAR(50),sales VARCHAR(50)
)
''')# 插入數據
sql = '''
INSERT INTO tmall_products (title, price, sales)
VALUES (%s, %s, %s)
'''
cursor.execute(sql, (title, price, sales))# 提交事務
conn.commit()# 關閉連接
cursor.close()
conn.close()

三、高級技巧與注意事項

1. 處理分頁

如果需要爬取多頁商品數據，可以分析分頁URL的規律，通過循環實現。

base_url = 'https://list.tmall.com/search_product.htm?q=關鍵詞&s='  # 替換為實際的搜索關鍵詞for page in range(0, 100, 44):  # 每頁44個商品，假設爬取前3頁url = f"{base_url}{page}"response = requests.get(url, headers=headers)if response.status_code == 200:html_content = response.textsoup = BeautifulSoup(html_content, 'lxml')# 提取當前頁的商品信息（示例：提取商品標題）product_tags = soup.find_all('div', class_='product')for product in product_tags:title_tag = product.find('a', class_='product-title')if title_tag:title = title_tag.get_text().strip()print(f"商品標題：{title}")

2. 使用代理IP

為了避免IP被封禁，可以使用代理IP。

proxies = {'http': 'http://your_proxy_ip:port','https': 'https://your_proxy_ip:port'
}response = requests.get(url, headers=headers, proxies=proxies)

3. 遵守法律法規和網站規則

遵守robots.txt協議：在爬取前，檢查目標網站的robots.txt文件，確保爬取行為符合網站規定。
合理設置請求間隔：避免頻繁請求，給服務器造成過大壓力。
不侵犯隱私：確保爬取的數據不涉及用戶隱私。

4. 異常處理

在實際應用中，應添加異常處理機制，以應對網絡請求失敗、HTML結構變化等情況。

try:response = requests.get(url, headers=headers, timeout=10)response.raise_for_status()  # 如果響應狀態碼不是200，拋出HTTPError異常html_content = response.textsoup = BeautifulSoup(html_content, 'lxml')# 提取商品信息...except requests.exceptions.RequestException as e:print(f"請求發生錯誤：{e}")
except Exception as e:print(f"發生未知錯誤：{e}")

四、完整代碼示例

import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
import randomdef crawl_tmall_product(url):headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}try:response = requests.get(url, headers=headers, timeout=10)response.raise_for_status()html_content = response.textsoup = BeautifulSoup(html_content, 'lxml')# 提取商品標題title_tag = soup.find('span', class_='J_TSearch_Title')title = title_tag.get_text().strip() if title_tag else '未找到商品標題'# 提取商品價格price_tag = soup.find('span', class_='tm-price')price = price_tag.get_text().strip() if price_tag else '未找到商品價格'# 提取商品銷量（以月銷為例）sales_tag = soup.find('div', class_='tm-detail-hd-sale')sales = sales_tag.find('span').get_text().strip().replace('月銷', '') if sales_tag else '未找到商品銷量'return {'商品標題': title,'商品價格': price,'商品銷量': sales}except requests.exceptions.RequestException as e:print(f"請求發生錯誤：{e}")return Noneexcept Exception as e:print(f"發生未知錯誤：{e}")return Nonedef main():# 示例：爬取單個商品product_url = 'https://detail.tmall.com/item.htm?id=商品ID'  # 替換為實際的商品IDproduct_data = crawl_tmall_product(product_url)if product_data:print(f"商品標題：{product_data['商品標題']}")print(f"商品價格：{product_data['商品價格']}")print(f"商品銷量：{product_data['商品銷量']}")# 保存到CSV文件data = [product_data]df = pd.DataFrame(data)df.to_csv('tmall_products.csv', index=False, encoding='utf-8-sig')print("數據已保存到tmall_products.csv")if __name__ == '__main__':main()

五、總結

通過以上步驟，你可以使用Python爬取天貓商品數據。在實際應用中，需要根據目標網站的具體情況調整代碼，并注意遵守相關法律法規和網站規則。希望本教程對你有所幫助！

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/news/908268.shtml
繁體地址，請注明出處：http://hk.pswp.cn/news/908268.shtml
英文地址，請注明出處：http://en.pswp.cn/news/908268.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！