Python 爬蟲開發

文章目錄

1. 常用庫安裝
2. 基礎爬蟲開發
- 2.1. 使用 requests 獲取網頁內容
- 2.2. 使用 BeautifulSoup 解析 HTML
- 2.3. 處理登錄與會話
3. 進階爬蟲開發
- 3.1. 處理動態加載內容（Selenium）
- 3.2. 使用Scrapy框架
- 3.3. 分布式爬蟲（Scrapy-Redis）
4. 爬蟲優化與反反爬策略
- 4.1. 常見反爬機制及應對
- 4.2. 代理IP使用示例
- 4.3. 隨機延遲與請求頭

BeautifulSoup 官方文檔
https://beautifulsoup.readthedocs.io/zh-cn/v4.4.0/
https://cloud.tencent.com/developer/article/1193258
https://blog.csdn.net/zcs2312852665/article/details/144804553

參考：
https://blog.51cto.com/haiyongblog/13806452

1. 常用庫安裝

pip install requests beautifulsoup4 scrapy selenium pandas

2. 基礎爬蟲開發

2.1. 使用 requests 獲取網頁內容

import requestsurl = 'https://top.baidu.com/board?tab=realtime'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
}response = requests.get(url, headers=headers)
print(response.status_code)     # 200表示成功
print(response.text[:500])      # 打印前500個字符

在這里插入圖片描述

2.2. 使用 BeautifulSoup 解析 HTML

from bs4 import BeautifulSouphtml_doc = """<html><head><title>測試頁面</title></head><body><p class="title"><b>示例網站</b></p><p class="story">這是一個示例頁面<a href="http://example.com/1" class="link" id="link1">鏈接1</a><a href="http://example.com/2" class="link" id="link2">鏈接2</a></p>"""soup = BeautifulSoup(html_doc, 'html.parser')# 獲取標題
print(soup.title.string)# 獲取所有鏈接
for link in soup.find_all('a'):print(link.get('href'), link.string)# 通過CSS類查找
print(soup.find('p', class_='title').text)

2.3. 處理登錄與會話

import requestslogin_url = 'https://example.com/login'
target_url = 'https://example.com/dashboard'session = requests.Session()# 登錄請求
login_data = {'username': 'your_username','password': 'your_password'
}response = session.post(login_url, data=login_data)if response.status_code == 200:# 訪問需要登錄的頁面dashboard = session.get(target_url)print(dashboard.text)
else:print('登錄失敗')

3. 進階爬蟲開發

3.1. 處理動態加載內容（Selenium）

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager# 設置無頭瀏覽器
options = webdriver.ChromeOptions()
options.add_argument('--headless')  # 無界面模式
options.add_argument('--disable-gpu')# 自動下載chromedriver
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)url = 'https://dynamic-website.com'
driver.get(url)# 等待元素加載（隱式等待）
driver.implicitly_wait(10)# 獲取動態內容
dynamic_content = driver.find_element(By.CLASS_NAME, 'dynamic-content')
print(dynamic_content.text)driver.quit()

3.2. 使用Scrapy框架

# 創建Scrapy項目
# scrapy startproject example_project
# cd example_project
# scrapy genspider example example.com# 示例spider代碼
import scrapyclass ExampleSpider(scrapy.Spider):name = 'example'allowed_domains = ['example.com']start_urls = ['http://example.com/']def parse(self, response):# 提取數據title = response.css('title::text').get()links = response.css('a::attr(href)').getall()yield {'title': title,'links': links}# 運行爬蟲
# scrapy crawl example -o output.json

3.3. 分布式爬蟲（Scrapy-Redis）

# settings.py配置
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
REDIS_URL = 'redis://localhost:6379'# spider代碼
from scrapy_redis.spiders import RedisSpiderclass MyDistributedSpider(RedisSpider):name = 'distributed_spider'redis_key = 'spider:start_urls'def parse(self, response):# 解析邏輯pass

4. 爬蟲優化與反反爬策略

4.1. 常見反爬機制及應對

User-Agent檢測：隨機切換User-Agent
IP限制：使用代理IP池
驗證碼：OCR識別或打碼平臺
行為分析：模擬人類操作間隔
JavaScript渲染：使用Selenium或Pyppeteer

4.2. 代理IP使用示例

import requestsproxies = {'http': 'http://proxy_ip:port','https': 'https://proxy_ip:port'
}try:response = requests.get('https://example.com', proxies=proxies, timeout=5)print(response.text)
except Exception as e:print(f'請求失敗: {e}')

4.3. 隨機延遲與請求頭

import random
import time
import requests
from fake_useragent import UserAgentua = UserAgent()def random_delay():time.sleep(random.uniform(0.5, 2.5))def get_with_random_headers(url):headers = {'User-Agent': ua.random,'Accept-Language': 'en-US,en;q=0.5','Referer': 'https://www.google.com/'}random_delay()return requests.get(url, headers=headers)

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/web/81436.shtml
繁體地址，請注明出處：http://hk.pswp.cn/web/81436.shtml
英文地址，請注明出處：http://en.pswp.cn/web/81436.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！