Selenium 基礎操作
作為一名資深爬蟲工程師,我將帶您全面掌握Selenium自動化測試與網頁爬取技術。
本教程基于Python 3.12,使用uv進行依賴管理,并通過FastAPI搭建模擬網站供實戰練習。
第一章:環境搭建
1.1 安裝Python 3.12
首先確保您已安裝Python 3.12,可以從Python官網下載安裝。
1.2 安裝uv包管理器
uv是一個快速的Python包管理器,替代傳統的pip:
# 安裝uv
curl -LsSf https://astral.sh/uv/install.sh | sh# 或者使用pip安裝
pip install uv
1.3 創建項目并安裝依賴
# 創建項目目錄
mkdir selenium-tutorial && cd selenium-tutorial# 初始化項目
uv init -p 3.12# 初始化虛擬環境并指定python版本
uv venv .venv# 激活虛擬環境
# windows
.venv\Scripts\activate# macos | linux
source .venv\bin\activate# 安裝所需依賴
uv add selenium fastapi uvicorn jinja2 python-multipart webdriver-manager
或者
# 創建項目目錄
uv init selenium-tutorial -p 3.12 && cd selenium-tutorial# 初始化虛擬環境并指定python版本
uv venv .venv# 激活虛擬環境
# windows
.venv\Scripts\activate# macos | linux
source .venv\bin\activate# 安裝所需依賴
uv add selenium fastapi uvicorn jinja2 python-multipart webdriver-manager
1.4 瀏覽器驅動配置
Selenium需要對應瀏覽器的驅動程序,我們使用webdriver-manager
自動管理:
- Chrome: 會自動下載對應版本的chromedriver
- Firefox: 會自動下載geckodriver
- Edge: 會自動下載msedgedriver
無需手動下載和配置路徑,webdriver-manager
會處理一切。
第二章:FastAPI模擬網站搭建
為了進行安全合法的練習,我們搭建一個模擬網站作為爬取目標。
app.py
"""
code: app.py
"""
from fastapi import FastAPI, Request, Form
from fastapi.responses import HTMLResponse
from fastapi.templating import Jinja2Templates
import random
import string# 創建FastAPI應用
app = FastAPI(title="Selenium 練習網站")# 設置模板目錄
templates = Jinja2Templates(directory="templates")# 生成隨機token用于演示token驗證
def generate_token():return ''.join(random.choices(string.ascii_letters + string.digits, k=16))# 首頁
@app.get("/", response_class=HTMLResponse)
async def read_root(request: Request):# 生成頁面tokentoken = generate_token()return templates.TemplateResponse("index.html", {"request": request,"token": token,"products": [{"id": 1, "name": "筆記本電腦", "price": 5999, "category": "電子產品"},{"id": 2, "name": "機械鍵盤", "price": 399, "category": "電腦配件"},{"id": 3, "name": "無線鼠標", "price": 199, "category": "電腦配件"},{"id": 4, "name": "藍牙耳機", "price": 799, "category": "音頻設備"}]})# 登錄頁面
@app.get("/login", response_class=HTMLResponse)
async def login_page(request: Request):token = generate_token()return templates.TemplateResponse("login.html", {"request": request,"token": token})# 處理登錄請求
@app.post("/login", response_class=HTMLResponse)
async def login(request: Request,username: str = Form(...),password: str = Form(...),token: str = Form(...)
):# 簡單驗證邏輯if username == "test" and password == "password123":return templates.TemplateResponse("dashboard.html", {"request": request,"message": "登錄成功","username": username})else:return templates.TemplateResponse("login.html", {"request": request,"error": "用戶名或密碼錯誤","token": generate_token()})# 動態內容頁面(用于演示等待機制)
@app.get("/dynamic", response_class=HTMLResponse)
async def dynamic_content(request: Request):return templates.TemplateResponse("dynamic.html", {"request": request})# 表單頁面
@app.get("/form", response_class=HTMLResponse)
async def form_page(request: Request):token = generate_token()return templates.TemplateResponse("form.html", {"request": request,"token": token})# 文件上傳頁面
@app.get("/upload", response_class=HTMLResponse)
async def upload_page(request: Request):return templates.TemplateResponse("upload.html", {"request": request})if __name__ == "__main__":import uvicornuvicorn.run(app, host="0.0.0.0", port=8000)
創建模板文件
創建templates
目錄,并添加以下HTML文件:
templates/index.html
<!DOCTYPE html>
<html>
<head><title>Selenium練習網站</title><style>body { font-family: Arial, sans-serif; max-width: 1200px; margin: 0 auto; padding: 20px; }.product { border: 1px solid #ddd; padding: 10px; margin: 10px; display: inline-block; width: 200px; }.nav { margin-bottom: 20px; }.nav a { margin-right: 15px; text-decoration: none; color: #333; }</style><input type="hidden" id="csrf_token" value="{{ token }}">
</head>
<body><div class="nav"><a href="/" id="home-link">首頁</a><a href="/login" class="nav-link">登錄</a><a href="/dynamic" class="nav-link">動態內容</a><a href="/form" class="nav-link">表單</a><a href="/upload" class="nav-link">文件上傳</a></div><h1>產品列表</h1><div id="products">{% for product in products %}<div class="product" data-id="{{ product.id }}"><h3 class="product-name">{{ product.name }}</h3><p class="product-price">價格: ¥{{ product.price }}</p><p class="category">{{ product.category }}</p><button class="add-to-cart" data-product="{{ product.id }}">加入購物車</button></div>{% endfor %}</div>
</body>
</html>
templates/login.html
<!DOCTYPE html>
<html>
<head><title>登錄 - Selenium練習網站</title><style>.container { max-width: 400px; margin: 50px auto; padding: 20px; border: 1px solid #ddd; }.error { color: red; }</style>
</head>
<body><div class="container"><h1>用戶登錄</h1>{% if error %}<p class="error">{{ error }}</p>{% endif %}<form method="post"><input type="hidden" name="token" value="{{ token }}"><div><label for="username">用戶名:</label><input type="text" id="username" name="username" required></div><div><label for="password">密碼:</label><input type="password" id="password" name="password" required></div><button type="submit" id="submit-btn">登錄</button></form></div>
</body>
</html>
templates/dashboard.html
<!DOCTYPE html>
<html>
<head><title>用戶中心 - Selenium練習網站</title><style>.container { max-width: 800px; margin: 50px auto; padding: 20px; border: 1px solid #ddd; }.success { color: green; font-size: 1.2em; }.user-info { margin: 20px 0; padding: 15px; background-color: #f5f5f5; }.nav { margin-bottom: 20px; }.nav a { margin-right: 15px; text-decoration: none; color: #333; }</style>
</head>
<body><div class="nav"><a href="/" id="home-link">首頁</a><a href="/login" class="nav-link">登錄</a><a href="/dynamic" class="nav-link">動態內容</a><a href="/form" class="nav-link">表單</a><a href="/upload" class="nav-link">文件上傳</a></div><div class="container"><h1>用戶中心</h1><p class="success">{{ message }}</p><div class="user-info"><p>用戶名: {{ username }}</p><p>登錄時間: {{ now }}</p><p>賬戶狀態: 正常</p></div><h3>最近活動</h3><ul><li>瀏覽了產品列表</li><li>查看了動態內容</li><li>提交了測試表單</li></ul></div>
</body>
</html>
templates/dynamic.html
<!DOCTYPE html>
<html>
<head><title>動態內容 - Selenium練習網站</title><style>.container { max-width: 800px; margin: 50px auto; padding: 20px; }.dynamic-content { margin: 20px 0; padding: 15px; border: 1px solid #ccc; display: none; }.visible-after-delay { margin: 20px 0; padding: 15px; background-color: #e3f2fd; display: none; }#delayed-button { padding: 10px 20px; background-color: #4CAF50; color: white; border: none; cursor: pointer; display: none; }#status-message { margin-top: 20px; padding: 10px; }.nav { margin-bottom: 20px; }.nav a { margin-right: 15px; text-decoration: none; color: #333; }</style>
</head>
<body><div class="nav"><a href="/" id="home-link">首頁</a><a href="/login" class="nav-link">登錄</a><a href="/dynamic" class="nav-link">動態內容</a><a href="/form" class="nav-link">表單</a><a href="/upload" class="nav-link">文件上傳</a></div><div class="container"><h1>動態內容演示</h1><p>本頁面展示各種動態加載的內容,用于測試Selenium的等待機制。</p><div id="dynamic-content" class="dynamic-content">這是延遲加載的動態內容,通常通過JavaScript在頁面加載后一段時間顯示。</div><div class="visible-after-delay">這是另一個延遲顯示的內容,使用了不同的延遲時間。</div><button id="delayed-button">點擊我</button><div id="status-message"></div></div><script>// 模擬動態內容加載setTimeout(() => {document.getElementById('dynamic-content').style.display = 'block';}, 2000); // 2秒后顯示// 另一個延遲顯示的元素setTimeout(() => {document.querySelector('.visible-after-delay').style.display = 'block';}, 4000); // 4秒后顯示// 延遲顯示按鈕并添加點擊事件setTimeout(() => {const button = document.getElementById('delayed-button');button.style.display = 'inline-block';button.addEventListener('click', () => {document.getElementById('status-message').textContent = '按鈕已點擊,操作成功!';document.getElementById('status-message').style.backgroundColor = '#dff0d8';});}, 6000); // 6秒后顯示按鈕// 存儲CSRF Token到JavaScript變量,用于演示window.csrfToken = 'dynamic_' + Math.random().toString(36).substring(2);</script>
</body>
</html>
templates/form.html
<!DOCTYPE html>
<html>
<head><title>表單示例 - Selenium練習網站</title><style>.container { max-width: 600px; margin: 50px auto; padding: 20px; border: 1px solid #ddd; }.form-group { margin-bottom: 15px; }label { display: block; margin-bottom: 5px; }input, select, textarea { width: 100%; padding: 8px; box-sizing: border-box; }button { padding: 10px 20px; background-color: #4CAF50; color: white; border: none; cursor: pointer; }.success { color: green; margin-top: 15px; padding: 10px; background-color: #dff0d8; display: none; }.error { color: red; margin-top: 15px; }.nav { margin-bottom: 20px; }.nav a { margin-right: 15px; text-decoration: none; color: #333; }</style>
</head>
<body><div class="nav"><a href="/" id="home-link">首頁</a><a href="/login" class="nav-link">登錄</a><a href="/dynamic" class="nav-link">動態內容</a><a href="/form" class="nav-link">表單</a><a href="/upload" class="nav-link">文件上傳</a></div><div class="container"><h1>用戶信息表單</h1><form id="user-form" method="post"><input type="hidden" name="token" value="{{ token }}"><div class="form-group"><label for="name">姓名:</label><input type="text" id="name" name="name" required></div><div class="form-group"><label for="email">郵箱:</label><input type="email" id="email" name="email" required></div><div class="form-group"><label for="age">年齡:</label><input type="number" id="age" name="age" min="1" max="120"></div><div class="form-group"><label for="gender">性別:</label><select id="gender" name="gender"><option value="">請選擇</option><option value="male">男</option><option value="female">女</option><option value="other">其他</option></select></div><div class="form-group"><label>興趣愛好:</label><div><input type="checkbox" id="hobby1" name="hobbies" value="reading"><label for="hobby1">閱讀</label><input type="checkbox" id="hobby2" name="hobbies" value="sports"><label for="hobby2">運動</label><input type="checkbox" id="hobby3" name="hobbies" value="music"><label for="hobby3">音樂</label></div></div><div class="form-group"><label for="message">留言:</label><textarea id="message" name="message" rows="4"></textarea></div><button type="submit" id="submit-form">提交</button></form><div id="form-success" class="success">表單提交成功!</div>{% if error %}<div class="error">{{ error }}</div>{% endif %}</div><script>// 簡單的表單驗證document.getElementById('user-form').addEventListener('submit', function(e) {const name = document.getElementById('name').value;const email = document.getElementById('email').value;if (!name || !email) {alert('請填寫姓名和郵箱');e.preventDefault();return false;}// 在實際提交前更新success消息的顯示狀態document.getElementById('form-success').style.display = 'block';return true;});</script>
</body>
</html>
templates/upload.html
<!DOCTYPE html>
<html>
<head><title>文件上傳 - Selenium練習網站</title><style>.container { max-width: 600px; margin: 50px auto; padding: 20px; border: 1px solid #ddd; }.form-group { margin-bottom: 15px; }label { display: block; margin-bottom: 5px; }input[type="file"] { margin: 10px 0; }button { padding: 10px 20px; background-color: #4CAF50; color: white; border: none; cursor: pointer; }.upload-area { border: 2px dashed #ccc; padding: 30px; text-align: center; margin-bottom: 20px; }.upload-area.dragover { border-color: #4CAF50; background-color: #f5f5f5; }.message { margin-top: 20px; padding: 10px; display: none; }.success { background-color: #dff0d8; color: #3c763d; }.error { background-color: #f2dede; color: #a94442; }.nav { margin-bottom: 20px; }.nav a { margin-right: 15px; text-decoration: none; color: #333; }</style>
</head>
<body><div class="nav"><a href="/" id="home-link">首頁</a><a href="/login" class="nav-link">登錄</a><a href="/dynamic" class="nav-link">動態內容</a><a href="/form" class="nav-link">表單</a><a href="/upload" class="nav-link">文件上傳</a></div><div class="container"><h1>文件上傳演示</h1><p>本頁面用于測試文件上傳功能,可以上傳圖片、文檔等文件。</p><form id="upload-form" method="post" enctype="multipart/form-data"><div class="form-group"><label for="file-title">文件標題:</label><input type="text" id="file-title" name="title" required></div><div class="form-group"><label>選擇文件:</label><div class="upload-area" id="upload-area">點擊或拖拽文件到這里上傳<input type="file" id="file-upload" name="file" multiple style="display: none;"></div><p id="file-name" style="margin-top: 10px;"></p></div><div class="form-group"><label for="file-description">文件描述:</label><textarea id="file-description" name="description" rows="3"></textarea></div><button type="submit" id="upload-btn">上傳文件</button></form><div id="success-message" class="message success">文件上傳成功!</div><div id="error-message" class="message error">文件上傳失敗,請重試。</div></div><script>// 處理拖拽上傳const uploadArea = document.getElementById('upload-area');const fileInput = document.getElementById('file-upload');const fileNameDisplay = document.getElementById('file-name');// 點擊上傳區域觸發文件選擇uploadArea.addEventListener('click', () => {fileInput.click();});// 顯示選擇的文件名fileInput.addEventListener('change', (e) => {if (e.target.files.length > 0) {const fileNames = Array.from(e.target.files).map(file => file.name).join(', ');fileNameDisplay.textContent = `已選擇: ${fileNames}`;}});// 拖拽相關事件uploadArea.addEventListener('dragover', (e) => {e.preventDefault();uploadArea.classList.add('dragover');});uploadArea.addEventListener('dragleave', () => {uploadArea.classList.remove('dragover');});uploadArea.addEventListener('drop', (e) => {e.preventDefault();uploadArea.classList.remove('dragover');if (e.dataTransfer.files.length > 0) {// 這里只是模擬,實際項目中需要額外處理const fileNames = Array.from(e.dataTransfer.files).map(file => file.name).join(', ');fileNameDisplay.textContent = `已選擇: ${fileNames}`;}});// 表單提交處理document.getElementById('upload-form').addEventListener('submit', function(e) {// 簡單驗證if (!fileInput.files.length) {alert('請選擇要上傳的文件');e.preventDefault();return false;}// 顯示成功消息(實際應用中會由服務器處理)setTimeout(() => {document.getElementById('success-message').style.display = 'block';document.getElementById('error-message').style.display = 'none';}, 1000);return true;});</script>
</body>
</html>
啟動模擬網站:
uv run app.py
訪問 http://localhost:8000 即可看到我們創建的模擬網站。
第三章:Selenium基礎
3.0 牛刀小試
# -*- coding: utf-8 -*-
"""
02_start_browser.py
作用:用三種方式啟動 Chrome
"""
from time import sleepfrom selenium import webdriver # 總入口
from selenium.webdriver.chrome.service import Service # 驅動服務
from selenium.webdriver.chrome.options import Options # 瀏覽器參數
from webdriver_manager.chrome import ChromeDriverManager# 方法1:最簡方式(驅動已放 PATH)
driver1 = webdriver.Chrome()
driver1.maximize_window() # 瀏覽器窗口最大化
driver1.get("https://www.baidu.com")
sleep(10)
driver1.quit()# 方法2:指定驅動路徑
service = Service(executable_path=r"C:\Users\李昊哲\.wdm\drivers\chromedriver\win64\140.0.7339.82\chromedriver-win32\chromedriver.exe")
driver2 = webdriver.Chrome(service=service)
driver2.maximize_window() # 瀏覽器窗口最大化
driver2.get("https://www.sogou.com")
sleep(10)
driver2.quit()# 方法3:無頭模式 + 常用參數
options = Options()
options.add_argument("--headless") # 無頭模式
options.add_argument("--window-size=1920x1080") # 設置瀏覽器窗體尺寸
driver3 = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)
driver3.get("https://cn.bing.com/")
print("頁面標題:", driver3.title)
driver3.quit()
3.1 第一個Selenium腳本
first_script.py
"""
code: first_script.py
"""
# 導入必要的庫
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
import timedef main():# 初始化Chrome瀏覽器驅動# 使用webdriver_manager自動管理驅動,無需手動下載driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))try:# 打開我們的模擬網站首頁driver.get("http://localhost:8000")# 打印當前頁面標題print(f"頁面標題: {driver.title}")# 打印當前頁面URLprint(f"當前URL: {driver.current_url}")# 等待3秒,讓我們看清效果time.sleep(3)# 刷新頁面driver.refresh()print("頁面已刷新")time.sleep(2)# 導航到登錄頁面driver.get("http://localhost:8000/login")print("已導航到登錄頁面")time.sleep(2)# 后退到上一頁driver.back()print("已后退到首頁")time.sleep(2)# 前進到下一頁driver.forward()print("已前進到登錄頁面")time.sleep(2)finally:# 關閉瀏覽器driver.quit()print("瀏覽器已關閉")if __name__ == "__main__":main()
3.2 代碼解析
-
導入庫:
webdriver
:Selenium的核心庫,提供各種瀏覽器的驅動接口Service
:用于管理瀏覽器驅動的服務ChromeDriverManager
:自動管理Chrome驅動的安裝和版本匹配
-
初始化瀏覽器:
webdriver.Chrome()
:創建Chrome瀏覽器實例- 通過
Service
和ChromeDriverManager
自動處理驅動
-
基本操作:
get(url)
:打開指定URLtitle
:獲取頁面標題current_url
:獲取當前頁面URLrefresh()
:刷新頁面back()
:后退到上一頁forward()
:前進到下一頁quit()
:關閉瀏覽器并釋放資源
第四章:8種元素定位方式
Selenium提供了8種元素定位方式,掌握這些是進行自動化操作的基礎。
4.1 通過ID定位 (find_element_by_id)
locate_by_id.py
"""
code: locate_by_id.py
"""
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
import timedef main():# 初始化瀏覽器驅動driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))try:# 打開登錄頁面driver.get("http://localhost:8000/login")time.sleep(1)# 1. 通過ID定位用戶名輸入框# 思路:找到id為"username"的元素,這是最直接可靠的定位方式username_input = driver.find_element(By.ID, "username")# 操作元素:輸入用戶名username_input.send_keys("test_user")time.sleep(1)# 2. 通過ID定位密碼輸入框password_input = driver.find_element(By.ID, "password")password_input.send_keys("test_password")time.sleep(1)# 3. 通過ID定位提交按鈕submit_btn = driver.find_element(By.ID, "submit-btn")# 操作元素:點擊按鈕submit_btn.click()time.sleep(2)finally:# 關閉瀏覽器driver.quit()if __name__ == "__main__":main()
4.2 通過Name定位 (find_element_by_name)
locate_by_name.py
"""
code: locate_by_name.py
"""
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
import timedef main():driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))try:driver.get("http://localhost:8000/login")time.sleep(1)# 通過Name定位用戶名輸入框# 思路:當元素有name屬性時,可以使用此方法,適合表單元素username_input = driver.find_element(By.NAME, "username")username_input.send_keys("test")time.sleep(1)# 通過Name定位密碼輸入框password_input = driver.find_element(By.NAME, "password")password_input.send_keys("password123")time.sleep(1)# 通過Name定位token字段(隱藏字段)# 這在處理CSRF驗證時很有用token_input = driver.find_element(By.NAME, "token")print(f"獲取到的token值: {token_input.get_attribute('value')}")# 提交表單driver.find_element(By.ID, "submit-btn").click()time.sleep(2)finally:driver.quit()if __name__ == "__main__":main()
4.3 通過Class Name定位 (find_element_by_class_name)
locate_by_class.py
"""
code: locate_by_class.py
"""
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
import timedef main():driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))try:driver.get("http://localhost:8000")time.sleep(1)# 通過Class Name定位單個元素# 思路:定位導航鏈接,class為"nav-link"first_link = driver.find_element(By.CLASS_NAME, "nav-link")print(f"第一個導航鏈接文本: {first_link.text}")first_link.click()time.sleep(2)# 返回首頁driver.back()time.sleep(1)# 通過Class Name定位多個元素# 思路:獲取所有產品項,class為"product"products = driver.find_elements(By.CLASS_NAME, "product")print(f"找到 {len(products)} 個產品")# 遍歷所有產品并打印名稱for product in products:# 在每個產品元素內部查找產品名稱name = product.find_element(By.CLASS_NAME, "product-name")print(f"產品名稱: {name.text}")time.sleep(2)finally:driver.quit()if __name__ == "__main__":main()
4.4 通過Tag Name定位 (find_element_by_tag_name)
locate_by_tag.py
"""
code: locate_by_tag.py
"""
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
import timedef main():driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))try:driver.get("http://localhost:8000")time.sleep(1)# 通過Tag Name定位元素# 思路:獲取所有鏈接標簽<a>links = driver.find_elements(By.TAG_NAME, "a")print(f"頁面上有 {len(links)} 個鏈接")# 打印所有鏈接的文本和URLfor link in links:print(f"鏈接文本: {link.text}, URL: {link.get_attribute('href')}")# 通過Tag Name定位標題# 思路:找到第一個h1標簽heading = driver.find_element(By.TAG_NAME, "h1")print(f"頁面主標題: {heading.text}")# 在表單中通過標簽名定位輸入框(結合其他定位方式更有效)driver.get("http://localhost:8000/login")inputs = driver.find_elements(By.TAG_NAME, "input")print(f"登錄表單中有 {len(inputs)} 個輸入框")time.sleep(2)finally:driver.quit()if __name__ == "__main__":main()
4.5 通過Link Text定位 (find_element_by_link_text)
locate_by_link_text.py
"""
code: locate_by_link_text.py
"""
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
import timedef main():driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))try:driver.get("http://localhost:8000")time.sleep(1)# 通過完整鏈接文本定位# 思路:精確匹配鏈接的全部文本login_link = driver.find_element(By.LINK_TEXT, "登錄")print(f"找到登錄鏈接: {login_link.get_attribute('href')}")login_link.click()time.sleep(2)# 返回首頁driver.back()time.sleep(1)# 定位另一個鏈接dynamic_link = driver.find_element(By.LINK_TEXT, "動態內容")dynamic_link.click()time.sleep(2)# 返回首頁driver.back()time.sleep(1)finally:driver.quit()if __name__ == "__main__":main()
4.6 通過Partial Link Text定位 (find_element_by_partial_link_text)
locate_by_partial_link.py
"""
code: locate_by_partial_link.py
"""
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
import timedef main():driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))try:driver.get("http://localhost:8000")time.sleep(1)# 通過部分鏈接文本定位# 思路:只需匹配鏈接文本的一部分,適用于文本較長或動態變化的情況form_link = driver.find_element(By.PARTIAL_LINK_TEXT, "表")print(f"找到包含'表'字的鏈接: {form_link.text}")form_link.click()time.sleep(2)# 返回首頁driver.back()time.sleep(1)# 另一個示例upload_link = driver.find_element(By.PARTIAL_LINK_TEXT, "上傳")upload_link.click()time.sleep(2)finally:driver.quit()if __name__ == "__main__":main()
4.7 通過XPath定位 (find_element_by_xpath)
XPath是一種在XML文檔中定位元素的語言,也可用于HTML。它是最靈活的定位方式之一。
locate_by_xpath.py
"""
code: locate_by_xpath.py
"""
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
import timedef main():driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))try:driver.get("http://localhost:8000")time.sleep(1)# 1. 絕對路徑定位(不推薦,維護性差)# 思路:從根節點開始的完整路徑,頁面結構變化會導致失效home_link = driver.find_element(By.XPATH, "/html/body/div/a[1]")print(f"通過絕對路徑找到的鏈接: {home_link.text}")# 2. 相對路徑定位# 思路:從任意節點開始,更靈活products = driver.find_elements(By.XPATH, "//div[@class='product']")print(f"通過相對路徑找到 {len(products)} 個產品")# 3. 屬性匹配# 思路:通過元素的屬性值定位username_input = driver.find_element(By.XPATH, "//input[@id='username']")# 如果上面找不到(因為在首頁),我們導航到登錄頁if not username_input.is_displayed():driver.get("http://localhost:8000/login")username_input = driver.find_element(By.XPATH, "//input[@id='username']")username_input.send_keys("test")# 4. 部分屬性匹配# 思路:匹配屬性值的一部分,使用contains()password_input = driver.find_element(By.XPATH, "//input[contains(@name, 'pass')]")password_input.send_keys("password123")# 5. 文本匹配# 思路:通過元素的文本內容定位submit_btn = driver.find_element(By.XPATH, "//button[text()='登錄']")submit_btn.click()time.sleep(2)# 返回首頁driver.back()time.sleep(1)# 6. 層級定位# 思路:結合父子關系定位first_product_price = driver.find_element(By.XPATH, "//div[@class='product'][1]//p[@class='product-price']")print(f"第一個產品價格: {first_product_price.text}")# 7. 邏輯運算# 思路:使用and/or組合多個條件dynamic_link = driver.find_element(By.XPATH, "//a[@class='nav-link' and contains(text(), '動態')]")dynamic_link.click()time.sleep(2)finally:driver.quit()if __name__ == "__main__":main()
4.8 通過CSS Selector定位 (find_element_by_css_selector)
CSS選擇器是另一種強大的元素定位方式,通常比XPath更簡潔。
locate_by_css.py
"""
code: locate_by_css.py
"""
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
import timedef main():driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))try:driver.get("http://localhost:8000")time.sleep(1)# 1. ID選擇器# 思路:使用#符號加ID值home_link = driver.find_element(By.CSS_SELECTOR, "#home-link")print(f"通過ID選擇器找到: {home_link.text}")# 2. Class選擇器# 思路:使用.符號加class值nav_links = driver.find_elements(By.CSS_SELECTOR, ".nav-link")print(f"通過Class選擇器找到 {len(nav_links)} 個導航鏈接")# 3. 標簽選擇器# 思路:直接使用標簽名headings = driver.find_elements(By.CSS_SELECTOR, "h1, h3")print(f"找到 {len(headings)} 個標題元素")# 4. 屬性選擇器# 思路:通過元素屬性定位token_input = driver.find_element(By.CSS_SELECTOR, "input[type='hidden'][id='csrf_token']")print(f"CSRF Token值: {token_input.get_attribute('value')}")# 5. 層級選擇器# 思路:通過元素層級關系定位product_prices = driver.find_elements(By.CSS_SELECTOR, ".product .product-price")print("所有產品價格:")for price in product_prices:print(price.text)# 6. 偽類選擇器# 思路:使用CSS偽類定位first_product = driver.find_element(By.CSS_SELECTOR, ".product:first-child")print(f"第一個產品名稱: {first_product.find_element(By.CSS_SELECTOR, '.product-name').text}")# 導航到登錄頁driver.get("http://localhost:8000/login")time.sleep(1)# 7. 組合選擇器# 思路:組合多種條件定位form_elements = driver.find_elements(By.CSS_SELECTOR, "form div input")print(f"登錄表單中有 {len(form_elements)} 個輸入元素")time.sleep(2)finally:driver.quit()if __name__ == "__main__":main()
第五章:等待機制
在自動化測試中,頁面元素的加載往往需要時間,使用合適的等待機制至關重要。
selenium_waits.py
"""
code: selenium_waits.py
"""
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
import timedef main():driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))try:# 1. 隱式等待# 思路:設置全局等待時間,對所有元素查找操作生效# 注意:隱式等待會影響整個driver的生命周期driver.implicitly_wait(10) # 等待10秒driver.get("http://localhost:8000/dynamic")print("已打開動態內容頁面")# 2. 強制等待(不推薦)# 思路:固定等待一段時間,不管元素是否已加載# 缺點:會浪費不必要的時間,或因加載慢而失敗print("使用強制等待...")time.sleep(3) # 強制等待3秒# 3. 顯式等待# 思路:針對特定元素設置等待條件和超時時間print("使用顯式等待...")try:# 等待動態加載的元素出現,最長等待10秒,每500毫秒檢查一次dynamic_element = WebDriverWait(driver, 10, 0.5).until(EC.presence_of_element_located((By.ID, "dynamic-content")))print(f"找到動態內容: {dynamic_element.text}")# 等待元素可見visible_element = WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.CLASS_NAME, "visible-after-delay")))print(f"可見元素內容: {visible_element.text}")# 等待元素可點擊clickable_button = WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.ID, "delayed-button")))print("點擊延遲加載的按鈕")clickable_button.click()# 等待文本出現WebDriverWait(driver, 10).until(EC.text_to_be_present_in_element((By.ID, "status-message"), "已點擊"))status = driver.find_element(By.ID, "status-message")print(f"狀態: {status.text}")except TimeoutException:print("超時:未能在指定時間內找到元素")time.sleep(2)finally:driver.quit()if __name__ == "__main__":main()
常用的Expected Conditions
presence_of_element_located
:元素存在于DOM中visibility_of_element_located
:元素可見(存在且可見)element_to_be_clickable
:元素可點擊text_to_be_present_in_element
:元素包含特定文本title_contains
:頁面標題包含特定文本invisibility_of_element_located
:元素不可見frame_to_be_available_and_switch_to_it
:frame可用并切換到該frame
第六章:突破Token限制
許多網站使用Token(如CSRF Token)來防止自動化腳本,以下是突破這些限制的常用方法:
bypass_token.py
"""
code; bypass_token.py
"""
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import timedef bypass_token_method1():"""方法1:從頁面中提取Token并使用"""driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))try:driver.get("http://localhost:8000/login")# 1. 從頁面中提取token# 思路:先獲取頁面中的token值,再在后續操作中使用token_element = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.NAME, "token")))token_value = token_element.get_attribute("value")print(f"提取到的token值: {token_value}")# 2. 填寫表單driver.find_element(By.ID, "username").send_keys("test")driver.find_element(By.ID, "password").send_keys("password123")# 3. 提交表單(會自動帶上token)driver.find_element(By.ID, "submit-btn").click()# 驗證是否登錄成功try:success_message = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, "//*[contains(text(), '登錄成功')]")))print("登錄成功,Token驗證通過")except:print("登錄失敗")time.sleep(3)finally:driver.quit()def bypass_token_method2():"""方法2:使用瀏覽器上下文保留Token"""driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))try:# 1. 首次訪問獲取Token并保存在瀏覽器中driver.get("http://localhost:8000")print("首次訪問獲取初始Token")time.sleep(2)# 2. 導航到其他頁面,Token會通過Cookie保持driver.get("http://localhost:8000/form")print("導航到表單頁面,使用保持的Token")# 3. 填寫并提交表單,此時會自動使用Cookie中的Tokendriver.find_element(By.ID, "name").send_keys("測試用戶")driver.find_element(By.ID, "email").send_keys("test@example.com")driver.find_element(By.ID, "submit-form").click()# 驗證表單提交是否成功try:success_msg = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, "form-success")))print(f"表單提交成功: {success_msg.text}")except:print("表單提交失敗")time.sleep(3)finally:driver.quit()def bypass_token_method3():"""方法3:使用JavaScript直接設置Token"""driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))try:driver.get("http://localhost:8000/login")# 1. 執行JavaScript獲取或設置Token# 思路:有些網站的Token可能存儲在JavaScript變量中token = driver.execute_script("""// 模擬從JavaScript變量獲取Tokenif (window.csrfToken) {return window.csrfToken;}// 或者直接設置Token元素的值var tokenInput = document.querySelector('input[name="token"]');if (tokenInput) {// 這里可以替換為你獲取到的有效TokentokenInput.value = 'override_token_value';return tokenInput.value;}return null;""")print(f"通過JS操作的Token值: {token}")# 2. 填寫登錄信息driver.find_element(By.ID, "username").send_keys("test")driver.find_element(By.ID, "password").send_keys("password123")driver.find_element(By.ID, "submit-btn").click()time.sleep(3)finally:driver.quit()if __name__ == "__main__":print("=== 方法1:從頁面中提取Token ===")bypass_token_method1()print("\n=== 方法2:使用瀏覽器上下文保留Token ===")bypass_token_method2()print("\n=== 方法3:使用JavaScript直接設置Token ===")bypass_token_method3()
第七章:實戰案例
下面是一個綜合實戰案例,展示如何使用Selenium進行完整的網站爬取和交互:
selenium_practical.py
"""
code: selenium_practical.py
"""
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException, NoSuchElementException
import time
import json
from dataclasses import dataclass, asdict
from typing import List# 數據類用于存儲產品信息
@dataclass
class Product:id: intname: strprice: floatcategory: strdef crawl_products(driver) -> List[Product]:"""爬取產品信息"""products = []try:# 導航到產品頁面driver.get("http://localhost:8000")# 等待產品加載完成WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, "products")))# 獲取所有產品元素product_elements = driver.find_elements(By.CLASS_NAME, "product")for element in product_elements:try:# 提取產品信息product_id = int(element.get_attribute("data-id"))name = element.find_element(By.CLASS_NAME, "product-name").textprice_text = element.find_element(By.CLASS_NAME, "product-price").textprice = float(price_text.replace("價格: ¥", ""))category = element.find_element(By.CLASS_NAME, "category").text# 創建產品對象product = Product(id=product_id,name=name,price=price,category=category)products.append(product)print(f"已爬取產品: {name}")# 模擬點擊"加入購物車"按鈕add_button = element.find_element(By.CLASS_NAME, "add-to-cart")add_button.click()time.sleep(0.5)except Exception as e:print(f"爬取單個產品時出錯: {str(e)}")print(f"共爬取 {len(products)} 個產品")return productsexcept Exception as e:print(f"爬取產品列表時出錯: {str(e)}")return []def login(driver, username: str, password: str) -> bool:"""登錄網站"""try:# 導航到登錄頁driver.get("http://localhost:8000/login")# 等待頁面加載WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, "submit-btn")))# 提取并使用tokentoken = driver.find_element(By.NAME, "token").get_attribute("value")print(f"登錄使用的token: {token}")# 填寫登錄表單driver.find_element(By.ID, "username").send_keys(username)driver.find_element(By.ID, "password").send_keys(password)# 提交表單driver.find_element(By.ID, "submit-btn").click()# 驗證登錄是否成功try:WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, "//*[contains(text(), '登錄成功')]")))print("登錄成功")return Trueexcept TimeoutException:print("登錄失敗")return Falseexcept Exception as e:print(f"登錄過程出錯: {str(e)}")return Falsedef main():# 配置Chrome選項chrome_options = webdriver.ChromeOptions()# 添加用戶配置,避免被識別為機器人chrome_options.add_argument("--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36")# 禁用自動化控制特征chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"])chrome_options.add_experimental_option("useAutomationExtension", False)# 初始化驅動driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()),options=chrome_options)# 進一步隱藏自動化特征driver.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", {"source": """Object.defineProperty(navigator, 'webdriver', {get: () => undefined})"""})try:# 1. 爬取產品信息products = crawl_products(driver)# 保存產品信息到JSON文件with open("products.json", "w", encoding="utf-8") as f:json.dump([asdict(p) for p in products], f, ensure_ascii=False, indent=2)print("產品信息已保存到products.json")# 2. 登錄網站login_success = login(driver, "test", "password123")if login_success:time.sleep(2)# 3. 訪問其他頁面driver.get("http://localhost:8000/dynamic")try:# 等待動態內容加載dynamic_content = WebDriverWait(driver, 15).until(EC.presence_of_element_located((By.ID, "dynamic-content")))print(f"動態內容: {dynamic_content.text}")except TimeoutException:print("未能加載動態內容")time.sleep(3)finally:# 關閉瀏覽器driver.quit()print("爬蟲完成,瀏覽器已關閉")if __name__ == "__main__":main()
第八章:最佳實踐與反反爬策略
8.1 避免被識別為機器人
- 設置合理的用戶代理:模擬真實瀏覽器的用戶代理
- 添加隨機延遲:避免操作過于規律
- 禁用自動化特征:隱藏Selenium的特征標識
- 使用真實瀏覽器配置:加載真實的瀏覽器配置文件
- 模擬人類行為:隨機化點擊位置、添加鼠標移動等
8.2 代碼組織與維護
- 封裝常用操作:將重復的操作封裝為函數或類
- 使用Page Object模式:將頁面元素和操作封裝為對象
- 異常處理:完善的異常處理機制,提高穩定性
- 日志記錄:記錄關鍵操作和錯誤信息
- 配置分離:將配置信息與代碼分離,便于維護
8.3 性能優化
- 減少不必要的等待:合理設置等待時間
- 批量操作:盡量減少與瀏覽器的交互次數
- 無頭模式:在不需要可視化時使用無頭模式
- 資源限制:限制圖片、CSS等非必要資源的加載
總結
本教程全面介紹了Selenium的使用方法,從環境搭建到高級技巧,涵蓋了8種元素定位方式、等待機制、突破Token限制等關鍵內容。
通過FastAPI搭建的模擬網站,您可以安全合法地進行練習。
Selenium是一個強大的工具,不僅可用于網頁爬取,還廣泛應用于自動化測試。掌握這些技能將極大提升您在Web自動化領域的能力。
隨著網站反爬技術的不斷升級,爬蟲工程師也需要不斷學習和適應新的挑戰。始終記住,在進行網絡爬蟲時,要遵守網站的robots協議和相關法律法規。