selenium和pytessarct提取古詩文網的驗證碼（python爬蟲）

代碼實現的主要功能：

瀏覽器自動化控制
驗證碼圖像獲取與處理
OCR驗證碼識別
表單自動填寫與提交
登錄狀態驗證
異常處理與資源清理

1. 瀏覽器初始化與頁面加載

driver = webdriver.Chrome()
driver.get("https://www.gushiwen.cn/user/login.aspx?from=http://www.gushiwen.cn/user/collect.aspx")
time.sleep(2)

功能：啟動Chrome瀏覽器并打開古詩文網登錄頁面
關鍵點：
- webdriver.Chrome()?初始化瀏覽器驅動
- time.sleep(2)?確保頁面完全加載（實際建議改用?WebDriverWait）

2.驗證碼捕獲與預處理

code_img = driver.find_element(By.ID, 'imgCode')
img_bytes = code_img.screenshot_as_png
image = Image.open(io.BytesIO(img_bytes))
image = image.convert('L')  # 灰度化

功能：獲取驗證碼圖像并優化識別條件
關鍵點：
- screenshot_as_png?直接獲取二進制圖像數據
- convert('L')?將彩色圖轉為灰度圖，提升OCR準確率
- 注釋掉的二值化代碼可用于高對比度驗證碼

3. OCR驗證碼識別

custom_config = r'--psm 7 --oem 3 -c tessedit_char_whitelist=0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ'
textcode = pytesseract.image_to_string(image, config=custom_config)
textcode = textcode.strip().replace(' ', '')[:4]

功能：通過Tesseract引擎識別驗證碼文本
關鍵參數：
- --psm 7：單行文本識別模式
- --oem 3：默認OCR引擎
- tessedit_char_whitelist：限定識別字符集
數據處理：去除空格并截取前4位字符

4. 登錄表單操作

driver.find_element(By.ID, 'email').send_keys("2833622025@qq.com")
driver.find_element(By.ID, 'pwd').send_keys("ckn12138")
driver.find_element(By.ID, 'code').send_keys(textcode)
driver.find_element(By.ID, 'denglu').click()

功能：自動填寫并提交登錄表單
元素定位：
- 通過HTML元素的ID定位各輸入框
- denglu?是登錄按鈕的ID

5. 登錄結果驗證

if "退出登錄" in driver.page_source:print("登錄成功！")html = driver.page_source
else:print("登錄失敗，請檢查賬號或驗證碼！")

驗證邏輯：檢查頁面是否出現"退出登錄"文本
成功操作：獲取登錄后的頁面源碼
失敗處理：輸出錯誤提示

6. 異常處理與資源釋放

except Exception as e:print("程序運行出錯:", str(e))
finally:driver.quit()

異常捕獲：打印任何運行時錯誤
資源清理：確保瀏覽器最終關閉

典型執行流程

打開瀏覽器 → 導航到登錄頁
定位驗證碼 → 圖像預處理 → OCR識別
自動填寫賬號/密碼/驗證碼 → 點擊登錄
檢查登錄結果 → 輸出頁面源碼或錯誤信息
無論成功與否都關閉瀏覽器

具體代碼展示

from selenium import webdriver
from selenium.webdriver.common.by import By
from PIL import Image
import pytesseract
import io
import time# 初始化瀏覽器
driver = webdriver.Chrome()
driver.get("https://www.gushiwen.cn/user/login.aspx?from=http://www.gushiwen.cn/user/collect.aspx")# 等待頁面加載
time.sleep(2)try:# 獲取驗證碼元素code_img = driver.find_element(By.ID, 'imgCode')# 將驗證碼截圖保存到內存img_bytes = code_img.screenshot_as_pngimage = Image.open(io.BytesIO(img_bytes))# 圖像預處理（提高識別率）image = image.convert('L')  # 灰度化# image = image.point(lambda x: 0 if x < 128 else 255, '1')  # 二值化（根據需要啟用）# 識別驗證碼custom_config = r'--psm 7 --oem 3 -c tessedit_char_whitelist=0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ'textcode = pytesseract.image_to_string(image, config=custom_config)textcode = textcode.strip().replace(' ', '')[:4]  # 清理結果并取前4位print("識別的驗證碼:", textcode)if len(textcode) == 4:# 填寫登錄信息driver.find_element(By.ID, 'email').send_keys("2833622025@qq.com")driver.find_element(By.ID, 'pwd').send_keys("ckn12138")driver.find_element(By.ID, 'code').send_keys(textcode)driver.find_element(By.ID, 'denglu').click()# 等待登錄完成time.sleep(3)# 驗證登錄是否成功if "退出登錄" in driver.page_source:print("登錄成功！")# 獲取登錄后的頁面內容html = driver.page_sourceprint(html)else:print("登錄失敗，請檢查賬號或驗證碼！")else:print("驗證碼識別失敗或長度不正確")except Exception as e:print("程序運行出錯:", str(e))finally:# 關閉瀏覽器driver.quit()

但是這個代碼識別出來的驗證碼不準確最好用超級鷹識別方式再識別一遍~

運行結果：

會自己填充賬號密碼之后

之后關閉瀏覽器

識別成功運行結果因為驗證碼形式簡單比較好識別：

識別失敗：

?網頁會顯示驗證碼錯誤！

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/pingmian/75320.shtml
繁體地址，請注明出處：http://hk.pswp.cn/pingmian/75320.shtml
英文地址，請注明出處：http://en.pswp.cn/pingmian/75320.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！