以下是結合多種技術實現的PDF解析詳細示例(Python實現),涵蓋文本、表格和掃描件處理場景:
一、環境準備與依賴安裝
# 核心依賴庫 pip install pdfplumber tabula-py pytesseract opencv-python mysql-connector-python
二、完整解析流程示例
import pdfplumber import tabula import pytesseract import cv2 import re import mysql.connector from mysql.connector import Error from PIL import Image import hashlib # ========== 1. 通用PDF解析 ========== ?:ml-citation{ref="6,7" data="citationList"} def parse_pdf(pdf_path): result = {"text": [], "tables": [], "images": []} # 文本提取(含坐標信息) with pdfplumber.open(pdf_path) as pdf: for page in pdf.pages: # 基礎文本提取 text = page.extract_text(x_tolerance=1, y_tolerance=1) if text: result["text"].append({ "page": page.page_number, "content": text_clean(text), "bbox": page.bbox }) # 表格識別與提取 ?:ml-citation{ref="3" data="citationList"} tables = page.find_tables() if tables: result["tables"].extend([ {"table_data": table.extract(), "position": table.bbox} for table in tables ]) # 掃描件處理模塊 ?:ml-citation{ref="6" data="citationList"} if not result["text"]: result.update(process_scanned_pdf(pdf_path)) return result # ========== 2. 掃描件OCR處理 ========== ?:ml-citation{ref="6" data="citationList"} def process_scanned_pdf(pdf_path, dpi=300): ocr_results = [] images = convert_pdf_to_images(pdf_path, dpi) # 使用pdf2image庫轉換 for img in images: processed_img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY) text = pytesseract.image_to_string(processed_img, lang='chi_sim+eng') ocr_results.append(text_clean(text)) return {"ocr_text": ocr_results} # ========== 3. 數據標準化處理 ========== ?:ml-citation{ref="7" data="citationList"} def text_clean(raw_text): # 多級清洗流程 text = re.sub(r'\s{2,}', ' ', raw_text) # 壓縮空白字符 text = re.sub(r'[\x00-\x1F\x7F-\x9F]', '', text) # 刪除控制字符 return text.strip() # ========== 4. 結構化字段提取 ========== ?:ml-citation{ref="4,7" data="citationList"} def extract_structured_data(texts): patterns = { "date": r'\d{4}年\d{1,2}月\d{1,2}日', "amount": r'金額:(\d+\.\d{2})元', "parties": r'甲方:(.*?)\n乙方:(.*?)\n' } structured_data = {} for pattern_name, regex in patterns.items(): matches = [] for text in texts: matches.extend(re.findall(regex, text)) structured_data[pattern_name] = matches return structured_data # ========== 5. 數據庫存儲 ========== ?:ml-citation{ref="8" data="citationList"} def save_to_mysql(data, pdf_hash): try: conn = mysql.connector.connect( host='localhost', database='pdf_archive', user='root', password='' ) cursor = conn.cursor() insert_query = """ INSERT INTO documents (file_hash, raw_text, structured_data) VALUES (%s, %s, %s) """ cursor.execute(insert_query, ( pdf_hash, "\n".join(data['text']), json.dumps(data['structured']) )) conn.commit() except Error as e: print(f"數據庫錯誤: {e}") finally: if conn.is_connected(): cursor.close() conn.close() # ========== 6. 主執行流程 ========== if __name__ == "__main__": pdf_file = "sample_contract.pdf" # 生成文件指紋 ?:ml-citation{ref="6" data="citationList"} with open(pdf_file, "rb") as f: file_hash = hashlib.sha256(f.read()).hexdigest() # 執行解析 parsed_data = parse_pdf(pdf_file) structured = extract_structured_data(parsed_data['text']) parsed_data['structured'] = structured # 數據持久化 save_to_mysql(parsed_data, file_hash)
三、關鍵處理策略說明
-
?混合解析機制?
- 優先嘗試直接文本提取(可編輯PDF)?6
- 自動降級到OCR處理(掃描件)?6
- 保留原始坐標信息用于后期驗證?4
-
?表格識別增強?
使用PDF頁面的物理布局檢測表格邊界(page.find_tables()
),配合tabula
進行精確提取?3 -
?多語言OCR支持?
通過lang='chi_sim+eng'
參數實現中英文混合識別,需提前安裝Tesseract中文訓練包?6 -
?數據驗證機制?
- 通過
file_hash
字段實現重復文件過濾?6 - 同時保存原始文本和結構化數據用于數據追溯?8
- 通過
四、數據庫表結構設計
CREATE TABLE `documents` ( `id` INT UNSIGNED AUTO_INCREMENT PRIMARY KEY, `file_hash` CHAR(64) NOT NULL UNIQUE, `raw_text` LONGTEXT, `structured_data` JSON, `ocr_flag` TINYINT(1) DEFAULT 0, `create_time` TIMESTAMP DEFAULT CURRENT_TIMESTAMP ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
五、異常處理建議
-
?編碼兼容性?
添加數據庫連接參數charset='utf8mb4'
支持生僻字存儲?8 -
?大文件分塊處理?
使用生成器逐頁處理超過50頁的PDF文檔:def batch_process(pdf_path, batch_size=10): with pdfplumber.open(pdf_path) as pdf: total_pages = len(pdf.pages) for i in range(0, total_pages, batch_size): yield pdf.pages[i:i+batch_size]
-
?容錯機制?
在解析循環中添加異常捕獲:try: text = page.extract_text() except PDFSyntaxError as e: logging.warning(f"Page {page_num} parse failed: {str(e)}") continue
該示例實現了從基礎文本提取到復雜表格識別的完整流程,并包含掃描件處理方案,實際應用中需根據具體文檔結構調整正則表達式模式和表格識別參數?