目錄
前言
新增PdfNewExtractor類
替換ExtractProcessor類
最終結果
前言
dify的1.1.3版本知識庫pdf解析實現使用pypdfium2提取文本,主要存在以下問題:
1. 文本提取能力有限,對表格和圖片支持不足
2. 缺乏專門的中文處理優化
3. 沒有文檔結構分析
4. 缺少文檔質量評估
建議優化方案:
1. 使用pdfplumber替代pypdfium2
2. 增加OCR支持
3. 優化中文處理邏輯
4. 添加文檔結構分析
5. 實現智能表格識別
6. 增加緩存機制
7. 優化大文件處理
導入包pdfplumber和pytesseract
pip install pdfplumber pip install pytesseract
新增PdfNewExtractor類
新增一個PdfNewExtractor處理類替代老的PdfExtractor
from collections.abc import Iterator
from typing import Optional, cast
import pdfplumber
import pytesseract
from PIL import Image
import iofrom core.rag.extractor.blob.blob import Blob
from core.rag.extractor.extractor_base import BaseExtractor
from core.rag.models.document import Document
from extensions.ext_storage import storageclass PdfNewExtractor(BaseExtractor):"""Enhanced PDF loader with improved text extraction, OCR support, and structure analysis.Args:file_path: Path to the PDF file to load.file_cache_key: Optional cache key for storing extracted text.enable_ocr: Whether to enable OCR for text extraction from images."""def __init__(self, file_path: str, file_cache_key: Optional[str] = None, enable_ocr: bool = False):"""Initialize with file path and optional settings."""self._file_path = file_pathself._file_cache_key = file_cache_keyself._enable_ocr = enable_ocrdef extract(self) -> list[Document]:"""Extract text from PDF with caching support."""plaintext_file_exists = Falseif self._file_cache_key:try:text = cast(bytes, storage.load(self._file_cache_key)).decode("utf-8")plaintext_file_exists = Truereturn [Document(page_content=text)]except FileNotFoundError:passdocuments = list(self.load())text_list = []for document in documents:text_list.append(document.page_content)text = "\n\n".join(text_list)# Save plaintext file for cachingif not plaintext_file_exists and self._file_cache_key:storage.save(self._file_cache_key, text.encode("utf-8"))return documentsdef load(self) -> Iterator[Document]:"""Lazy load PDF pages with enhanced text extraction."""blob = Blob.from_path(self._file_path)yield from self.parse(blob)def parse(self, blob: Blob) -> Iterator[Document]:"""Parse PDF with enhanced features including OCR and structure analysis."""with blob.as_bytes_io() as file_obj:with pdfplumber.open(file_obj) as pdf:for page_number, page in enumerate(pdf.pages):# Extract text with layout preservation and encoding detectioncontent = page.extract_text(layout=True)# Try to detect and fix encoding issuestry:# First try to decode as UTF-8content = content.encode('utf-8').decode('utf-8')except UnicodeError:try:# If UTF-8 fails, try GB18030 (common Chinese encoding)content = content.encode('utf-8').decode('gb18030', errors='ignore')except UnicodeError:# If all else fails, use a more lenient approachcontent = content.encode('utf-8', errors='ignore').decode('utf-8', errors='ignore')# Extract tables if presenttables = page.extract_tables()if tables:table_text = "\n\nTables:\n"for table in tables:# Convert table to text formattable_text += "\n" + "\n".join(["\t".join([str(cell) if cell else "" for cell in row]) for row in table])content += table_text# Perform OCR if enabled and text content is limited or contains potential encoding issuesif self._enable_ocr and (len(content.strip()) < 100 or any('\ufffd' in line for line in content.splitlines())):image = page.to_image()img_bytes = io.BytesIO()image.original.save(img_bytes, format='PNG')img_bytes.seek(0)pil_image = Image.open(img_bytes)# Use multiple language models and improve OCR accuracyocr_text = pytesseract.image_to_string(pil_image,lang='chi_sim+chi_tra+eng', # Support both simplified and traditional Chineseconfig='--psm 3 --oem 3' # Use more accurate OCR mode)if ocr_text.strip():# Clean and normalize OCR textocr_text = ocr_text.replace('\x0c', '').strip()content = f"{content}\n\nOCR Text:\n{ocr_text}"metadata = {"source": blob.source,"page": page_number,"has_tables": bool(tables)}yield Document(page_content=content, metadata=metadata)
替換ExtractProcessor類
在ExtractProcessor中把兩處extractor = PdfExtractor(file_path),替換成extractor = PdfNewExtractor(file_path)。
分別在代碼144行和148行
最終結果
經過測試,優化效果完美