Dify智能體平臺源碼二次開發筆記（6） - 優化知識庫pdf文檔的識別

前言

新增PdfNewExtractor類

替換ExtractProcessor類

最終結果

前言

dify的1.1.3版本知識庫pdf解析實現使用pypdfium2提取文本，主要存在以下問題：
1. 文本提取能力有限，對表格和圖片支持不足
2. 缺乏專門的中文處理優化
3. 沒有文檔結構分析
4. 缺少文檔質量評估
建議優化方案：
1. 使用pdfplumber替代pypdfium2
2. 增加OCR支持
3. 優化中文處理邏輯
4. 添加文檔結構分析
5. 實現智能表格識別
6. 增加緩存機制
7. 優化大文件處理

導入包pdfplumber和pytesseract

pip install pdfplumber
pip install pytesseract

新增PdfNewExtractor類

新增一個PdfNewExtractor處理類替代老的PdfExtractor

from collections.abc import Iterator
from typing import Optional, cast
import pdfplumber
import pytesseract
from PIL import Image
import iofrom core.rag.extractor.blob.blob import Blob
from core.rag.extractor.extractor_base import BaseExtractor
from core.rag.models.document import Document
from extensions.ext_storage import storageclass PdfNewExtractor(BaseExtractor):"""Enhanced PDF loader with improved text extraction, OCR support, and structure analysis.Args:file_path: Path to the PDF file to load.file_cache_key: Optional cache key for storing extracted text.enable_ocr: Whether to enable OCR for text extraction from images."""def __init__(self, file_path: str, file_cache_key: Optional[str] = None, enable_ocr: bool = False):"""Initialize with file path and optional settings."""self._file_path = file_pathself._file_cache_key = file_cache_keyself._enable_ocr = enable_ocrdef extract(self) -> list[Document]:"""Extract text from PDF with caching support."""plaintext_file_exists = Falseif self._file_cache_key:try:text = cast(bytes, storage.load(self._file_cache_key)).decode("utf-8")plaintext_file_exists = Truereturn [Document(page_content=text)]except FileNotFoundError:passdocuments = list(self.load())text_list = []for document in documents:text_list.append(document.page_content)text = "\n\n".join(text_list)# Save plaintext file for cachingif not plaintext_file_exists and self._file_cache_key:storage.save(self._file_cache_key, text.encode("utf-8"))return documentsdef load(self) -> Iterator[Document]:"""Lazy load PDF pages with enhanced text extraction."""blob = Blob.from_path(self._file_path)yield from self.parse(blob)def parse(self, blob: Blob) -> Iterator[Document]:"""Parse PDF with enhanced features including OCR and structure analysis."""with blob.as_bytes_io() as file_obj:with pdfplumber.open(file_obj) as pdf:for page_number, page in enumerate(pdf.pages):# Extract text with layout preservation and encoding detectioncontent = page.extract_text(layout=True)# Try to detect and fix encoding issuestry:# First try to decode as UTF-8content = content.encode('utf-8').decode('utf-8')except UnicodeError:try:# If UTF-8 fails, try GB18030 (common Chinese encoding)content = content.encode('utf-8').decode('gb18030', errors='ignore')except UnicodeError:# If all else fails, use a more lenient approachcontent = content.encode('utf-8', errors='ignore').decode('utf-8', errors='ignore')# Extract tables if presenttables = page.extract_tables()if tables:table_text = "\n\nTables:\n"for table in tables:# Convert table to text formattable_text += "\n" + "\n".join(["\t".join([str(cell) if cell else "" for cell in row]) for row in table])content += table_text# Perform OCR if enabled and text content is limited or contains potential encoding issuesif self._enable_ocr and (len(content.strip()) < 100 or any('\ufffd' in line for line in content.splitlines())):image = page.to_image()img_bytes = io.BytesIO()image.original.save(img_bytes, format='PNG')img_bytes.seek(0)pil_image = Image.open(img_bytes)# Use multiple language models and improve OCR accuracyocr_text = pytesseract.image_to_string(pil_image,lang='chi_sim+chi_tra+eng',  # Support both simplified and traditional Chineseconfig='--psm 3 --oem 3'  # Use more accurate OCR mode)if ocr_text.strip():# Clean and normalize OCR textocr_text = ocr_text.replace('\x0c', '').strip()content = f"{content}\n\nOCR Text:\n{ocr_text}"metadata = {"source": blob.source,"page": page_number,"has_tables": bool(tables)}yield Document(page_content=content, metadata=metadata)

替換ExtractProcessor類

在ExtractProcessor中把兩處extractor = PdfExtractor(file_path)，替換成extractor = PdfNewExtractor(file_path)。
分別在代碼144行和148行

最終結果

經過測試，優化效果完美

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/news/901971.shtml
繁體地址，請注明出處：http://hk.pswp.cn/news/901971.shtml
英文地址，請注明出處：http://en.pswp.cn/news/901971.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！