Elasticsearch：使用 Azure AI 文檔智能解析 PDF 文本和表格數據

作者：來自 Elastic?James Williams

了解如何使用 Azure AI 文檔智能解析包含文本和表格數據的 PDF 文檔。

Azure AI 文檔智能是一個強大的工具，用于從 PDF 中提取結構化數據。它可以有效地提取文本和表格數據。提取的數據可以索引到 Elastic Cloud Serverless，以支持 RAG（Retrieval Augmented Generation - 檢索增強生成）。

在這篇博客中，我們將通過攝取四份最新的 Elastic N.V. 季度報告來演示 Azure AI 文檔智能的強大功能。這些 PDF 文檔的頁數從 43 頁到 196 頁不等，每個 PDF 都包含文本和表格數據。我們將使用以下提示測試表格數據的檢索：比較/對比 Q2-2025、Q1-2025、Q4-2024 和 Q3-2024 的訂閱收入？

這個提示比較復雜，因為它需要來自四個不同 PDF 的上下文，這些 PDF 中的相關信息以表格格式呈現。

讓我們通過一個端到端的參考示例來了解，這個示例由兩個主要部分組成：

Python 筆記本

下載四個季度的 Elastic N.V. 10-Q 文件 PDF
使用 Azure AI 文檔智能解析每個 PDF 文件中的文本和表格數據
將文本和表格數據輸出到 JSON 文件
將 JSON 文件攝取到 Elastic Cloud Serverless

Elastic Cloud Serverless

為 PDF 文本 + 表格數據創建向量嵌入
為 RAG 提供向量搜索數據庫查詢
預配置的 OpenAI 連接器用于 LLM 集成
A/B 測試界面用于與 10-Q 文件進行對話

前提條件

此筆記本中的代碼塊需要 Azure AI Document Intelligence 和 Elasticsearch 的 API 密鑰。Azure AI Document Intelligence 的最佳起點是創建一個 Document Intelligence 資源。對于 Elastic Cloud Serverless，請參考入門指南。你需要 Python 3.9+ 來運行這些代碼塊。

創建 .env 文件

將 Azure AI Document Intelligence 和 Elastic Cloud Serverless 的密鑰放入 .env 文件中。

AZURE_AI_DOCUMENT_INTELLIGENCE_ENDPOINT=YOUR_AZURE_RESOURCE_ENDPOINT
AZURE_AI_DOCUMENT_INTELLIGENCE_API_KEY=YOUR_AZURE_RESOURCE_API_KEYES_URL=YOUR_ES_URL
ES_API_KEY=YOUR_ES_API_KEY

安裝 Python 包

!pip install elasticsearch python-dotenv tqdm azure-core azure-ai-documentintelligence requests httpx

創建輸入和輸出文件夾

import osinput_folder_pdf = "./pdf"
output_folder_pdf = "./json"folders = [input_folder_pdf, output_folder_pdf]def create_folders_if_not_exist(folders):for folder in folders:os.makedirs(folder, exist_ok=True)print(f"Folder '{folder}' created or already exists.")create_folders_if_not_exist(folders)

下載 PDF 文件

下載四個最近的 Elastic 10-Q 季度報告。如果你已經有了 PDF 文件，可以將它們放在 ‘./pdf’ 文件夾中。

import os
import requestsdef download_pdf(url, directory='./pdf', filename=None):if not os.path.exists(directory):os.makedirs(directory)response = requests.get(url)if response.status_code == 200:if filename is None:filename = url.split('/')[-1]filepath = os.path.join(directory, filename)with open(filepath, 'wb') as file:file.write(response.content)print(f"Downloaded {filepath}")else:print(f"Failed to download file from {url}")print("Downloading 4 recent 10-Q reports for Elastic NV.")
base_url = 'https://s201.q4cdn.com/217177842/files/doc_financials'
download_pdf(f'{base_url}/2025/q2/e5aa7a0a-6f56-468d-a5bd-661792773d71.pdf',      filename='elastic-10Q-Q2-2025.pdf')
download_pdf(f'{base_url}/2025/q1/18656e06-8107-4423-8e2b-6f2945438053.pdf', filename='elastic-10Q-Q1-2025.pdf')
download_pdf(f'{base_url}/2024/q4/9949f03b-09fb-4941-b105-62a304dc1411.pdf', filename='elastic-10Q-Q4-2024.pdf')
download_pdf(f'{base_url}/2024/q3/7e60e3bd-ff50-4ae8-ab12-5b3ae19420e6.pdf', filename='elastic-10Q-Q3-2024.pdf')

使用 Azure AI Document Intelligence 解析 PDF

在解析 PDF 文件的代碼塊中有很多內容。以下是簡要總結：

設置 Azure AI Document Intelligence 導入和環境變量
使用 AnalyzeResult 解析 PDF 段落
使用 AnalyzeResult 解析 PDF 表格
結合 PDF 段落和表格數據
通過對每個 PDF 文件執行 1-4 步，整合所有結果并將其存儲為 JSON

設置 Azure AI Document Intelligence 導入和環境變量

最重要的導入是 AnalyzeResult。這個類表示文檔分析的結果，并包含關于文檔的詳細信息。我們關心的細節包括頁面、段落和表格。

import os
from azure.core.credentials import AzureKeyCredential
from azure.ai.documentintelligence import DocumentIntelligenceClient
from azure.ai.documentintelligence.models import AnalyzeResult
from azure.ai.documentintelligence.models import AnalyzeDocumentRequest
import json
from dotenv import load_dotenv
from tqdm import tqdmload_dotenv()AZURE_AI_DOCUMENT_INTELLIGENCE_ENDPOINT =  os.getenv('AZURE_AI_DOCUMENT_INTELLIGENCE_ENDPOINT')
AZURE_AI_DOCUMENT_INTELLIGENCE_API_KEY = os.getenv('AZURE_AI_DOCUMENT_INTELLIGENCE_API_KEY')

使用 AnalyzeResult 解析 PDF 段落

從每個頁面提取段落文本。不要提取表格數據。

def parse_paragraphs(analyze_result):table_offsets = []page_content = {}for paragraph in analyze_result.paragraphs:  for span in paragraph.spans:if span.offset not in table_offsets:for region in paragraph.bounding_regions:page_number = region.page_numberif page_number not in page_content:page_content[page_number] = []page_content[page_number].append({"content_text": paragraph.content})return page_content, table_offsets

使用 AnalyzeResult 解析 PDF 表格

從每個頁面提取表格內容。不要提取段落文本。這個技術最有趣的副作用是，無需轉換表格數據。LLM 知道如何讀取看起來像 “單元格 [0, 1]：表格數據……” 的文本。

def parse_tables(analyze_result, table_offsets):page_content = {}for table in analyze_result.tables:table_data = []for region in table.bounding_regions:page_number = region.page_numberfor cell in table.cells:for span in cell.spans:table_offsets.append(span.offset)table_data.append(f"Cell [{cell.row_index}, {cell.column_index}]: {cell.content}")if page_number not in page_content:page_content[page_number] = []page_content[page_number].append({"content_text": "\n".join(table_data)})return page_content

結合 PDF 段落和表格數據

在頁面級別進行預處理分塊以保留上下文，這樣我們可以輕松手動驗證 RAG 檢索。稍后，你將看到，這種預處理分塊不會對 RAG 輸出產生負面影響。

def combine_paragraphs_tables(filepath, paragraph_content, table_content):page_content_concatenated = {}structured_data = []# Combine paragraph and table contentfor p_number in set(paragraph_content.keys()).union(table_content.keys()):concatenated_text = ""if p_number in paragraph_content:for content in paragraph_content[p_number]:concatenated_text += content["content_text"] + "\n"if p_number in table_content:for content in table_content[p_number]:concatenated_text += content["content_text"] + "\n"page_content_concatenated[p_number] = concatenated_text.strip()# Append a single item per page to the structured_data listfor p_number, concatenated_text in page_content_concatenated.items():structured_data.append({"page_number": p_number,"content_text": concatenated_text,"pdf_file": os.path.basename(filepath)})return structured_data

把所有內容結合在一起

打開 ./pdf 文件夾中的每個 PDF，解析文本和表格數據，并將結果保存為 JSON 文件，該文件包含 page_number、content_text 和 pdf_file 字段。content_text 字段表示每個頁面的段落和表格數據。

pdf_files = [os.path.join(input_folder_pdf, file)for file in os.listdir(input_folder_pdf)if file.endswith(".pdf")
]document_intelligence_client = DocumentIntelligenceClient(endpoint=AZURE_AI_DOCUMENT_INTELLIGENCE_ENDPOINT, credential=AzureKeyCredential(AZURE_AI_DOCUMENT_INTELLIGENCE_API_KEY),connection_timeout=600 
)for filepath in tqdm(pdf_files, desc="Parsing PDF files"):with open(filepath, "rb") as file:poller = document_intelligence_client.begin_analyze_document("prebuilt-layout",AnalyzeDocumentRequest(bytes_source=file.read()))analyze_result: AnalyzeResult = poller.result()paragraph_content, table_offsets = parse_paragraphs(analyze_result)table_content = parse_tables(analyze_result, table_offsets)structured_data = combine_paragraphs_tables(filepath, paragraph_content, table_content)# Convert the structured data to JSON formatjson_output = json.dumps(structured_data, indent=4)# Get the filename without the ".pdf" extensionfilename_without_ext = os.path.splitext(os.path.basename(filepath))[0]# Write the JSON output to a fileoutput_json_file = f"{output_folder_pdf}/{filename_without_ext}.json"with open(output_json_file, "w") as json_file:json_file.write(json_output)

加載數據到 Elastic Cloud Serverless

以下代碼塊處理：

設置 Elasticsearch 客戶端和環境變量的導入
在 Elastic Cloud Serverless 中創建索引
將 ./json 目錄中的 JSON 文件加載到 pdf-chat 索引中

設置 Elasticsearch 客戶端和環境變量的導入

最重要的導入是 Elasticsearch。這個類負責連接到 Elastic Cloud Serverless，創建并填充 pdf-chat 索引。

import json
from dotenv import load_dotenv
from elasticsearch import Elasticsearch
from tqdm import tqdm
import osload_dotenv()ES_URL = os.getenv('ES_URL')
ES_API_KEY = os.getenv('ES_API_KEY')es = Elasticsearch(hosts=ES_URL,api_key=ES_API_KEY, request_timeout=300)

在 Elastic Cloud Serverless 中創建索引

此代碼塊創建一個名為 “pdf_chat” 的索引，并具有以下映射：

page_content - 用于通過全文搜索測試 RAG
page_content_sparse - 用于通過稀疏向量測試 RAG
page_content_dense - 用于通過密集向量測試 RAG
page_number - 對于構建引用很有用
pdf_file - 對于構建引用很有用

注意使用了 copy_to 和 semantic_text。copy_to 工具將 body_content 復制到兩個語義文本（semantic_text）字段。每個語義文本字段都映射到一個 ML 推理端點，一個用于稀疏向量，一個用于密集向量。由 Elastic 提供的 ML 推理會自動將每頁分成 250 個 token 的塊，并有 100 個 token 的重疊。

index_name= "pdf-chat"
index_body = {"mappings": {"properties": {"page_content": {"type": "text", "copy_to": ["page_content_sparse","page_content_dense"]},"page_content_sparse": {"type": "semantic_text", "inference_id": ".elser-2-elasticsearch"},"page_content_dense": {"type": "semantic_text", "inference_id": ".multilingual-e5-small-elasticsearch"},"page_number": {"type": "text"},"pdf_file": {"type": "text", "fields": {"keyword": {"type": "keyword"}}}}}
}if es.indices.exists(index=index_name):es.indices.delete(index=index_name)print(f"Index '{index_name}' deleted successfully.")response = es.indices.create(index=index_name, body=index_body)
if 'acknowledged' in response and response['acknowledged']:print(f"Index '{index_name}' created successfully.")
elif 'error' in response:print(f"Failed to create: '{index_name}'") print(f"Error: {response['error']['reason']}")
else:print(f"Index '{index_name}' already exists.")

將 JSON 文件從 `./json` 目錄加載到 `pdf-chat` 索引

此過程將花費幾分鐘時間，因為我們需要：

加載 402 頁 PDF 數據
為每個 page_content 塊創建稀疏文本嵌入
為每個 page_content 塊創建密集文本嵌入

files = os.listdir(output_folder_pdf)
with tqdm(total=len(files), desc="Indexing PDF docs") as pbar_files:for file in files:with open(output_folder_pdf + "/" + file) as f:data = json.loads(f.read())with tqdm(total=len(data), desc=f"Processing {file}") as pbar_pages:for page in data:doc = {"page_content": page['content_text'],"page_number": page['page_number'],"pdf_file": page['pdf_file']}id = f"{page['pdf_file']}_{page['page_number']}"es.index(index=index_name, id=id, body=json.dumps(doc))pbar_pages.update(1)pbar_files.update(1)

最后還有一個代碼技巧需要提到。我們將通過以下命名約定設置 Elastic 文檔 ID：FILENAME_PAGENUMBER。這樣可以方便地查看與引用關聯的 PDF 文件和頁面號碼，在 Playground 中進行驗證。

Elastic Cloud Serverless

Elastic Cloud Serverless 是原型化新 Retrieval-Augmented Generation (RAG) 系統的絕佳選擇，因為它提供了完全托管的可擴展基礎設施，避免了手動集群管理的復雜性。它開箱即用地支持稀疏和密集向量搜索，使你能夠高效地實驗不同的檢索策略。借助內置的語義文本嵌入、相關性排名和混合搜索功能，Elastic Cloud Serverless 加速了搜索驅動應用程序的迭代周期。

借助 Azure AI Document Intelligence 和一些 Python 代碼，我們準備好了看看是否能讓 LLM 在真實數據的基礎上回答問題。讓我們打開 Playground，并使用不同的查詢策略進行一些手動 A/B 測試。