Doc2X:?精度、?性價??檔解析 API，助力Arxiv論文智能解讀Agent構建

前言

在AI大模型時代，RAG（Retrieval-Augmented Generation）檢索增強生成技術已經成為構建智能知識庫和問答系統的核心架構。然而，在實際項目實施過程中，開發者們往往會遇到一個關鍵痛點：如何高質量地將各種格式的文檔轉換為結構化數據，以便后續的向量化和檢索。

傳統的文檔解析方案存在諸多局限性：開源工具精度不足，商業化產品價格昂貴，復雜文檔（特別是包含公式、圖表的學術文檔）解析效果差強人意。正是在這樣的背景下，Doc2X應運而生，為開發者提供了一個高精度、高性價比的文檔解析解決方案。

官方網站：https://doc2x.noedgeai.com/
Doc2X API接口文檔：https://noedgeai.feishu.cn/wiki/Q8QIw3PT7i4QghkhPoecsmSCnG1

Doc2X產品概覽

Doc2X是一款專為開發者設計的文檔解析API服務，能夠將PDF、圖片等多種格式的文檔精準轉換為Markdown、LaTeX、HTML、Word等結構化格式。其核心優勢可以概括為以下幾點：

🎯 卓越的解析精度

相比傳統開源方案和其他商業化工具，Doc2X在復雜文檔解析方面表現突出：

復雜布局處理：對于包含多欄布局、圖文混排的文檔，能夠準確識別和保持結構
表格跨頁合并：智能識別并合并跨越頁面邊界的表格，確保數據完整性
圖片內容提取：不僅提取圖片，還能識別圖片中的文字內容和對應的caption

🧮 領先的公式識別能力

這是Doc2X的核心競爭優勢之一：

多格式公式支持：無論是印刷體還是部分手寫體公式，都能實現高精度識別
LaTeX標準輸出：轉換結果符合LaTeX標準，支持MathJax渲染
Word兼容性：轉換的公式在Word中能夠正確顯示，避免亂碼問題

💰 極致性價比

相比同類產品，Doc2X提供了更具競爭力的價格方案，讓中小企業和個人開發者也能享受到高質量的文檔解析服務。其中0.02元一頁，

在官方體驗平臺最近也在搞新用戶活動，大家可以體驗一下效果，每日簽到可以送解析頁碼額度

在使用Doc2X之前，我們先回顧下RAG系統構建中的關鍵步驟是什么？

多種使用方式

支持API調用

Doc2x API v2 PDF 接口文檔:https://noedgeai.feishu.cn/wiki/Q8QIw3PT7i4QghkhPoecsmSCnG1

這個文檔也提供了

官方SDK工具封裝的pdfdeal

源碼地址：https://github.com/NoEdgeAI/pdfdeal-docs
文檔地址：https://noedgeai.github.io/pdfdeal-docs/zh/guide/

文檔對新手非常友好，里面也有些教程，大家可以操作試試。

桌面端應用：支持多種平臺安裝和使用

RAG系統構建中的核心價值

數據預處理階段的關鍵作用

在RAG系統的構建流程中，Doc2X主要發揮以下作用：

文檔標準化：將各種格式的文檔統一轉換為機器友好的格式
信息完整性保障：確保公式、表格、圖表等關鍵信息不丟失
結構化數據輸出：為后續的文本分塊和向量化提供高質量的數據源

提升RAG系統整體效果

高質量的文檔解析直接影響RAG系統的最終表現：

檢索準確性提升：

準確的文本內容確保關鍵信息能被正確索引
保留的文檔結構有助于上下文理解
完整的公式和表格信息提升專業領域查詢的召回率

生成質量改善：

結構化的輸入數據讓大模型能夠更好地理解文檔內容
準確的公式表示避免了生成過程中的理解偏差
豐富的上下文信息提升了答案的準確性和完整性

學術論文PDF解析效果

最近讀論文比較多，剛好見到這個不湊的工具，相比開源工具，容易調用以及構建應用，筆者充值了10元，500頁額度，來測試下論文解讀的效果

筆者通過Doc2X對Arxiv解析之后的論文markdown內容輸入到大模型服務中，然后輸出整篇論文解讀內容。下面我們盡量做到自動化：

根據查詢詞實現Arxiv論文列表檢索
指定某個論文然后下載PDF文件
然后將PDF文件傳入到Doc2X API服務進行解析
根據解析結果調用大模型進行論文解讀八股文

下面我們看看怎么實現？

Arxiv論文檢索

首先安裝arxiv 包

pip install arxiv

pypi文檔地址：https://pypi.org/project/arxiv/
下面我們實現Arxiv論文搜索以及PDF論文下載

import arxiv
import os
from typing import List, Optional, Generator
from pathlib import Pathclass ArxivSearcher:"""Arxiv論文搜索和下載工具類"""def __init__(self):"""初始化Arxiv客戶端"""self.client = arxiv.Client()def search_papers(self, query: str, max_results: int = 10, sort_by: arxiv.SortCriterion = arxiv.SortCriterion.Relevance) -> List[arxiv.Result]:"""搜索論文Args:query: 搜索查詢詞max_results: 最大結果數量sort_by: 排序方式Returns:論文結果列表"""search = arxiv.Search(query=query,max_results=max_results,sort_by=sort_by)results = list(self.client.results(search))return resultsdef search_by_id(self, paper_ids: List[str]) -> List[arxiv.Result]:"""根據論文ID搜索Args:paper_ids: 論文ID列表Returns:論文結果列表"""search = arxiv.Search(id_list=paper_ids)results = list(self.client.results(search))return resultsdef download_paper(self, paper_id: str, download_dir: str = "./downloads", filename: Optional[str] = None) -> str:"""下載指定論文的PDFArgs:paper_id: 論文IDdownload_dir: 下載目錄filename: 自定義文件名Returns:下載文件的完整路徑"""# 確保下載目錄存在Path(download_dir).mkdir(parents=True, exist_ok=True)# 搜索論文papers = self.search_by_id([paper_id])if not papers:raise ValueError(f"未找到ID為 {paper_id} 的論文")paper = papers[0]# 下載PDFif filename:filepath = paper.download_pdf(dirpath=download_dir, filename=filename)else:filepath = paper.download_pdf(dirpath=download_dir)return filepathdef print_paper_info(self, papers: List[arxiv.Result]) -> None:"""打印論文信息Args:papers: 論文結果列表"""for i, paper in enumerate(papers, 1):print(f"\n{i}. 標題: {paper.title}")print(f"   作者: {', '.join([author.name for author in paper.authors])}")print(f"   發布日期: {paper.published.strftime('%Y-%m-%d')}")print(f"   摘要: {paper.summary[:200]}...")print(f"   PDF鏈接: {paper.pdf_url}")print(f"   論文ID: {paper.entry_id.split('/')[-1]}")def search_and_display(self, query: str, max_results: int = 10) -> List[arxiv.Result]:"""搜索并顯示論文信息Args:query: 搜索查詢詞max_results: 最大結果數量Returns:論文結果列表"""print(f"正在搜索: {query}")print(f"最大結果數: {max_results}")print("-" * 80)papers = self.search_papers(query, max_results)self.print_paper_info(papers)return papers# 使用示例
if __name__ == "__main__":# 創建搜索器實例searcher = ArxivSearcher()# 示例1: 搜索"Retrieval Augmented Generation"相關論文print("=" * 80)print("示例1: 搜索 'Retrieval Augmented Generation' 相關論文")print("=" * 80)rag_papers = searcher.search_and_display(query="Retrieval Augmented Generation RAG",max_results=5)if rag_papers:print("\n" + "=" * 80)print("示例2: 下載第一篇論文")print("=" * 80)first_paper = rag_papers[0]paper_id = first_paper.entry_id.split('/')[-1]try:downloaded_path = searcher.download_paper(paper_id=paper_id,download_dir="./downloads",filename=f"rag_paper_{paper_id}.pdf")print(f"論文已下載到: {downloaded_path}")except Exception as e:print(f"下載失敗: {e}")

Doc2X論文解析

from pdfdeal import Doc2X
from pathlib import Path
from typing import Union, List, Tuple, Optional
import os
import zipfile
import shutilclass PDFParser:"""PDF解析器類，用于將PDF文件轉換為Markdown內容"""def __init__(self, api_key: str, debug: bool = True, thread: int = 5, full_speed: bool = True):"""初始化PDF解析器Args:api_key: Doc2X API密鑰debug: 是否開啟調試模式thread: 線程數full_speed: 是否開啟全速模式"""self.client = Doc2X(apikey=api_key,debug=debug,thread=thread,full_speed=full_speed)def _extract_zip_file(self, zip_path: str, extract_to: str = None) -> str:"""解壓ZIP文件Args:zip_path: ZIP文件路徑extract_to: 解壓目標目錄，如果為None則解壓到ZIP文件同目錄Returns:解壓后的目錄路徑"""if not os.path.exists(zip_path):raise FileNotFoundError(f"ZIP文件不存在: {zip_path}")# 如果沒有指定解壓目錄，則使用ZIP文件同目錄if extract_to is None:extract_to = os.path.dirname(zip_path)# 創建解壓目錄Path(extract_to).mkdir(parents=True, exist_ok=True)# 解壓文件with zipfile.ZipFile(zip_path, 'r') as zip_ref:zip_ref.extractall(extract_to)print(f"ZIP文件已解壓到: {extract_to}")return extract_todef parse_pdf_to_markdown_with_auto_extract(self, pdf_path: str,output_path: str = "./Output",output_format: str = "md",ocr: bool = True,convert: bool = False,auto_extract: bool = True,keep_zip: bool = False) -> Tuple[Union[str, List[str]], List[dict], bool, str]:"""將PDF文件解析為Markdown內容并自動解壓（如果生成了ZIP文件）Args:pdf_path: PDF文件路徑output_path: 輸出目錄路徑output_format: 輸出格式，支持 'md', 'md_dollar', 'text', 'texts', 'detailed'ocr: 是否使用OCRconvert: 是否將 [ 和 [[ 轉換為 $ 和 $$auto_extract: 是否自動解壓ZIP文件keep_zip: 是否保留原ZIP文件Returns:成功轉換的內容或文件路徑、失敗信息、是否有錯誤、解壓目錄路徑的元組"""# 檢查PDF文件是否存在if not os.path.exists(pdf_path):raise FileNotFoundError(f"PDF文件不存在: {pdf_path}")# 確保輸出目錄存在Path(output_path).mkdir(parents=True, exist_ok=True)# 調用Doc2X進行轉換success, failed, flag = self.client.pdf2file(pdf_file=pdf_path,output_path=output_path,output_format=output_format,ocr=ocr,convert=convert,)extract_dir = None# 如果轉換成功且需要自動解壓if not flag and auto_extract:# 檢查是否生成了ZIP文件if isinstance(success, str) and success.endswith('.zip'):try:# 解壓ZIP文件extract_dir = self._extract_zip_file(success)# 如果不保留ZIP文件，則刪除它if not keep_zip:os.remove(success)print(f"已刪除ZIP文件: {success}")print(f"解壓完成，文件位于: {extract_dir}")except Exception as e:print(f"解壓ZIP文件時出錯: {e}")elif isinstance(success, list):# 處理多個文件的情況for file_path in success:if isinstance(file_path, str) and file_path.endswith('.zip'):try:extract_dir = self._extract_zip_file(file_path)if not keep_zip:os.remove(file_path)print(f"已刪除ZIP文件: {file_path}")except Exception as e:print(f"解壓ZIP文件 {file_path} 時出錯: {e}")return success, failed, flag, extract_dirdef parse_existing_zip(self, zip_path: str, extract_to: str = None, keep_zip: bool = False) -> str:"""解析已存在的ZIP文件Args:zip_path: ZIP文件路徑extract_to: 解壓目標目錄keep_zip: 是否保留原ZIP文件Returns:解壓后的目錄路徑"""extract_dir = self._extract_zip_file(zip_path, extract_to)if not keep_zip:os.remove(zip_path)print(f"已刪除ZIP文件: {zip_path}")return extract_dirdef parse_pdf_to_markdown(self, pdf_path: str,output_path: str = "./Output",output_format: str = "md",ocr: bool = True,convert: bool = False,) -> Tuple[Union[str, List[str]], List[dict], bool]:"""將PDF文件解析為Markdown內容Args:pdf_path: PDF文件路徑output_path: 輸出目錄路徑output_format: 輸出格式，支持 'md', 'md_dollar', 'text', 'texts', 'detailed'ocr: 是否使用OCRconvert: 是否將 [ 和 [[ 轉換為 $ 和 $$Returns:成功轉換的內容或文件路徑、失敗信息、是否有錯誤的元組"""# 檢查PDF文件是否存在if not os.path.exists(pdf_path):raise FileNotFoundError(f"PDF文件不存在: {pdf_path}")# 確保輸出目錄存在Path(output_path).mkdir(parents=True, exist_ok=True)# 調用Doc2X進行轉換success, failed, flag = self.client.pdf2file(pdf_file=pdf_path,output_path=output_path,output_format=output_format,ocr=ocr,convert=convert,)return success, failed, flagdef parse_pdf_to_text(self, pdf_path: str) -> str:"""將PDF文件解析為純文本字符串Args:pdf_path: PDF文件路徑Returns:解析后的文本內容"""success, failed, flag = self.parse_pdf_to_markdown(pdf_path=pdf_path,output_format="text")if flag:  # 有錯誤raise Exception(f"PDF解析失敗: {failed}")return successdef parse_pdf_to_pages(self, pdf_path: str) -> List[str]:"""將PDF文件按頁解析為文本列表Args:pdf_path: PDF文件路徑Returns:按頁分割的文本列表"""success, failed, flag = self.parse_pdf_to_markdown(pdf_path=pdf_path,output_format="texts")if flag:  # 有錯誤raise Exception(f"PDF解析失敗: {failed}")return successdef parse_pdf_to_markdown_file(self, pdf_path: str,output_path: str = "./Output",custom_filename: Optional[str] = None) -> str:"""將PDF文件轉換為Markdown文件并保存Args:pdf_path: PDF文件路徑output_path: 輸出目錄路徑custom_filename: 自定義輸出文件名Returns:生成的Markdown文件路徑"""output_names = Noneif custom_filename:output_names = [custom_filename]success, failed, flag = self.client.pdf2file(pdf_file=pdf_path,output_names=output_names,output_path=output_path,output_format="md",ocr=True)if flag:  # 有錯誤raise Exception(f"PDF轉換失敗: {failed}")return success[0] if isinstance(success, list) else successdef batch_parse_pdfs(self, pdf_paths: List[str],output_path: str = "./Output",output_format: str = "md") -> Tuple[List[str], List[dict], bool]:"""批量解析多個PDF文件Args:pdf_paths: PDF文件路徑列表output_path: 輸出目錄路徑output_format: 輸出格式Returns:成功轉換的文件路徑列表、失敗信息列表、是否有錯誤"""# 檢查所有PDF文件是否存在for pdf_path in pdf_paths:if not os.path.exists(pdf_path):raise FileNotFoundError(f"PDF文件不存在: {pdf_path}")# 確保輸出目錄存在Path(output_path).mkdir(parents=True, exist_ok=True)# 批量轉換success, failed, flag = self.client.pdf2file(pdf_file=pdf_paths,output_path=output_path,output_format=output_format,ocr=True)return success, failed, flagdef get_markdown_content(self, pdf_path: str) -> str:"""直接獲取PDF的Markdown內容（不保存文件）Args:pdf_path: PDF文件路徑Returns:Markdown格式的文本內容"""success, failed, flag = self.parse_pdf_to_markdown(pdf_path=pdf_path,output_format="text",convert=True  # 轉換數學公式格式)if flag:  # 有錯誤raise Exception(f"PDF解析失敗: {failed}")return success# 使用示例
if __name__ == "__main__":# 初始化解析器（需要替換為您的API密鑰）parser = PDFParser(api_key="sk-8vnrrnhtttc6xtk1qout8cqti65g3ocz")# 示例2: 解析PDF并自動解壓pdf_path = "downloads/recent_rag_paper_2505.22571v3.pdf"if os.path.exists(pdf_path):try:print("\n正在解析PDF并自動解壓...")success, failed, flag, extract_dir = parser.parse_pdf_to_markdown_with_auto_extract(pdf_path=pdf_path,output_path="./auto_extract_output",output_format="md",auto_extract=True,keep_zip=False  # 不保留ZIP文件)if not flag:print(f"PDF解析成功！")if extract_dir:print(f"內容已自動解壓到: {extract_dir}")else:print(f"生成的文件: {success}")else:print(f"PDF解析失敗: {failed}")except Exception as e:print(f"解析PDF時出錯: {e}")

能夠正確解析論文中的圖片

論文表格解析完全正確

論文解讀八股文

我們基于調用大模型服務，傳入論文markdown內容，然后生成以下各個部分內容

研究動機：分析論文研究的核心問題和背景
研究現狀：總結該領域的研究現狀和前人工作
創新點：分析論文的創新思路來源
解決方案：詳細分析論文提出的解決方案
實驗設計：分析實驗設計和驗證方法
研究結論：總結論文的主要發現和結論
未來方向：分析論文提出的未來研究方向
偽代碼：基于論文內容生成核心算法的偽代碼

下面筆者構建了一個Streamlit應用，我們使用看看怎么使用

首先我們搜索一些關于RAG的論文

然后選擇某篇我們感興趣的論文進行下載

然后通過Doc2X進行解析

調用DeepSeek實現論文解讀

下面是解析結果，我們可以看下：

在這里插入圖片描述
最后是偽代碼生成：

import numpy as np
from typing import List, Dict, Tuple
from transformers import AutoModelForCausalLM, AutoTokenizerclass RAGInstruct:def __init__(self, corpus: List[str], retriever_model: str = "contriever-msmarco",llm_model: str = "gpt-4"):"""初始化RAG-Instruct生成器參數:corpus: 外部知識語料庫retriever_model: 檢索模型名稱llm_model: 用于生成指令的大語言模型"""self.corpus = corpusself.retriever = self._load_retriever(retriever_model)self.llm = self._load_llm(llm_model)self.instruction_datasets = self._load_exemplar_datasets()def generate_rag_instructions(self, num_instructions: int = 40000,max_docs_per_instruction: int = 5) -> List[Dict]:"""生成RAG指令數據集參數:num_instructions: 要生成的指令數量max_docs_per_instruction: 每個指令關聯的最大文檔數返回:生成的RAG指令數據集"""dataset = []for _ in range(num_instructions):# 1. 隨機選擇一個RAG范式rag_paradigm = self._sample_rag_paradigm()# 2. 隨機選擇一個模擬指令作為模板exemplar_instruction = self._sample_exemplar_instruction()# 3. 基于模擬指令檢索相關文檔relevant_docs = self._retrieve_docs(exemplar_instruction, top_k=max_docs_per_instruction)# 4. 根據RAG范式篩選文檔selected_docs = self._select_docs_by_paradigm(relevant_docs, rag_paradigm)# 5. 隨機采樣不相關文檔作為噪聲unrelated_docs = self._sample_unrelated_docs(selected_docs)# 6. 使用LLM生成RAG指令和回答instruction, answer = self._generate_with_llm(selected_docs, unrelated_docs, exemplar_instruction, rag_paradigm)# 7. 添加到數據集dataset.append({"instruction": instruction,"answer": answer,"relevant_docs": selected_docs,"unrelated_docs": unrelated_docs,"paradigm": rag_paradigm})return datasetdef _sample_rag_paradigm(self) -> str:"""從5種RAG范式中隨機采樣一種"""paradigms = ["r0", "r1", "r2", "r3", "r4"]  # 對應論文中的5種范式weights = [0.1, 0.2, 0.2, 0.3, 0.2]  # 每種范式的采樣權重return np.random.choice(paradigms, p=weights)def _generate_with_llm(self, relevant_docs: List[str], unrelated_docs: List[str],exemplar_instruction: str,paradigm: str) -> Tuple[str, str]:"""使用LLM生成RAG指令和回答參數:relevant_docs: 相關文檔列表unrelated_docs: 不相關文檔列表exemplar_instruction: 模擬指令模板paradigm: RAG范式返回:(生成的指令, 生成的回答)"""# 構建LLM提示(簡化版，實際實現更復雜)prompt = f"""<Documents>{self._format_docs(relevant_docs)}</Documents>Your task is to generate a question q* and response a* based on:- RAG Paradigm: {self._get_paradigm_description(paradigm)}- Simulated Instruction: {exemplar_instruction}"""# 調用LLM生成response = self.llm.generate(prompt)return self._parse_llm_response(response)