Nanonets-OCR：Qwen2.5VL-3B的微調模型更強大的文檔解析能力|附效果實測

一 Nanonets-OCR

簡介

Nanonets-OCR不再滿足于單純提取文本，它能智能解析圖像中的公式、表格、水印、簽名、圖表、復選框等復雜結構，并輸出格式清晰的 Markdown。

核心功能

● LaTeX 公式識別：自動將文中數學公式轉為標準 LaTeX 格式

● 智能圖像描述：識別圖表、二維碼等內容并生成結構化描述

● 簽名識別與隔離：可精準定位文檔中的簽名內容

● 水印提取：有效檢測并提取文檔中的水印信息

● 復選框識別：將復選框狀態標準化為統一符號，便于后續處理

● 復雜表格提取：支持嵌套結構的表格識別，輸出 Markdown/HTML 格式

訓練過程

模型基于 25 萬頁圖文數據訓練，涵蓋科研、金融、醫療、法律、發票、收據等多個行業，結合合成數據與人工標注，最終在 Qwen2.5-VL-3B 基礎上完成精調。
注意事項：

● 微調數據中不包含手寫數據，因此暫不支持手寫體識別

● 仍有一定幻覺風險（模型大小限制）

二效果測試

在線試用：https://huggingface.co/spaces/Souvik3333/Nanonets-ocr-s

● 論文封面（帶水印）
論文封面-帶水印

輸出的Markdown結果：

MonkeyOCR: Document Parsing with a Structure-Recognition-Relation Triplet Paradigm
Zhang Li1, Yuliang Liu1,?, Qiang Liu2, Zhiyin Ma1, Ziyang Zhang1, Shuo Zhang1, Zidun Guo1, Jiarui Zhang2, Xinyu Wang1, Xiang Bai11Huazhong University of Science and Technology, 2Kingsoft Office<img>
A bar chart comparing performance metrics across different datasets and models. The x-axis shows different document types (e.g., Formula (EN), Formula (ZH), Table (EN), Table (ZH), Exam paper, Academic Papers, Newspaper, Overall, Infer Speed) and their corresponding values. The y-axis shows the performance metric (e.g., accuracy, speed). The bars represent different models and their corresponding values.
</img>Figure 1: Performance comparison of MonkeyOCR and other SOTA models on OmniDocBench [33]. “Overall” represents the comprehensive evaluation across nine document types in OmniDocBench.Abstract
We introduce MonkeyOCR, a vision-language model for document parsing that advances the state of the art by leveraging a Structure-Recognition-Relation (SRR) triplet paradigm. This design simplifies what would otherwise be a complex multi-tool pipeline (as in MinerU’s modular approach) and avoids the inefficiencies of processing full pages with giant end-to-end models (e.g., large multimodal LLMs like Qwen-VL). In SRR, document parsing is abstracted into three fundamental questions – “Where is it?” (structure), “What is it?” (recognition), and “How is it organized?” (relation) – corresponding to layout analysis, content identification, and logical ordering. This focused decomposition balances accuracy and speed: it enables efficient, scalable processing without sacrificing precision. To train and evaluate this approach, we introduce the MonkeyDoc (the most comprehensive document parsing dataset to date), with 3.9 million instances spanning over ten document types in both Chinese and English. Experiments show that MonkeyOCR outperforms MinerU by an average of 5.1%, with particularly notable improvements on challenging content such as formulas (+15.0%) and tables (+8.6%). Remarkably, our 3B-parameter model surpasses much larger and top-performing models, including Qwen2.5-VL (72B) and Gemini 2.5 Pro, achieving state-of-the-art average performance on English document parsing tasks. In addition, MonkeyOCR processes multi-page documents significantly faster (0.84 pages per second compared to 0.65 for MinerU and 0.12 for Qwen2.5-VL-7B). The 3B model can be efficiently deployed for inference on a single NVIDIA 3090 GPU. Code and models will be released at https://github.com/Yuliang-Liu/MonkeyOCR.<watermark>arXiv:2506.05218v1 [cs.CV] 5 Jun 2025</watermark>Technical Report. ?Project lead.

● 帶表格和圖片
帶有圖表的論文

輸出的Markdown結果：

<table>
<thead>
<tr>
<th><strong>Model Type</strong></th>
<th><strong>Models</strong></th>
<th><strong>Book</strong></th>
<th><strong>Slides</strong></th>
<th><strong>Financial Report</strong></th>
<th><strong>Textbook</strong></th>
<th><strong>Exam Paper</strong></th>
<th><strong>Magazine</strong></th>
<th><strong>Academic Papers</strong></th>
<th><strong>Notes</strong></th>
<th><strong>Newspaper</strong></th>
<th><strong>Overall</strong></th>
</tr>
</thead>
<tbody>
<tr>
<td rowspan="3"><strong>Pipeline Tools</strong></td>
<td>MinerU <sup>[43]</sup></td>
<td>0.055</td>
<td>0.124</td>
<td>0.033</td>
<td>0.102</td>
<td>0.159</td>
<td><strong>0.072</strong></td>
<td>0.025</td>
<td>0.984</td>
<td>0.171</td>
<td>0.206</td>
</tr>
<tr>
<td>Marker <sup>[35]</sup></td>
<td>0.074</td>
<td>0.340</td>
<td>0.089</td>
<td>0.319</td>
<td>0.452</td>
<td>0.153</td>
<td>0.059</td>
<td>0.651</td>
<td>0.192</td>
<td>0.274</td>
</tr>
<tr>
<td>Mathpix <sup>[26]</sup></td>
<td>0.131</td>
<td>0.220</td>
<td>0.202</td>
<td>0.216</td>
<td>0.278</td>
<td>0.147</td>
<td>0.091</td>
<td>0.634</td>
<td>0.690</td>
<td>0.300</td>
</tr>
<tr>
<td rowspan="3"><strong>Expert VLMs</strong></td>
<td>GOT-OCR <sup>[45]</sup></td>
<td>0.111</td>
<td>0.222</td>
<td>0.067</td>
<td>0.132</td>
<td>0.204</td>
<td>0.198</td>
<td>0.179</td>
<td>0.388</td>
<td>0.771</td>
<td>0.267</td>
</tr>
<tr>
<td>Nougat <sup>[3]</sup></td>
<td>0.734</td>
<td>0.958</td>
<td>1.000</td>
<td>0.820</td>
<td>0.930</td>
<td>0.830</td>
<td>0.214</td>
<td>0.991</td>
<td>0.871</td>
<td>0.806</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td rowspan="3"><strong>General VLMs</strong></td>
<td>GPT4o <sup>[32]</sup></td>
<td>0.157</td>
<td>0.163</td>
<td>0.348</td>
<td>0.187</td>
<td>0.281</td>
<td>0.173</td>
<td>0.146</td>
<td>0.607</td>
<td>0.751</td>
<td>0.316</td>
</tr>
<tr>
<td>Qwen2.5-VL-7B <sup>[1]</sup></td>
<td>0.148</td>
<td><strong>0.053</strong></td>
<td>0.111</td>
<td>0.137</td>
<td>0.189</td>
<td>0.117</td>
<td>0.134</td>
<td>0.204</td>
<td>0.706</td>
<td>0.205</td>
</tr>
<tr>
<td>InternVL3-8B <sup>[5]</sup></td>
<td>0.163</td>
<td><strong>0.056</strong></td>
<td>0.107</td>
<td>0.109</td>
<td><strong>0.129</strong></td>
<td>0.100</td>
<td>0.159</td>
<td><strong>0.150</strong></td>
<td>0.681</td>
<td>0.188</td>
</tr>
<tr>
<td rowspan="2"><strong>Mix</strong></td>
<td><strong>MonkeyOCR-3B</strong></td>
<td><strong>0.046</strong></td>
<td>0.120</td>
<td><strong>0.024</strong></td>
<td><strong>0.100</strong></td>
<td><strong>0.129</strong></td>
<td><strong>0.086</strong></td>
<td><strong>0.024</strong></td>
<td>0.643</td>
<td><strong>0.131</strong></td>
<td><strong>0.155</strong></td>
</tr>
<tr>
<td><strong>MonkeyOCR-3B*</strong></td>
<td>0.054</td>
<td>0.203</td>
<td>0.038</td>
<td>0.112</td>
<td>0.138</td>
<td>0.032</td>
<td><strong>0.194</strong></td>
<td>0.136</td>
<td><strong>0.120</strong></td>
<td></td>
</tr>
</tbody>
</table>Table 3: The end-to-end text recognition performance on OmniDocBench across 9 PDF page types.
* represents the use of the layout model trained by us with improved capability for Chinese layout detection.Specifically, it attained the highest end-to-end recognition accuracy in six categories. The 3B model outperformed InternVL3-8B <sup>[5]</sup> by 5% and surpassed MinerU <sup>[43]</sup> by 3.3% in overall accuracy. Notably, on the newspaper category, MonkeyOCR outperformed the previous state-of-the-art MinerU by 4%, demonstrating its strong capability in parsing dense and complex layouts. These results highlight MonkeyOCR’s superior generalization ability and robustness across various document types. Moreover, benefiting from enhanced Chinese language capabilities, MonkeyOCR* outperforms the original version by 44.9% on the notes category, achieving state-of-the-art overall performance.5.3 Implement DetailsDuring the training process, we utilize the AdamW optimizer with a learning rate of 2e-5 and a cosine learning rate schedule. We employ a batch size of 64. Our 3B model was trained for 53 hours on 32 A800 GPUs. By integrating with LMDeploy <sup>[7]</sup>, our model can successfully run on RTX 3090 GPUs.&lt;img&gt;Bar chart comparing performance of MonkeyOCR-3B, Gemini2.0-flash, Gemini2.5-Pro, Qwen2-VL-72B, Qwen2.5-VL-72B, and InternVL2-76B across different document parsing tasks.&lt;/img&gt;
**Figure 6:** **End-to-end evaluation on OmniDocBench.** Performance comparison of MonkeyOCR with closed-source and extra-large open-source VLMs across different document parsing tasks.6 DiscussionAs is well-established, increasing model scale generally leads to improved performance. To further explore the potential of MonkeyOCR, we conducted comparative evaluations against both larger open-source models and leading closed-source commercial solutions on OmniDocBench. As illustrated in Figure 6, MonkeyOCR achieves the highest overall performance on English documents, outperforming Qwen2.5-VL-72B by 7.4% and surpassing the current state-of-the-art closed-source model, Gemini 2.5-Pro, by 0.8%. However, Gemini 2.5-Pro demonstrates slightly better performance on Chinese documents, indicating there is still some margin for improvement in MonkeyOCR’s Chinese document parsing capabilities.

甚至部分加粗都能識別正確
識別結果

● latex公式
帶有公式的論文

輸出的Markdown結果：

selecting high-scoring pages for inclusion. This results in a curated set of 951,000 high-quality samples.**Manual Annotation for Chinese Documents.** To address the limited availability of Chinese region-level reading order annotations, we manually annotate a diverse set of Chinese documents, including research reports, academic papers, user manuals, books, test papers, slides, official documents, newspapers, journals, and contracts. This effort produces an additional 154,000 high-quality samples, substantially enhancing the representation of Chinese document scenarios.**Expert Model-Based Auto-Annotation.** For datasets that provide only region-level bounding boxes without reading order information, we leverage expert models to generate region-level reading order annotations automatically. Specifically, we utilize PPOCR [17] for line-wise text recognition within each region, obtain text line positions, and then apply LayoutReader [44] to predict the reading order of these lines. The region-level order is determined by aggregating the predicted order of all text lines within each region. Through this approach, we generate 78,000 additional region-level annotations, further enriching the diversity and coverage of our dataset.## 4 MonkeyOCR&lt;img&gt;A diagram illustrating the overall architecture of MonkeyOCR. It shows a pipeline with four main stages: Diverse PDF Types, Structure, Recognition, and Relation. Each stage is represented by a box with an arrow pointing to the next stage. The Diverse PDF Types stage shows various types of documents like textbooks, exam papers, academic papers, books, newspapers, slides, notes, financial reports, and magazines. The Structure stage involves cropping the input image, followed by recognition and relation prediction. The Recognition stage extracts structured information from each region in parallel. The Relation stage determines the logical reading order of the detected elements. Finally, the output is serialized as HTML, JSON, or Markdown.&lt;/img&gt;**Figure 5: The overall architecture of MonkeyOCR.** The system adopts a Structure-Recognition-Relation framework, consisting of structure detection, which locates and classifies semantic regions; block-level content recognition, which extracts structured information from each region in parallel; and relation prediction, which determines the logical reading order of the detected elements.The proposed method, **MonkeyOCR**, addresses the fundamental limitations of both pipeline-based and end-to-end document parsing approaches by introducing a modular yet globally optimized Structure-Recognition-Relation (SRR) framework. As illustrated in Figure 5, we decompose the document parsing process into three relatively independent but tightly integrated stages: *structure detection*, *block-level content recognition*, and *relation prediction*. This design aims to mitigate the cumulative error typically observed in pipeline toolchains, while also improving inference efficiency by reducing the context length compared to monolithic end-to-end models.In the first stage, a YOLO-based [49] document layout detector processes the input image $I \in \mathbb{R}^{H \times W \times 3}$, producing a set of bounding boxes $B = \{b_1, b_2, \ldots, b_n\}$ and their corresponding element types $T = \{t_1, t_2, \ldots, t_n\}$. Each bounding box $b_i = (x_{1i}, y_{1i}, x_{2i}, y_{2i})$ represents the spatial coordinates of the $i$-th element, and the element type $t_i \in \{\text{text}, \text{table}, \text{formula}, \text{figure}, \ldots\}$ specifies the category of the detected element.For the second stage, we perform block-level content recognition in parallel. Each detected region $b_i$ is cropped and, together with a type-specific prompt $p_{t_i}$, is fed into our LMM for type-aware content extraction:$$C = \text{LMM}( \{I^1_{\text{crop}}, I^2_{\text{crop}}, \ldots, I^n_{\text{crop}}\}, \{p_{t_1}, p_{t_2}, \ldots, p_{t_n}\}),$$

包含Latex的圖片也不在話下，公式完全正確保存為Latex語法。

● 帶水印
帶有多個水印的論文

輸出的Markdown結果：

selecting high-scoring pages for inclusion. This results in a curated set of 951,000 high-quality samples.**Manual Annotation for Chinese Documents.** To address the limited availability of Chinese region-level reading order annotations, we manually annotate a diverse set of Chinese documents, including research reports, academic papers, user manuals, books, test papers, slides, official documents, newspapers, journals, and contracts. This effort produces an additional 154,000 high-quality samples, substantially enhancing the representation of Chinese document scenarios.**Expert Model-Based Auto-Annotation.** For datasets that provide only region-level bounding boxes without reading order information, we leverage expert models to generate region-level reading order annotations automatically. Specifically, we utilize PPOCR [17] for line-wise text recognition within each region, obtain text line positions, and then apply LayoutReader [44] to predict the reading order of these lines. The region-level order is determined by aggregating the predicted order of all text lines within each region. Through this approach, we generate 78,000 additional region-level annotations, further enriching the diversity and coverage of our dataset.## 4 MonkeyOCR&lt;img&gt;A diagram illustrating the overall architecture of MonkeyOCR. It shows the process of converting diverse PDF types into structured data. The steps include structure detection, block-level content recognition, and relation prediction.&lt;/img&gt;**Figure 5: The overall architecture of MonkeyOCR.** The system adopts a Structure-Recognition-Relation framework, consisting of structure detection, which locates and classifies semantic regions; block-level content recognition, which extracts structured information from each region in parallel; and relation prediction, which determines the logical reading order of the detected elements.The proposed method, **MonkeyOCR**, addresses the fundamental limitations of both pipeline-based and end-to-end document parsing approaches by introducing a modular yet globally optimized Structure-Recognition-Relation (SRR) framework. As illustrated in Figure 5, we decompose the document parsing process into three relatively independent but tightly integrated stages: *structure detection*, *block-level content recognition*, and *relation prediction*. This design aims to mitigate the cumulative error typically observed in pipeline toolchains, while also improving inference efficiency by reducing the context length compared to monolithic end-to-end models.In the first stage, a YOLO-based [49] document layout detector processes the input image $I \in \mathbb{R}^{H \times W \times 3}$, producing a set of bounding boxes $B = \{b_1, b_2, \ldots, b_n\}$ and their corresponding element types $T = \{t_1, t_2, \ldots, t_n\}$. Each bounding box $b_i = (x_{1i}, y_{1i}, x_{2i}, y_{2i})$ represents the spatial coordinates of the $i$-th element, and the element type $t_i \in \{\text{text}, \text{table}, \text{formula}, \text{figure}, \ldots\}$ specifies the category of the detected element.For the second stage, we perform block-level content recognition in parallel. Each detected region $b_i$ is cropped and, together with a type-specific prompt $p_{ti}$, is fed into our LMM for type-aware content extraction:$$C = \text{LMM}(\{I^1_\text{crop}, I^2_\text{crop}, \ldots, I^n_\text{crop}\}, \{p_{t_1}, p_{t_2}, \ldots, p_{t_n}\}),$$

水印完全去除，同時正文、圖片以及公式等內容都正常顯示。