Nanonets-OCR:Qwen2.5VL-3B的微調模型 更強大的文檔解析能力|附效果實測

一 Nanonets-OCR

簡介

Nanonets-OCR不再滿足于單純提取文本,它能智能解析圖像中的公式、表格、水印、簽名、圖表、復選框等復雜結構,并輸出格式清晰的 Markdown。

核心功能

LaTeX 公式識別:自動將文中數學公式轉為標準 LaTeX 格式

智能圖像描述:識別圖表、二維碼等內容并生成結構化描述

簽名識別與隔離:可精準定位文檔中的簽名內容

水印提取:有效檢測并提取文檔中的水印信息

復選框識別:將復選框狀態標準化為統一符號,便于后續處理

復雜表格提取:支持嵌套結構的表格識別,輸出 Markdown/HTML 格式

訓練過程

模型基于 25 萬頁圖文數據訓練,涵蓋科研、金融、醫療、法律、發票、收據等多個行業,結合合成數據與人工標注,最終在 Qwen2.5-VL-3B 基礎上完成精調。
注意事項:

● 微調數據中不包含手寫數據,因此暫不支持手寫體識別

● 仍有一定幻覺風險(模型大小限制)

二 效果測試

在線試用:https://huggingface.co/spaces/Souvik3333/Nanonets-ocr-s

● 論文封面(帶水印)
論文封面-帶水印

輸出的Markdown結果:

MonkeyOCR: Document Parsing with a Structure-Recognition-Relation Triplet Paradigm
Zhang Li1, Yuliang Liu1,?, Qiang Liu2, Zhiyin Ma1, Ziyang Zhang1, Shuo Zhang1, Zidun Guo1, Jiarui Zhang2, Xinyu Wang1, Xiang Bai11Huazhong University of Science and Technology, 2Kingsoft Office<img>
A bar chart comparing performance metrics across different datasets and models. The x-axis shows different document types (e.g., Formula (EN), Formula (ZH), Table (EN), Table (ZH), Exam paper, Academic Papers, Newspaper, Overall, Infer Speed) and their corresponding values. The y-axis shows the performance metric (e.g., accuracy, speed). The bars represent different models and their corresponding values.
</img>Figure 1: Performance comparison of MonkeyOCR and other SOTA models on OmniDocBench [33]. “Overall” represents the comprehensive evaluation across nine document types in OmniDocBench.Abstract
We introduce MonkeyOCR, a vision-language model for document parsing that advances the state of the art by leveraging a Structure-Recognition-Relation (SRR) triplet paradigm. This design simplifies what would otherwise be a complex multi-tool pipeline (as in MinerU’s modular approach) and avoids the inefficiencies of processing full pages with giant end-to-end models (e.g., large multimodal LLMs like Qwen-VL). In SRR, document parsing is abstracted into three fundamental questions – “Where is it?” (structure), “What is it?” (recognition), and “How is it organized?” (relation) – corresponding to layout analysis, content identification, and logical ordering. This focused decomposition balances accuracy and speed: it enables efficient, scalable processing without sacrificing precision. To train and evaluate this approach, we introduce the MonkeyDoc (the most comprehensive document parsing dataset to date), with 3.9 million instances spanning over ten document types in both Chinese and English. Experiments show that MonkeyOCR outperforms MinerU by an average of 5.1%, with particularly notable improvements on challenging content such as formulas (+15.0%) and tables (+8.6%). Remarkably, our 3B-parameter model surpasses much larger and top-performing models, including Qwen2.5-VL (72B) and Gemini 2.5 Pro, achieving state-of-the-art average performance on English document parsing tasks. In addition, MonkeyOCR processes multi-page documents significantly faster (0.84 pages per second compared to 0.65 for MinerU and 0.12 for Qwen2.5-VL-7B). The 3B model can be efficiently deployed for inference on a single NVIDIA 3090 GPU. Code and models will be released at https://github.com/Yuliang-Liu/MonkeyOCR.<watermark>arXiv:2506.05218v1 [cs.CV] 5 Jun 2025</watermark>Technical Report. ?Project lead.

● 帶表格和圖片
帶有圖表的論文

輸出的Markdown結果:

<table>
<thead>
<tr>
<th><strong>Model Type</strong></th>
<th><strong>Models</strong></th>
<th><strong>Book</strong></th>
<th><strong>Slides</strong></th>
<th><strong>Financial Report</strong></th>
<th><strong>Textbook</strong></th>
<th><strong>Exam Paper</strong></th>
<th><strong>Magazine</strong></th>
<th><strong>Academic Papers</strong></th>
<th><strong>Notes</strong></th>
<th><strong>Newspaper</strong></th>
<th><strong>Overall</strong></th>
</tr>
</thead>
<tbody>
<tr>
<td rowspan="3"><strong>Pipeline Tools</strong></td>
<td>MinerU <sup>[43]</sup></td>
<td>0.055</td>
<td>0.124</td>
<td>0.033</td>
<td>0.102</td>
<td>0.159</td>
<td><strong>0.072</strong></td>
<td>0.025</td>
<td>0.984</td>
<td>0.171</td>
<td>0.206</td>
</tr>
<tr>
<td>Marker <sup>[35]</sup></td>
<td>0.074</td>
<td>0.340</td>
<td>0.089</td>
<td>0.319</td>
<td>0.452</td>
<td>0.153</td>
<td>0.059</td>
<td>0.651</td>
<td>0.192</td>
<td>0.274</td>
</tr>
<tr>
<td>Mathpix <sup>[26]</sup></td>
<td>0.131</td>
<td>0.220</td>
<td>0.202</td>
<td>0.216</td>
<td>0.278</td>
<td>0.147</td>
<td>0.091</td>
<td>0.634</td>
<td>0.690</td>
<td>0.300</td>
</tr>
<tr>
<td rowspan="3"><strong>Expert VLMs</strong></td>
<td>GOT-OCR <sup>[45]</sup></td>
<td>0.111</td>
<td>0.222</td>
<td>0.067</td>
<td>0.132</td>
<td>0.204</td>
<td>0.198</td>
<td>0.179</td>
<td>0.388</td>
<td>0.771</td>
<td>0.267</td>
</tr>
<tr>
<td>Nougat <sup>[3]</sup></td>
<td>0.734</td>
<td>0.958</td>
<td>1.000</td>
<td>0.820</td>
<td>0.930</td>
<td>0.830</td>
<td>0.214</td>
<td>0.991</td>
<td>0.871</td>
<td>0.806</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td rowspan="3"><strong>General VLMs</strong></td>
<td>GPT4o <sup>[32]</sup></td>
<td>0.157</td>
<td>0.163</td>
<td>0.348</td>
<td>0.187</td>
<td>0.281</td>
<td>0.173</td>
<td>0.146</td>
<td>0.607</td>
<td>0.751</td>
<td>0.316</td>
</tr>
<tr>
<td>Qwen2.5-VL-7B <sup>[1]</sup></td>
<td>0.148</td>
<td><strong>0.053</strong></td>
<td>0.111</td>
<td>0.137</td>
<td>0.189</td>
<td>0.117</td>
<td>0.134</td>
<td>0.204</td>
<td>0.706</td>
<td>0.205</td>
</tr>
<tr>
<td>InternVL3-8B <sup>[5]</sup></td>
<td>0.163</td>
<td><strong>0.056</strong></td>
<td>0.107</td>
<td>0.109</td>
<td><strong>0.129</strong></td>
<td>0.100</td>
<td>0.159</td>
<td><strong>0.150</strong></td>
<td>0.681</td>
<td>0.188</td>
</tr>
<tr>
<td rowspan="2"><strong>Mix</strong></td>
<td><strong>MonkeyOCR-3B</strong></td>
<td><strong>0.046</strong></td>
<td>0.120</td>
<td><strong>0.024</strong></td>
<td><strong>0.100</strong></td>
<td><strong>0.129</strong></td>
<td><strong>0.086</strong></td>
<td><strong>0.024</strong></td>
<td>0.643</td>
<td><strong>0.131</strong></td>
<td><strong>0.155</strong></td>
</tr>
<tr>
<td><strong>MonkeyOCR-3B*</strong></td>
<td>0.054</td>
<td>0.203</td>
<td>0.038</td>
<td>0.112</td>
<td>0.138</td>
<td>0.032</td>
<td><strong>0.194</strong></td>
<td>0.136</td>
<td><strong>0.120</strong></td>
<td></td>
</tr>
</tbody>
</table>Table 3: The end-to-end text recognition performance on OmniDocBench across 9 PDF page types.
* represents the use of the layout model trained by us with improved capability for Chinese layout detection.Specifically, it attained the highest end-to-end recognition accuracy in six categories. The 3B model outperformed InternVL3-8B <sup>[5]</sup> by 5% and surpassed MinerU <sup>[43]</sup> by 3.3% in overall accuracy. Notably, on the newspaper category, MonkeyOCR outperformed the previous state-of-the-art MinerU by 4%, demonstrating its strong capability in parsing dense and complex layouts. These results highlight MonkeyOCR’s superior generalization ability and robustness across various document types. Moreover, benefiting from enhanced Chinese language capabilities, MonkeyOCR* outperforms the original version by 44.9% on the notes category, achieving state-of-the-art overall performance.5.3 Implement DetailsDuring the training process, we utilize the AdamW optimizer with a learning rate of 2e-5 and a cosine learning rate schedule. We employ a batch size of 64. Our 3B model was trained for 53 hours on 32 A800 GPUs. By integrating with LMDeploy <sup>[7]</sup>, our model can successfully run on RTX 3090 GPUs.&lt;img&gt;Bar chart comparing performance of MonkeyOCR-3B, Gemini2.0-flash, Gemini2.5-Pro, Qwen2-VL-72B, Qwen2.5-VL-72B, and InternVL2-76B across different document parsing tasks.&lt;/img&gt;
**Figure 6:** **End-to-end evaluation on OmniDocBench.** Performance comparison of MonkeyOCR with closed-source and extra-large open-source VLMs across different document parsing tasks.6 DiscussionAs is well-established, increasing model scale generally leads to improved performance. To further explore the potential of MonkeyOCR, we conducted comparative evaluations against both larger open-source models and leading closed-source commercial solutions on OmniDocBench. As illustrated in Figure 6, MonkeyOCR achieves the highest overall performance on English documents, outperforming Qwen2.5-VL-72B by 7.4% and surpassing the current state-of-the-art closed-source model, Gemini 2.5-Pro, by 0.8%. However, Gemini 2.5-Pro demonstrates slightly better performance on Chinese documents, indicating there is still some margin for improvement in MonkeyOCR’s Chinese document parsing capabilities.

甚至部分加粗都能識別正確
識別結果

● latex公式
帶有公式的論文

輸出的Markdown結果:

selecting high-scoring pages for inclusion. This results in a curated set of 951,000 high-quality samples.**Manual Annotation for Chinese Documents.** To address the limited availability of Chinese region-level reading order annotations, we manually annotate a diverse set of Chinese documents, including research reports, academic papers, user manuals, books, test papers, slides, official documents, newspapers, journals, and contracts. This effort produces an additional 154,000 high-quality samples, substantially enhancing the representation of Chinese document scenarios.**Expert Model-Based Auto-Annotation.** For datasets that provide only region-level bounding boxes without reading order information, we leverage expert models to generate region-level reading order annotations automatically. Specifically, we utilize PPOCR [17] for line-wise text recognition within each region, obtain text line positions, and then apply LayoutReader [44] to predict the reading order of these lines. The region-level order is determined by aggregating the predicted order of all text lines within each region. Through this approach, we generate 78,000 additional region-level annotations, further enriching the diversity and coverage of our dataset.## 4 MonkeyOCR&lt;img&gt;A diagram illustrating the overall architecture of MonkeyOCR. It shows a pipeline with four main stages: Diverse PDF Types, Structure, Recognition, and Relation. Each stage is represented by a box with an arrow pointing to the next stage. The Diverse PDF Types stage shows various types of documents like textbooks, exam papers, academic papers, books, newspapers, slides, notes, financial reports, and magazines. The Structure stage involves cropping the input image, followed by recognition and relation prediction. The Recognition stage extracts structured information from each region in parallel. The Relation stage determines the logical reading order of the detected elements. Finally, the output is serialized as HTML, JSON, or Markdown.&lt;/img&gt;**Figure 5: The overall architecture of MonkeyOCR.** The system adopts a Structure-Recognition-Relation framework, consisting of structure detection, which locates and classifies semantic regions; block-level content recognition, which extracts structured information from each region in parallel; and relation prediction, which determines the logical reading order of the detected elements.The proposed method, **MonkeyOCR**, addresses the fundamental limitations of both pipeline-based and end-to-end document parsing approaches by introducing a modular yet globally optimized Structure-Recognition-Relation (SRR) framework. As illustrated in Figure 5, we decompose the document parsing process into three relatively independent but tightly integrated stages: *structure detection*, *block-level content recognition*, and *relation prediction*. This design aims to mitigate the cumulative error typically observed in pipeline toolchains, while also improving inference efficiency by reducing the context length compared to monolithic end-to-end models.In the first stage, a YOLO-based [49] document layout detector processes the input image $I \in \mathbb{R}^{H \times W \times 3}$, producing a set of bounding boxes $B = \{b_1, b_2, \ldots, b_n\}$ and their corresponding element types $T = \{t_1, t_2, \ldots, t_n\}$. Each bounding box $b_i = (x_{1i}, y_{1i}, x_{2i}, y_{2i})$ represents the spatial coordinates of the $i$-th element, and the element type $t_i \in \{\text{text}, \text{table}, \text{formula}, \text{figure}, \ldots\}$ specifies the category of the detected element.For the second stage, we perform block-level content recognition in parallel. Each detected region $b_i$ is cropped and, together with a type-specific prompt $p_{t_i}$, is fed into our LMM for type-aware content extraction:$$C = \text{LMM}( \{I^1_{\text{crop}}, I^2_{\text{crop}}, \ldots, I^n_{\text{crop}}\}, \{p_{t_1}, p_{t_2}, \ldots, p_{t_n}\}),$$

包含Latex的圖片也不在話下,公式完全正確保存為Latex語法。

● 帶水印
帶有多個水印的論文

輸出的Markdown結果:

selecting high-scoring pages for inclusion. This results in a curated set of 951,000 high-quality samples.**Manual Annotation for Chinese Documents.** To address the limited availability of Chinese region-level reading order annotations, we manually annotate a diverse set of Chinese documents, including research reports, academic papers, user manuals, books, test papers, slides, official documents, newspapers, journals, and contracts. This effort produces an additional 154,000 high-quality samples, substantially enhancing the representation of Chinese document scenarios.**Expert Model-Based Auto-Annotation.** For datasets that provide only region-level bounding boxes without reading order information, we leverage expert models to generate region-level reading order annotations automatically. Specifically, we utilize PPOCR [17] for line-wise text recognition within each region, obtain text line positions, and then apply LayoutReader [44] to predict the reading order of these lines. The region-level order is determined by aggregating the predicted order of all text lines within each region. Through this approach, we generate 78,000 additional region-level annotations, further enriching the diversity and coverage of our dataset.## 4 MonkeyOCR&lt;img&gt;A diagram illustrating the overall architecture of MonkeyOCR. It shows the process of converting diverse PDF types into structured data. The steps include structure detection, block-level content recognition, and relation prediction.&lt;/img&gt;**Figure 5: The overall architecture of MonkeyOCR.** The system adopts a Structure-Recognition-Relation framework, consisting of structure detection, which locates and classifies semantic regions; block-level content recognition, which extracts structured information from each region in parallel; and relation prediction, which determines the logical reading order of the detected elements.The proposed method, **MonkeyOCR**, addresses the fundamental limitations of both pipeline-based and end-to-end document parsing approaches by introducing a modular yet globally optimized Structure-Recognition-Relation (SRR) framework. As illustrated in Figure 5, we decompose the document parsing process into three relatively independent but tightly integrated stages: *structure detection*, *block-level content recognition*, and *relation prediction*. This design aims to mitigate the cumulative error typically observed in pipeline toolchains, while also improving inference efficiency by reducing the context length compared to monolithic end-to-end models.In the first stage, a YOLO-based [49] document layout detector processes the input image $I \in \mathbb{R}^{H \times W \times 3}$, producing a set of bounding boxes $B = \{b_1, b_2, \ldots, b_n\}$ and their corresponding element types $T = \{t_1, t_2, \ldots, t_n\}$. Each bounding box $b_i = (x_{1i}, y_{1i}, x_{2i}, y_{2i})$ represents the spatial coordinates of the $i$-th element, and the element type $t_i \in \{\text{text}, \text{table}, \text{formula}, \text{figure}, \ldots\}$ specifies the category of the detected element.For the second stage, we perform block-level content recognition in parallel. Each detected region $b_i$ is cropped and, together with a type-specific prompt $p_{ti}$, is fed into our LMM for type-aware content extraction:$$C = \text{LMM}(\{I^1_\text{crop}, I^2_\text{crop}, \ldots, I^n_\text{crop}\}, \{p_{t_1}, p_{t_2}, \ldots, p_{t_n}\}),$$

水印完全去除,同時正文、圖片以及公式等內容都正常顯示。

三 總結

傳統的Pipeline方式,只能檢測出圖片,無法處理圖片的內容;相比之下,Nanonets-OCR模型,不只是看得見文字,更能從圖片中提取出具體的語義信息,從而豐富文檔的內容。

在一些高級RAG場景中,可以借助VLM的多模態能力,對圖片進行總結,在召回階段對圖片的語義信息進行向量檢索,即可召回相關的圖片,增加RAG的可信度。

本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。
如若轉載,請注明出處:http://www.pswp.cn/web/86454.shtml
繁體地址,請注明出處:http://hk.pswp.cn/web/86454.shtml
英文地址,請注明出處:http://en.pswp.cn/web/86454.shtml

如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!

相關文章

Git下載與使用完全指南:從安裝到基礎操作詳解,附上git的學習網站(很直觀)(可以模擬git的全過程)

一、Git簡介與下載安裝 1.1 Git是什么&#xff1f; Git是目前世界上最先進的分布式版本控制系統&#xff0c;由Linus Torvalds&#xff08;Linux之父&#xff09;開發。它可以高效地處理從小型到大型項目的版本管理&#xff0c;具有以下特點&#xff1a; 分布式架構&#xff…

論分布式設計

20250419-作 題目 分布式是指將一個系統或任務分解成多個子部分&#xff0c;并在多個計算機或服務器之間進行協同工作的方式。每個子部分都可以在不同的計算機節點上運行&#xff0c;彼此之間通過網絡進行通信和協調。分布式技術在當今互聯網應用中起著重要作用&#xff0c;例…

Vue樣式綁定與條件渲染詳

一、Vue樣式綁定 在Vue中&#xff0c;我們可以通過多種方式動態地綁定樣式&#xff0c;讓界面根據數據狀態變化而自動更新樣式。 1. class樣式綁定 (1) 字符串寫法 適用場景&#xff1a;樣式的類名不確定&#xff0c;需要動態指定 <template><div><!-- 綁定…

固態電池火熱-美國固態電池企業QuantumScape宣布,產能規模化邁出了重要一步

美國固態電池企業QuantumScape宣布&#xff0c;其先進的Cobra隔膜工藝已成功集成到基線電池生產中&#xff0c;標志著公司生產能力規模化邁出了重要一步。 添加圖片注釋&#xff0c;不超過 140 字&#xff08;可選&#xff09; 600478 科力遠 業績固態電池 | 1.科力遠發布20…

Python 商務數據分析—— NumPy 學習筆記Ⅰ

一、NumPy 簡介 1.1 NumPy 特性 高性能科學計算庫&#xff1a;專為處理多維數組設計&#xff0c;底層用 C 語言實現&#xff0c;運算速度遠超 Python 原生列表。 矢量運算&#xff1a;支持批量數據操作&#xff0c;避免顯式循環&#xff0c;代碼更簡潔高效。 廣播機制&…

中州養老:搭建環境(第二節)

目錄 項目初始工程搭建: 不同項目需要的前后端環境也不同 前端項目搭建: 熟悉模塊的方式 代碼閱讀 如何開發一個接口 Swagger(接口文檔) Api注解的說明 ???????項目初始工程搭建: 公司項目分兩種,新立項目(0-1)和已有項目開發(1-2) 熟悉項目結構,每個模塊對應的…

[1-01-01].第78節:Java8新特性 - Lambda表達式

java基礎語法大綱 一、Lambda 表達式 1.1.概述&#xff1a; 1.Lambda 是一個匿名函數&#xff0c;我們可以把 Lambda 表達式理解為是一段可以傳遞的代碼&#xff08;將代碼像數據一樣進行傳遞&#xff09;2.使用Lambda 表達式可以寫出更簡潔、更靈活的代碼。作為一種更緊湊的…

【2025.6.27 校內 NOI 模擬賽】總結(貪心, 容斥、組合計數, dsu on tree、 虛樹)

文章目錄 時間安排反思題解[六省聯考 2017] 期末考試&#xff08;貪心&#xff0c; 枚舉&#xff09;[JSOI2019] 神經網絡&#xff08;容斥&#xff0c; 組合計數&#xff0c; 樹背包&#xff09;[ZJOI2019] 語言&#xff08;dsu on tree&#xff0c; 虛樹&#xff0c; 結論&am…

實際前端開發中,常用指令的封裝

實際前端開發中&#xff0c;常用指令的封裝 全局指令處理步驟main.ts指令目錄結構src/directives/index.ts 一、輸入框空格禁止指令1、指令文件clearSpace.ts2、指令使用 全局指令處理步驟 main.ts import { createApp } from "vue"; import App from "./App.…

鴻蒙OH南向開發 輕量系統內核(LiteOS-M)【異常調測】

基本概念 OpenHarmony LiteOS-M提供異常接管調測手段&#xff0c;幫助開發者定位分析問題。異常接管是操作系統對運行期間發生的異常情況進行處理的一系列動作&#xff0c;例如打印異常發生時異常類型、發生異常時的系統狀態、當前函數的調用棧信息、CPU現場信息、任務調用堆棧…

算法-堆排序

文章目錄 整體架構流程技術細節小結 整體架構流程 大頂推&#xff1a;是構建一個完整的二叉樹 大頂推&#xff1a;即父節點的值大于左右子樹的值。 循環構建大頂推 在給定的數組&#xff0c;既可以明確樹的高度。 在循環的時候&#xff0c;構建樹的高度從lgn至0。即從堆低往堆…

【鴻蒙HarmonyOS Next App實戰開發】二維碼生成技術實現與解析

隨著移動應用開發中對便捷交互體驗的需求日益增長&#xff0c;二維碼作為信息傳遞的重要載體&#xff0c;其生成與使用變得越來越普遍。本文將基于鴻蒙HarmonyOS應用開發框架&#xff0c;詳細介紹如何實現一個功能完備的二維碼生成器&#xff0c;并附上完整代碼解析。 注意該實…

1 Studying《Is Parallel Programming Hard》6-9

目錄 Chapter 6 Partitioning and Synchronization Design 6.1 分區練習 6.2 設計準則 6.3 同步粒度 6.4 并行快速路徑 6.5 超越黨派分歧 6.6 分區、并行和優化 Chapter 7 Locking 7.1 活命 7.2 鎖的類型 7.3 鎖定實施問題 7.4 基于鎖的存在性保證 7.5 鎖定&a…

Java練習題精選16-20

Java練習題精選16-20 一、第十六題二、第十七題三、第十八題四、第十九題五、第二十題一、第十六題 現有一個存放學生成績的數組{66, 77, 88, 99},要求將該數組正序輸出每個下標所對應的元素。 public class Test {public static void main(String[] args) {int<

新能源知識庫(68)汽車電鍍與蒸汽

汽車電鍍是提升零部件耐磨性、抗腐蝕性和美觀性的關鍵工藝&#xff0c;其流程根據基材&#xff08;金屬或塑料&#xff09;和部件功能需求有所差異。 汽車電鍍是以 基材特性和 功能需求為導向的精密工藝&#xff1a; ?金屬件?&#xff1a;核心流程為 ?除油→酸洗→電鍍→鈍…

Veo 3 視頻生成大模型完整操作教程(2025)

隨著 AI 多模態能力的飛躍&#xff0c;Google DeepMind 發布的 Veo 3 成為了生成視頻領域的一顆重磅炸彈。它不僅能夠根據文本生成高質量的視頻畫面&#xff0c;還能同步生成對白、背景音和環境音&#xff0c;是目前最接近真正“AI 導演”的大模型。 本文將帶你詳細了解 Veo 3…

10【認識文件系統】

1 認識硬件——磁盤 1.1 物理構成 磁盤是計算機中唯一的機械設備&#xff0c;同時也是一種外部存儲設備&#xff08;外設&#xff09;。早期的計算機通常配備的是機械硬盤&#xff08;HDD&#xff09;&#xff0c;依靠磁頭和盤片的機械運動來進行數據的讀寫。但隨著用戶對計算…

Windows命令連接符的安全風險分析與防御策略

1. 命令連接符簡介 在 Windows 的命令行環境&#xff08;CMD/PowerShell&#xff09;中&#xff0c;命令連接符用于在同一行執行多個命令&#xff0c;提高效率。然而&#xff0c;攻擊者常利用這些符號構造惡意命令&#xff0c;繞過安全檢測或執行多階段攻擊。 常見命令連接符…

大屏可視化制作指南

一、大屏可視化概述 &#xff08;一&#xff09;概念 大屏可視化是指通過大屏幕展示復雜數據的視覺呈現形式&#xff0c;它借助圖形、圖表、地圖等元素&#xff0c;將海量數據以直觀易懂的方式呈現出來&#xff0c;幫助用戶快速理解數據背后的含義和價值。 &#xff08;二&a…

Halcon ——— OCR字符提取與多類型識別技術詳解

工業視覺實戰&#xff1a;OCR字符提取與多類型識別技術詳解 在工業自動化領域&#xff0c;OCR字符提取是產品追溯、質量控制和信息讀取的核心技術。本文將深入解析Halcon中OCR字符提取的全流程&#xff0c;重點解釋核心算子參數&#xff0c;并提供完整的工業級代碼實現。 一、O…