在處理大量文檔時,尤其是在構建知識庫、進行文檔分析或訓練大語言模型(LLM)時,將各種格式的文件(如 PDF、Word、Excel、PPT、HTML 等)轉換為統一的 Markdown 格式,能夠顯著提高處理效率和兼容性。Microsoft 的開源項目 MarkItDown 正是為此目的而生。
什么是 MarkItDown?
MarkItDown 是一個輕量級的 Python 工具,旨在將多種文件格式轉換為 Markdown 格式,特別適用于 LLM 和相關的文本分析管道。與 Pandoc 等工具相比,MarkItDown 更加專注于保留文檔的結構和內容,如標題、列表、表格、鏈接等,輸出的 Markdown 格式既適合人類閱讀,也適合機器處理
為什么選擇 MarkItDown?
1. 多格式支持
MarkItDown 支持從多種格式轉換為 Markdown,包括:
-
PDF
-
PowerPoint(PPTX)
-
Word(DOCX)
-
Excel(XLSX)
-
圖片(包括 EXIF 元數據和 OCR)
-
音頻(包括 EXIF 元數據和語音轉錄)
-
HTML
-
文本格式(如 CSV、JSON、XML)
-
ZIP 文件(迭代處理其中的內容)
-
YouTube URL
-
EPUB
-
以及更多
這種廣泛的格式支持使得 MarkItDown 成為處理各種文檔的利器。
html -> md ,內部轉換是markdownify
2. 專為 LLM 優化
MarkItDown 輸出的 Markdown 格式經過優化,適合用于 LLM 的輸入,能夠有效利用 LLM 的上下文窗口。此外,MarkItDown 還提供了 MCP(Model Context Protocol)服務器,支持與 LLM 的實時集成,例如與 Claude Desktop 的集成。
3. 強大的插件系統
MarkItDown 提供了插件系統,用戶可以根據需要擴展功能。例如,有用戶開發了插件來處理 DOCX 文件,并將其中的圖片提取到指定文件夾中。
如何使用 MarkItDown?
安裝 MarkItDown
使用 pip 安裝:
pip install markitdown
如果想安裝全部功能(支持更多格式和插件):
pip install 'markitdown[all]'
基本用法
假設你有一個 Word 文件 example.docx
,想把它轉換為 Markdown:
markitdown example.docx
輸出內容會在終端顯示。如果想保存為文件:
markitdown example.docx > example.md
python 代碼
from markitdown import MarkItDownmd = MarkItDown(enable_plugins=False) # Set to True to enable plugins
result = md.convert("example.docx")
print(result.text_content)
example.docx
轉換成的example.md
# Sample DocumentThis document was created using accessibility techniques for headings, lists, image alternate text, tables, and columns. It should be completely accessible using assistive technologies such as screen readers.## HeadingsThere are eight section headings in this document. At the beginning, "Sample Document" is a level 1 heading. The main section headings, such as "Headings" and "Lists" are level 2 headings. The Tables section contains two sub-headings, "Simple Table" and "Complex Table," which are both level 3 headings.## ListsThe following outline of the sections of this document is an ordered (numbered) list with six items. The fifth item, "Tables," contains a nested unordered (bulleted) list with two items.1. Headings
2. Lists
3. Links
4. Images
5. Tables* Simple Tables
* Complex Tables1. Columns## LinksIn web documents, links can point different locations on the page, different pages, or even downloadable documents, such as Word documents or PDFs:[Top of this Page](#_top)
[Sample Document](http://www.dhs.state.il.us/page.aspx?item=67072)
[Sample Document (docx)](http://www.dhs.state.il.us/OneNetLibrary/27897/documents/Initiatives/IITAA/Sample-Document.docx)## ImagesDocuments may contain images. For example, there is an image of the web accessibility symbol to the left of this paragraph. Its alternate text is "Web Access Symbol".Alt text should communicate what an image means, not how it looks.Some images, such as charts or graphs, require long descriptions, but not all document types allow that. In web pages, long descriptions may be provided in several ways: on the page below the image, via a link below the image, or via a link on the image.## Tables### Simple TablesSimple tables have a uniform number of columns and rows, without any merged cells:| **Screen Reader** | **Responses** | **Share** |
| --- | --- | --- |
| JAWS | 853 | 49% |
| NVDA | 238 | 14% |
| Window-Eyes | 214 | 12% |
| System Access | 181 | 10% |
| VoiceOver | 159 | 9% |### Complex TablesThe following is a complex table, using merged cells as headers for sections within the table. This can't be made accessible in all types of documents:| | | | | |
| --- | --- | --- | --- | --- |
| | **May 2012** | | **September 2010** | |
| **Screen Reader** | **Responses** | **Share** | **Responses** | **Share** |
| JAWS | 853 | 49% | 727 | 59% |
| NVDA | 238 | 14% | 105 | 9% |
| Window-Eyes | 214 | 12% | 138 | 11% |
| System Access | 181 | 10% | 58 | 5% |
| VoiceOver | 159 | 9% | 120 | 10% |## ColumnsThis is an example of columns. With columns, the page is split into two or more horizontal sections. Unlike tables, in which you usually read across a row and then down to the next, in columns, you read down a column and then across to the next.When columns are not created correctly, screen readers may run lines together, reading the first line of the first column, then the first line of the second column, then the second line of the first column, and so on. Obviously, that is not accessible.Process finished with exit code 0
轉換多種格式示例
1. PDF 轉 Markdown
markitdown document.pdf > document.md
MarkItDown 會嘗試保留標題、列表、表格和鏈接。
2. PPTX 轉 Markdown
markitdown slides.pptx > slides.md
每一張幻燈片的內容會轉換為 Markdown 的標題和列表結構。
3. Excel 轉 Markdown
markitdown data.xlsx > data.md
表格會被保留為 Markdown 表格格式,便于在筆記或 GitHub 上閱讀。
4. ZIP 文件批量處理
如果有一個 ZIP 包含多個文件:
markitdown archive.zip
MarkItDown 會自動遍歷 ZIP 內所有文件并生成 Markdown 輸出。
高級示例:帶圖片和音頻
MarkItDown 可以處理圖片和音頻文件,并嘗試提取 EXIF 元數據或進行語音轉錄:
# 圖片
markitdown photo.jpg > photo.md# 音頻
markitdown audio.mp3 > audio.md
在 LLM 工作流中的應用
MarkItDown 特別適合與大語言模型(LLM)結合使用。你可以:
-
先將各種文檔統一轉換為 Markdown;
-
再將 Markdown 作為輸入喂給模型進行問答或摘要;
-
保留結構化內容(標題、列表、表格),提升 LLM 的理解能力。
GitHub - microsoft/markitdown: Python tool for converting files and office documents to Markdown.