MaxKB+MinerU：通過API實現PDF文檔解析并存儲至知識庫

MinerU是一款開源的高質量數據提取工具，能夠將PDF文檔轉換為Markdown和JSON格式。2025年6月13日，MinerU發布了v2.0版本，相較于v1.0版本實現了架構和功能的全面重構與升級。在優化代碼結構和交互方式的同時，v2.0版本還集成了小參數量、高性能多模態文檔解析模型，能夠實現端到端的高速、高精度文檔理解。實際測試表明，新版本對復雜圖表的解析效果較上一版本有明顯提升，目前已經能夠滿足90%以上的復雜文檔解析需求。

值得一提的是，MinerU出色的PDF文檔解析能力特別適合與MaxKB開源項目配合使用。通過"MinerU+MaxKB"的組合方案，用戶不僅能夠獲得高質量的文檔解析效果，還能顯著提升知識庫問答系統的性能。為方便用戶集成，MinerU項目現已提供API對接服務（https://mineru.net/apiManage）。接下來，本文將詳細介紹如何通過MinerU在線API實現與MaxKB的對接。

一、實現方法

當用戶提供文件地址后，系統會將該地址賦值給file_url變量，并作為參數傳遞給MinerU文件解析服務。MinerU在完成文件解析后，會返回一個任務ID（task_id）。系統會將其傳入MinerU的查詢接口，當檢測到任務處理完成時，自動獲取結果文件的下載鏈接（full_url）。隨后，系統執行文件下載操作，將結果文件保存到MaxKB容器的/opt/maxkb/download目錄下。最后，系統會自動完成文件上傳和智能分段處理，將內容存儲到知識庫中。
在這里插入圖片描述

二、MaxKB函數創建

我們需要在MaxKB的函數庫中創建四個核心功能函數，其用途分別為：

1.調用MinerU單個文件解析；

2.從MinerU獲取任務結果；

3.通過URL鏈接下載文件至服務器；

4.將解析后的ZIP文件上傳至知識庫。

相關的代碼說明如下：

■ MinerU單個文件解析函數：負責調用MinerU的單文件解析服務，通過傳入PDF文檔的在線地址來創建解析任務，并返回對應的task_id；

import requestsdef create_task(file_url):url = 'https://mineru.net/api/v4/extract/task'token = '自己申請的 Token'header = {'Content-Type': 'application/json','Authorization': f'Bearer {token}'}data = {'url': file_url,'is_ocr': True,  #是否啟動 ocr 功能，默認 false'enable_formula': True,  #是否開啟公式識別，默認 true'enable_table': True,    #是否開啟表格識別，默認 true'language': "ch",    #指定文檔語言，默認 ch，可以設置為auto'model_version': "v2",  #mineru模型版本，兩個選項:v1、v2，默認v1。}res = requests.post(url,headers=header,json=data,timeout=5)res_data = res.json()task_id_data = res_data["data"]["task_id"]return task_id_data

在這里插入圖片描述

■ MinerU獲取任務結果函數：用于查詢任務狀態，通過傳入task_id獲取解析結果，成功后將返回ZIP格式解析文件的下載地址；

import time
import requests
def querybyid(task_id,max_retries=100,retry_interval=5):url = f'https://mineru.net/api/v4/extract/task/{task_id}'token = '申請的Token'header = {'Content-Type': 'application/json','Authorization': f'Bearer {token}'}retries = 0while retries < max_retries:try:res = requests.get(url, headers=header, timeout=5)res.raise_for_status()  # 檢查請求是否成功data = res.json()if "data" in data and "full_zip_url" in data["data"] and data["data"]["full_zip_url"]:return data["data"]["full_zip_url"]else:print(f"full_zip_url 為空，正在等待任務完成。已重試 {retries + 1} 次，共 {max_retries} 次。")time.sleep(retry_interval)retries += 1except requests.exceptions.RequestException as e:print(f"請求失敗，錯誤信息：{e}。正在重試...")time.sleep(retry_interval)retries += 1raise Exception(f"在 {max_retries} 次重試后，仍未獲取到有效的 full_zip_url。")

在這里插入圖片描述

■ 文件下載函數：根據提供的ZIP文件下載鏈接，將文件保存至容器內的/opt/maxkb/download目錄。需要注意的是，MaxKB默認使用sandbox用戶運行，需確保該用戶對/opt/maxkb/download目錄有讀寫權限；

import os
import requests
from urllib.parse import urlparsedef download_file(download_url, save_dir='/opt/maxkb/download'):os.makedirs(save_dir, exist_ok=True)# 獲取文件名parsed_url = urlparse(download_url)filename = os.path.basename(parsed_url.path)save_path = os.path.join(save_dir, filename) # 文件下載后保存的目錄，需要默認用戶對此目錄有讀寫權限# 下載文件try:response = requests.get(download_url, stream=True)response.raise_for_status()  # 檢查請求是否成功total_size = int(response.headers.get('content-length', 0))block_size = 1024  # 1KBprogress = 0print(f"開始下載 {filename} 到 {save_dir}")with open(save_path, 'wb') as f:for data in response.iter_content(block_size):f.write(data)progress += len(data)# 打印下載進度print(f"下載進度: {progress / total_size * 100:.2f}%", end='\r')print(f"\n下載完成: {save_path}")return save_pathexcept requests.exceptions.RequestException as e:print(f"下載失敗: {e}")return None

在這里插入圖片描述

■ ZIP文件上傳至知識庫：通過MaxKB API將服務器上的ZIP解析文件上傳至知識庫存儲。

import json
import logging
import requests
# 配置日志
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')def initialize(file_path):config = {# MaxKB API密鑰'authorization_apikey': 'user-ac86ec515de17969f2f8a9c8ab21e52f',    # 文件分段處理的API地址'split_url': 'http://10.1.11.58:8080/api/dataset/document/split',# 目標知識庫的API地址'upload_url': 'http://10.1.11.58:8080/api/dataset/3d1d5d4e-5576-11f0-bc5c-0242ac120003/document/_bach','file_path': rf'{file_path}','file_name': '函數庫上傳文檔分段'}return configdef upload_file(config):headers = {'accept': 'application/json','AUTHORIZATION': f'{config["authorization_apikey"]}'}try:files = {'file': open(config["file_path"], 'rb')}response = requests.post(config["split_url"], headers=headers, files=files)response.raise_for_status()response_data = response.json()map_content = {}for item in response_data.get("data", []):for content_item in item.get("content", []):map_content[content_item.get("title", "")] = content_item.get("content", "")return map_contentexcept requests.exceptions.RequestException as e:logging.error(f"文件分段上傳失敗: {e}")return {}except Exception as e:logging.error(f"處理文件內容時出錯: {e}")return {}def send_post_request(config, map_content):headers = {"Content-Type": "application/json","Authorization": f'{config["authorization_apikey"]}'}paragraphs = [{"title": key, "content": value} for key, value in map_content.items()]document_wrapper = {"name": config["file_name"],"paragraphs": paragraphs}json_body = json.dumps([document_wrapper])try:response = requests.post(config["upload_url"], headers=headers, data=json_body)response.raise_for_status()logging.info(f"上傳文件響應: {response.text}")return Trueexcept requests.exceptions.RequestException as e:logging.error(f"上傳文件失敗: {e}")return Falsedef main(file_path):config = initialize(file_path)map_content = upload_file(config)if not map_content:logging.error("文件分段上傳失敗或內容為空，程序終止")return Falseif not send_post_request(config, map_content):logging.error("文件上傳失敗，程序終止")return Falsereturn "文件已上傳成功，并保持在知識庫中"

在這里插入圖片描述

三、在MaxKB中創建應用

在上述四個函數創建完成后，我們可以在MaxKB中嘗試創建高級應用。輸入或提取上傳文件的鏈接后，按照前文順序依次添加MinerU單個文件解析函數節點→從MinerU獲取任務結果函數節點→下載文件函數節點→文件上傳函數節點。
在這里插入圖片描述

小助手提示“文件上傳成功”，即可回到知識庫頁面，在目標知識庫中看到新上傳的文檔。
在這里插入圖片描述

總結來說，MinerU v2.0是一款開源、高性能的PDF文檔解析工具，具備強大的多模態處理能力。通過MaxKB與MinerU的深度聯動，可以基于函數調用構建清晰高效的 “文件地址→解析→下載→上傳” 自動化流程，無縫銜接原始文檔與結構化知識庫的構建。

“MinerU+MaxKB”的組合方案，不僅可以顯著提升文檔解析的精度與效率，更能大幅增強知識庫問答系統的能力與效果。

如果您對我們的項目感興趣，歡迎下載并體驗MaxKB！

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/news/917146.shtml
繁體地址，請注明出處：http://hk.pswp.cn/news/917146.shtml
英文地址，請注明出處：http://en.pswp.cn/news/917146.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！