【RAG】Linux系統下ppt轉pptx，讀取解析pptx文本數據

前情提要

檢索增強生成（RAG）技術，作為 AI 領域的尖端技術，能夠提供可靠且最新的外部知識，極大地便利了各種任務。在 AI 內容生成的浪潮中，RAG 通過其強大的檢索能力為生成式 AI 提供了額外的知識，助力其產出高質量內容。盡管大型語言模型（LLMs）在語言處理上展現了突破性的能力，但仍受限于內部知識的幻覺和過時。因此，檢索增強的 LLMs 應運而生，它們利用外部權威知識庫，而非僅依賴內部知識，以提升生成質量。

遇到問題

針對pptx的文檔解析技術存在已久，但是ppt格式文件無法進行解析，且我沒有搜索到在Linux系統服務器中ppt轉pptx的資料，window系統中倒是可以轉換

解決方案

安裝系統依賴

apt-get install unoconv
apt-get install libreoffice

安裝軟件包依賴

pip install unoconv
pip install pyuno
pip install weaviate-client
pip install unstructured[all-docs] == 0.13.3
pip install python-dotenv

代碼demo

import glob
import os
import subprocess
import weaviate
import weaviate.classes as wvc
from dotenv import load_dotenv
from unstructured.chunking.title import chunk_by_title
from unstructured.documents.elements import CompositeElement, Table
from unstructured.partition.pptx import partition_pptx
from weaviate.config import AdditionalConfigload_dotenv()
os.environ['UNO_PATH'] = '/usr/lib/libreoffice'
os.environ['PATH'] += ':/usr/lib/libreoffice/program'file_path = "/your/ppt_path/case_1.ppt"def extract_text(file_name: str):elements = partition_pptx(filename=file_name,multipage_sections=True,infer_table_structure=True,include_page_breaks=False,)chunks = chunk_by_title(elements=elements,multipage_sections=True,combine_text_under_n_chars=0,new_after_n_chars=None,max_characters=4096,)text_list = []for chunk in chunks:if isinstance(chunk, CompositeElement):text = chunk.texttext_list.append(text)elif isinstance(chunk, Table):if text_list:text_list[-1] = text_list[-1] + "\n" + chunk.metadata.text_as_htmlelse:text_list.append(chunk.hunk.metadata.text_as_html)result_dict = {"無標題":[]}for text in text_list:split_text = text.split("\n\n", 1)if len(split_text) == 2:title, text = split_textif title not in result_dict:result_dict[title] = []result_dict[title].append(text)else:result_dict["無標題"].append(text)return result_dictdef split_chunks(text_list: list, source: str):chunks = []for text in text_list:for key, value in text.items():chunks.append({"question": key, "answer": value, "source": source})return chunksdef convert_ppt_to_pptx(ppt_file_path):# Define the command to run LibreOffice in headless modecommand = ['libreoffice','--headless','--convert-to', 'pptx','--outdir', os.path.dirname(ppt_file_path),ppt_file_path]# Run the commandresult = subprocess.run(command, capture_output=True, text=True)if result.returncode != 0:raise RuntimeError(f"Failed to convert '{ppt_file_path}' to PPTX.\nError: {result.stderr}")return ppt_file_path.replace('.ppt', '.pptx')pptx_file_path = convert_ppt_to_pptx(file_path)
print("convert ppt to pptx done")
contents = extract_text(pptx_file_path)
for k,v in contents.items():print(k,v)print("__"*30)

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/diannao/12738.shtml
繁體地址，請注明出處：http://hk.pswp.cn/diannao/12738.shtml
英文地址，請注明出處：http://en.pswp.cn/diannao/12738.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！