微軟開源GraphRAG的使用教程-使用自定義數據測試GraphRAG

在這里插入圖片描述

微軟在今年4月份的時候提出了GraphRAG的概念，然后在上周開源了GraphRAG,Github鏈接見https://github.com/microsoft/graphrag,截止當前，已有6900+Star。

安裝教程

官方推薦使用Python3.10-3.12版本，我使用Python3.10版本安裝時，在初始化項目過程中會報錯，切換到Python3.11版本后運行正常，推測是Python3.10與微軟的一些最新的SDK不兼容。所以建議使用Python3.11的環境，安裝GraphRAG比較簡單，直接下面一行代碼即可安裝成功。

pip install graphrag

使用教程

在這個教程中，我們使用馬伯庸的《太白金星有點煩》這個短篇小說為例，測試下使用微軟開源的GraphRAG的處理效果。

注意，GraphRAG是使用LLM來提取文本片段中的實體關系，因此耗費Token數較多，如果是個人調研使用，不建議使用GPT4級別的模型（費用太高，不差錢的大佬請忽略此條建議）。綜合成本和效果，我這里使用的是DeepSeek-Chat模型。

初始化項目

我這邊先創建了一個臨時測試目錄myTest，然后按照官方教程，在myTest目錄下創建了input目錄，并把《太白金星有點煩》這本書的txt版本重命名為book.txt后放到input目錄下。然后調用python -m graphrag.index --init 進行初始化工作，生成一些配置文件。

mkdir ./myTest/input
curl https://www.xxx.com/太白金星有點煩.txt > ./myTest/input/book.txt  // 這里是示例代碼，大家在測試時根據實際情況放入自己要測試的txt文本即可。
cd ./myTest
python -m graphrag.index --init

執行完成后，會在當前目錄（即MyTest）目錄下生成幾個新的文件夾：output-后續執行生成的中間結果會保存到這個目錄中；prompts-處理過程中用到的一些Prompt內容；.env-大模型API配置文件，里面默認就一個GRAPHRAG_API_KEY 用于配置大模型的apiKey；settings.yaml-該文件是整體的配置信息，如果我們使用的非OPENAI的官方模型和官方API，我們需要修改此配置文件來讓GraphRAG按照我們指定的配置文件執行。

配置相關文件

先在.env文件中配置大模型API的Key，這個配置是全局生效的。我們在.env文件中配置完成后，不需要在settings.yaml文件中重復配置。settings.yaml中使用的默認模型為gpt-4-turbo-preview ，如果不需要修改模型以及調用的API地址，那現在就已經配置完成了，后續的配置內容可以執行忽略并直接到執行階段。

我這里使用的是agicto 提供的APIkey(主要是新用戶注冊可以免費獲取到10塊錢的調用額度，白嫖還是挺爽的)。我在這里主要就修改了API地址和調用模型的名稱，修改完成后的settings文件完整內容如下：

encoding_model: cl100k_base
skip_workflows: []
llm:api_key: ${GRAPHRAG_API_KEY}type: openai_chat # or azure_openai_chatmodel: deepseek-chatmodel_supports_json: false # recommended if this is available for your model.api_base: https://api.agicto.cn/v1# max_tokens: 4000# request_timeout: 180.0# api_version: 2024-02-15-preview# organization: <organization_id># deployment_name: <azure_model_deployment_name># tokens_per_minute: 150_000 # set a leaky bucket throttle# requests_per_minute: 10_000 # set a leaky bucket throttle# max_retries: 10# max_retry_wait: 10.0# sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times# concurrent_requests: 25 # the number of parallel inflight requests that may be madeparallelization:stagger: 0.3# num_threads: 50 # the number of threads to use for parallel processingasync_mode: threaded # or asyncioembeddings:## parallelization: override the global parallelization settings for embeddingsasync_mode: threaded # or asynciollm:api_key: ${GRAPHRAG_API_KEY}type: openai_embedding # or azure_openai_embeddingmodel: text-embedding-3-smallapi_base: https://api.agicto.cn/v1# api_base: https://<instance>.openai.azure.com# api_version: 2024-02-15-preview# organization: <organization_id># deployment_name: <azure_model_deployment_name># tokens_per_minute: 150_000 # set a leaky bucket throttle# requests_per_minute: 10_000 # set a leaky bucket throttle# max_retries: 10# max_retry_wait: 10.0# sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times# concurrent_requests: 25 # the number of parallel inflight requests that may be made# batch_size: 16 # the number of documents to send in a single request# batch_max_tokens: 8191 # the maximum number of tokens to send in a single request# target: required # or optionalchunks:size: 300overlap: 100group_by_columns: [id] # by default, we don't allow chunks to cross documentsinput:type: file # or blobfile_type: text # or csvbase_dir: "input"file_encoding: utf-8file_pattern: ".*\\.txt$"cache:type: file # or blobbase_dir: "cache"# connection_string: <azure_blob_storage_connection_string># container_name: <azure_blob_storage_container_name>storage:type: file # or blobbase_dir: "output/${timestamp}/artifacts"# connection_string: <azure_blob_storage_connection_string># container_name: <azure_blob_storage_container_name>reporting:type: file # or console, blobbase_dir: "output/${timestamp}/reports"# connection_string: <azure_blob_storage_connection_string># container_name: <azure_blob_storage_container_name>entity_extraction:## llm: override the global llm settings for this task## parallelization: override the global parallelization settings for this task## async_mode: override the global async_mode settings for this taskprompt: "prompts/entity_extraction.txt"entity_types: [organization,person,geo,event]max_gleanings: 0summarize_descriptions:## llm: override the global llm settings for this task## parallelization: override the global parallelization settings for this task## async_mode: override the global async_mode settings for this taskprompt: "prompts/summarize_descriptions.txt"max_length: 500claim_extraction:## llm: override the global llm settings for this task## parallelization: override the global parallelization settings for this task## async_mode: override the global async_mode settings for this task# enabled: trueprompt: "prompts/claim_extraction.txt"description: "Any claims or facts that could be relevant to information discovery."max_gleanings: 0community_report:## llm: override the global llm settings for this task## parallelization: override the global parallelization settings for this task## async_mode: override the global async_mode settings for this taskprompt: "prompts/community_report.txt"max_length: 2000max_input_length: 8000cluster_graph:max_cluster_size: 10embed_graph:enabled: false # if true, will generate node2vec embeddings for nodes# num_walks: 10# walk_length: 40# window_size: 2# iterations: 3# random_seed: 597832umap:enabled: false # if true, will generate UMAP embeddings for nodessnapshots:graphml: falseraw_entities: falsetop_level_nodes: falselocal_search:# text_unit_prop: 0.5# community_prop: 0.1# conversation_history_max_turns: 5# top_k_mapped_entities: 10# top_k_relationships: 10# max_tokens: 12000global_search:# max_tokens: 12000# data_max_tokens: 12000# map_max_tokens: 1000# reduce_max_tokens: 2000# concurrency: 32

執行并構建圖索引

此流程是GraphRAG的核心流程，即構建基于圖的知識庫用于后續的問答環節，通過以下代碼即可觸發執行。

python -m graphrag.index

基于微軟在論文中提到的實現思路，執行過程GraphRAG主要實現了如下功能：

Source Documents → Text Chunks：將源文檔分割成文本塊。
Text Chunks → Element Instances：從每個文本塊中提取圖節點和邊的實例。
Element Instances → Element Summaries：為每個圖元素生成摘要。
Element Summaries → Graph Communities：使用社區檢測算法將圖劃分為社區。
Graph Communities → Community Summaries：為每個社區生成摘要。
Community Summaries → Community Answers → Global Answer：使用社區摘要生成局部答案，然后匯總這些局部答案以生成全局答案。

整體執行耗時與具體的文本大小有關。我這個例子整體耗時大概20分鐘，耗費人民幣大約4塊錢。執行過程中的輸出如下：


?? Reading settings from settings.yaml
/home/xinfeng/miniconda3/envs/graphrag-new/lib/python3.11/site-packages/numpy/core/fromnumeric.py:59: FutureWarning: 'DataFrame.swapaxes' is deprecated and will 
be removed in a future version. Please use 'DataFrame.transpose' instead.return bound(*args, **kwds)
?? create_base_text_un

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/web/80781.shtml
繁體地址，請注明出處：http://hk.pswp.cn/web/80781.shtml
英文地址，請注明出處：http://en.pswp.cn/web/80781.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！