微軟在今年4月份的時候提出了GraphRAG的概念,然后在上周開源了GraphRAG,Github鏈接見https://github.com/microsoft/graphrag,截止當前,已有6900+Star。
安裝教程
官方推薦使用Python3.10-3.12版本,我使用Python3.10版本安裝時,在初始化項目過程中會報錯,切換到Python3.11版本后運行正常,推測是Python3.10與微軟的一些最新的SDK不兼容。所以建議使用Python3.11的環境,安裝GraphRAG比較簡單,直接下面一行代碼即可安裝成功。
pip install graphrag
使用教程
在這個教程中,我們使用馬伯庸的《太白金星有點煩》這個短篇小說為例,測試下使用微軟開源的GraphRAG的處理效果。
注意,GraphRAG是使用LLM來提取文本片段中的實體關系,因此耗費Token數較多,如果是個人調研使用,不建議使用GPT4級別的模型(費用太高,不差錢的大佬請忽略此條建議)。綜合成本和效果,我這里使用的是DeepSeek-Chat模型。
初始化項目
我這邊先創建了一個臨時測試目錄myTest,然后按照官方教程,在myTest目錄下創建了input目錄,并把《太白金星有點煩》這本書的txt版本重命名為book.txt后放到input目錄下。然后調用python -m graphrag.index --init
進行初始化工作,生成一些配置文件。
mkdir ./myTest/input
curl https://www.xxx.com/太白金星有點煩.txt > ./myTest/input/book.txt // 這里是示例代碼,大家在測試時根據實際情況放入自己要測試的txt文本即可。
cd ./myTest
python -m graphrag.index --init
執行完成后,會在當前目錄(即MyTest)目錄下生成幾個新的文件夾:output-后續執行生成的中間結果會保存到這個目錄中;prompts-處理過程中用到的一些Prompt內容;.env-大模型API配置文件,里面默認就一個GRAPHRAG_API_KEY
用于配置大模型的apiKey;settings.yaml-該文件是整體的配置信息,如果我們使用的非OPENAI的官方模型和官方API,我們需要修改此配置文件來讓GraphRAG按照我們指定的配置文件執行。
配置相關文件
先在.env文件中配置大模型API的Key,這個配置是全局生效的。我們在.env文件中配置完成后,不需要在settings.yaml文件中重復配置。settings.yaml中使用的默認模型為gpt-4-turbo-preview
,如果不需要修改模型以及調用的API地址,那現在就已經配置完成了,后續的配置內容可以執行忽略并直接到執行階段。
我這里使用的是agicto 提供的APIkey(主要是新用戶注冊可以免費獲取到10塊錢的調用額度,白嫖還是挺爽的)。我在這里主要就修改了API地址和調用模型的名稱,修改完成后的settings文件完整內容如下:
encoding_model: cl100k_base
skip_workflows: []
llm:api_key: ${GRAPHRAG_API_KEY}type: openai_chat # or azure_openai_chatmodel: deepseek-chatmodel_supports_json: false # recommended if this is available for your model.api_base: https://api.agicto.cn/v1# max_tokens: 4000# request_timeout: 180.0# api_version: 2024-02-15-preview# organization: <organization_id># deployment_name: <azure_model_deployment_name># tokens_per_minute: 150_000 # set a leaky bucket throttle# requests_per_minute: 10_000 # set a leaky bucket throttle# max_retries: 10# max_retry_wait: 10.0# sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times# concurrent_requests: 25 # the number of parallel inflight requests that may be madeparallelization:stagger: 0.3# num_threads: 50 # the number of threads to use for parallel processingasync_mode: threaded # or asyncioembeddings:## parallelization: override the global parallelization settings for embeddingsasync_mode: threaded # or asynciollm:api_key: ${GRAPHRAG_API_KEY}type: openai_embedding # or azure_openai_embeddingmodel: text-embedding-3-smallapi_base: https://api.agicto.cn/v1# api_base: https://<instance>.openai.azure.com# api_version: 2024-02-15-preview# organization: <organization_id># deployment_name: <azure_model_deployment_name># tokens_per_minute: 150_000 # set a leaky bucket throttle# requests_per_minute: 10_000 # set a leaky bucket throttle# max_retries: 10# max_retry_wait: 10.0# sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times# concurrent_requests: 25 # the number of parallel inflight requests that may be made# batch_size: 16 # the number of documents to send in a single request# batch_max_tokens: 8191 # the maximum number of tokens to send in a single request# target: required # or optionalchunks:size: 300overlap: 100group_by_columns: [id] # by default, we don't allow chunks to cross documentsinput:type: file # or blobfile_type: text # or csvbase_dir: "input"file_encoding: utf-8file_pattern: ".*\\.txt$"cache:type: file # or blobbase_dir: "cache"# connection_string: <azure_blob_storage_connection_string># container_name: <azure_blob_storage_container_name>storage:type: file # or blobbase_dir: "output/${timestamp}/artifacts"# connection_string: <azure_blob_storage_connection_string># container_name: <azure_blob_storage_container_name>reporting:type: file # or console, blobbase_dir: "output/${timestamp}/reports"# connection_string: <azure_blob_storage_connection_string># container_name: <azure_blob_storage_container_name>entity_extraction:## llm: override the global llm settings for this task## parallelization: override the global parallelization settings for this task## async_mode: override the global async_mode settings for this taskprompt: "prompts/entity_extraction.txt"entity_types: [organization,person,geo,event]max_gleanings: 0summarize_descriptions:## llm: override the global llm settings for this task## parallelization: override the global parallelization settings for this task## async_mode: override the global async_mode settings for this taskprompt: "prompts/summarize_descriptions.txt"max_length: 500claim_extraction:## llm: override the global llm settings for this task## parallelization: override the global parallelization settings for this task## async_mode: override the global async_mode settings for this task# enabled: trueprompt: "prompts/claim_extraction.txt"description: "Any claims or facts that could be relevant to information discovery."max_gleanings: 0community_report:## llm: override the global llm settings for this task## parallelization: override the global parallelization settings for this task## async_mode: override the global async_mode settings for this taskprompt: "prompts/community_report.txt"max_length: 2000max_input_length: 8000cluster_graph:max_cluster_size: 10embed_graph:enabled: false # if true, will generate node2vec embeddings for nodes# num_walks: 10# walk_length: 40# window_size: 2# iterations: 3# random_seed: 597832umap:enabled: false # if true, will generate UMAP embeddings for nodessnapshots:graphml: falseraw_entities: falsetop_level_nodes: falselocal_search:# text_unit_prop: 0.5# community_prop: 0.1# conversation_history_max_turns: 5# top_k_mapped_entities: 10# top_k_relationships: 10# max_tokens: 12000global_search:# max_tokens: 12000# data_max_tokens: 12000# map_max_tokens: 1000# reduce_max_tokens: 2000# concurrency: 32
執行并構建圖索引
此流程是GraphRAG的核心流程,即構建基于圖的知識庫用于后續的問答環節,通過以下代碼即可觸發執行。
python -m graphrag.index
基于微軟在論文中提到的實現思路,執行過程GraphRAG主要實現了如下功能:
- Source Documents → Text Chunks:將源文檔分割成文本塊。
- Text Chunks → Element Instances:從每個文本塊中提取圖節點和邊的實例。
- Element Instances → Element Summaries:為每個圖元素生成摘要。
- Element Summaries → Graph Communities:使用社區檢測算法將圖劃分為社區。
- Graph Communities → Community Summaries:為每個社區生成摘要。
- Community Summaries → Community Answers → Global Answer:使用社區摘要生成局部答案,然后匯總這些局部答案以生成全局答案。
整體執行耗時與具體的文本大小有關。我這個例子整體耗時大概20分鐘,耗費人民幣大約4塊錢。執行過程中的輸出如下:
?? Reading settings from settings.yaml
/home/xinfeng/miniconda3/envs/graphrag-new/lib/python3.11/site-packages/numpy/core/fromnumeric.py:59: FutureWarning: 'DataFrame.swapaxes' is deprecated and will
be removed in a future version. Please use 'DataFrame.transpose' instead.return bound(*args, **kwds)
?? create_base_text_un