基于GraphRAG+Ollama驗證知識圖譜和檢索增強融合

之前介紹了知識圖譜與檢索增強的融合探索GraphRAG。

https://blog.csdn.net/liliang199/article/details/151189579

這里嘗試在CPU環境，基于GraphRAG+Ollama，驗證GraphRAG構建知識圖譜和檢索增強查詢過程。

1 環境安裝

1.1 GraphRAG安裝

在本地cpu環境，基于linux conda安裝python，pip安裝graphrag，過程如下。

conda create -n graphrag python=3.10

conda activate graphrag

pip install graphrag==0.5.0 -i https://pypi.tuna.tsinghua.edu.cn/simple

安裝graphrag 0.5.0（之后版本可能和ollama有兼容問題）

1.2 Ollama LLM安裝

假設ollama已安裝，具體安裝過程參考

https://blog.csdn.net/liliang199/article/details/149267372

這里ollama下載llm模型mistral和embedding模型nomic-embed-text

ollama pull nomic-embed-text

ollama pull mistral

默認ollama模型上下文長度為2048，不能有效支持GraphRAG，需要對上下文長度進行修改。

導出現有llm模型配置，配置文件為Modelfile

ollama show --modelfile mistral:latest > Modelfile?

修改Modelfile，在PARAMETER區域添加如下配置，支持10k上下文，可依據具體情況設定。

PARAMETER num_ctx 10000

修改后示例如下

基于修改后的Modelfile，創建新的ollama模型，指令如下。

ollama create mistral:10k -f Modelfile

查看新創建模型

ollama list

2 GraphRAG圖譜構建驗證

2.1 測試數據準備

首先創建工作目錄

mkdir ragtest/input -p

輸入為如下小1000行的文本，獲取命令如下。

https://www.gutenberg.org/cache/epub/7785/pg7785.txt

wget https://www.gutenberg.org/cache/epub/7785/pg7785.txt -O ragtest/input/Transformers_intro.txt

此時，./ragtest包含測試數據，初始化指令如下。

graphrag init --root ./ragtest

生成參數配置文件ragtest/settings.yaml

2.2 環境變量設置

設置GRAPHRAG_API_KEY和GRAPHRAG_CLAIM_EXTRACTION_ENABLED

export GRAPHRAG_API_KEY=ollama
export GRAPHRAG_CLAIM_EXTRACTION_ENABLED=True

設置參數GRAPHRAG_CLAIM_EXTRACTION_ENABLED=True，否則無法生成協變量，Local Search出錯。

2.3 模型參數配置

模型參數配置文件ragtest/settings.yaml

修改llm model為mistral:10k，embedding model為nomic-embed-text

調用本地ollama llm服務，所以設置api_base: http://localhost:11434/v1

本地cpu部署，計算很慢，所以設置一個很長的request_timeout: 18000

沒有GPU，過大concurrent_requests沒效果，反而導致超時，設置concurrent_requests: 1

### This config file contains required core defaults that must be set, along with a handful of common optional settings.
### For a full list of available settings, see https://microsoft.github.io/graphrag/config/yaml/

### LLM settings ###
## There are a number of settings to tune the threading and token limits for LLM calls - check the docs.

encoding_model: cl100k_base # this needs to be matched to your model!

llm:
? api_key: ${GRAPHRAG_API_KEY} # set this in the generated .env file
? type: openai_chat # or azure_openai_chat
? model: mistral:10k

? api_base: http://localhost:11434/v1

? request_timeout: 18000

??concurrent_requests: 1
? model_supports_json: true # recommended if this is available for your model.
? # audience: "https://cognitiveservices.azure.com/.default"
? # api_base: https://<instance>.openai.azure.com
? # api_version: 2024-02-15-preview
? # organization: <organization_id>
? # deployment_name: <azure_model_deployment_name>

parallelization:
? stagger: 0.3
? # num_threads: 50

async_mode: threaded # or asyncio

embeddings:
? async_mode: threaded # or asyncio
? vector_store:
? ? type: lancedb
? ? db_uri: 'output/lancedb'
? ? container_name: default
? ? overwrite: true
? llm:
? ? api_key: ${GRAPHRAG_API_KEY}
? ? type: openai_embedding # or azure_openai_embedding
? ? model: nomic-embed-text
? ? api_base: http://localhost:11434/v1
? ? request_timeout: 18000

? ??concurrent_requests: 1
? ? # api_base: https://<instance>.openai.azure.com
? ? # api_version: 2024-02-15-preview
? ? # audience: "https://cognitiveservices.azure.com/.default"
? ? # organization: <organization_id>
? ? # deployment_name: <azure_model_deployment_name>

### Input settings ###

input:
? type: file # or blob
? file_type: text # or csv
? base_dir: "input"
? file_encoding: utf-8
? file_pattern: ".*\\.txt$"

chunks:
? size: 1200
? overlap: 100
? group_by_columns: [id]

### Storage settings ###
## If blob storage is specified in the following four sections,
## connection_string and container_name must be provided

cache:
? type: file # or blob
? base_dir: "cache"

reporting:
? type: file # or console, blob
? base_dir: "logs"

storage:
? type: file # or blob
? base_dir: "output"

## only turn this on if running `graphrag index` with custom settings
## we normally use `graphrag update` with the defaults
update_index_storage:
? # type: file # or blob
? # base_dir: "update_output"

### Workflow settings ###

skip_workflows: []

entity_extraction:
? prompt: "prompts/entity_extraction.txt"
? entity_types: [organization,person,geo,event]
? max_gleanings: 1

summarize_descriptions:
? prompt: "prompts/summarize_descriptions.txt"
? max_length: 500

claim_extraction:
? enabled: false
? prompt: "prompts/claim_extraction.txt"
? description: "Any claims or facts that could be relevant to information discovery."
? max_gleanings: 1

community_reports:
? prompt: "prompts/community_report.txt"
? max_length: 2000
? max_input_length: 8000

cluster_graph:
? max_cluster_size: 10

embed_graph:
? enabled: false # if true, will generate node2vec embeddings for nodes

umap:
? enabled: false # if true, will generate UMAP embeddings for nodes

snapshots:
? graphml: false
? raw_entities: false
? top_level_nodes: false
? embeddings: false
? transient: false

### Query settings ###
## The prompt locations are required here, but each search method has a number of optional knobs that can be tuned.
## See the config docs: https://microsoft.github.io/graphrag/config/yaml/#query

local_search:
? prompt: "prompts/local_search_system_prompt.txt"

global_search:
? map_prompt: "prompts/global_search_map_system_prompt.txt"
? reduce_prompt: "prompts/global_search_reduce_system_prompt.txt"
? knowledge_prompt: "prompts/global_search_knowledge_system_prompt.txt"

drift_search:
? prompt: "prompts/drift_search_system_prompt.txt"

2.4?數據索引構建

然后就是構建索引，這里需要設置--reporter "rich"，不設置會報錯。

nohup graphrag index --root ./ragtest --reporter "rich" > run.log &

本地CPU運行ollama其實不太有效，耗時太長，導致各種奇怪超時和報錯，所以最好有GPU。

另外，雖然可以調用外部LLM服務，但GraphRAG索引會消耗大量tokens，這需要不差錢。

附錄

問題1:?Invalid value for '--reporter'

Invalid value for '--reporter' (env var: 'None'): <ReporterType.RICH: 'rich'> is not one of 'rich', 'print', 'none'. ? ? ? ? ? ? ? ? ? ? ? ? ? │

補全輸出參數"none"/"print"/"rich"，比如?--reporter "rich"

reference

---

GraphRAG

https://github.com/msolhab/graphrag

Project Gutenberg

https://www.gutenberg.org/

Global Search Notebook

https://microsoft.github.io/graphrag/examples_notebooks/global_search/

GraphRAG-知識圖譜與檢索增強的融合探索

https://blog.csdn.net/liliang199/article/details/151189579

GraphTest - 直接使用阿里API，總體費用相對可控。

https://github.com/NanGePlus/GraphragTest

GraphRAG（最新版）+Ollama本地部署，以及中英文示例

https://juejin.cn/post/7439046849883226146

傻瓜操作：GraphRAG、Ollama 本地部署及踩坑記錄

https://blog.csdn.net/weixin_42107217/article/details/141649920

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/web/96148.shtml
繁體地址，請注明出處：http://hk.pswp.cn/web/96148.shtml
英文地址，請注明出處：http://en.pswp.cn/web/96148.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！