【橘子大模型】初探rag知識庫的構建

一、簡介

我們在實現了一系列功能之后，終于來到了rag的部分，下面我們將基于langchain來實現一個rag檢索。
關于rag方面的知識，可以查看這兩篇文章：
大模型應用之RAG詳解
什么是 RAG（檢索增強生成）
或者是去油管上IBM Technology專欄，他們在這方面的簡介令人印象深刻。
好了，我們還是簡單來描述一下rag文本處理的一個流程。

數據加載
文本分塊
文本嵌入
創建索引

其實簡單理解就是你要把一個外部的文檔(各種格式)進行拆分，然后做嵌入(向量化)，存儲進向量庫，作為檢索的語料。
這就是我們今天主要要做的，我們要基于langchain來實現一下，然后后面不斷完善他。

二、代碼實現

1、啟動es

我們需要使用es進行向量結果的存儲，所以我們簡單啟動一個es，我們選擇8.17.2
配置文件我們簡單配置一下，不啟動太多的安全檢測。如果你想啟動生產級別的單點配置，可以參考使用docker搭建ELK環境

cluster.name: my-application
node.name: node-1
http.port: 9200
xpack.security.http.ssl.enabled: false
xpack.security.enabled: false
xpack.security.transport.ssl.enabled: false
xpack.ml.enabled: false

kibana我們就不配置了，啟動之后我們來編寫python代碼。

2、代碼實現

我們準備兩個pdf，我準備的是山東省公布的對臺灣同胞的政策和濰坊的美食介紹。分別是lutai.pdf，weifangfood.pdf。
我們來看一下langchain對于文檔加載的支持。這部分內容位于langchain的文檔加載
你能看到他支持多種文件格式的加載。
Webpages：網絡頁面
PDFs：pdf文件
…
我們這里使用的就是pdf加載器。我們就用這個，PyPDF文檔
你按照他的那個步驟安裝這個包就好了：

pip install -qU pypdf

我們在pycharm中的虛擬環境中加載這個就可以了，前面我們弄過了，可以直接去看前面的文章就行。

from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_ollama import OllamaEmbeddings
from langchain_elasticsearch import ElasticsearchStore
from uuid import uuid4from langchain_core.documents import Document# 構建一個文件數組，后期用來解析
pdf1 = './weifangfood.pdf'
pdf2 = './lutai.pdf'
pdfs = [pdf1,pdf2]
docs = []# 解析pdf拆分文檔
for pdf in pdfs:# 構建pdf loaderloader = PyPDFLoader(pdf)# 添加到文檔數組中，后面就處理這個docs.extend(loader.load())# 對文檔結果做切分，每一塊切1000個字符，重疊大約200個，并且設置索引(做編號，標記這個切分結果來自于文檔切分的哪一塊)
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000,chunk_overlap=200,add_start_index=True)
split_text_list = text_splitter.split_documents(docs)# 構建向量化組件
embed = OllamaEmbeddings(model="llama3.2:latest")# 構建es向量存儲器
elastic_vector_search = ElasticsearchStore(es_url="http://localhost:9200",index_name="langchain_index",embedding=embed,
)documents = []
for text in split_text_list:# 向量化，輸出的就是向量結果# vector = embed.embed_query(text.page_content)# 對拆分的文本構建一個文檔結構document = Document(page_content = text.page_content,metadata = text.metadata,)documents.append(document)uuids = [str(uuid4()) for _ in range(len(documents))]
# 添加到es向量存儲中
elastic_vector_search.add_documents(documents=documents, ids=uuids)

以上代碼均可參考以下文檔來實現。
文本向量化
向量數據庫選擇
es向量化存儲

運行之后他會自動在es中創建一個索引langchain_index，然后把你的文檔向量化的結果存儲進es中。

3、向量化結果

我們運行之后，查看一下這個索引。

{"langchain_index": {"aliases": {},"mappings": {"properties": {"metadata": {"properties": {"author": {"type": "text","fields": {"keyword": {"type": "keyword","ignore_above": 256}}},"comments": {"type": "text","fields": {"keyword": {"type": "keyword","ignore_above": 256}}},"company": {"type": "text","fields": {"keyword": {"type": "keyword","ignore_above": 256}}},"creationdate": {"type": "date"},"creator": {"type": "text","fields": {"keyword": {"type": "keyword","ignore_above": 256}}},"keywords": {"type": "text","fields": {"keyword": {"type": "keyword","ignore_above": 256}}},"moddate": {"type": "date"},"page": {"type": "long"},"page_label": {"type": "text","fields": {"keyword": {"type": "keyword","ignore_above": 256}}},"producer": {"type": "text","fields": {"keyword": {"type": "keyword","ignore_above": 256}}},"source": {"type": "text","fields": {"keyword": {"type": "keyword","ignore_above": 256}}},"sourcemodified": {"type": "text","fields": {"keyword": {"type": "keyword","ignore_above": 256}}},"start_index": {"type": "long"},"subject": {"type": "text","fields": {"keyword": {"type": "keyword","ignore_above": 256}}},"title": {"type": "text","fields": {"keyword": {"type": "keyword","ignore_above": 256}}},"total_pages": {"type": "long"},"trapped": {"type": "text","fields": {"keyword": {"type": "keyword","ignore_above": 256}}}}},"text": {"type": "text","fields": {"keyword": {"type": "keyword","ignore_above": 256}}},"vector": {"type": "dense_vector","dims": 3072,"index": true,"similarity": "cosine","index_options": {"type": "int8_hnsw","m": 16,"ef_construction": 100}}}},"settings": {"index": {"routing": {"allocation": {"include": {"_tier_preference": "data_content"}}},"number_of_shards": "1","provided_name": "langchain_index","creation_date": "1744981253591","number_of_replicas": "1","uuid": "JdwR_WlTQzWtzFRlVCv-rw","version": {"created": "8521000"}}}}
}

我們來主要分析一下他的這些字段，或者你可以在代碼中輸出結果來看下。我們在es中直接查看即可。

text:你在把文本拆分之后然后對每一段進行向量化存儲，每一段就是一條es doc。text存儲了這一段的原文。
metadata：元數據，這里是一個層級結構，下層還存儲著當前段來自哪個文檔，屬于第幾段。這個文檔一共拆分了幾段，當前這一段屬于第幾段，等等元信息。
vector：double類型的數組，其實就是當前段向量化之后的結果。

我們來看其中一個段的結果:

{"_index": "langchain_index","_id": "e5bb54ff-8269-4e0b-8cb3-3952dc4d57b9","_score": 1,"_ignored": ["text.keyword"],"_source": {"text": """中國臺灣網 8 月 15 日濟南訊 7 月 31 日，山東省正式發布實施《關
于促進魯臺經濟文化交流合作的若干措施》，該措施共 56 條，其中
促進魯臺經濟合作 23 條措施、促進魯臺文教交流 14 條措施，支持
臺灣同胞在魯學習創業就業生活 19 條。措施涵蓋產業合作、就業、
人才引進、知識產權保護、文教合作及職業資格考試、證件辦理等
方面，綜合運用了財稅、金融、用地等政策手段，并明確規定今后
山東省各級政府在制定發展規劃、設立扶持資金和出臺支持政策時，
都將保障符合條件的臺資企業和臺灣同胞享有同等待遇。
山東是大陸國有經濟大省、基礎設施建設不斷發展，該措施支
持臺灣民間資本與山東國有資本共同設立股權投資基金、產業投資
基金，支持臺資參與山東高速公路、軌道交通等多方面基礎設施建
設，讓臺商臺企率先享受山東發展紅利，實現兩岸資本合作共贏。
山東也是儒家文化發源地和教育人力資源大省，措施扶持臺灣同胞、
臺灣高校研究機構和民間社團參與承接山東文化產業工程項目，聯
合開展教學改革和共建研究生培養基地，大力拓展兩岸文化融合的
深度廣度，特別是將大力吸引臺灣儒學研究高端人才到山東開展儒
學研究和傳播工作，符合條件的給予優厚待遇。同時，該措施也針
對臺商反映的“退城進園”、農業土地流轉合同到期優先續租、子
女就學、臺胞行醫等做出了統籌管理和制度化安排。
據悉，山東省人大正加緊制定保護促進臺胞在山東投資權益的
地方立法條例，條例將吸收本次出臺的 56 條措施，在法律層面上對
臺胞應享有權益予以明確和保障。""","metadata": {"producer": "","creator": "WPS 文字","creationdate": "2024-07-30T17:08:05+09:08","author": "","comments": "","company": "","keywords": "","moddate": "2024-07-30T17:08:05+09:08","sourcemodified": "D:20240730170805+09'08'","subject": "","title": "","trapped": "/False","source": "./lutai.pdf","total_pages": 14,"page": 0,"page_label": "1","start_index": 0},"vector": [0.017683318,0.010620082,0.02021882,-0.0032659627,......0.006196561]}}

4、相似性檢索

我們把文本向量化之后，我們就可以執行一些向量檢索，這個是es中一個比較重要的概念，文檔位于ai檢索
我們使用langchain中的一些封裝就可以。文檔同樣位于langchain相似性檢索

from langchain_ollama import OllamaEmbeddings
from langchain_elasticsearch import ElasticsearchStore# 向量化
embed = OllamaEmbeddings(model="llama3.2:latest")# 存儲es
elastic_vector_search = ElasticsearchStore(es_url="http://localhost:9200",index_name="langchain_index",embedding=embed,
)
# 使用相似性檢索
results = elastic_vector_search.similarity_search_with_score(query="進魯臺經濟合作方面",k=1,
)# 輸出檢索結果
for doc, score in results:print(f"* [SIM={score:3f}] {doc.page_content} [{doc.metadata}]")

這個檢索結果有很大提升空間，我們后面整合到llm中再說。

5、Retrievers 檢索器

我們的langchain中有很多向量檢索的api，es的 pgsql的很多。這種都有各自的使用方式，于是langchain封裝了一個上層實現Retrievers 檢索器
眾多實現都實現了這個接口，我們可以直接使用這個上層接口來屏蔽底層的實現。我們來簡單實現一個。

from langchain_ollama import OllamaEmbeddings
from langchain_elasticsearch import ElasticsearchStore# 向量化
embed = OllamaEmbeddings(model="llama3.2:latest")# 存儲es
elastic_vector_search = ElasticsearchStore(es_url="http://localhost:9200",index_name="langchain_index",embedding=embed,
)
# 構建一個Retrievers 檢索器，使用similarity相似性檢索。k：1表示每一個問題都只返回一個回答
retriever = elastic_vector_search.as_retriever(search_type="similarity",search_kwargs={"k":1}
)# 批量執行兩個問題
resp = retriever.batch(["臺灣同胞在山東就業期間有什么政策","濰坊有啥好吃的"]
)
# 返回輸出
for result in resp:print(result)