基于LangChain的RAG開發教程（二）

v1.0官方文檔：https://python.langchain.com/v0.1/docs/get_started/introduction/

最新文檔：https://python.langchain.com/v0.2/docs/introduction/

LangChain是一個能夠利用大語言模型（LLM，Large Language Model）能力進行快速應用開發的框架：

高度抽象的組件，可以像搭積木一樣，使用LangChain的組件來實現我們的應用
集成外部數據到LLM中，比如API接口數據、文件、外部應用等；
提供了許多可自定義的LLM高級能力，比如Agent、RAG等等；

LangChain框架主要由以下六個部分組成：

Model IO：格式化和管理LLM的輸入和輸出
Retrieval：檢索，與特定應用數據交互，比如RAG，與向量數據庫密切相關，能夠實現從向量數據庫中搜索與問題相關的文檔來作為增強LLM的上下文
Agents：決定使用哪個工具（高層指令）的結構體，而tools則是允許LLM與外部系統交互的接口
Chains：構建運行程序的block-style組合，即能將多個模塊連接起來，實現復雜的功能應用
Memory：在運行一個鏈路（chain）時能夠存儲程序狀態的信息，比如存儲歷史對話記錄，隨時能夠對這些歷史對話記錄重新加載，保證長對話的準確性
Callbacks：回調機制，可以追蹤任何鏈路的步驟，記錄日志

可以到上一篇文章LangChain入門開發教程（一）：Model I/O回顧了Model I/O涉及的Prompts、LLMs、Chat model、Output Parser等概念及其使用，這篇文章則繼續介紹LangChain的第二個核心組件：Retrieval。

Retrieval

https://python.langchain.com/v0.1/docs/modules/data_connection/

許多LLM應用需要使用到用戶特定的數據，比如某些私域/垂直領域的數據，但它并不包含在LLM的訓練樣本集合，因此LLM無法在這些領域很好地發揮它的能力
對于近期的資訊信息，由于訓練樣本的滯后性，無法實時更新最新資訊語料，因此LLM是無法獲知臨近的資料信息的，有時甚至會一本正經的胡說八道

而針對這種數據情況，目前最主要的解決方式則是RAG（Retrieval Augmented Generation，檢索增強生成）：在生成過程中，外部的數據會通過檢索然后傳遞給LLM，讓LLM能夠利用這些新知識作為上下文。

LangChain提供了全部RAG應用的構建模板，覆蓋了與檢索流程相關的所有步驟，例如獲取外部數據等。它主要包含以下幾個核心模塊：

Document loaders：LangChain內置100多種document loaders，可以從許多不同的數據源加載數據，支持所有類型包括HTML、PDF和code等，也支持各種地址包括S3 buckets、公開的網站等等。
Text Splitting：檢索的一個關鍵部分就是只獲取文檔中相關的那一部分，這涉及到多個轉換步驟來為檢索準備文檔。其中主要的一個步驟就是將一個大型文檔拆分為更小的chunks（塊）。
Text Embedding Models：檢索的另外一個關鍵部分就是為文檔創建embeddings。embeddings可以捕獲文本的語義信息，讓你可以快速高效的尋找到相似的文本。
Vector Stores：隨著embeddings數量增加，便需要一個向量數據庫來高效地存儲和檢索這些embeddings，像之前介紹過的支持本地內存： Annoy & Faiss、chroma，或者C/S架構：milvus、weaviate、qdrant等等
Retrievers：一旦數據存入數據庫，接下來要做的便是檢索。LangChain提供了許多不同的檢索算法，能夠很輕松的實現語義相似檢索。
Indexing：LangChain提供了Indexing API來同步數據源到向量數據庫。

文檔加載器

示例代碼：document_loaders.ipynb

文檔加載器可以從任意源加載數據，以Document類的形式，Document包括一塊文本（text）和相關的元數據（metadata）。LangChain提供了許多文檔加載器，既可以加載簡單的.txt文件，也有從任意網頁加載文本內容的，甚至可以加載Youtube視頻的文字內容。

文檔加載器提供了load方法可以從指定源加載數據為文檔，也實現lazy_load的懶加載方式，僅在需要訪問的時候才加載到內存。

Text

TextLoader是最簡單的一個加載器，它會讀取一個文件作為text，然后將所有內容放置在一個document里。

from langchain_community.document_loaders import TextLoaderloader = TextLoader("./examples/sql.md")
loader.load()
"""
[Document(page_content="## 創建表\n\n```sql\n# 分區表\ncreate table test_t2(words string,frequency string) partitioned by (partdate string) row format delimited fields terminated by ','.....", metadata={'source': './examples/sql.md'})]
"""

page_content存儲文字內容，metadata存儲了數據源信息。

CSV

CSV（Comma-separated values）是一種逗號分隔文本文件，每一行代表一條數據記錄，一條數據記錄會有一個或者多個fields，以逗號隔開。

比如下面的csv文件樣例：

from langchain_community.document_loaders.csv_loader import CSVLoaderloader = CSVLoader(file_path='./examples/test.csv')
loader.load()
"""
[Document(page_content='id: 1\nname: 張三\ndegree: 本科', metadata={'source': './examples/test.csv', 'row': 0}),Document(page_content='id: 2\nname: 李四\ndegree: 碩士', metadata={'source': './examples/test.csv', 'row': 1})]
"""

CSVLoader會把一行即一條數據記錄作為一個Document，不同fields之間以換行符隔開，metadata則存儲了對應的行號和數據源信息。

CSV是分隔符分隔文本文件中的一種，CSVLoader可以指定分隔符，通過參數 csv_args：

loader = CSVLoader(file_path='./examples/no_fields_name.csv', csv_args={'delimiter': ',','quotechar': '"','fieldnames': ['id', 'name', 'degree']}, source_column='id'
)loader.load()
"""
[Document(page_content='id: 1\nname: 張三\ndegree: 本科', metadata={'source': '1', 'row': 0}),Document(page_content='id: 2\nname: 李四\ndegree: 碩士', metadata={'source': '2', 'row': 1})]
"""

delimiter：分隔符
quotechar：csv這類文件是以換行符來分隔每一條數據記錄的，如果某個field中的值存在換行符，則需要轉義字符quotechar
fieldnames：當csv文件沒有列名時，可以指定列名
source_column：選擇一個fields name作為metadata中的source

PDF

下面我們再列舉PDF文件的加載，以上述TextLoader中的sql.md導出pdf為例：

from langchain_community.document_loaders import PyPDFLoaderloader = PyPDFLoader("examples/sql.pdf")
pages = loader.load()
pages
"""
[Document(page_content="創建表 \n插?數據 # 分區表create table test_t2(words string,frequency string) partitioned by (partdate string) row format delimited fields terminated by ','......", metadata={'source': 'examples/sql.pdf', 'page': 0})]
"""

LangChain會把pdf文件的每一頁內容存儲到一個Document實例中。

其他加載器

File Directory、HTML、JSON、Markdown 等

自定義加載器

文檔加載主要包括以下幾個抽象組件：

Component	Description
Document	Contains `text` and `metadata`
BaseLoader	Use to convert raw data into `Documents`（將原始數據轉換為`Documents`）
Blob	A representation of binary data that’s located either in a file or in memory（表示一個文件或內存里的二進制數據）
BaseBlobParser	Logic to parse a `Blob` to yield `Document` objects（將 `Blob`解析為`Document`）

自定義文檔加載器.

一個文檔加載器需要繼承BaseLoader，并提供了以下幾個需要實現的接口來加載文檔：

Method Name	Explanation
lazy_load	Used to load documents one by one lazily. Use for production code.（懶加載，使用一個文檔加載一個，適合生產環境）
alazy_load	Async variant of `lazy_load`（異步實現）
load	Used to load all the documents into memory eagerly. Use for prototyping or interactive work.（餓漢式加載，一次性加載全部的文檔到內存，可以不重寫，默認調用`list(self.lazy_load())`）
aload	Used to load all the documents into memory eagerly. Use for prototyping or interactive work. （異步實現）

下面實現了一個加載文本文件的例子，每一行加載為一個Document

from typing import AsyncIterator, Iteratorfrom langchain_core.document_loaders import BaseLoader
from langchain_core.documents import Documentclass CustomDocumentLoader(BaseLoader):"""An example document loader that reads a file line by line."""def __init__(self, file_path: str) -> None:"""Initialize the loader with a file path.Args:file_path: The path to the file to load."""self.file_path = file_pathdef lazy_load(self) -> Iterator[Document]:  # <-- Does not take any arguments"""A lazy loader that reads a file line by line.When you're implementing lazy load methods, you should use a generatorto yield documents one by one."""with open(self.file_path, encoding="utf-8") as f:line_number = 0for line in f:yield Document(page_content=line,metadata={"line_number": line_number, "source": self.file_path},)line_number += 1# alazy_load is OPTIONAL.# If you leave out the implementation, a default implementation which delegates to lazy_load will be used!async def alazy_load(self,) -> AsyncIterator[Document]:  # <-- Does not take any arguments"""An async lazy loader that reads a file line by line."""# Requires aiofiles# Install with `pip install aiofiles`# https://github.com/Tinche/aiofilesimport aiofilesasync with aiofiles.open(self.file_path, encoding="utf-8") as f:line_number = 0async for line in f:yield Document(page_content=line,metadata={"line_number": line_number, "source": self.file_path},)line_number += 1

仍然以上述的sql.md文件為例子進行加載：

loader = CustomDocumentLoader('./examples/sql.md')
for doc in loader.lazy_load():print(doc)
"""
<class 'langchain_core.documents.base.Document'>  |  page_content='## 創建表\n' metadata={'line_number': 0, 'source': './examples/sql.md'}
<class 'langchain_core.documents.base.Document'>  |  page_content='```sql\n' metadata={'line_number': 1, 'source': './examples/sql.md'}
......
"""async for doc in loader.alazy_load():print()print(type(doc))print(doc)
"""
<class 'langchain_core.documents.base.Document'>  |  page_content='## 創建表\n' metadata={'line_number': 0, 'source': './examples/sql.md'}
<class 'langchain_core.documents.base.Document'>  |  page_content='```sql\n' metadata={'line_number': 1, 'source': './examples/sql.md'}
......
"""

文本分割器

示例代碼：text_splitter.ipynb

在這里，要先理解一個概念，RAG中基于向量數據庫的語義相似檢索或者其他數據庫的全文檢索，其對象都是文檔，在LangChain中則對應上述的Document。因此，傳給LLM的檢索文檔：

不能過于冗長，一是會增加成本（包括tokens消耗和性能），二是依賴LLM的context length能力
盡量只包含與查詢query語義相關的內容，不要有過多不相關的內容，這反而會影響LLM的推斷
盡量保證文檔的上下文完整性，而不是隨意切分同個主題/段落中的句子

那么，此時就需要對原始文本進行切割，更好地將我們的文本數據轉化為一個個Document，而不是像上述Text和PDF加載器例子那樣，直接將整個文本直接加載為一個Document。

一般的處理思路如下：

將長文本切分為小的、有語義價值的塊（chunks），通常是句子；
將這些小的塊合并為稍微大一點的塊，直到一定程度的大小（size），當然，最理想的情況是語義相關的句子能夠合并在一起；
可以額外設置一個overlap，當超出指定size時，可以繼續取內容，盡量每一個塊是一個完整的句子。

分割級別

整個文本分割過程可以非常簡單，也可以很復雜，復雜程度從低到高的處理方法如下：

字符分割：指定某個字符作為分隔符
遞歸字符分割：使用一組字符，遞歸地切分
文檔格式分割：針對不同格式的文檔使用不同的方法
語義分割：基于embeddings進行分隔

下面，我們會對每種級別的分割都列舉一些LangChain內置的實現例子。

在開始實踐之前，我們需要認識兩個概念：

Chunk Size：切割之后的一個塊（也就是一個Document）的大小
Chunk Overlap：連續的塊之間重復的字符數量，這是盡量避免將一個完整的上下文片段被切分了。這也將會連續的兩個塊存在一部分重復的字符，這部分重復的內容便是overlap。

Chunk Overlap

字符分割

首先，我們可以直接按照上圖[Chunk Overlap]的示例，僅指定chunk size和chunk overlap進行切割，僅以固定的大小去切分為每一個Document：

from langchain_text_splitters import CharacterTextSplittertext_splitter = CharacterTextSplitter(separator="",  # 默認為"\n\n"，因此不使用分隔符的話，需要指定separator=""chunk_size=35,chunk_overlap=4,
)text = "This is the text I would like to chunk up. It is the example text for this exercise"
text_splitter.create_documents([text])
"""
[Document(page_content='This is the text I would like to ch'),Document(page_content='o chunk up. It is the example text'),Document(page_content='ext for this exercise')]
"""

如果指定了使用分隔字符的話，那么會按照以下流程進行切分：

先以separator作為分割符對長文本進行切分
然后對切分之后比較短的chunk進行合并，直到與下一個chunk合并會超過chunk_size
但是，存在單個chunk的文本超過chunk_size的情況

text_splitter = CharacterTextSplitter(separator="\n",chunk_size=35,chunk_overlap=4,
)text = "This is\n the text I would\n like to chunk up.It is the example text for this exercise"
text_splitter.create_documents([text])
"""
[Document(page_content='This is\n the text I would'),Document(page_content='like to chunk up.It is the example text for this exercise')]
"""

遞歸字符分割

有了前面的字符分割原理基礎之后，那么遞歸字符分割就比較好理解了，它在字符切割的基礎上，加入了一組分割字符列表和遞歸，并且這組分割字符等級應該逐漸降低，比如 ["\n\n", "\n", " ", ""] 是段落->句子->單詞這樣的逐級遞減：

首先，逐級遞減的順序去找到長文本存在的最高等級的分隔字符，比如\n\n；
然后執行上面指定字符分割的步驟；
對當前分割字符切割之后，超過chunk_size的chunk，繼續遞歸執行下一個等級的分割字符(比如\n)的塊；
直到最后一個分隔字符即""，跳出遞歸

首先，我們還是先上代碼例子，依然以上述的英文句子為例，看看使用遞歸字符分割之后是什么結果：

from langchain_text_splitters import RecursiveCharacterTextSplittertext_splitter = RecursiveCharacterTextSplitter(separators=["\n\n", "\n", " ", ""],chunk_size=35,chunk_overlap=4,
)text = "This is\n\n the\n\n text\n I would like\n to chunk up. It is the example text\n for\n\n this exercise"
text_splitter.create_documents([text])
"""
[Document(page_content='This is\n\n the'),Document(page_content='text\n I would like'),Document(page_content='to chunk up. It is the example'),Document(page_content='text'),Document(page_content='for'),Document(page_content='this exercise')]
"""

遞歸字符分割是實用性非常高的一種長文本切分方法，也是LangChain推薦使用的方法。

代碼分割

這一類文本分割是針對代碼文件，比如cpp、python等，但其依賴的仍然是上面的遞歸字符分割，比如下面的python代碼，其實就是使用了一組特殊的分割字符列表而已：['\nclass ', '\ndef ', '\n\tdef ', '\n\n', '\n', ' ', '']，先根據類和函數去切分，再根據正常的行切分：

from langchain_text_splitters import (Language,RecursiveCharacterTextSplitter,
)PYTHON_CODE = """
def hello_world():print("Hello, World!")# Call the function
hello_world()
"""
python_splitter = RecursiveCharacterTextSplitter.from_language(language=Language.PYTHON, chunk_size=50, chunk_overlap=0
)
python_docs = python_splitter.create_documents([PYTHON_CODE])
python_docs
"""
[Document(page_content='def hello_world():\n    print("Hello, World!")'),Document(page_content='# Call the function\nhello_world()')]
"""RecursiveCharacterTextSplitter.get_separators_for_language(Language.PYTHON)
"""
['\nclass ', '\ndef ', '\n\tdef ', '\n\n', '\n', ' ', '']
"""

但其實除了普通的代碼，還支持markdown、html、latex等格式的文本文件的切分。

特定格式文檔

如上所述，一直在強調每一個chunk應當保證文本的下上文，那么對于某些格式的文本文件，我們應該需要遵循它本身的結構去切分。比如markdown文本，它是以不同等級的headers來組織目錄的，比如#表示一級目錄，##表示二級目錄，那么直觀的想法便是按照headers來切分：

from langchain_text_splitters import MarkdownHeaderTextSplittermarkdown_document = "# Intro \n\n    ## History \n\n Markdown[9] is a lightweight markup language for creating formatted text using a plain-text editor. John Gruber created Markdown in 2004 as a markup language that is appealing to human readers in its source code form.[9] \n\n Markdown is widely used in blogging, instant messaging, online forums, collaborative software, documentation pages, and readme files. \n\n ## Rise and divergence \n\n As Markdown popularity grew rapidly, many Markdown implementations appeared, driven mostly by the need for \n\n additional features such as tables, footnotes, definition lists,[note 1] and Markdown inside HTML blocks. \n\n #### Standardization \n\n From 2012, a group of people, including Jeff Atwood and John MacFarlane, launched what Atwood characterised as a standardisation effort. \n\n ## Implementations \n\n Implementations of Markdown are available for over a dozen programming languages."headers_to_split_on = [("#", "Header 1"),("##", "Header 2"),("###", "Header 3"),
]markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
md_header_splits = markdown_splitter.split_text(markdown_document)
md_header_splits
"""
[Document(page_content='Markdown[9] is a lightweight markup language for creating formatted text using a plain-text editor. John Gruber created Markdown in 2004 as a markup language that is appealing to human readers in its source code form.[9]  \nMarkdown is widely used in blogging, instant messaging, online forums, collaborative software, documentation pages, and readme files.', metadata={'Header 1': 'Intro', 'Header 2': 'History'}),Document(page_content='As Markdown popularity grew rapidly, many Markdown implementations appeared, driven mostly by the need for  \nadditional features such as tables, footnotes, definition lists,[note 1] and Markdown inside HTML blocks.  \n#### Standardization  \nFrom 2012, a group of people, including Jeff Atwood and John MacFarlane, launched what Atwood characterised as a standardisation effort.', metadata={'Header 1': 'Intro', 'Header 2': 'Rise and divergence'}),Document(page_content='Implementations of Markdown are available for over a dozen programming languages.', metadata={'Header 1': 'Intro', 'Header 2': 'Implementations'})]
"""

可以看到，結果已經按照headers分組切分，比如第一個Document則對應一級目錄Intro和二級目錄History，也即{'Header 1': 'Intro', 'Header 2': 'History'}。

雖然，這已經是按照原本的預期進行切分，但是切分后的每一個Document可能還是長文本。此時，便可以使用上述的遞歸字符分割，對每一段文本進一步切成較小的chunk。

# MD splits
markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on, strip_headers=False
)
md_header_splits = markdown_splitter.split_text(markdown_document)# Char-level splits
from langchain_text_splitters import RecursiveCharacterTextSplitterchunk_size = 250
chunk_overlap = 30
text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap
)# Split
splits = text_splitter.split_documents(md_header_splits)
splits
"""
[Document(page_content='# Intro  \n## History  \nMarkdown[9] is a lightweight markup language for creating formatted text using a plain-text editor. John Gruber created Markdown in 2004 as a markup language that is appealing to human readers in its source code form.[9]', metadata={'Header 1': 'Intro', 'Header 2': 'History'}),Document(page_content='Markdown is widely used in blogging, instant messaging, online forums, collaborative software, documentation pages, and readme files.', metadata={'Header 1': 'Intro', 'Header 2': 'History'}),Document(page_content='## Rise and divergence  \nAs Markdown popularity grew rapidly, many Markdown implementations appeared, driven mostly by the need for  \nadditional features such as tables, footnotes, definition lists,[note 1] and Markdown inside HTML blocks.', metadata={'Header 1': 'Intro', 'Header 2': 'Rise and divergence'}),Document(page_content='#### Standardization  \nFrom 2012, a group of people, including Jeff Atwood and John MacFarlane, launched what Atwood characterised as a standardisation effort.', metadata={'Header 1': 'Intro', 'Header 2': 'Rise and divergence'}),Document(page_content='## Implementations  \nImplementations of Markdown are available for over a dozen programming languages.', metadata={'Header 1': 'Intro', 'Header 2': 'Implementations'})]
"""

strip_headers=False表示保留headers原始文本
可以看到，從原本較長的3個Documents切分為5個較小的Documents

語義分割

LangChain接口文檔：semantic-chunker，使用上非常簡單：

# pip install --quiet langchain_experimentalfrom langchain_experimental.text_splitter import SemanticChunker
from langchain_community.embeddings import OpenAIEmbeddings# This is a long document we can split up.
with open("../../state_of_the_union.txt") as f:state_of_the_union = f.read()text_splitter = SemanticChunker(OpenAIEmbeddings())docs = text_splitter.create_documents([state_of_the_union])

但它背后的原理比較有意思，這里講一下：（預備知識：句子的Embedding可以用來衡量兩個句子之間的語義相關性）

選擇一種合適的方法對整塊長文本切分為句子
以三個句子組的窗口形式組合句子，對應下面代碼的buffer_size=1，即前面buffer_size個句子+當前句子+后面buffer_size個句子，并且對所有組合句子進行Embedding
遍歷所有句子，比較當前組合句子與下一個組合句子的語義相關性，即embeddings的距離
如果距離不超過threshold，則合并當前句子和下一個句子到當前的chunk中
如果距離超過了threshold，則當前chunk合并完成，下一個句子作為下一個chunk的開頭

這里解釋下這么做的有效性：

比較Embedding距離的是窗口形式的組合句子，因為單個句子比較存在較大的噪聲
連續的兩個組合句子Embedding距離超過閾值時，則前后句子語義相關性較低，可以認為開始了一個的語義部分
threshold有多種計算方法，其中一種是使用分位法，計算所有距離的95分位值作為閾值。

import rewith open('./examples/mit.txt') as file:essay = file.read()# Splitting the essay on '.', '?', and '!'
single_sentences_list = re.split(r'(?<=[.?!])\s+', essay)
print (f"{len(single_sentences_list)} senteneces were found")sentences = [{'sentence': x, 'index' : i} for i, x in enumerate(single_sentences_list)]
sentences[:3]
"""
317 senteneces were found
[{'sentence': '\n\nWant to start a startup?', 'index': 0},{'sentence': 'Get funded by\nY Combinator.', 'index': 1},{'sentence': 'October 2006(This essay is derived from a talk at MIT.)\nTill recently graduating seniors had two choices: get a job or go\nto grad school.','index': 2}]
"""sentences = combine_sentences(sentences, buffer_size=1)
sentences[:3]
"""
[{'sentence': '\n\nWant to start a startup?','index': 0,'combined_sentence': '\n\nWant to start a startup? Get funded by\nY Combinator.'},{'sentence': 'Get funded by\nY Combinator.','index': 1,'combined_sentence': '\n\nWant to start a startup? Get funded by\nY Combinator. October 2006(This essay is derived from a talk at MIT.)\nTill recently graduating seniors had two choices: get a job or go\nto grad school.'},{'sentence': 'October 2006(This essay is derived from a talk at MIT.)\nTill recently graduating seniors had two choices: get a job or go\nto grad school.','index': 2,'combined_sentence': 'Get funded by\nY Combinator. October 2006(This essay is derived from a talk at MIT.)\nTill recently graduating seniors had two choices: get a job or go\nto grad school. I think there will increasingly be a third option:\nto start your own startup.'}]
"""

Embedding

示例代碼：embeddings.ipynb

開頭已經提到embeddings是檢索的關鍵部分，embeddings可以為一塊文本創建一個向量表征，這能夠用于實現文本相似檢索，在同一個向量空間里，我們通過計算embeddings之間的距離來衡量兩塊文本的語義相關性。

許多主流的大模型供應商都會有其對應的Embedding模型。下面仍然以通義千問的免費tokens額度來作為演示案例，它支持最大2048的字符長度，生成的向量維度為1536。

不過LangChain沒有內置通義千問的Embedding實現，因此需要自己實現，比較簡單，可以參考內置的BaichuanTextEmbeddings實現：

embed_documents：批量調用Embedding模型，生成一批文本的向量
embed_query：生成單個文本的向量

from tongyi.embeddings import TongyiEmbeddingsembeddings_model = TongyiEmbeddings()embeddings = embeddings_model.embed_documents(["你好嗎","你的名字是什么","我的肚子好痛啊","腸胃不舒服","我在吃東西"]
)len(embeddings), len(embeddings[0])
"""
(5, 1536)
"""

那么，我們來驗證下embeddings是否能來衡量句子之間的語義相關性，下面以這個文本"腸胃不舒服"與上面的5個文本進行逐一比較：

import numpy as npdef normalize(x):x = np.asarray(x)norms = np.sum(np.multiply(x, x))norms = np.sqrt(norms)return x / normsfor i in range(5):similarity = np.dot(normalize(embeddings[2]), normalize(embeddings[i]))print(f'"{texts[2]}"與"{texts[i]}"的語義相似度為：{similarity}')
"""
"我的肚子好痛啊"與"你好嗎"的語義相似度為：0.3540708666322656
"我的肚子好痛啊"與"你的名字是什么"的語義相似度為：0.3079039808785484
"我的肚子好痛啊"與"我的肚子好痛啊"的語義相似度為：1.0
"我的肚子好痛啊"與"腸胃不舒服"的語義相似度為：0.418081827009795
"我的肚子好痛啊"與"我在吃東西"的語義相似度為：0.3523671162523911
"""

可以看到，結果是合理的，肚子與腸胃、好痛與不舒服存在一定的相關性，因此"我的肚子好痛啊" 與"腸胃不舒服"的相似度最高。

緩存

這一章節講一下embeddings的緩存，防止重復計算，以文本的hash值作為緩存的key，即需要完全相同的文本才能利用緩存快速二次訪問。

其實現過程如下：

初始化一個CacheBackedEmbeddings
需要用到Embeddings生成模型，如上述的TongyiEmbeddings
還有一個存儲對象，如LocalFileStore，會在指定目錄以文件的形式進行embeddings的存儲

from langchain.embeddings import CacheBackedEmbeddings
from langchain.storage import LocalFileStorestore = LocalFileStore("./cache/")cached_embedder = CacheBackedEmbeddings.from_bytes_store(embeddings_model, store, namespace=embeddings_model.model_name
)%%time
cached_embedder.embed_documents(texts)
"""
CPU times: user 75.1 ms, sys: 6.25 ms, total: 81.4 ms
Wall time: 357 ms
"""%%time
cached_embedder.embed_documents(texts)
"""
CPU times: user 5.27 ms, sys: 2.35 ms, total: 7.62 ms
Wall time: 6.12 ms
"""list(store.yield_keys())
"""
['text_embedding_v1f33a10ff-859a-5463-b3ff-f49f9fa5f6fa','text_embedding_v1046ba0f1-f46d-50cb-a4a2-d42b4b0a372b','text_embedding_v1fdcb1804-6409-5e76-89ff-9684747fff9d','text_embedding_v17cd8ea1f-6312-57bd-b0c4-46b1e007af6a','text_embedding_v1c8a6a73e-11f0-59e7-84f4-cb126b59694f']
"""

可以看到，因為有緩存，第二次生成向量的速度比第一次快出許多。
namespace，支持多個命名空間，可以避免不同Embedding模型去生成相同文本時的key碰撞
batch_size，可以設置CacheBackedEmbeddings批量寫入換粗，而不是一次性全量寫入

CacheBackedEmbeddings其實跟TongyiEmbeddings一樣都是Embeddings的子類，因此也是調用相同的方法embed_documents來生成向量，它其實就是調用傳入的真正的Embeddings來生成向量，并且緩存起來。

不過內置的CacheBackedEmbeddings的embed_query方法并沒有實現緩存，這明明只是批量和單個文本的區別，LangChain內置卻不實現，不懂。

內存存儲、基于redis的緩存：

使用方式與上述的LocalFileStore一樣

from langchain.storage import InMemoryByteStorefrom langchain.storage import RedisStore

向量數據庫

示例代碼：vector_store.ipynb

上述介紹了幾種向量緩存的方式，包括本地文件、內存、redis等，但其實這些都不是主流的向量存儲方式，因為它們其實并不支持檢索。

更普遍的方式是：

選擇Embedding模型，對源數據進行批量文檔生成的表征向量，存儲到向量數據庫
在查詢的時候，對非結構化的查詢文本使用同樣的Embedding模型生成表征向量，然后去向量數據庫召回與查詢文本的表征向量相似的向量對應的文檔。

下面我們以faiss作為向量數據庫進行實例演示。（LangChain支持的向量數據庫十分豐富，可以根據自己的需求進行選擇）

加載源數據

from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import CharacterTextSplitter# Load the document, split it into chunks, embed each chunk and load it into the vector store.
raw_documents = TextLoader('./examples/rag.txt').load()
text_splitter = CharacterTextSplitter(separator='\n\n\n', chunk_size=50, chunk_overlap=4)
documents = text_splitter.split_documents(raw_documents)
documents[:2]
"""
[Document(page_content='2024年普通高等學校招生全國統一考試（簡稱：2024年全國高考），是中華人民共和國合格的高中畢業生或具有同等學力的考生參加的選拔性考試 [1-2]。2024年報名人數1342萬人，比2023年增加51萬人 [21]。', metadata={'source': './examples/rag.txt'}),Document(page_content='2024年高考是黑龍江、甘肅、吉林、安徽、江西、貴州、廣西7個省份（中國第四批高考綜合改革省份）的第一屆落地實施的新高考。 [3]', metadata={'source': './examples/rag.txt'})]
"""

向量存儲

from langchain_community.vectorstores import FAISS
from tongyi.embeddings import TongyiEmbeddingsdb = FAISS.from_documents(documents, TongyiEmbeddings())

這里的from_documents會進行以下步驟：

初始化faiss索引，這里LangChain的FAISS實現看起來并不完善，僅支持IndexFlatIP和IndexFlatL2兩種索引方式，并且默認是使用效率較低的精確檢索IndexFlatL2，更多的索引方式可前往以前的文章 Faiss
對documents里的所有文本塊通過Embedding模型生成向量
向量寫入faiss，并更新/構建索引（后續才能進行向量檢索）

檢索

query = "哪里可以了解高考成績"
docs = db.similarity_search(query)
print(docs[0].page_content)
"""
一、在哪里可以了解高考成績、志愿填報時間和方式、各高校招生計劃、往年錄取參考等志愿填報權威信息？
各省級教育行政部門或招生考試機構官方網站、微信公眾號等權威渠道都會公布今年高考各階段工作時間安排，包括高考成績公布時間和查詢方式、志愿填報時間，以及今年各高校招生計劃、往年錄取情況參考等權威信息。考生和家長要及時關注本地官方權威渠道發布的消息內容。
考生高考志愿是高校錄取的重要依據，請廣大考生務必按照省級招生考試機構相關要求按時完成志愿填報。前期，教育部已會同有關部門協調互聯網平臺對省級招生考試機構和高校的官方網站、微信公眾號等進行了權威標識，請廣大考生在信息查詢時認準官方權威渠道，切勿輕信網絡不實信息。
"""
docs
"""
[Document(page_content='一、在哪里可以了解高考成績、志愿填報時間和方式、各高校招生計劃、往年錄取參考等志愿填報權威信息？\n各省級教育行政部門或招生考試機構官方網站、微信公眾號等......', metadata={'source': './examples/rag.txt'}),Document(page_content='2024年高考是黑龍江、甘肅、吉林、安徽、江西、貴州、廣西7個省份（中國第四批高考綜合改革省份）的第一屆落地實施的新高考。 [3]', metadata={'source': './examples/rag.txt'}),Document(page_content='三、高校招生章程有什么作用，如何查詢？\n高校招生章程由學校依據相關法律規定和國家招生政策制定，是學校開展招生工作的依據。考生在填報志愿前，應仔細查閱擬報考高校的招生章程，全面了解高校招生辦法和相關招生要求。\n主要查詢途徑有：中國高等教育學生信息網的“陽光高考”信息平臺（https://gaokao.chsi.com.cn）；各高校官方招生網站等。', metadata={'source': './examples/rag.txt'}),Document(page_content='二、高考志愿填報咨詢有哪些公共服務？\n教育部高度重視高考志愿填報咨詢服務工作，指導各地建立了招生考試機構、高校、中學多方面志愿填報咨詢公共服務體系......', metadata={'source': './examples/rag.txt'})]
"""

其中，有幾個關鍵的參數：

def similarity_search(self,query: str,k: int = 4,filter: Optional[Union[Callable, Dict[str, Any]]] = None,fetch_k: int = 20,**kwargs: Any,)

k：返回的檢索數量
filter：由于Document有附帶metadata的，因此可以使用filter函數，metadata作為入參，在檢索之后能夠進一步篩選
score_threshold：相似度閾值，可以用來過濾相似度較低的文檔
fetch_k：向量數據庫的召回數量，即在filter和相似度閾值過濾之前的文檔數量

MMR

LangChain支持對檢索結果進行基于maximum marginal relevance（MMR，最大邊界相關法）的重新排序，MMR具體公式如下：

$\ S [ λ S i m ( D i , Q ) ? ( 1 ? λ ) m a x D j ∈ S S i m ( D i , D j ) ] MMR=Argmax_{D_i\in R \backslash S}[\lambda Sim(D_i,Q)-(1-\lambda)max_{D_j \in S}Sim(D_i,D_j)]$

Q為查詢文檔，
D為所有檢索得到的相關文檔
S是已經被選擇的文檔
R\S是未被選擇的文檔
$\lambda$ 是一個0-1的權重

S初始化為與查詢相似度最高的一個文檔，然后一直循環MMR的計算，每一次S都會增加一個文檔，直到所有文檔都被選擇到。

query = "哪里可以了解高考成績"
docs = db.max_marginal_relevance_search(query)
docs
"""
[Document(page_content='一、在哪里可以了解高考成績、志愿填報時間和方式、各高校招生計劃、往年錄取參考等志愿填報權威信息？\n各省級教育行政部門或招生考試機構官方網站、微信公眾號等......', metadata={'source': './examples/rag.txt'}),Document(page_content='2024年高考是黑龍江、甘肅、吉林、安徽、江西、貴州、廣西7個省份（中國第四批高考綜合改革省份）的第一屆落地實施的新高考。 [3]', metadata={'source': './examples/rag.txt'}),Document(page_content='十、錄取通知書何時能收到？\n高校一般會在錄取結束后一周左右向錄取新生寄發錄取通知書。若考生在省級招生考試機構或高校官方網站上查詢到了錄取結果，一直沒有收到錄取通知書，可及時聯系錄取高校公布的招生咨詢電話查詢本人錄取通知書郵寄情況。', metadata={'source': './examples/rag.txt'}),Document(page_content='地區,報名時間\n北京,2023年10月25日9時至28日17時......', metadata={'source': './examples/rag.txt'})]
"""

可以看到，加了MMR重排序之后，top2與前面不加的是一樣，不過MMR重排序之后，第3和第4個文檔就改變了。

檢索

示例代碼：retrievers.ipynb

LCEL調用

在上一篇文章LangChain入門開發教程（一）：Model I/O中，我們提到了LangChain的一種表達語言/協議：LCEL，基本上所有核心組件都是基于LCEL的方式去實現和調用，那么檢索也可以通過LCEL來使用，這樣對后面跟其他組件如LLM聯動，提供很大的便捷性。

retriever = db.as_retriever(search_type="similarity", search_kwargs={"k": 2})retriever.invoke("哪里可以了解高考成績")
"""
[Document(page_content='一、在哪里可以了解高考成績、志愿填報時間和方式、各高校招生計劃、往年錄取參考等志愿填報權威信息？\n各省級教育行政部門或招生考試機構官方網站、微信公眾號等......', metadata={'source': './examples/rag.txt'}),Document(page_content='2024年高考是黑龍江、甘肅、吉林、安徽、江西、貴州、廣西7個省份（中國第四批高考綜合改革省份）的第一屆落地實施的新高考。 [3]', metadata={'source': './examples/rag.txt'})]
"""

search_type：支持similarity、similarity_score_threshold、mmr

search_kwargs：檢索參數，上述章節[向量數據庫-檢索]已經詳細闡述過

多查詢檢索

以距離為度量的向量數據庫檢索，是通過將query進行embedding（表征）到高維的向量空間，然后基于距離檢索相似文檔（embedding到相同向量空間）。

但有時query中詞語的輕微改變，或者embedding無法很好地捕獲query的語義信息，那么將導致無法有效檢索到相似文檔。

而多query檢索便是應對這個問題，會通過提示詞工程，將query輸入到LLM從不同角度生成多個類似的查詢，再分別用多個query去進行檢索，然后匯聚這些檢索結果，并進行去重，這樣能夠獲取更多潛在的相似文檔。

from langchain.retrievers.multi_query import MultiQueryRetriever
from langchain_community.chat_models import ChatTongyichat = ChatTongyi()retriever_from_llm = MultiQueryRetriever.from_llm(retriever=retriever, llm=chat, include_original=True
)unique_docs = retriever_from_llm.invoke("哪里可以了解高考成績")
unique_docs
"""
[Document(page_content='一、在哪里可以了解高考成績、志愿填報時間和方式、各高校招生計劃、往年錄取參考等志愿填報權威信息？\n各省級教育行政部門或招生考試機構官方網站、微信公眾號等......', metadata={'source': './examples/rag.txt'}),Document(page_content='三、高校招生章程有什么作用，如何查詢？\n高校招生章程由學校依據相關法律規定和國家招生政策制定，是學校開展招生工作的依據。考生在填報志愿前，應仔細查閱擬報考高校的招生章程，全面了解高校招生辦法和相關招生要求。\n主要查詢途徑有：中國高等教育學生信息網的“陽光高考”信息平臺（https://gaokao.chsi.com.cn）；各高校官方招生網站等。', metadata={'source': './examples/rag.txt'}),Document(page_content='2024年高考是黑龍江、甘肅、吉林、安徽、江西、貴州、廣西7個省份（中國第四批高考綜合改革省份）的第一屆落地實施的新高考。 [3]', metadata={'source': './examples/rag.txt'})]
"""len(unique_docs)
"""3"""

生成多個查詢的LLM提示詞也可以從LangChain源碼獲取：

You are an AI language model assistant. Your task is 
to generate 3 different versions of the given user 
question to retrieve relevant documents from a vector  database. 
By generating multiple perspectives on the user question, 
your goal is to help the user overcome some of the limitations 
of distance-based similarity search. Provide these alternative 
questions separated by newlines. Original question: {question}

上下文壓縮

檢索面臨的一個挑戰是在你構建文檔知識庫時，是無法提前預知query的內容，它可能是任何的內容，這意味著檢索到的文檔中與query最相關的信息可能會被湮沒在許多不相關的文本中，這即會導致更加昂貴的LLM tokens調用費用，又會影響LLM的回復效果，因為引入了噪聲。

針對這個問題，LangChain提供了一種思路方法：對檢索文檔，即上下文進行壓縮。

1. LLMChainExtractor

LLMChainExtractor會引入額外的一個LLM來遍歷原始的檢索文檔中提取與query相關的信息片段，而不是整一個文檔，并且會過濾不存在與query相關的信息的文檔。

from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractorcompressor = LLMChainExtractor.from_llm(llm)
compression_retriever = ContextualCompressionRetriever(base_compressor=compressor, base_retriever=retriever
)compressed_docs = compression_retriever.invoke("哪里可以了解高考成績"
)
compressed_docs
"""
[Document(page_content='各省級教育行政部門或招生考試機構官方網站、微信公眾號等權威渠道都會公布今年高考各階段工作時間安排，包括高考成績公布時間和查詢方式、志愿填報時間，以及今年各高校招生計劃、往年錄取情況參考等權威信息。', metadata={'source': './examples/rag.txt'})]
"""

我們仍然可以從LangChain源碼中得到這個提取相關信息的LLM提示詞：

Given the following question and context, extract any part of the context *AS IS* that is relevant to answer the question. If none of the context is relevant return {no_output_str}. Remember, *DO NOT* edit the extracted parts of the context.> Question: {{question}}
> Context:
>>>
{{context}}
>>>
Extracted relevant parts:

2. LLMChainFilter

LLMChainFilter則是更為簡單但更魯棒的實現，它仍然需要引入額外的LLM，讓LLM來決定和篩選原始的檢索文檔中哪些文檔是與query相關的需要返回的，也即過濾LLM認為與query不相關的文檔，但不改寫文檔的內容

from langchain.retrievers.document_compressors import LLMChainFilter_filter = LLMChainFilter.from_llm(llm)
compression_retriever = ContextualCompressionRetriever(base_compressor=_filter, base_retriever=retriever
)compressed_docs = compression_retriever.invoke("哪里可以了解高考成績"
)
compressed_docs
"""
[Document(page_content='一、在哪里可以了解高考成績、志愿填報時間和方式、各高校招生計劃、往年錄取參考等志愿填報權威信息？\n各省級教育行政部門或招生考試機構官方網站、微信公眾號等權威渠道都會公布今年高考各階段工作時間安排，包括高考成績公布時間和查詢方式、志愿填報時間，以及今年各高校招生計劃、往年錄取情況參考等權威信息。考生和家長要及時關注本地官方權威渠道發布的消息內容。\n考生高考志愿是高校錄取的重要依據，請廣大考生務必按照省級招生考試機構相關要求按時完成志愿填報。前期，教育部已會同有關部門協調互聯網平臺對省級招生考試機構和高校的官方網站、微信公眾號等進行了權威標識，請廣大考生在信息查詢時認準官方權威渠道，切勿輕信網絡不實信息。', metadata={'source': './examples/rag.txt'})]
"""

同樣的，過濾不相關文檔的LLM提示詞：

Given the following question and context, return YES if the context is relevant to the question and NO if it isn't.> Question: {question}
> Context:
>>>
{context}
>>>
> Relevant (YES / NO):

3. DocumentCompressorPipeline

DocumentCompressorPipeline可以實現多種上下文壓縮器（compressors）和轉換器（transformers）序列的鏈式調用。

下面代碼樣例中的compressors-EmbeddingsFilter和transformers-EmbeddingsRedundantFilter都是基于Embedding模型，這樣可以不用引入額外的LLM，因為額外的LLM調用其實是昂貴且增加大量耗時的。

但是EmbeddingsFilter 貌似比較雞肋，因為它是基于embeddings相似度閾值進行過濾，這其實向量數據庫檢索便支持閾值過濾了，這對后面會提及的全文檢索才有著重要的作用。
EmbeddingsRedundantFilter則是通過過濾原始的檢索文檔中高度雷同（相似）的文檔，從而實現上下文的壓縮

from langchain.retrievers.document_compressors import DocumentCompressorPipeline
from langchain_community.document_transformers import EmbeddingsRedundantFilterredundant_filter = EmbeddingsRedundantFilter(embeddings=embeddings)
relevant_filter = EmbeddingsFilter(embeddings=embeddings, similarity_threshold=0.76)
pipeline_compressor = DocumentCompressorPipeline(transformers=[redundant_filter, relevant_filter]
)compression_retriever = ContextualCompressionRetriever(base_compressor=pipeline_compressor, base_retriever=retriever
)compressed_docs = compression_retriever.invoke("哪里可以了解高考成績"
)

集成檢索

類似于機器學習中的集成學習，我們也可以在檢索的時候，使用多個檢索器，集成它們的結果。而一般是搭配spare retriever（即全文檢索）和dense retriever（向量相似度檢索），全文檢索是基于關鍵詞去搜索相關文檔，而向量檢索是基于語義相似度，可以互為補充。

下面，我們以BM25作為全文檢索器，可以先了解下它的原理：

BM25

IDF

$q_i$ 是query分詞之后的第i個詞（term）
$f(q_i,D)$ 是 $q_i$ 在文檔中的出現頻率
$∣ D ∣$ 是文檔的長度，avgdl則是所有文檔的平均長度，這里的長度是基于分詞之后的詞的數量，即term的數量，而不是字符串長度
$IDF(q_i,D)$ 是TF-IDF中的一項：逆文檔頻率，不過BM25的IDF稍有調整。N是文檔的總數， $n(q_i)$ 是包含 $q_i$ 的文檔數量

from langchain_community.retrievers import BM25Retrieverdoc_list = [doc.page_content for doc in documents]
bm25_retriever = BM25Retriever.from_texts(doc_list, metadatas=[{"source": f"BM25"}] * len(doc_list)
)
bm25_retriever.k = 2from langchain.retrievers import EnsembleRetriever# initialize the ensemble retriever
ensemble_retriever = EnsembleRetriever(retrievers=[bm25_retriever, retriever], weights=[0.5, 0.5]
)docs = ensemble_retriever.invoke("哪里可以了解高考成績")
docs
"""
[Document(page_content='十、錄取通知書何時能收到？\n高校一般會在錄取結束后一周左右向錄取新生寄發錄取通知書。若考生在省級招生考試機構或高校官方網站上查詢到了錄取結果，一直沒有收到錄取通知書，可及時聯系錄取高校公布的招生咨詢電話查詢本人錄取通知書郵寄情況。', metadata={'source': 'BM25'}),Document(page_content='一、在哪里可以了解高考成績、志愿填報時間和方式、各高校招生計劃、往年錄取參考等志愿填報權威信息？\n各省級教育行政部門或招生考試機構官方網站、微信公眾號等權威渠道都會公布今年高考各階段工作時間安排，包括高考成績公布時間和查詢方式、志愿填報時間，以及今年各高校招生計劃、往年錄取情況參考等權威信息。考生和家長要及時關注本地官方權威渠道發布的消息內容。\n考生高考志愿是高校錄取的重要依據，請廣大考生務必按照省級招生考試機構相關要求按時完成志愿填報。前期，教育部已會同有關部門協調互聯網平臺對省級招生考試機構和高校的官方網站、微信公眾號等進行了權威標識，請廣大考生在信息查詢時認準官方權威渠道，切勿輕信網絡不實信息。', metadata={'source': './examples/rag.txt'}),Document(page_content='九、錄取工作采用什么方式，一般什么時間開始？\n高校招生實行計算機遠程網上錄取，各省（區、市）錄取工作一般于7月上旬開始，8月底之前結束。', metadata={'source': 'BM25'}),Document(page_content='2024年高考是黑龍江、甘肅、吉林、安徽、江西、貴州、廣西7個省份（中國第四批高考綜合改革省份）的第一屆落地實施的新高考。 [3]', metadata={'source': './examples/rag.txt'})]
"""

##多向量檢索-分塊

在檢索之前，切分文檔的過程中經常存在著這樣沖突的需求：

希望文檔是更短的文本，這樣embedding可以更精準的映射它們的語義。過于長的文本，可能會導致embedding丟失重要的語義；
檢索得到的文檔又希望是更長的文本，這樣能才能保持完整的上下文。

LangChain提供了這樣的思路和實現：

將原始文檔切分為更小的chunk然后進行embedding
記錄原始文檔與切分后的chunk的映射關系
通過向量數據庫檢索相似的chunk，然后映射回原始文檔

from langchain.retrievers.multi_vector import MultiVectorRetriever
from langchain.storage import InMemoryByteStoreimport uuid
from langchain_text_splitters import RecursiveCharacterTextSplitterdoc_ids = [str(uuid.uuid4()) for _ in documents]
id_key = "doc_id"# The splitter to use to create smaller chunks
child_text_splitter = RecursiveCharacterTextSplitter(chunk_size=100, chunk_overlap=20)
sub_docs = []
for i, doc in enumerate(documents):_id = doc_ids[i]_sub_docs = child_text_splitter.split_documents([doc])for _doc in _sub_docs:_doc.metadata[id_key] = _idsub_docs.extend(_sub_docs)len(documents), len(sub_docs)
"""
(15, 70)
"""vectorstore = FAISS.from_documents(sub_docs, TongyiEmbeddings())
# The storage layer for the parent documents
store = InMemoryByteStore()
retriever = MultiVectorRetriever(vectorstore=vectorstore,byte_store=store,id_key=id_key,
)
retriever.docstore.mset(list(zip(doc_ids, documents)))# Vectorstore alone retrieves the small chunks
retriever.vectorstore.similarity_search("哪里可以了解高考成績")[0]
"""
Document(page_content='一、在哪里可以了解高考成績、志愿填報時間和方式、各高校招生計劃、往年錄取參考等志愿填報權威信息？', metadata={'source': './examples/rag.txt', 'doc_id': 'd855063c-ad52-4a09-a304-aa9d2b2ebd17'})
"""# Retriever returns larger chunks
retriever.invoke("哪里可以了解高考成績")[0]
"""
Document(page_content='一、在哪里可以了解高考成績、志愿填報時間和方式、各高校招生計劃、往年錄取參考等志愿填報權威信息？\n各省級教育行政部門或招生考試機構官方網站、微信公眾號等權威渠道都會公布今年高考各階段工作時間安排，包括高考成績公布時間和查詢方式、志愿填報時間，以及今年各高校招生計劃、往年錄取情況參考等權威信息。考生和家長要及時關注本地官方權威渠道發布的消息內容。\n考生高考志愿是高校錄取的重要依據，請廣大考生務必按照省級招生考試機構相關要求按時完成志愿填報。前期，教育部已會同有關部門協調互聯網平臺對省級招生考試機構和高校的官方網站、微信公眾號等進行了權威標識，請廣大考生在信息查詢時認準官方權威渠道，切勿輕信網絡不實信息。', metadata={'source': './examples/rag.txt'})
"""

不過對于這個過程，LangChain也提供了封裝實現：Parent Document Retriever，其實也就是將拆分子文檔和建立映射關系封裝起來，效果等同于上面的代碼。

from langchain.retrievers import ParentDocumentRetriever
from langchain_chroma import Chroma
from langchain.storage import InMemoryStore# This text splitter is used to create the child documents
child_splitter = RecursiveCharacterTextSplitter(chunk_size=100, chunk_overlap=20)
# The vectorstore to use to index the child chunks (empty to start)
# FAISS not supports empty initialzation
vectorstore = Chroma(collection_name="full_documents", embedding_function=TongyiEmbeddings()
)
# The storage layer for the parent documents
store = InMemoryStore()
retriever = ParentDocumentRetriever(vectorstore=vectorstore,docstore=store,child_splitter=child_splitter,
)retriever.add_documents(documents, ids=None)sub_docs = vectorstore.similarity_search("哪里可以了解高考成績")retrieved_docs = retriever.invoke("哪里可以了解高考成績")

多向量檢索-總結

概括總結有時可以對文本塊進行精準的壓縮，因為剔除了無用信息噪聲，這可以帶來更好的檢索效果。

那么，我們可以使用上一個小節的思路，使用總結的文本塊進行embedding和檢索，然后再映射還原到本來的文本塊。

import uuidfrom langchain_core.documents import Document
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_community.chat_models import ChatTongyichain = ({"doc": lambda x: x.page_content}| ChatPromptTemplate.from_template("概括以下內容:\n\n{doc}")| ChatTongyi(max_retries=0)| StrOutputParser()
)summaries = chain.batch(documents, {"max_concurrency": 5})
summaries[:2]
"""
['2024年全國高考是中國的一項重要考試，用于選拔高中畢業生和具備同等學歷的考生，2024年的報名人數達到了1342萬人，相比上一年增長了51萬人。','2024年，中國有7個省份（黑龍江、甘肅、吉林、安徽、江西、貴州和廣西）將首次實施新的高考制度，作為第四批改革省份。']
"""doc_ids = [str(uuid.uuid4()) for _ in documents]
id_key = "doc_id"summary_docs = [Document(page_content=s, metadata={id_key: doc_ids[i]})for i, s in enumerate(summaries)
]# The vectorstore to use to index the child chunks
vectorstore = FAISS.from_documents(summary_docs, TongyiEmbeddings())
# The storage layer for the parent documents
store = InMemoryByteStore()
retriever = MultiVectorRetriever(vectorstore=vectorstore,byte_store=store,id_key=id_key,
)
retriever.docstore.mset(list(zip(doc_ids, documents)))sub_docs = vectorstore.similarity_search("哪里可以了解高考成績")
sub_docs[0]
"""
Document(page_content='高考相關信息可以在各省級教育行政部門或招生考試機構的官方網站、微信公眾號等權威渠道獲取，包括成績公布時間、查詢方式、志愿填報時間、高校招生計劃和歷年錄取參考等。考生和家長需密切關注官方發布的信息。志愿填報至關重要，考生必須遵循省級招生考試機構的規定。教育部已與相關部門合作確保官方渠道的權威性，提醒大家在查詢時要識別官方標識，避免相信非官方的不實信息。', metadata={'doc_id': 'd5f6fbc3-3425-4c3e-914a-f10669c9ae53'})
"""retrieved_docs = retriever.invoke("哪里可以了解高考成績")
retrieved_docs[0]
"""
Document(page_content='一、在哪里可以了解高考成績、志愿填報時間和方式、各高校招生計劃、往年錄取參考等志愿填報權威信息？\n各省級教育行政部門或招生考試機構官方網站、微信公眾號等權威渠道都會公布今年高考各階段工作時間安排，包括高考成績公布時間和查詢方式、志愿填報時間，以及今年各高校招生計劃、往年錄取情況參考等權威信息。考生和家長要及時關注本地官方權威渠道發布的消息內容。\n考生高考志愿是高校錄取的重要依據，請廣大考生務必按照省級招生考試機構相關要求按時完成志愿填報。前期，教育部已會同有關部門協調互聯網平臺對省級招生考試機構和高校的官方網站、微信公眾號等進行了權威標識，請廣大考生在信息查詢時認準官方權威渠道，切勿輕信網絡不實信息。', metadata={'source': './examples/rag.txt'})
"""

多向量檢索-假設問題

有這么一種場景，例如QA系統，用戶往往是輸入一個問題形式的query，然后去檢索相關的回答/答案形式的document。那么，使用常規的處理方法，會導致query和數據庫中的documents形式差異較大，embedding可以無法捕獲這種問題（query）和回答（document）的語義相關。

那么，既然使用問題形式的query進行檢索，那么我們也可以為documents構造合適的假設性問題，為何叫假設性問題，因為這些問題是通過LLM生成的，而不是真實存在的。

from langchain_core.messages import AIMessage
from langchain_core.exceptions import OutputParserExceptiondef custom_parse(ai_message: AIMessage) -> str:"""Parse the AI message."""if '\n\n' in ai_message.content:return ai_message.content.split('\n\n')elif '\n' in ai_message.content:return ai_message.content.split('\n')else:raise OutputParserException("Badly formed question!")chain = ({"doc": lambda x: x.page_content}| ChatPromptTemplate.from_template("為下面內容生成3個合適的提問問題：\n\n{doc}\n\n#限制\n生成的3個問題使用兩個換行符，即```\n\n```符號進行隔開")| ChatTongyi(max_retries=0)| custom_parse
)hypothetical_questions = chain.batch(documents, {"max_concurrency": 5})
hypothetical_questions[0]
"""
['1. 2024年全國高考的全稱是什么？','2. 與2023年相比，2024年全國高考的報名人數有何變化？','3. 能否提供2023年全國高考的報名人數數據作為對比？']
"""doc_ids = [str(uuid.uuid4()) for _ in documents]
id_key = "doc_id"question_docs = []
for i, question_list in enumerate(hypothetical_questions):question_docs.extend([Document(page_content=s, metadata={id_key: doc_ids[i]}) for s in question_list])# The vectorstore to use to index the child chunks
vectorstore = FAISS.from_documents(question_docs, TongyiEmbeddings())
# The storage layer for the parent documents
store = InMemoryByteStore()
retriever = MultiVectorRetriever(vectorstore=vectorstore,byte_store=store,id_key=id_key,
)
retriever.docstore.mset(list(zip(doc_ids, documents)))sub_docs = vectorstore.similarity_search("哪里可以了解高考成績")
sub_docs[0]
"""
Document(page_content='1. 從哪里可以獲得高考成績查詢的具體時間和方式，以及志愿填報的詳細指導？', metadata={'doc_id': '120780d1-f9f2-4ee4-ac03-d806343878b8'})
"""retrieved_docs = retriever.invoke("哪里可以了解高考成績")
retrieved_docs[0]
"""
Document(page_content='一、在哪里可以了解高考成績、志愿填報時間和方式、各高校招生計劃、往年錄取參考等志愿填報權威信息？\n各省級教育行政部門或招生考試機構官方網站、微信公眾號等權威渠道都會公布今年高考各階段工作時間安排，包括高考成績公布時間和查詢方式、志愿填報時間，以及今年各高校招生計劃、往年錄取情況參考等權威信息。考生和家長要及時關注本地官方權威渠道發布的消息內容。\n考生高考志愿是高校錄取的重要依據，請廣大考生務必按照省級招生考試機構相關要求按時完成志愿填報。前期，教育部已會同有關部門協調互聯網平臺對省級招生考試機構和高校的官方網站、微信公眾號等進行了權威標識，請廣大考生在信息查詢時認準官方權威渠道，切勿輕信網絡不實信息。', metadata={'source': './examples/rag.txt'})
"""

Self-querying檢索

self-querying檢索，正如它的名字，可以自我查詢。具體來講，給定任何自然語言的query，檢索器使用一個查詢構造的LLM來生成和解析成結構化的查詢語法，然后應用到數據庫。如下圖所示：

這不僅能夠讓檢索器使用用戶輸入的query與存儲的文檔進行語義相似比較，還能夠從用戶輸入提取一個過濾器，來對文檔的metadata進行篩選。

self-querying

1. 構造電影類型的文檔數據，包含上映年份（year）、評分（rating）、類型（genre）

from langchain_chroma import Chroma
from langchain_core.documents import Documentdocs = [Document(page_content="A bunch of scientists bring back dinosaurs and mayhem breaks loose",metadata={"year": 1993, "rating": 7.7, "genre": "science fiction"},),Document(page_content="Leo DiCaprio gets lost in a dream within a dream within a dream within a ...",metadata={"year": 2010, "director": "Christopher Nolan", "rating": 8.2},),Document(page_content="A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea",metadata={"year": 2006, "director": "Satoshi Kon", "rating": 8.6},),Document(page_content="A bunch of normal-sized women are supremely wholesome and some men pine after them",metadata={"year": 2019, "director": "Greta Gerwig", "rating": 8.3},),Document(page_content="Toys come alive and have a blast doing so",metadata={"year": 1995, "genre": "animated"},),Document(page_content="Three men walk into the Zone, three men walk out of the Zone",metadata={"year": 1979,"director": "Andrei Tarkovsky","genre": "thriller","rating": 9.9,},),
]
vectorstore = Chroma.from_documents(docs, TongyiEmbeddings())

2. 創建self-querying檢索器，需要增加文檔數據的描述document_content_description和每個metadata字段的描述metadata_field_info

from langchain.chains.query_constructor.base import AttributeInfo
from langchain.retrievers.self_query.base import SelfQueryRetrievermetadata_field_info = [AttributeInfo(name="genre",description="The genre of the movie. One of ['science fiction', 'comedy', 'drama', 'thriller', 'romance', 'action', 'animated']",type="string",),AttributeInfo(name="year",description="The year the movie was released",type="integer",),AttributeInfo(name="director",description="The name of the movie director",type="string",),AttributeInfo(name="rating", description="A 1-10 rating for the movie", type="float"),
]
document_content_description = "Brief summary of a movie"
llm = ChatTongyi()
retriever = SelfQueryRetriever.from_llm(llm,vectorstore,document_content_description,metadata_field_info,
)

3. 檢索測試

# This example only specifies a filter
retriever.invoke("I want to watch a movie rated higher than 8.5")
"""
[Document(page_content='Three men walk into the Zone, three men walk out of the Zone', metadata={'director': 'Andrei Tarkovsky', 'genre': 'thriller', 'rating': 9.9, 'year': 1979}),Document(page_content='A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea', metadata={'director': 'Satoshi Kon', 'rating': 8.6, 'year': 2006})]
"""

4. 實現原理

下面，我們來揭開self-querying檢索器的神秘面紗，簡單的闡述下它的原理機制。

其中，最重要的是查詢構造器query_constructor的LLM chain，即將自然語言的query轉化為filter。下一步其實就是直接執行向量數據庫檢索，只是帶上了filter。

from langchain.chains.query_constructor.base import (StructuredQueryOutputParser,get_query_constructor_prompt,
)
from langchain.retrievers.self_query.chroma import ChromaTranslator
from langchain_community.chat_models import ChatTongyillm = ChatTongyi()prompt = get_query_constructor_prompt(document_content_description,metadata_field_info,allowed_comparators=ChromaTranslator.allowed_comparators,allowed_operators=ChromaTranslator.allowed_operators
)
output_parser = StructuredQueryOutputParser.from_components()
query_constructor = prompt | llm | output_parser

如上面代碼，LLM Chain包含了一個提示詞模板，一個llm和一個輸出解析器。我們一個一個來。

prompt.format(query="dummy question")
"""
Your goal is to structure the user\'s query to match the request schema provided below.<< Structured Request Schema >>
When responding use a markdown code snippet with a JSON object formatted in the following schema:```json
{"query": string \\ text string to compare to document contents"filter": string \\ logical condition statement for filtering documents
}
```The query string should contain only text that is expected to match the contents of documents. Any conditions in the filter should not be mentioned in the query as well.A logical condition statement is composed of one or more comparison and logical operation statements.A comparison statement takes the form: `comp(attr, val)`:
- `comp` (eq | ne | gt | gte | lt | lte): comparator
- `attr` (string):  name of attribute to apply the comparison to
- `val` (string): is the comparison valueA logical operation statement takes the form `op(statement1, statement2, ...)`:
- `op` (and | or): logical operator
- `statement1`, `statement2`, ... (comparison statements or logical operation statements): one or more statements to apply the operation toMake sure that you only use the comparators and logical operators listed above and no others.
Make sure that filters only refer to attributes that exist in the data source.
Make sure that filters only use the attributed names with its function names if there are functions applied on them.
Make sure that filters only use format `YYYY-MM-DD` when handling date data typed values.
Make sure that filters take into account the descriptions of attributes and only make comparisons that are feasible given the type of data being stored.
Make sure that filters are only used as needed. If there are no filters that should be applied return "NO_FILTER" for the filter value.<< Example 1. >>
Data Source:
```json
{"content": "Lyrics of a song","attributes": {"artist": {"type": "string","description": "Name of the song artist"},"length": {"type": "integer","description": "Length of the song in seconds"},"genre": {"type": "string","description": "The song genre, one of "pop", "rock" or "rap""}}
}
```User Query:
What are songs by Taylor Swift or Katy Perry about teenage romance under 3 minutes long in the dance pop genreStructured Request:
```json
{"query": "teenager love","filter": "and(or(eq(\\"artist\\", \\"Taylor Swift\\"), eq(\\"artist\\", \\"Katy Perry\\")), lt(\\"length\\", 180), eq(\\"genre\\", \\"pop\\"))"
}
```<< Example 2. >>
Data Source:
```json
{"content": "Lyrics of a song","attributes": {"artist": {"type": "string","description": "Name of the song artist"},"length": {"type": "integer","description": "Length of the song in seconds"},"genre": {"type": "string","description": "The song genre, one of "pop", "rock" or "rap""}}
}
```User Query:
What are songs that were not published on SpotifyStructured Request:
```json
{"query": "","filter": "NO_FILTER"
}
```<< Example 3. >>
Data Source:
```json
{"content": "Brief summary of a movie","attributes": {"genre": {"description": "The genre of the movie. One of [\'science fiction\', \'comedy\', \'drama\', \'thriller\', \'romance\', \'action\', \'animated\']","type": "string"},"year": {"description": "The year the movie was released","type": "integer"},"director": {"description": "The name of the movie director","type": "string"},"rating": {"description": "A 1-10 rating for the movie","type": "float"}
}
}
```User Query:
dummy questionStructured Request:
"""ChromaTranslator.allowed_comparators
"""
[<Comparator.EQ: 'eq'>,<Comparator.NE: 'ne'>,<Comparator.GT: 'gt'>,<Comparator.GTE: 'gte'>,<Comparator.LT: 'lt'>,<Comparator.LTE: 'lte'>]
"""ChromaTranslator.allowed_operators
"""
[<Operator.AND: 'and'>, <Operator.OR: 'or'>]
"""

上述便是LangChain設計的查詢構造器的提示詞模板，其中我們需要輸入對應向量數據庫支持的比較操作和邏輯操作。

根據提示詞中的examples可以看出，llm的輸出應該是類似這樣的形式：

and(or(eq("artist", "Taylor Swift"), eq("artist", "Katy Perry")), lt("length", 180), eq("genre", "pop"))

然后再通過輸出解析器將它轉化為結構化的過濾器：

query_constructor.invoke("I want to watch a movie about dinosaurs rated higher than 8.5")
"""
StructuredQuery(query='dinosaurs', filter=Comparison(comparator=<Comparator.GT: 'gt'>, attribute='rating', value=8.5), limit=None)
"""

應用

講到這里，鋪墊了這么久，最后來一個實際的應用案例。

大模型局限

首先，眾所周知，許多大模型包括ChatGPT、文心一言、通義千問等等，由于其訓練語料是存在滯后性的，因此，這些大模型對于最新的資訊是無法獲知。我們就以最近的2024年高考，來對大模型進行詢問：

from langchain_community.chat_models import ChatTongyichat = ChatTongyi()chat.invoke('2024高考報名人數是多少')
"""
AIMessage(content='對不起，我無法提供具體的2024年高考報名人數信息，因為這些數據通常由各省份的教育考試機構或政府部門發布，而且會在高考報名開始前公布。對于這類實時數據，建議你關注當地教育部門或考試院的官方通知，或者在高考報名開始時查詢相關公告。', response_metadata={'model_name': 'qwen-turbo', 'finish_reason': 'stop', 'request_id': '2cffd503-04a3-96e6-bed6-efd912311105', 'token_usage': {'input_tokens': 16, 'output_tokens': 67, 'total_tokens': 83}}, id='run-293addeb-4aa2-4ea9-b53a-861483c0114c-0')
"""chat.invoke('2024高考，廣東的報名時間是什么時候')
"""
AIMessage(content='高考報名時間每年可能會有所變動，具體以官方發布的通知為準。一般來說，廣東省的高考報名時間通常在每年的11月份進行，持續一周左右。建議你關注廣東省教育考試院或當地教育局的官方網站，他們會發布最準確的高考報名通知和時間安排。同時，也要注意報名截止日期，不要錯過報名時間。', response_metadata={'model_name': 'qwen-turbo', 'finish_reason': 'stop', 'request_id': 'a2e34ac6-4188-9626-a93d-cfea2347401d', 'token_usage': {'input_tokens': 20, 'output_tokens': 74, 'total_tokens': 94}}, id='run-5a200266-3e07-4271-9f7e-a74f4b7156d9-0')
"""

可以看到，對于第一個問題，大模型根本無法問答，只能含糊其辭。第二個問題，也是無法直接回答，而是借鑒了2023年高考。

引入RAG

正如開頭便提到的，RAG（Retrieval Augmented Generation，檢索增強生成）是在生成過程中，外部的數據會通過檢索然后傳遞給LLM，讓LLM能夠利用這些新知識作為上下文。

首先，來看看我們關于2024年高考的文檔語料：

接著，我們以上述講解的一個檢索器：基于假設問題的多向量檢索。對文檔構造多個假設性提問，將query去檢索相似的假設性提問，并關聯到原始的文檔，作為LLM提示詞的上下文：

（retriever便完全對應上述章節中的多向量檢索-假設問題）

from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParsertemplate = """回答用戶的問題，下面的內容可以作為你的知識依據：
```
{context}
```用戶的問題：{query}
"""
prompt = ChatPromptTemplate.from_template(template)def format_docs(docs):return "\n\n".join([d.page_content for d in docs])rag_chain = ({"context": retriever | format_docs, "query": RunnablePassthrough()}| prompt| chat| StrOutputParser()
)rag_chain.invoke("2024年高考報名人數是多少")
"""
'2024年全國高考的報名人數達到1342萬人。'
"""rag_chain.invoke("2024年高考，廣東的報名時間是什么時候")
"""
'2024年廣東的高考報名時間是2023年11月1日至10日。'
"""