LangChain 0.2 - 矢量存儲和檢索器

本文翻譯整理自：Vector stores and retrievers
https://python.langchain.com/v0.2/docs/tutorials/retrievers/

文章目錄

- 一、說明
- - 概念
- 二、文件
- 三、Vector stores
- - 示例
- 四、Retrievers
- 五、了解更多

一、說明

本教程將讓您熟悉 LangChain 的向量存儲和檢索器抽象。
這些抽象旨在支持從（向量）數據庫和其他來源檢索數據，以便與法學碩士工作流程集成。
它們對于獲取要作為模型推理的一部分進行推理的數據的應用程序非常重要，例如檢索增強生成或 RAG（請參閱此處的RAG 教程）。

概念

本指南重點介紹文本數據的檢索。我們將涵蓋以下概念：

Documents;
Vector stores;
Retrievers.

項目設置可參考之前文章第二部分

二、文件

LangChain 實現了文檔抽象，旨在表示文本單元和相關元數據。它有兩個屬性：

page_content：代表內容的字符串；
metadata：包含任意元數據的字典。

該metadata屬性可以捕獲有關文檔來源、其與其他文檔的關系以及其他信息的信息。
請注意，單個Document對象通常代表較大文檔的一部分。

讓我們生成一些示例文檔：

from langchain_core.documents import Documentdocuments = [Document(page_content="Dogs are great companions, known for their loyalty and friendliness.",metadata={"source": "mammal-pets-doc"},),Document(page_content="Cats are independent pets that often enjoy their own space.",metadata={"source": "mammal-pets-doc"},),Document(page_content="Goldfish are popular pets for beginners, requiring relatively simple care.",metadata={"source": "fish-pets-doc"},),Document(page_content="Parrots are intelligent birds capable of mimicking human speech.",metadata={"source": "bird-pets-doc"},),Document(page_content="Rabbits are social animals that need plenty of space to hop around.",metadata={"source": "mammal-pets-doc"},),
]

API參考：

Document

在這里，我們生成了五個文檔，其中包含指示三個不同“來源”的元數據。

三、Vector stores

矢量搜索是存儲和搜索非結構化數據（例如非結構化文本）的常用方法。
這個想法是存儲與文本關聯的數字向量。
給定一個查詢，我們可以將其嵌入為相同維度的向量，并使用向量相似性度量來識別存儲中的相關數據。

LangChain VectorStore對象包含向Document存儲添加文本和對象以及使用各種相似性度量查詢它們的方法。
它們通常使用嵌入模型進行初始化，該模型決定如何將文本數據轉換為數字向量。

LangChain 包含一套與不同矢量存儲技術的集成。
一些矢量存儲由提供商（例如，各種云提供商）托管，并且需要特定的憑據才能使用；有些（例如Postgres）在單獨的基礎設施中運行，可以在本地或通過第三方運行；其他可以在內存中運行以處理輕量級工作負載。
在這里，我們將使用Chroma演示 LangChain VectorStores 的用法，其中包括內存中實現。

為了實例化向量存儲，我們通常需要提供嵌入模型來指定如何將文本轉換為數字向量。

這里我們將使用OpenAI 嵌入。

from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddingsvectorstore = Chroma.from_documents(documents,embedding=OpenAIEmbeddings(),
)

API參考：

OpenAIEmbeddings

調用 .from_documents 此處會將文檔添加到矢量存儲中。
VectorStore實現了添加文檔的方法，這些方法也可以在實例化對象后調用。
大多數實現將允許您連接到現有的向量存儲——例如，通過提供客戶端、索引名稱或其他信息。
有關更多詳細信息，請參閱特定集成的文檔。

一旦我們實例化了VectorStore包含文檔的，我們就可以查詢它。 VectorStore包含查詢方法：

同步和異步；
按字符串查詢和按向量查詢；
有或沒有返回相似度分數；
通過相似性和最大邊際相關性（以平衡查詢的相似性和檢索結果的多樣性）。

這些方法通常會在其輸出中包含Document對象的列表。

示例

根據與字符串查詢的相似性返回文檔：

vectorstore.similarity_search("cat")

[Document(page_content='Cats are independent pets that often enjoy their own space.', metadata={'source': 'mammal-pets-doc'}),Document(page_content='Dogs are great companions, known for their loyalty and friendliness.', metadata={'source': 'mammal-pets-doc'}),Document(page_content='Rabbits are social animals that need plenty of space to hop around.', metadata={'source': 'mammal-pets-doc'}),Document(page_content='Parrots are intelligent birds capable of mimicking human speech.', metadata={'source': 'bird-pets-doc'})]

異步查詢：

await vectorstore.asimilarity_search("cat")

[Document(page_content='Cats are independent pets that often enjoy their own space.', metadata={'source': 'mammal-pets-doc'}),Document(page_content='Dogs are great companions, known for their loyalty and friendliness.', metadata={'source': 'mammal-pets-doc'}),Document(page_content='Rabbits are social animals that need plenty of space to hop around.', metadata={'source': 'mammal-pets-doc'}),Document(page_content='Parrots are intelligent birds capable of mimicking human speech.', metadata={'source': 'bird-pets-doc'})]

返回分數：

# Note that providers implement different scores; Chroma here
# returns a distance metric that should vary inversely with
# similarity.vectorstore.similarity_search_with_score("cat")

[(Document(page_content='Cats are independent pets that often enjoy their own space.', metadata={'source': 'mammal-pets-doc'}),0.3751849830150604),(Document(page_content='Dogs are great companions, known for their loyalty and friendliness.', metadata={'source': 'mammal-pets-doc'}),0.48316916823387146),(Document(page_content='Rabbits are social animals that need plenty of space to hop around.', metadata={'source': 'mammal-pets-doc'}),0.49601367115974426),(Document(page_content='Parrots are intelligent birds capable of mimicking human speech.', metadata={'source': 'bird-pets-doc'}),0.4972994923591614)]

根據與嵌入查詢的相似性返回文檔：

embedding = OpenAIEmbeddings().embed_query("cat")vectorstore.similarity_search_by_vector(embedding)

[Document(page_content='Cats are independent pets that often enjoy their own space.', metadata={'source': 'mammal-pets-doc'}),Document(page_content='Dogs are great companions, known for their loyalty and friendliness.', metadata={'source': 'mammal-pets-doc'}),Document(page_content='Rabbits are social animals that need plenty of space to hop around.', metadata={'source': 'mammal-pets-doc'}),Document(page_content='Parrots are intelligent birds capable of mimicking human speech.', metadata={'source': 'bird-pets-doc'})]

了解更多：

API reference
How-to guide
Integration-specific docs

四、Retrievers

LangChain VectorStore對象沒有 Runnable 子類，因此不能立即集成到 LangChain 表達語言鏈中。

LangChain Retrievers 是 Runnables，因此它們實現了一組標準方法（例如，同步和異步invoke以及batch操作），并被設計為合并到 LCEL 鏈中。

我們可以自己創建一個簡單的版本，無需子類化Retriever。
如果我們選擇希望使用什么方法來檢索文檔，我們可以輕松創建一個可運行程序。
下面我們將圍繞該similarity_search方法構建一個：

from typing import Listfrom langchain_core.documents import Document
from langchain_core.runnables import RunnableLambdaretriever = RunnableLambda(vectorstore.similarity_search).bind(k=1)  # select top resultretriever.batch(["cat", "shark"])

API參考

Document
RunnableLambda

[[Document(page_content='Cats are independent pets that often enjoy their own space.', metadata={'source': 'mammal-pets-doc'})],[Document(page_content='Goldfish are popular pets for beginners, requiring relatively simple care.', metadata={'source': 'fish-pets-doc'})]]

Vectorstore 實現了as_retriever一個生成 Retriever 的方法，特別是 VectorStoreRetriever。
這些檢索器包括特定的search_type 和 search_kwargs 屬性，用于標識要調用的底層向量存儲的哪些方法以及如何參數化它們。
例如，我們可以用以下內容復制上面的內容：

retriever = vectorstore.as_retriever(search_type="similarity",search_kwargs={"k": 1},
)retriever.batch(["cat", "shark"])

[[Document(page_content='Cats are independent pets that often enjoy their own space.', metadata={'source': 'mammal-pets-doc'})],[Document(page_content='Goldfish are popular pets for beginners, requiring relatively simple care.', metadata={'source': 'fish-pets-doc'})]]

VectorStoreRetriever 支持搜索類型 "similarity"（默認）、"mmr"（最大邊際相關性，如上所述）和"similarity_score_threshold"。
我們可以使用后者根據相似度得分對檢索器輸出的文檔進行閾值化。

檢索器可以輕松融入更復雜的應用程序，例如檢索增強生成 (RAG) 應用程序，該應用程序將給定的問題與檢索到的上下文結合到 LLM 提示中。
下面我們展示了一個最小示例。

pip install -qU langchain-openai

import getpass
import osos.environ["OPENAI_API_KEY"] = getpass.getpass()from langchain_openai import ChatOpenAIllm = ChatOpenAI(model="gpt-3.5-turbo-0125")

from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthroughmessage = """
Answer this question using the provided context only.{question}Context:
{context}
"""prompt = ChatPromptTemplate.from_messages([("human", message)])rag_chain = {"context": retriever, "question": RunnablePassthrough()} | prompt | llm

API參考：

ChatPromptTemplate
RunnablePassthrough

response = rag_chain.invoke("tell me about cats")print(response.content)

Cats are independent pets that often enjoy their own space.

五、了解更多

檢索策略可能豐富且復雜。例如：

我們可以從查詢中推斷出硬規則和過濾器（例如，“使用 2020 年之后發布的文檔”）；
我們可以以某種方式返回鏈接到檢索到的上下文的文檔（例如，通過某些文檔分類法）；
我們可以為每個上下文單元生成多個嵌入；
我們可以整合來自多個檢索器的結果；
我們可以為文檔分配權重，例如，對最近的文檔賦予更高的權重。

操作指南的檢索器部分涵蓋了這些和其他內置檢索策略。
擴展 BaseRetriever類以實現自定義檢索器也很簡單。
請參閱操作指南： https://python.langchain.com/v0.2/docs/how_to/custom_retriever/。

2024-05-22（三）