langchain從入門到精通（三十二）——RAG優化策略（八）自查詢檢索器實現動態數據過濾

1. 查詢構建與自查詢檢索器

在 RAG 應用開發中，檢索外部數據時，前面的優化案例中，無論是生成的子查詢、問題分解、生成假設性文檔，最后在執行檢索的時候使用的都是固定的篩選條件（沒有附加過濾的相似性搜索）。
但是在某些情況下，用戶發起的原始提問其實隱式攜帶了篩選條件，例如提問：
請幫我整理下關于2023年全年關于AI的新聞匯總。
在這段原始提問中，如果執行相應的向量數據庫相似性搜索，其實是附加了篩選條件的，即 year=2023，但是在普通的相似性搜索中，是不會考慮 2023 年這個條件的（因為沒有添加元數據過濾器，2022年和2023年數據在高維空間其實很接近），存在很大概率會將其他年份的數據也檢索出來。
那么有沒有一種策略，能根據用戶傳遞的原始問題構建相應的元數據過濾器 呢？這樣在搜索的時候帶上對應的元數據過濾器，不僅可以壓縮檢索范圍，還能提升搜索的準確性。這個思想其實就是查詢構建或者稱為自查詢。
并且除了向量數據庫，類比映射到關系型數據庫、圖數據庫也是同樣的操作技巧，即：

關系型數據庫自查詢：使用 LLM 將自然語言轉換成 SQL 過濾語句。
圖數據庫自查詢：使用 LLM 將自然語言轉換成圖查詢語句。
向量數據庫：使用 LLM 將自然語言轉換成元數據過濾器/向量檢索器。
這就是查詢構建概念的由來，但是并不是所有的數據都支持查詢構建的，需要看存儲的 Document 是否存在元數據，對應的數據庫類型是否支持篩選，在 LangChain 中是否針對性做了封裝（如果沒封裝，自行實現難度比較大）。

將查詢構建這個步驟單獨拎出來，它的運行流程其實很簡單，但是底層的操作非常麻煩，如下：

在 LangChain 中，針對一些高頻使用的向量數據庫封裝了自查詢檢索器的相關支持——SelfQueryRetriever，無需自行構建轉換語句與解析，使用該類進行二次包裝即可。
所有支持自查詢檢索器的向量數據庫都在這個鏈接內部可以看到使用示例：https://imooc-langchain.shortvar.com/docs/integrations/retrievers/self_query/（但是因為向量數據庫的更新頻率過快，LangChain 封裝的部分向量數據庫已經更新，但是 SelfQueryRetriever 內部的邏輯還未更新）。

SelfQueryRetriever 使用起來也非常簡單，以 Pinecone 向量數據庫為例，首先安裝對應的依賴：

pip install --upgrade --quiet lark

定義好帶元數據的文檔、支持過濾的元數據、包裝的向量數據庫、文檔內容的描述等信息，即可進行快速包裝，示例代碼如下:

import dotenv
from langchain.chains.query_constructor.base import AttributeInfo
from langchain.retrievers import SelfQueryRetriever
from langchain_core.documents import Document
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_pinecone import PineconeVectorStoredotenv.load_dotenv()# 1.構建文檔列表并上傳到數據庫
documents = [Document(page_content="肖申克的救贖",metadata={"year": 1994, "rating": 9.7, "director": "弗蘭克·德拉邦特"},),Document(page_content="霸王別姬",metadata={"year": 1993, "rating": 9.6, "director": "陳凱歌"},),Document(page_content="阿甘正傳",metadata={"year": 1994, "rating": 9.5, "director": "羅伯特·澤米吉斯"},),Document(page_content="泰坦尼克號",metadat={"year": 1997, "rating": 9.5, "director": "詹姆斯·卡梅隆"},),Document(page_content="千與千尋",metadat={"year": 2001, "rating": 9.4, "director": "宮崎駿"},),Document(page_content="星際穿越",metadat={"year": 2014, "rating": 9.4, "director": "克里斯托弗·諾蘭"},),Document(page_content="忠犬八公的故事",metadat={"year": 2009, "rating": 9.4, "director": "萊塞·霍爾斯道姆"},),Document(page_content="三傻大鬧寶萊塢",metadat={"year": 2009, "rating": 9.2, "director": "拉庫馬·希拉尼"},),Document(page_content="瘋狂動物城",metadat={"year": 2016, "rating": 9.2, "director": "拜倫·霍華德"},),Document(page_content="無間道",metadat={"year": 2002, "rating": 9.3, "director": "劉偉強"},),
]
db = PineconeVectorStore(index_name="llmops",embedding=OpenAIEmbeddings(model="text-embedding-3-small"),namespace="dataset",text_key="text"
)
db.add_documents(documents)# 2.創建自查詢元數據
metadata_field_info = [AttributeInfo(name="year",description="電影的發布年份",type="integer",),AttributeInfo(name="rating",description="電影的評分",type="float",),AttributeInfo(name="director",description="電影的導演",type="string",),
]# 3.創建子查詢檢索
self_query_retriever = SelfQueryRetriever.from_llm(llm=ChatOpenAI(model="gpt-3.5-turbo-16k", temperature=0),vectorstore=db,document_contents="電影的名字",metadata_field_info=metadata_field_info,enable_limit=True,
)# 4.檢索示例
search_docs = self_query_retriever.invoke("查找下評分高于9.5分的電影")print(search_docs)
print(len(search_docs))
輸出內容：
[Document(metadata={'director': '陳凱歌', 'rating': 9.6, 'year': 1993.0}, page_content='霸王別姬'), Document(metadata={'director': '弗蘭克·德拉邦特', 'rating': 9.7, 'year': 1994.0}, page_content='肖申克的救贖')]

自查詢檢索器對于面向特定領域的專用 Agent 效果相對較好（對通用 Agent 來說效果較差），因為這些領域的文檔一般相對來說比較規范，例如：財報、新聞、自媒體文章、教培等行業，這些行業的數據都能剝離出通用支持過濾與篩選的元數據/字段，使用自查詢檢索器能抽象出對應的檢索字段信息。

2 . 自查詢檢索器的運行邏輯與衍生

在 LangChain 中，涉及調用第三方服務或者調用本地自定義工具的，例如自查詢檢索器、檢索器邏輯路由等，在底層都是通過一個預設好的 Prompt 生成符合相應規則的內容（字符串、JSON），然后通過解析器解析生成的內容，并將解析出來的結構化內容調用特定的接口、服務亦或者本地函數實現。
例如在自查詢檢索器底層，首先使用 FewShotPromptTemplate+函數回調/結構化輸出生成特定規則的查詢語句，這段提示代碼如下:

DEFAULT_SCHEMA = """\
<< Structured Request Schema >>
When responding use a markdown code snippet with a JSON object formatted in the following schema:\```json
{{{{"query": string \\ text string to compare to document contents"filter": string \\ logical condition statement for filtering documents
}}}}
\```The query string should contain only text that is expected to match the contents of documents. Any conditions in the filter should not be mentioned in the query as well.A logical condition statement is composed of one or more comparison and logical operation statements.A comparison statement takes the form: `comp(attr, val)`:
- `comp` ({allowed_comparators}): comparator
- `attr` (string):  name of attribute to apply the comparison to
- `val` (string): is the comparison valueA logical operation statement takes the form `op(statement1, statement2, ...)`:
- `op` ({allowed_operators}): logical operator
- `statement1`, `statement2`, ... (comparison statements or logical operation statements): one or more statements to apply the operation toMake sure that you only use the comparators and logical operators listed above and no others.
Make sure that filters only refer to attributes that exist in the data source.
Make sure that filters only use the attributed names with its function names if there are functions applied on them.
Make sure that filters only use format `YYYY-MM-DD` when handling date data typed values.
Make sure that filters take into account the descriptions of attributes and only make comparisons that are feasible given the type of data being stored.
Make sure that filters are only used as needed. If there are no filters that should be applied return "NO_FILTER" for the filter value.\
"""

原始問題如下：

查找下評分高于9.5分的電影

生成的查詢語句原文如下：

{"query": "","filter": "gt(\"rating\", 9.5)"
}

接下來使用特定的轉換器，將生成的查詢語句轉換成適配向量數據庫的過濾器，并在檢索時傳遞該參數，從而完成自查詢構建的全過程，不同的向量數據庫對應的轉換器差異也非常大。
這個思想其實已經涉及到將 LLM 與企業自有應用/API進行快速對接智能化，即如何將 LLM 生成的文字信息對接到當前業務系統中。
例如企業自有一套 PPT 生成 API 接口，通過傳遞設定的參數即可生成對應的 PPT，假設有這樣一段參數規則：

[{"page": 1, # PPT的頁數"background": {"size": [400, 600],  # 背景圖片大小"position": [0, 0],  # 背景圖片位置"image_url": "xxx",  # 圖片URL...},"objects": [{"type": "title",  # 對象類型"attribute": {"content": "求知若渴，虛心若愚",  # 標題內容"size": 20,  # 標題字體大小"color": "#000000",  # 標題顏色"position": [240, 128],  # 標題位置"font": "微軟雅黑",  # 標題字體...}},...],...}
]

如果想通過 LLM 構建一個 PPT 自動生成工具，只需要設定好 prompt，讓 LLM 按照特定的規則生成一段用于描述 PPT信息的參數，接下來解析這段參數，并將相應的參數傳遞給現成沒有智能的 PPT 生成工具，即可快速實現自然語言->PPT 的過程。
當然在這類應用的開發過程中，需要考慮的其他因素其實還非常多，例如：