題意:搜索具有相似文本的文檔
問題背景:
I have a document with three attributes: tags, location, and text.
我有一份文檔,包含三個屬性:標簽、位置和文本。
Currently, I am indexing all of them using LangChain/pgvector/embeddings.
目前,我正在使用 LangChain/pgvector/embeddings 對所有這些進行索引。
I have satisfactory results, but I want to know if there is a better way since I want to find one or more documents with a specific tag and location, but the text can vary drastically while still meaning the same thing. I thought about using embeddings/vector databases for this reason.
我目前的結果令人滿意,但我想知道是否有更好的方法,因為我想找到具有特定標簽和位置的一個或多個文檔,但文本可能變化很大而意思仍然相同。出于這個原因,我考慮過使用嵌入/向量數據庫。
Would it also be a case of using RAG (Retrieval-Augmented Generation) to "teach" the LLM about some common abbreviations that it doesn't know?
是否也可以利用 RAG(檢索增強生成)來“教授”大型語言模型(LLM)一些它不知道的常見縮寫呢?
import pandas as pdfrom langchain_core.documents import Document
from langchain_postgres import PGVector
from langchain_postgres.vectorstores import PGVector
from langchain_openai.embeddings import OpenAIEmbeddingsconnection = "postgresql+psycopg://langchain:langchain@localhost:5432/langchain"
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
collection_name = "notas_v0"vectorstore = PGVector(embeddings=embeddings,collection_name=collection_name,connection=connection,use_jsonb=True,
)# START INDEX# df = pd.read_csv("notes.csv")
# df = df.dropna() # .head(10000)
# df["tags"] = df["tags"].apply(
# lambda x: [tag.strip() for tag in x.split(",") if tag.strip()]
# )# long_texts = df["Texto Longo"].tolist()
# wc = df["Centro Trabalho Responsável"].tolist()
# notes = df["Nota"].tolist()
# tags = df["tags"].tolist()# documents = list(
# map(
# lambda x: Document(
# page_content=x[0], metadata={"wc": x[1], "note": x[2], "tags": x[3]}
# ),
# zip(long_texts, wc, notes, tags),
# )
# )# print(
# [
# vectorstore.add_documents(documents=documents[i : i + 100])
# for i in range(0, len(documents), 100)
# ]
# )
# print("Done.")### END INDEX### BEGIN QUERYresult = vectorstore.similarity_search_with_relevance_scores("EVTD202301222707",filter={"note": {"$in": ["15310116"]}, "tags": {"$in": ["abcd", "xyz"]}},k=10, # Limit of results
)### END QUERY
問題解決:
There is one primary unknown here, what is the approximate or average number of tokens in the "text" part of your input.
這里有一個主要的未知因素,即你輸入中“文本”部分的大致或平均token數是多少。
Scenario 1: You do not have a very long input (say, somewhere around 512 tokens)
場景1:你的輸入不是很長(大約512個token左右)
In this case, to get better results, you can train your own "embedding-model", please look at?my answer?here which has some info around it.
在這種情況下,為了獲得更好的結果,你可以訓練自己的“嵌入模型”。請參考我之前的回答,其中有一些相關信息。
Once you get right embedding model, you index corresponding text vectors in you RAG pipeline. There are a couple of other steps as well which are applicable to all the scenarios, so, I will add them at the end.
一旦你獲得了合適的嵌入模型,你就可以在你的RAG管道中索引相應的文本向量。還有一些其他步驟適用于所有場景,因此,我將在最后添加它們。
Scenario 2: You have a very long input per document, say, every "text" input is huge (say, ~8000 tokens, this number can be anything though). In this case you can leverage symbolic search instead of vector search. Symbolic search because, in any language, to describe something which really means the same or has similar context, there will surely be a lot of words overlap in source and target text. It will be very rare to find 10 pages text on a same topic that does not have a lot of work overlap.
場景2:每個文檔的輸入都非常長,例如,每個“文本”輸入都很大(大約8000個標記,盡管這個數字可以是任何數)。在這種情況下,你可以利用符號搜索而不是向量搜索。之所以選擇符號搜索,是因為在任何語言中,為了描述具有相同含義或相似上下文的內容,源文本和目標文本中肯定會有很多詞匯重疊。很難找到關于同一主題但文字重疊不多的10頁文本。
So, you can leverage symbolic search here, ensemble it with vector based validators and use an LLM service that allows long context prompts. So, you find some good candidates via symbolic searches, then, pass it on the long context LLM to for remaining parts.
因此,你可以在這里利用符號搜索,將其與基于向量的驗證器結合使用,并使用允許長上下文提示的大型語言模型(LLM)服務。首先,通過符號搜索找到一些好的候選文檔,然后將其傳遞給長上下文LLM以處理剩余部分。
Steps Applicable to all the scenarios:????????適用于所有場景的步驟:
1. You json object should also contain "tag", "location" along with "text" and "vector"
你的JSON對象應該同時包含“tag”(標簽)、“location”(位置)、“text”(文本)和“vector”(向量)
{"text":"some text", "text_embedding":[...], #not applicable in symbolic search"location":"loc", "tags":[] }
2. This way, when you get matches from either vector search or symbolic search; you will further able to filter or sort based on other properties like tags and location
這樣,當你從向量搜索或符號搜索中獲得匹配項時,你將能夠基于其他屬性(如標簽和位置)進行進一步的過濾或排序。
Please comment if you have more doubts!????????如果你還有更多疑問,請隨時評論!