在信息檢索和生成式人工智能領域,重排序器是優化初始搜索結果順序的重要工具。重排序器與傳統的嵌入模型不同,它將查詢和文檔作為輸入,并直接返回相似度得分,而不是嵌入。該得分表示輸入查詢和文檔之間的相關性。
重排序器通常在第一階段檢索之后使用,通常通過向量近似最近鄰 (ANN) 技術完成。雖然 ANN 搜索能夠高效地獲取大量潛在相關的結果,但它們并不總是根據結果與查詢的實際語義接近程度來確定優先級。此時,重排序器會通過更深入的上下文分析來優化結果順序,通常會利用 BERT 或其他基于 Transformer 的高級機器學習模型。通過這種方式,重排序器可以顯著提高呈現給用戶的最終結果的準確性和相關性。
PyMilvus 模型庫集成了重排序功能,用于優化初始搜索返回結果的排序。從 Milvus 檢索到最近的嵌入后,您可以利用這些重排序工具來優化搜索結果,從而提高搜索結果的準確率。
Rerank Function | API or Open-sourced |
---|---|
BGE | Open-sourced |
Cross Encoder | Open-sourced |
Voyage | API |
Cohere | API |
Jina AI | API |
示例 1:使用 BGE rerank 函數根據查詢對文檔進行重新排序
在此示例中,我們演示了如何使用基于特定查詢的BGE 重新排序器對搜索結果進行重新排序。
要將重新排序器與PyMilvus 模型庫一起使用,首先安裝 PyMilvus 模型庫以及包含所有必要的重新排序實用程序的模型子包:
pip install pymilvus[model]
# or pip install "pymilvus[model]" for zsh.
要使用 BGE 重新排序器,首先導入BGERerankFunction
類:
from pymilvus.model.reranker import BGERerankFunction
然后,創建一個BGERerankFunction
重新排名的實例:
bge_rf = BGERerankFunction(model_name="BAAI/bge-reranker-v2-m3", # Specify the model name. Defaults to `BAAI/bge-reranker-v2-m3`.device="cpu" # Specify the device to use, e.g., 'cpu' or 'cuda:0'
)
要根據查詢對文檔重新排序,請使用以下代碼:
query = "What event in 1956 marked the official birth of artificial intelligence as a discipline?"documents = ["In 1950, Alan Turing published his seminal paper, 'Computing Machinery and Intelligence,' proposing the Turing Test as a criterion of intelligence, a foundational concept in the philosophy and development of artificial intelligence.","The Dartmouth Conference in 1956 is considered the birthplace of artificial intelligence as a field; here, John McCarthy and others coined the term 'artificial intelligence' and laid out its basic goals.","In 1951, British mathematician and computer scientist Alan Turing also developed the first program designed to play chess, demonstrating an early example of AI in game strategy.","The invention of the Logic Theorist by Allen Newell, Herbert A. Simon, and Cliff Shaw in 1955 marked the creation of the first true AI program, which was capable of solving logic problems, akin to proving mathematical theorems."
]bge_rf(query, documents)
預期輸出類似于以下內容:
[RerankResult(text="The Dartmouth Conference in 1956 is considered the birthplace of artificial intelligence as a field; here, John McCarthy and others coined the term 'artificial intelligence' and laid out its basic goals.", score=0.9911615761470803, index=1),RerankResult(text="In 1950, Alan Turing published his seminal paper, 'Computing Machinery and Intelligence,' proposing the Turing Test as a criterion of intelligence, a foundational concept in the philosophy and development of artificial intelligence.", score=0.0326971950177779, index=0),RerankResult(text='The invention of the Logic Theorist by Allen Newell, Herbert A. Simon, and Cliff Shaw in 1955 marked the creation of the first true AI program, which was capable of solving logic problems, akin to proving mathematical theorems.', score=0.006514905766152258, index=3),RerankResult(text='In 1951, British mathematician and computer scientist Alan Turing also developed the first program designed to play chess, demonstrating an early example of AI in game strategy.', score=0.0042116724917325935, index=2)]
示例 2:使用重新排序器增強搜索結果的相關性
在本指南中,我們將探索如何利用search()
PyMilvus 中的方法進行相似性搜索,以及如何使用重排序器增強搜索結果的相關性。我們的演示將使用以下數據集:
entities = [{'doc_id': 0, 'doc_vector': [-0.0372721,0.0101959,...,-0.114994], 'doc_text': "In 1950, Alan Turing published his seminal paper, 'Computing Machinery and Intelligence,' proposing the Turing Test as a criterion of intelligence, a foundational concept in the philosophy and development of artificial intelligence."}, {'doc_id': 1, 'doc_vector': [-0.00308882,-0.0219905,...,-0.00795811], 'doc_text': "The Dartmouth Conference in 1956 is considered the birthplace of artificial intelligence as a field; here, John McCarthy and others coined the term 'artificial intelligence' and laid out its basic goals."}, {'doc_id': 2, 'doc_vector': [0.00945078,0.00397605,...,-0.0286199], 'doc_text': 'In 1951, British mathematician and computer scientist Alan Turing also developed the first program designed to play chess, demonstrating an early example of AI in game strategy.'}, {'doc_id': 3, 'doc_vector': [-0.0391119,-0.00880096,...,-0.0109257], 'doc_text': 'The invention of the Logic Theorist by Allen Newell, Herbert A. Simon, and Cliff Shaw in 1955 marked the creation of the first true AI program, which was capable of solving logic problems, akin to proving mathematical theorems.'}
]
數據集組件:
doc_id
:每個文檔的唯一標識符。doc_vector
:表示文檔的向量嵌入。有關生成嵌入的指導,請參閱嵌入。doc_text
:文檔的文本內容。
準備工作
在啟動相似性搜索之前,您需要與 Milvus 建立連接,創建一個集合,并準備數據并將其插入到該集合中。以下代碼片段演示了這些準備步驟。
from pymilvus import MilvusClient, DataTypeclient = MilvusClient(uri="http://10.102.6.214:19530" # replace with your own Milvus server address
)client.drop_collection('test_collection')# define schemaschema = client.create_schema(auto_id=False, enabel_dynamic_field=True)schema.add_field(field_name="doc_id", datatype=DataType.INT64, is_primary=True, description="document id")
schema.add_field(field_name="doc_vector", datatype=DataType.FLOAT_VECTOR, dim=384, description="document vector")
schema.add_field(field_name="doc_text", datatype=DataType.VARCHAR, max_length=65535, description="document text")# define index paramsindex_params = client.prepare_index_params()index_params.add_index(field_name="doc_vector", index_type="IVF_FLAT", metric_type="IP", params={"nlist": 128})# create collectionclient.create_collection(collection_name="test_collection", schema=schema, index_params=index_params)# insert data into collectionclient.insert(collection_name="test_collection", data=entities)# Output:
# {'insert_count': 4, 'ids': [0, 1, 2, 3]}
進行相似性搜索
數據插入后,使用該方法進行相似性搜索search
。
# search results based on our queryres = client.search(collection_name="test_collection",data=[[-0.045217834, 0.035171617, ..., -0.025117004]], # replace with your query vectorlimit=3,output_fields=["doc_id", "doc_text"]
)for i in res[0]:print(f'distance: {i["distance"]}')print(f'doc_text: {i["entity"]["doc_text"]}')
預期輸出類似于以下內容:
distance: 0.7235960960388184
doc_text: The Dartmouth Conference in 1956 is considered the birthplace of artificial intelligence as a field; here, John McCarthy and others coined the term 'artificial intelligence' and laid out its basic goals.
distance: 0.6269873976707458
doc_text: In 1950, Alan Turing published his seminal paper, 'Computing Machinery and Intelligence,' proposing the Turing Test as a criterion of intelligence, a foundational concept in the philosophy and development of artificial intelligence.
distance: 0.5340118408203125
doc_text: The invention of the Logic Theorist by Allen Newell, Herbert A. Simon, and Cliff Shaw in 1955 marked the creation of the first true AI program, which was capable of solving logic problems, akin to proving mathematical theorems.
使用重新排序器來增強搜索結果
然后,通過重新排序步驟來提高搜索結果的相關性。在本例中,我們使用CrossEncoderRerankFunction
內置的 PyMilvus 對結果進行重新排序,以提高準確率。
# use reranker to rerank search resultsfrom pymilvus.model.reranker import CrossEncoderRerankFunctionce_rf = CrossEncoderRerankFunction(model_name="cross-encoder/ms-marco-MiniLM-L-6-v2", # Specify the model name.device="cpu" # Specify the device to use, e.g., 'cpu' or 'cuda:0'
)reranked_results = ce_rf(query='What event in 1956 marked the official birth of artificial intelligence as a discipline?',documents=["In 1950, Alan Turing published his seminal paper, 'Computing Machinery and Intelligence,' proposing the Turing Test as a criterion of intelligence, a foundational concept in the philosophy and development of artificial intelligence.","The Dartmouth Conference in 1956 is considered the birthplace of artificial intelligence as a field; here, John McCarthy and others coined the term 'artificial intelligence' and laid out its basic goals.","In 1951, British mathematician and computer scientist Alan Turing also developed the first program designed to play chess, demonstrating an early example of AI in game strategy.","The invention of the Logic Theorist by Allen Newell, Herbert A. Simon, and Cliff Shaw in 1955 marked the creation of the first true AI program, which was capable of solving logic problems, akin to proving mathematical theorems."],top_k=3
)# print the reranked results
for result in reranked_results:print(f'score: {result.score}')print(f'doc_text: {result.text}')
預期輸出類似于以下內容:
score: 6.250532627105713
doc_text: The Dartmouth Conference in 1956 is considered the birthplace of artificial intelligence as a field; here, John McCarthy and others coined the term 'artificial intelligence' and laid out its basic goals.
score: -2.9546022415161133
doc_text: In 1950, Alan Turing published his seminal paper, 'Computing Machinery and Intelligence,' proposing the Turing Test as a criterion of intelligence, a foundational concept in the philosophy and development of artificial intelligence.
score: -4.771512031555176
doc_text: The invention of the Logic Theorist by Allen Newell, Herbert A. Simon, and Cliff Shaw in 1955 marked the creation of the first true AI program, which was capable of solving logic problems, akin to proving mathematical theorems.