不依賴rerank 模型排序通過使用 PostgreSQL 中的 pgvector 與 tsearch2 函數進行混合搜索提高召回率

前言

在向量搜索中，召回率是一個關鍵指標，它衡量搜索結果的相關性。然而，提高召回率往往會犧牲其他指標，如索引大小或查詢延遲。為了平衡這些權衡，混合搜索技術應運而生。本文將介紹如何在 PostgreSQL 中結合 pgvector 實現混合搜索，并探討其對搜索結果的影響。

前提條件

必須使用PostgreSQL中的 pgvector 插件和 tsearch2 函數。請先檢查一下，你所使用的PostgreSQL 是否支持。

什么是混合搜索？

混合搜索是將向量相似性搜索與其他搜索方法（如全文搜索）相結合的一種技術。它通過多種搜索方法對同一數據進行搜索，對每種方法的結果進行排序，然后合并所有結果以確定最終排名。混合搜索的目標是提高搜索結果的質量，即提高召回率。

在混合搜索中，互惠排序融合（RRF）是一種常用的評分方法。RRF 通過加權評分系統，根據排名對結果進行評分。公式如下：

1.0 / (result_search_1_rank + rrf_k) +
1.0 / (result_search_2_rank + rrf_k)

其中，rrf_k 是一個常數，用于控制權重。較小的 rrf_k 值會賦予排名較高的項目更大的權重。

PostgreSQL 中的全文搜索

PostgreSQL 提供了多種全文搜索方法，如 tsearch2 和 pg_trgm，以及擴展如 pg_bigm 和 PGroonga。在本文中，我們將使用 tsearch2 函數，并結合 GIN 索引和 ts_rank_cd 結果排序方法。

示例：在 PostgreSQL 中構建混合搜索

數據準備

我們使用 Python 的 faker 庫生成隨機文本數據，并使用 multi-qa-MiniLM-L6-cos-v1 句子轉換器模型計算向量嵌入。以下是 Python 代碼：

from faker import Faker
import psycopg
from pgvector.psycopg import register_vector
from sentence_transformers import SentenceTransformerfake = Faker()
sentences = [fake.sentence(nb_words=50) for i in range(0, 50_000)]model = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1')
embeddings = model.encode(sentences)conn = psycopg.connect(dbname="<YOUR DATABASE>", autocommit=True)
cur = conn.cursor()with cur.copy("COPY products (description, embedding) FROM STDIN WITH (FORMAT BINARY)") as copy:copy.set_types(["text", "vector"])for content, embedding in zip(sentences, embeddings):copy.write_row((content, embedding))cur.close()
conn.close()

數據庫設置

在 PostgreSQL 中，我們需要創建表和索引：

-- 創建擴展
CREATE EXTENSION vector;-- 創建表
CREATE TABLE products (id int GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,description text NOT NULL,embedding vector(384) NOT NULL
);-- 創建 RRF 評分函數
CREATE OR REPLACE FUNCTION rrf_score(rank int, rrf_k int DEFAULT 50)
RETURNS numeric
LANGUAGE SQL
IMMUTABLE PARALLEL SAFE
AS $$SELECT COALESCE(1.0 / ($1 + $2), 0.0);
$$;-- 創建全文搜索索引
CREATE INDEX ON productsUSING GIN (to_tsvector('english', description));-- 創建向量搜索索引
CREATE INDEX ON productsUSING hnsw(embedding vector_cosine_ops) WITH (ef_construction=256);

搜索實現

單獨的向量相似性搜索

SELECT id, description, rank() OVER (ORDER BY $1 <=> embedding) AS rank
FROM products
ORDER BY $1 <=> embedding
LIMIT 10;

單獨的全文搜索

SELECTid,description,rank() OVER (ORDER BY ts_rank_cd(to_tsvector(description), plainto_tsquery('travel computer')) DESC) AS rank
FROM products
WHEREplainto_tsquery('english', 'travel computer') @@ to_tsvector('english', description)
ORDER BY rank
LIMIT 10;

混合搜索

SELECTsearches.id,searches.description,sum(rrf_score(searches.rank)) AS score
FROM ((SELECTid,description,rank() OVER (ORDER BY $1 <=> embedding) AS rankFROM productsORDER BY $1 <=> embeddingLIMIT 40)UNION ALL(SELECTid,description,rank() OVER (ORDER BY ts_rank_cd(to_tsvector(description), plainto_tsquery('travel computer')) DESC) AS rankFROM productsWHEREplainto_tsquery('english', 'travel computer') @@ to_tsvector('english', description)ORDER BY rankLIMIT 40)
) searches
GROUP BY searches.id, searches.description
ORDER BY score DESC
LIMIT 10;

性能分析

通過 EXPLAIN ANALYZE，我們可以看到 PostgreSQL 在混合搜索中同時使用了向量索引和全文搜索索引。以下是執行計劃的輸出：

Limit  (cost=789.66..789.69 rows=10 width=365) (actual time=8.516..8.519 rows=10 loops=1)->  Sort  (cost=789.66..789.86 rows=80 width=365) (actual time=8.515..8.518 rows=10 loops=1)Sort Key: (sum(COALESCE((1.0 / (("*SELECT* 1".rank + 50))::numeric), 0.0))) DESCSort Method: top-N heapsort  Memory: 32kB->  GroupAggregate  (cost=785.53..787.93 rows=80 width=365) (actual time=8.435..8.495 rows=79 loops=1)Group Key: "*SELECT* 1".id, "*SELECT* 1".description->  Sort  (cost=785.53..785.73 rows=80 width=341) (actual time=8.430..8.436 rows=80 loops=1)Sort Key: "*SELECT* 1".id, "*SELECT* 1".descriptionSort Method: quicksort  Memory: 53kB->  Append  (cost=84.60..783.00 rows=80 width=341) (actual time=0.877..8.414 rows=80 loops=1)->  Subquery Scan on "*SELECT* 1"  (cost=84.60..125.52 rows=40 width=341) (actual time=0.877..0.949 rows=40 loops=1)->  Limit  (cost=84.60..125.12 rows=40 width=349) (actual time=0.876..0.945 rows=40 loops=1)->  WindowAgg  (cost=84.60..50736.60 rows=50000 width=349) (actual time=0.876..0.942 rows=40 loops=1)->  Index Scan using products_embeddings_hnsw_idx on products  (cost=84.60..49861.60 rows=50000 width=341) (actual time=0.872..0.919 rows=40 loops=1)Order By: (embedding <=> '<redacted>'::vector)->  Subquery Scan on "*SELECT* 2"  (cost=656.58..657.08 rows=40 width=341) (actual time=7.448..7.458 rows=40 loops=1)->  Limit  (cost=656.58..656.68 rows=40 width=345) (actual time=7.447..7.453 rows=40 loops=1)->  Sort  (cost=656.58..656.89 rows=124 width=345) (actual time=7.447..7.449 rows=40 loops=1)Sort Key: (rank() OVER (?))Sort Method: top-N heapsort  Memory: 44kB->  WindowAgg  (cost=588.18..652.66 rows=124 width=345) (actual time=7.357..7.419 rows=139 loops=1)->  Sort  (cost=588.18..588.49 rows=124 width=337) (actual time=7.355..7.363 rows=139 loops=1)Sort Key: (ts_rank_cd(to_tsvector(products_1.description), plainto_tsquery('travel computer'::text))) DESCSort Method: quicksort  Memory: 79kB->  Bitmap Heap Scan on products products_1  (cost=30.38..583.87 rows=124 width=337) (actual time=0.271..7.323 rows=139 loops=1)Recheck Cond: ('''travel'' & ''comput'''::tsquery @@ to_tsvector('english'::regconfig, description))Heap Blocks: exact=138->  Bitmap Index Scan on products_description_gin_idx  (cost=0.00..30.35 rows=124 width=0) (actual time=0.186..0.186 rows=139 loops=1)Index Cond: (to_tsvector('english'::regconfig, description) @@ '''travel'' & ''comput'''::tsquery)
Planning Time: 0.193 ms
Execution Time: 8.553 ms

從執行計劃可以看出，PostgreSQL 在混合搜索中同時使用了向量索引和全文搜索索引，查詢效率較高。

為什么會使用這樣的方案，rerank模型不香嗎？

我們自己使用spring ai 框架搭建的大模型應用平臺的底層推理引擎是 ollama，因此我們重度依賴ollama 所提供的推理能力，然而ollama 并不支持rerank模型的服務，為了解決召回重排的問題，在不進一步增大部署實施難度的情況下，采用了混合檢索的方案進行簡化。

其次通過架構的討論，在算力不強的環境下，能夠盡量壓榨PostgreSQL 的能力，提高召回，也是壓縮成本的一個考慮。

spring ai 官方也將提供類似方案進行重排

已經在 spring ai 1.1.x 的pull request 中有大神實現了類似的方案。
https://github.com/spring-projects/spring-ai/pull/1097

后續在我們自己平臺上的進一步使用計劃

本文展示了如何在 PostgreSQL 中結合 pgvector 實現混合搜索，并通過示例驗證了其可行性。然而，這只是一個起點。未來的工作包括：

評估混合搜索的性能：在更大的數據集上測試混合搜索的召回率、查詢延遲和每秒查詢次數（QPS），并與單獨的向量搜索進行對比。
優化參數：調整 rrf_k 等參數，以找到最佳的混合搜索策略。
探索其他全文搜索算法：分析 PostgreSQL 中不同的全文搜索算法（如 pg_trgm、pg_bigm 等）對混合搜索結果的影響。

混合搜索為提升搜索結果的相關性提供了一種有效的途徑。通過結合向量相似性搜索和全文搜索的優勢，我們可以在不顯著增加查詢延遲的情況下，提高搜索結果的召回率。

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/bicheng/85356.shtml
繁體地址，請注明出處：http://hk.pswp.cn/bicheng/85356.shtml
英文地址，請注明出處：http://en.pswp.cn/bicheng/85356.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！