【騰訊云云上實驗室-向量數據庫】Tencent Cloud VectorDB在實戰項目中替換Milvus測試

為什么嘗試使用Tencent Cloud VectorDB替換Milvus向量庫？

亮點：Tencent Cloud VectorDB支持Embedding，免去自己搭建模型的負擔（搭建一個生產環境的模型實在耗費精力和體力）。

騰訊云向量數據庫是什么？

騰訊云向量數據庫是一款全托管的自研企業級分布式數據庫服務，專用于存儲、檢索、分析多維向量數據。該數據庫支持多種索引類型和相似度計算方法，單索引支持10億級向量規模，可支持百萬級 QPS 及毫秒級查詢延遲。騰訊云向量數據庫不僅能為大模型提供外部知識庫，提高大模型回答的準確性，還可廣泛應用于推薦系統、NLP 服務、計算機視覺、智能客服等 AI 領域。

Milvus是什么？

Milvus是在2019年創建的，其唯一目標是存儲、索引和管理由深度神經網絡和其他機器學習（ML）模型生成的大規模嵌入向量。作為一個專門設計用于處理輸入向量查詢的數據庫，它能夠處理萬億級別的向量索引。與現有的關系型數據庫主要處理遵循預定義模式的結構化數據不同，Milvus從底層設計用于處理從非結構化數據轉換而來的嵌入向量。

項目展示

游戲內提問
問答緩存庫后臺管理系統

項目介紹游戲內部接入ChatGPT的智能NPC，可以與她進行語音交流。可以回答與游戲相關的問題（這個專業問題是為了編寫這個文章，專門添加到問答緩存庫中的，游戲內會拒絕回答此類問題）。為了加快ChatGPT的回復速度和降低ChatGPT的費用，增加問答緩存機制。這里運用向量數據庫的相似文本相似度高的特性，通過向量搜索，匹配相似度大于一定值，例如：0.95。搜索到相似問題，直接返回答案，不在進行ChatGPT訪問。

其次，存在緩存，針對相似問題，還可以給予特定回復答案。例如上面示例，當提問“介紹一下騰訊向量數據庫”，直接回復“騰訊云向量數據庫是一款全托管的自研企業級分布式數據庫服務，專用于存儲、檢索、分析多維向量數據。該數據庫支持多種索引類型和相似度計算方法，單索引支持10億級向量規模，可支持百萬級 QPS 及毫秒級查詢延遲。騰訊云向量數據庫不僅能為大模型提供外部知識庫，提高大模型回答的準確性，還可廣泛應用于推薦系統、NLP 服務、計算機視覺、智能客服等 AI 領域。”

為什么使用向量數據庫？

重點：速度
向量相似度匹配是很長的數組，例如：bge-large-zh模型文本轉向量，生成的是768維的float數組。拿問題文本轉換為的768維向量與緩存的所有問題的向量進行相似性計算，然后獲取最相似的幾條數據，這個運算量非常大，速度非常慢。
測試代碼：
與300個768維向量進行相似比對，獲取最相似的一條數據，耗時幾秒鐘。按照這個速度，如果與幾千上萬條數據進行這么計算，簡直無法忍受。
這時就必須使用向量數據庫了，向量數據庫可以支持毫秒級檢索上百萬行數據。本人曾使用Milvus數據庫，分別插入1000行數數據和插入10萬行數據，然后進行搜索對比，都在幾十毫秒返回結果，數據量的增多，對檢索速度幾乎沒有任何影響。

本項目哪里需要使用向量數據庫？

玩家提問：玩家提問先通過embedding轉換為向量，在向量庫檢索相似的問題，滿足匹配條件，直接返回對應的答案。
后臺相似問題檢索：后臺通過向量檢索相似問題，以便對特定問題進行增刪改查。

使用騰訊云向量數庫（Tencent Cloud VectorDB）的優點？

支持Embedding：騰訊云向量數據庫（Tencent Cloud VectorDB）提供將非結構化數據轉換為向量數據的能力，目前已支持文本 Embedding 模型，能夠覆蓋多種主流語言的向量轉換，包括但不限于中文、英文。對于小型項目這是一個非常大的優勢。可以降低自己搭建embedding模型或者使用第三方embedding模型的成本。
FilterIndex的field_type支持數據類型簡單：只有String和Uint64，使用起來非常省心。而Milvus數據支持10幾種類型，對于初學者不友好，還要研究具體如何使用。

指定 Filter 字段的數據類型。取值如下：
String：字符型。若 name 為 id，則該參數固定為 FieldType.String。
Uint64：指無符號整數，該參數可設置為 FieldType.Uint64。

研究Tencent Cloud VectorDB，測試并封裝代碼庫my_tc_vector_db.py

if __name__ == '__main__':# 初始化myTcVectorDB = MyTcVectorDB("http://****************.tencentclb.com:30000", "root","2epSOV3HK6tiyALo6UqE3mGV**************")# 刪除數據庫myTcVectorDB.drop_collection("db-qa", "question_768")myTcVectorDB.drop_database("db-qa")# 創建數據庫myTcVectorDB.create_database("db-qa")# 創建索引和embedding，并創建集合index = Index(FilterIndex(name='id', field_type=FieldType.String, index_type=IndexType.PRIMARY_KEY),FilterIndex(name='question', field_type=FieldType.String, index_type=IndexType.FILTER),VectorIndex(name='vector', dimension=768, index_type=IndexType.HNSW,metric_type=MetricType.COSINE, params=HNSWParams(m=16, efconstruction=200)))embedding = Embedding(vector_field='vector', field='text', model=EmbeddingModel.BGE_BASE_ZH)collection = myTcVectorDB.create_collection("db-qa", "question_768", index, embedding)# 批量插入myTcVectorDB.upsert("db-qa", "question_768", [Document(id='0001', text='羅貫中', question='羅貫中'),Document(id='0002', text='吳承恩', question='吳承恩'),Document(id='0003', text='曹雪芹', question='曹雪芹'),Document(id='0004', text='郭富城', question='郭富城')])# 單條插入myTcVectorDB.upsert_one("db-qa", "question_768", id='0005', text='周杰倫', question='周杰倫')myTcVectorDB.upsert_one("db-qa", "question_768", id='0006', text='林俊杰', question='林俊杰')# 刪除0003myTcVectorDB.delete_by_id("db-qa", "question_768", "0003")# 文本搜索（無需向量轉換）text = myTcVectorDB.search_by_text("db-qa", "question_768", "郭富城")# 打印結果print_object(text)# 僅打印idif len(text[0]) > 0:for i in text[0]:print(i['id'])

解釋代碼功能：

初始化：傳入tcVectorDB的url、username和key，創建myTcVectorDB.
刪除數據庫db-qa下的數據集question_768，然后刪除數據庫db-qa
重新創建數據庫db-qa
指定索引和embedding，并創建集合question_768：這里指定id為主鍵、question為FilterIndex標量索引，vector為VectorIndex向量索引（注意官方文檔說明：指定向量索引字段名，固定為 vector。）因為使用中文檢索，Embedding使用BGE_BASE_ZH。
批量插入測試數據
單行插入測試數據
測試刪除單行數據
測試文本搜索，并打印結果

MyTcVectorDB庫代碼

import jsonimport tcvectordb
from tcvectordb.model.collection import Embedding
from tcvectordb.model.document import Document, SearchParams
from tcvectordb.model.enum import ReadConsistency, MetricType, FieldType, IndexType, EmbeddingModel
from tcvectordb.model.index import Index, FilterIndex, VectorIndex, HNSWParamsclass MyTcVectorDB:def __init__(self, url: str, username: str, key: str, timeout: int = 30):self._client = tcvectordb.VectorDBClient(url=url, username=username, key=key,read_consistency=ReadConsistency.EVENTUAL_CONSISTENCY, timeout=timeout)def create_database(self, database_name: str):"""Create a database:param database_name: database name:return: database"""return self._client.create_database(database_name=database_name)def drop_database(self, database_name: str):"""Drop a database:param database_name: database name:return: result"""return self._client.drop_database(database_name=database_name)def create_collection(self, db_name: str, collection_name: str, index: Index, ebd: Embedding):db = self._client.database(db_name)# 第二步，創建 Collectioncoll = db.create_collection(name=collection_name,shard=1,replicas=0,description='this is a collection of question embedding',index=index,embedding=ebd)return colldef drop_collection(self, db_name: str, collection_name: str):"""Drop a collection:param db_name: db name:param collection_name: collection name:return: result"""db = self._client.database(db_name)return db.drop_collection(collection_name)def upsert_one(self, db_name: str, collection_name: str, **kwargs):"""Upsert one document to collection:param db_name : db name:param collection_name: collection name:param document: Document:return: result"""db = self._client.database(db_name)coll = db.collection(collection_name)res = coll.upsert(documents=[Document(**kwargs)])return resdef upsert(self, db_name: str, collection_name: str, documents):"""Upsert documents to collection:param db_name : db name:param collection_name: collection name:param documents: list of Document:return: result"""db = self._client.database(db_name)coll = db.collection(collection_name)res = coll.upsert(documents=documents)return resdef search_by_text(self, db_name: str, collection_name: str, text: str, limit: int = 10):"""Search documents by text:param db_name : db name:param collection_name: collection name:param text: text:return: result"""db = self._client.database(db_name)coll = db.collection(collection_name)# searchByText 返回類型為 Dict，接口查詢過程中 embedding 可能會出現截斷，如發生截斷將會返回響應 warn 信息，如需確認是否截斷可以# 使用 "warning" 作為 key 從 Dict 結果中獲取警告信息，查詢結果可以通過 "documents" 作為 key 從 Dict 結果中獲取res = coll.searchByText(embeddingItems=[text],params=SearchParams(ef=200),limit=limit)return res.get('documents')def delete_by_id(self, db_name: str, collection_name: str, document_id):"""Delete document by id:param db_name : db name:param collection_name: collection name:param document_id: document id:return: result"""db = self._client.database(db_name)coll = db.collection(collection_name)res = coll.delete(document_ids=[document_id])return resdef print_object(obj):"""Print object"""for elem in obj:# ensure_ascii=False 保證中文不亂碼if hasattr(elem, '__dict__'):print(json.dumps(vars(elem), indent=4, ensure_ascii=False))else:print(json.dumps(elem, indent=4, ensure_ascii=False))

開始動手使用Tencent Cloud VectorDB在項目中替換Milvus

1、創建問題庫db-qa和集合question_768

與測試代碼基本一致

    # 初始化myTcVectorDB = MyTcVectorDB("http://****tencentclb.com:30000", "root","2epSOV3HK6tiyALo6UqE3mGVMbpP*******")# 創建數據庫myTcVectorDB.create_database("db-qa")# 創建索引和embedding，并創建集合index = Index(FilterIndex(name='id', field_type=FieldType.String, index_type=IndexType.PRIMARY_KEY),FilterIndex(name='question', field_type=FieldType.String, index_type=IndexType.FILTER),VectorIndex(name='vector', dimension=768, index_type=IndexType.HNSW,metric_type=MetricType.COSINE, params=HNSWParams(m=16, efconstruction=200)))embedding = Embedding(vector_field='vector', field='text', model=EmbeddingModel.BGE_BASE_ZH)collection = myTcVectorDB.create_collection("db-qa", "question_768", index, embedding)

2、游戲端和后臺文本向量搜索，用MyTcVectorDB替換Milvus

兩處代碼基本一致。這里去掉文本轉向量的步驟，因為TcVectorDB支持Embedding

    # 獲取問題轉換后的向量# success, vector = get_vector_from_text(question)# if not success:#     return {"code": -1, "id": 0, "answer": "向量計算失敗"}# results = questionCollection.search(vector, limit)results = myVectorDB.search_by_text("db-qa", "question_768", question, limit)...

上面代碼需要注意一點，騰訊向量數據的search結果與milvus的搜索結果是不一樣的，需要做一下適配。

3、重建向量數據庫

問答緩存的數據保存在mysql數據庫，向量數據庫主要作用是向量搜索。如果更換向量庫，只需要重建向量庫即可。下面代碼：

從mysql中獲取所有的問題
遍歷所有問答
把問題作為向量索引，問答的id為標量索引插入向量庫中
當前mysql數據庫中有大幾千條數據，重新構建向量就耗時10分鐘左右。

def rebuild_vector():# 查找所有的數據select_all = qaTable.select_all_qa()# 遍歷所有的數據for qa in select_all:insertId = qa[0]question = qa[1]timestamp = int(time.time())print(question)# 計算向量# 更新向量# success, vector = get_vector_from_text(question)# if not success:#     # 向量計算失敗,question#     logging.error("向量計算失敗,insertId:%s, question:%s", insertId, question)#     continue# # 刪除原有的向量# questionCollection.delete_question(insertId)# # 插入新的向量# questionCollection.insert_question(insertId, vector, question, timestamp)myVectorDB.delete_by_id("db-qa", "question_768", str(insertId))myVectorDB.upsert_one("db-qa", "question_768", id=str(insertId), text=question, question=question)return "重建向量庫成功"

4、修改后臺展示，看下修改后的效果圖

使用的文本轉向量的模型是：BGE_BASE_ZH
向量索引是：VectorIndex(name=‘vector’, dimension=768, index_type=IndexType.HNSW, metric_type=MetricType.COSINE, params=HNSWParams(m=16, efconstruction=200))
搜索文本返回結果代表的是相似度，保存在score中。

總結：

使用騰訊向量數據庫要比使用Milvus更加簡單易用，無需自己部署服務器。
騰訊云向量庫支持主流Embedding，直接支持文本向量搜索，避免自己部署Embedding模型，并避免調用文本轉向量的過程。對于開發者來說非常便利。
如果是個人，或者小型項目開發，非常值得使用騰訊云數據庫。如果是大型項目，不缺錢的話也非常推薦使用騰訊云數據庫，穩定、高效且安全。