建立向量嵌入數據庫
from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import CharacterTextSplitter
from langchain.docstore.document import Document
from langchain_chroma.vectorstores import Chromaimport vertexai
from vertexai.language_models import TextEmbeddingModel
導入了一些用于構建 文本嵌入式向量檢索系統(Embedding-based Retrieval System) 的模塊,結合了 LangChain、Chroma 向量數據庫、以及 Google Vertex AI 的文本嵌入模型。
從 LangChain 社區庫 導入 TextLoader,用于從 .txt 文件中加載純文本數據。
功能:將本地文本文件加載為 LangChain 的 Document 對象
導入 CharacterTextSplitter,是 LangChain 提供的一種 文本分割器。
功能:將長文本分割成較小的片段(例如按字符數),用于后續嵌入處理或問答檢索。
功能:用于表示一個包含文本內容和可選元數據的文檔對象,便于在整個鏈條中傳遞和處理。
導入 Chroma 類,它是 LangChain 支持的 向量數據庫(Vector Store) 之一,基于 ChromaDB。
功能:將文本嵌入后保存到數據庫中,可以用來做相似度搜索、向量檢索等任務。
從 Vertex AI 的語言模型模塊中導入 TextEmbeddingModel。
功能:調用 Google 預訓練的嵌入模型,將文本轉換為 嵌入向量(embedding vector),供后續存入向量數據庫或進行相似度計算。
# 設置 GCP 項目參數
project_id = ""
location = "us-central1" # 確保該地區支持 Gemini
import pandas as pdbooks = pd.read_csv("books_cleaned_new.csv")
books["tagged_description"]
0 9780002005883 A NOVEL THAT READERS and critics...
1 9780002261982 A new 'Christie for Christmas' -...
2 9780006178736 A memorable, mesmerizing heroine...
3 9780006280897 Lewis' work on the nature of lov...
4 9780006280934 "In The Problem of Pain, C.S. Le......
5192 9788172235222 On A Train Journey Home To North...
5193 9788173031014 This book tells the tale of a ma...
5194 9788179921623 Wisdom to Create a Life of Passi...
5195 9788185300535 This collection of the timeless ...
5196 9789027712059 Since the three volume edition o...
Name: tagged_description, Length: 5197, dtype: object
books["tagged_description"].to_csv("new_tagged_description.txt",sep = "\n",index = False,header = False)
將 books[“tagged_description”] 這一列保存為一個新的文本文件 new_tagged_description.txt,每行一個值,不含索引和列名。
sep=“\n”:每個元素用“換行”分隔,也就是每個值占一行。
index=False:不輸出 DataFrame 的行號索引。
header=False:不輸出列名(只保存純文本內容)。
raw_documents = TextLoader("new_tagged_description.txt", encoding="utf-8").load()
text_splitter = CharacterTextSplitter(chunk_size=0, chunk_overlap=0, separator="\n")
documents = text_splitter.split_documents(raw_documents)
encoding=“utf-8”:確保以 UTF-8 編碼讀取文件。
創建一個文本切分器,按換行符 \n 作為分割依據,把長文本拆分成多個塊(chunk)。
chunk_size=0:特殊用法,配合 separator=“\n”,表示按行完整切分,而不是定長字符數。
chunk_overlap=0:切分塊之間沒有重疊。
separator=“\n”:以換行符為切分依據。
將上一步加載的長文檔切割成一個個較小的 Document 實例,每個實例代表一行文本。
Created a chunk of size 1168, which is longer than the specified 0
Created a chunk of size 1214, which is longer than the specified 0
Created a chunk of size 373, which is longer than the specified 0
Created a chunk of size 309, which is longer than the specified 0
Created a chunk of size 483, which is longer than the specified 0
Created a chunk of size 482, which is longer than the specified 0
Created a chunk of size 960, which is longer than the specified 0
Created a chunk of size 188, which is longer than the specified 0
Created a chunk of size 843, which is longer than the specified 0
Created a chunk of size 296, which is longer than the specified 0
Created a chunk of size 197, which is longer than the specified 0
Created a chunk of size 881, which is longer than the specified 0
Created a chunk of size 1088, which is longer than the specified 0
Created a chunk of size 1189, which is longer than the specified 0
Created a chunk of size 304, which is longer than the specified 0
Created a chunk of size 270, which is longer than the specified 0
Created a chunk of size 211, which is longer than the specified 0
Created a chunk of size 214, which is longer than the specified 0
Created a chunk of size 513, which is longer than the specified 0
Created a chunk of size 752, which is longer than the specified 0
Created a chunk of size 388, which is longer than the specified 0
Created a chunk of size 263, which is longer than the specified 0
Created a chunk of size 253, which is longer than the specified 0
Created a chunk of size 306, which is longer than the specified 0
Created a chunk of size 728, which is longer than the specified 0
...
Created a chunk of size 1655, which is longer than the specified 0
Created a chunk of size 387, which is longer than the specified 0
Created a chunk of size 763, which is longer than the specified 0
Created a chunk of size 1032, which is longer than the specified 0
documents[0]
Document(metadata={'source': 'new_tagged_description.txt'}, page_content='9780002005883 A NOVEL THAT READERS and critics have been eagerly anticipating for over a decade, Gilead is an astonishingly imagined story of remarkable lives. John Ames is a preacher, the son of a preacher and the grandson (both maternal and paternal) of preachers. It’s 1956 in Gilead, Iowa, towards the end of the Reverend Ames’s life, and he is absorbed in recording his family’s story, a legacy for the young son he will never see grow up. Haunted by his grandfather’s presence, John tells of the rift between his grandfather and his father: the elder, an angry visionary who fought for the abolitionist cause, and his son, an ardent pacifist. He is troubled, too, by his prodigal namesake, Jack (John Ames) Boughton, his best friend’s lost son who returns to Gilead searching for forgiveness and redemption. Told in John Ames’s joyous, rambling voice that finds beauty, humour and truth in the smallest of life’s details, Gilead is a song of celebration and acceptance of the best and the worst the world has to offer. At its heart is a tale of the sacred bonds between fathers and sons, pitch-perfect in style and story, set to dazzle critics and readers alike.')
vertexai.init(project=project_id, location=location)
embedding_model = TextEmbeddingModel.from_pretrained("text-embedding-005")
初始化 Vertex AI 客戶端,以便后續調用 Google Cloud 上的 AI 模型。
Vertex AI 是 Google Cloud 提供的機器學習平臺,這一行的作用是連接到你的云項目和區域,使你能使用 Vertex AI 上部署的模型(如 Embedding、LLM、AutoML 等)。
從 Vertex AI 加載一個預訓練的文本嵌入(Text Embedding)模型。
# 嵌入函數(批處理)
def get_gemini_embeddings(texts: list[str]) -> list[list[float]]:embeddings = embedding_model.get_embeddings(texts)return [e.values for e in embeddings]
定義了一個 批量生成文本嵌入向量的函數,用于將多個文本轉換為數值表示(即向量)
返回值 是一個二維列表:每一條文本對應一個向量,每個向量是浮點數列表(如 768 維的向量)。
這是一個 列表推導式,將每個 embedding 對象的 .values 取出,構成最終的二維列表。
# 分批生成嵌入
BATCH_SIZE = 50
all_texts = [doc.page_content for doc in documents]
all_metadatas = [doc.metadata for doc in documents]batched_docs = []
batched_vectors = []for i in range(0, len(all_texts), BATCH_SIZE):batch_texts = all_texts[i:i+BATCH_SIZE]batch_metadatas = all_metadatas[i:i+BATCH_SIZE]batch_vectors = get_gemini_embeddings(batch_texts)for text, metadata, vector in zip(batch_texts, batch_metadatas, batch_vectors):batched_docs.append(Document(page_content=text, metadata=metadata))batched_vectors.append(vector)
在使用嵌入模型(如 Vertex AI 的 text-embedding-005)時,出于性能和 API 限制,不能一次性處理太多文本。因此通常使用「批處理」的方式進行嵌入生成。
從 documents 中提取:page_content:文本內容;metadata:與每條文本關聯的元數據(如文件名、頁碼等)
batched_docs:保存文本和元數據的 Document 對象
batched_vectors:保存每個文本對應的向量(float 列表)
使用步長為 BATCH_SIZE 的循環,每次處理一批文本
調用之前定義的 get_gemini_embeddings 函數生成嵌入向量
將每個文本、元數據、向量打包成一個 Document 對象 + 向量
存入兩個列表中,供后續使用(如構建向量索引)
舉例說明
假設你有 120 段文本,每段都要生成向量,代碼會像這樣運行:
第一批:第 0~49 條 → 嵌入 → 加入結果
第二批:第 50~99 條 → 嵌入 → 加入結果
第三批:第 100~119 條 → 嵌入 → 加入結果
使用 Chroma 創建一個持久化向量數據庫,將文本及其嵌入向量保存進去,并用 LangChain 封裝以供后續問答或檢索使用。
from chromadb import PersistentClient
from langchain_chroma.vectorstores import Chroma# 先建立 Chroma 客戶端
client = PersistentClient(path="./new_chroma_books")# 創建向量庫集合
collection = client.get_or_create_collection(name="books")# 插入數據(確保你的向量數目和文本數目一致)
collection.add(documents=[doc.page_content for doc in batched_docs],embeddings=batched_vectors,metadatas=[doc.metadata for doc in batched_docs],ids=[f"doc_{i}" for i in range(len(batched_docs))]
)# 用 langchain 封裝向量庫
db_books = Chroma(client=client,collection_name="books",embedding_function=lambda x: batched_vectors # 注意:這里最好改為動態函數
)
chromadb: 向量數據庫 Chroma 的 Python 客戶端
langchain_chroma.vectorstores.Chroma: LangChain 封裝的 Chroma 適配器,便于在 LangChain 中集成向量庫
PersistentClient:用于創建一個持久化本地向量數據庫客戶端
Chroma:LangChain 的向量數據庫接口封裝類
在 ./new_chroma_books 路徑下創建或打開一個本地向量數據庫
所有數據會保存在這個文件夾中,下次運行也能加載
創建一個名為 books 的集合(collection),類似于數據庫中的表。
插入以下內容到 books 集合中:
documents: 原始文本內容(字符串列表)
embeddings: 每條文本的向量(二維 float 列表)
metadatas: 每條文本的元數據(字典列表)
ids: 每條記錄的唯一 ID,如 doc_0, doc_1, …
封裝為 LangChain 可用的向量數據庫對象 db_books
傳入當前客戶端和集合名
embedding_function:嵌入函數,這里用了一個固定返回 batched_vectors 的 lambda
根據查詢語句生成嵌入向量,并在向量數據庫中查找最相似的 10 條文檔。
# 示例查詢
query = "A book to teach children about nature"
query_embedding = get_gemini_embeddings([query])[0]
docs = db_books.similarity_search_by_vector(query_embedding, k=10)
使用之前定義的 get_gemini_embeddings 函數,把查詢轉化為向量(嵌入表示)。
get_gemini_embeddings 返回的是列表(批處理),所以這里取第一個 [0] 得到該查詢的向量。
使用 db_books(封裝好的向量數據庫)對查詢向量進行相似度搜索。
similarity_search_by_vector(query_embedding, k=10) 表示返回與該向量最接近的 10 條文檔。
變量 docs 中保存的是 與查詢語句最相關的 10 本書的描述文本與元數據
[Document(id='doc_3751', metadata={'source': 'new_tagged_description.txt'}, page_content='9780786808717 A very special puddle sets Violet the mouse off on her latest nature discovery. It is through this puddle that Violet observes the effect rain has on the world around her. A Mylar puddle on the last page offers children a chance to see their reflection in a puddle, just like Violet!'),Document(id='doc_3747', metadata={'source': 'new_tagged_description.txt'}, page_content='9780786808069 Children will discover the exciting world of their own backyard in this introduction to familiar animals from cats and dogs to bugs and frogs. The combination of photographs, illustrations, and fun facts make this an accessible and delightful learning experience.'),Document(id='doc_442', metadata={'source': 'new_tagged_description.txt'}, page_content='"9780067575208 First published more than three decades ago, this reissue of Rachel Carson\'s award-winning classic brings her unique vision to a new generation of readers. Stunning new photographs by Nick Kelsh beautifully complement Carson\'s intimate account of adventures with her young nephew, Roger, as they enjoy walks along the rocky coast of Maine and through dense forests and open fields, observing wildlife, strange plants, moonlight and storm clouds, and listening to the ""living music"" of insects in the underbrush. ""If a child is to keep alive his inborn sense of wonder."" Writes Carson, ""he needs the companionship of at least one adult who can share it, rediscovering with him the joy, excitement and mystery of the world we live in."" The Sense of Wonder is a refreshing antidote to indifference and a guide to capturing the simple power of discovery that Carson views as essential to life. In her insightful new introduction, Linda Lear remembers Rachel Carson\'s groundbreaking achievements in the context of the legendary environmentalist\'s personal commitment to introducing young and old to the miracles of nature. Kelsh\'s lush photographs inspire sensual, tactile reactions: masses of leaves floating in a puddle are just waiting to be scooped up and examined more closely. An image of a narrow path through the trees evokes the earthy scent of the woods after a summer rain. Close-ups of mosses and miniature lichen fantasy-lands will spark innocent\'as well as more jaded\'imaginations. Like a curious child studying things underfoot and within reach, Kelsh\'s camera is drawn to patterns in nature that too often elude hurried adults\'a stand of beech trees in the springtime, patches of melting snow and the ripples from a pebble tossed into a slow-moving stream. The Sense of Wonder is a timeless volume that will be passed on from children to grandchildren, as treasured as the memory of an early-morning walk when the song of a whippoorwill was heard as if for the first time."'),Document(id='doc_3442', metadata={'source': 'new_tagged_description.txt'}, page_content='9780744578263 Washed up on the beach during a storm, the sea-thing child clings fearfully to the shore until he discovers his true destiny. Suggested level: primary.'),Document(id='doc_3797', metadata={'source': 'new_tagged_description.txt'}, page_content='9780789458209 Photographs and text explore the anatomy and life cycle of trees, examining the different kinds of bark, seeds, and leaves, the commercial processing of trees to make lumber, the creatures that live in trees, and other aspects.'),Document(id='doc_1639', metadata={'source': 'new_tagged_description.txt'}, page_content='9780374422080 This Newbery Honor Book tells the story of 11 -year-old Primrose, who lives in a small fishing village in British Columbia. She recounts her experiences and all she learns about human nature and the unpredictability of life after her parents are lost at sea.'),Document(id='doc_3750', metadata={'source': 'new_tagged_description.txt'}, page_content="9780786808397 Introduce your baby to birds, cats, dogs, and babies through fine art, illustration, and photographs. These books are a rare opportunity to exopse little ones to a range of images on a single subject, from simple child's drawings and abstract art to playful photos. A brief text accompanies each image, introducing baby to some basic -- and sometimes playful -- information about the subjects."),Document(id='doc_3748', metadata={'source': 'new_tagged_description.txt'}, page_content="9780786808373 Introducing your baby to birds, cats, dogs, and babies through fine art, illsutration and photographs. These books are a rare opportunity to expose little ones to a range of images on a single subject, from simple child's drawings and abstract art to playful photos. A brief text accompanies each image, introducing baby to some basic -- and sometimes playful -- information on the subjects."),Document(id='doc_3522', metadata={'source': 'new_tagged_description.txt'}, page_content='9780753459645 What is a leap year? Why are bees busy in summer? Who eats the moon? Why does it get dark at night? In I Wonder Why the Sun Rises by Brenda Walpole children will find out the answers to these and many more questions about time and seasons.'),Document(id='doc_3749', metadata={'source': 'new_tagged_description.txt'}, page_content="9780786808380 Introduce your babies to birds, cats, dogs, and babies through fine art, illustration, and photographs. These books are a rare opportunity to expose little ones to a range of images on a single subject, from simple child's drawings and abstract art to playful photos. A brief text accompanies each image, introducing the baby to some basic -- and sometimes playful -- information about the subjects.")]
books[books["isbn13"] == int(docs[0].page_content.split()[0].strip())]
取出第一個相似文檔的文本內容,將文本按照空格拆分,并取第一個詞,即為該文檔關聯的 ISBN-13 編號,去除該字符串首尾可能的空白字符(雖然對 ISBN 一般沒必要,但保險起見)。
將字符串轉為整數(因為 books[“isbn13”] 是整數類型)。
根據 docs[0] 文本中的 ISBN 編號,找到其在原始 books 表中的完整信息(如標題、作者、評分、分類等)。
根據自然語言查詢語句,從語義上推薦相關圖書,并返回包含這些圖書詳細信息的 DataFrame。
def retrieve_semantic_recommendations(query: str,top_k: int = 10,
) -> pd.DataFrame:query_embedding = get_gemini_embeddings([query])[0]recs = db_books.similarity_search_by_vector(query_embedding, k = 50)books_list = []for i in range(0, len(recs)):books_list += [int(recs[i].page_content.strip('"').split()[0])]return books[books["isbn13"].isin(books_list)]
query: 用戶輸入的自然語言查詢,例如 “Books about space exploration for kids”。
top_k: 期望返回的圖書數
用 Gemini 模型生成查詢的向量嵌入(embedding)表示,用于語義相似度比較。
在之前構建的 Chroma 向量數據庫中查找與該嵌入最相似的 50 本書。
recs 是一個由 Document 組成的列表,每個 Document 的 page_content 包含如 “9780316015844 This book is about…” 的字符串。
逐個提取每本推薦書的 ISBN 編號(在 page_content 中作為第一個詞),并將其轉換為 int 后存入 books_list。
從 books 數據框中篩選出 ISBN 在 books_list 中的記錄,返回包含書名、評分、頁數、描述等元數據的表格。
retrieve_semantic_recommendations("A book to teach children about nature")