小結:
- 文檔處理部分,建議在實際應用中詳細測試后使用
- 與向量數據庫的鏈接部分本質是接口封裝,向量數據庫需要自己選型
- 類似 LlamaIndex,LangChain 也提供了豐富的 Document Loaders DocumentLoaders和 Text Splitters Text Splitters
1. 文檔加載器 Document Loaders
#!pip install pymupdf
from langchain_community.document_loaders import PyMuPDFLoaderloader = PyMuPDFLoader("llama2.pdf")
pages = loader.load_and_split()print(pages[0].page_content)打印輸出:
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron?
Louis Martin?
Kevin Stone?
Peter Albert Amjad Almahairi Yasmine Babaei Nikolay Bashlykov Soumya Batra
Prajjwal Bhargava Shruti Bhosale Dan Bikel Lukas Blecher Cristian Canton Ferrer Moya Chen
Guillem Cucurull David Esiobu Jude Fernandes Jeremy Fu Wenyin Fu Brian Fuller
Cynthia Gao Vedanuj Goswami Naman Goyal Anthony Hartshorn Saghar Hosseini Rui Hou
Hakan Inan Marcin Kardas Viktor Kerkez Madian Khabsa Isabel Kloumann Artem Korenev
Punit Singh Koura Marie-Anne Lachaux Thibaut Lavril Jenya Lee Diana Liskovich
Yinghai Lu Yuning Mao Xavier Martinet Todor Mihaylov Pushkar Mishra
Igor Molybog Yixin Nie Andrew Poulton Jeremy Reizenstein Rashi Rungta Kalyan Saladi
Alan Schelten Ruan Silva Eric Michael Smith Ranjan Subramanian Xiaoqing Ellen Tan Binh Tang
Ross Taylor Adina Williams Jian Xiang Kuan Puxin Xu Zheng Yan Iliyan Zarov Yuchen Zhang
Angela Fan Melanie Kambadur Sharan Narang Aurelien Rodriguez Robert Stojnic
Sergey Edunov
Thomas Scialom?
GenAI, Meta
Abstract
In this work, we develop and release Llama 2, a collection of pretrained and ?ne-tuned
large language models (LLMs) ranging in scale from 7 billion to 70 billion parameters.
Our ?ne-tuned LLMs, called Llama 2-Chat, are optimized for dialogue use cases. Our
models outperform open-source chat models on most benchmarks we tested, and based on
our human evaluations for helpfulness and safety, may be a suitable substitute for closed-
source models. We provide a detailed description of our approach to ?ne-tuning and safety
improvements of Llama 2-Chat in order to enable the community to build on our work and
contribute to the responsible development of LLMs.
?Equal contribution, corresponding authors: {tscialom, htouvron}@meta.com
?Second author
Contributions for all the authors can be found in Section A.1.
arXiv:2307.09288v2 [cs.CL] 19 Jul 2023
2. 文檔處理器 TextSplitter
#!pip install --upgrade langchain-text-splittersfrom langchain_text_splitters import RecursiveCharacterTextSplittertext_splitter = RecursiveCharacterTextSplitter(chunk_size=200,chunk_overlap=100, length_function=len,add_start_index=True,
)paragraphs = text_splitter.create_documents([pages[0].page_content])
for para in paragraphs:print(para.page_content)print('-------')打印輸出:
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron?
Louis Martin?
Kevin Stone?
Peter Albert Amjad Almahairi Yasmine Babaei Nikolay Bashlykov Soumya Batra
-------
Kevin Stone?
Peter Albert Amjad Almahairi Yasmine Babaei Nikolay Bashlykov Soumya Batra
Prajjwal Bhargava Shruti Bhosale Dan Bikel Lukas Blecher Cristian Canton Ferrer Moya Chen
-------
Prajjwal Bhargava Shruti Bhosale Dan Bikel Lukas Blecher Cristian Canton Ferrer Moya Chen
Guillem Cucurull David Esiobu Jude Fernandes Jeremy Fu Wenyin Fu Brian Fuller
-------
Guillem Cucurull David Esiobu Jude Fernandes Jeremy Fu Wenyin Fu Brian Fuller
Cynthia Gao Vedanuj Goswami Naman Goyal Anthony Hartshorn Saghar Hosseini Rui Hou
-------
Cynthia Gao Vedanuj Goswami Naman Goyal Anthony Hartshorn Saghar Hosseini Rui Hou
Hakan Inan Marcin Kardas Viktor Kerkez Madian Khabsa Isabel Kloumann Artem Korenev
-------
Hakan Inan Marcin Kardas Viktor Kerkez Madian Khabsa Isabel Kloumann Artem Korenev
Punit Singh Koura Marie-Anne Lachaux Thibaut Lavril Jenya Lee Diana Liskovich
-------
Punit Singh Koura Marie-Anne Lachaux Thibaut Lavril Jenya Lee Diana Liskovich
Yinghai Lu Yuning Mao Xavier Martinet Todor Mihaylov Pushkar Mishra
-------
Yinghai Lu Yuning Mao Xavier Martinet Todor Mihaylov Pushkar Mishra
Igor Molybog Yixin Nie Andrew Poulton Jeremy Reizenstein Rashi Rungta Kalyan Saladi
-------
Igor Molybog Yixin Nie Andrew Poulton Jeremy Reizenstein Rashi Rungta Kalyan Saladi
Alan Schelten Ruan Silva Eric Michael Smith Ranjan Subramanian Xiaoqing Ellen Tan Binh Tang
-------
Alan Schelten Ruan Silva Eric Michael Smith Ranjan Subramanian Xiaoqing Ellen Tan Binh Tang
Ross Taylor Adina Williams Jian Xiang Kuan Puxin Xu Zheng Yan Iliyan Zarov Yuchen Zhang
-------
Ross Taylor Adina Williams Jian Xiang Kuan Puxin Xu Zheng Yan Iliyan Zarov Yuchen Zhang
Angela Fan Melanie Kambadur Sharan Narang Aurelien Rodriguez Robert Stojnic
Sergey Edunov
Thomas Scialom?
-------
Sergey Edunov
Thomas Scialom?
GenAI, Meta
Abstract
In this work, we develop and release Llama 2, a collection of pretrained and ?ne-tuned
-------
Abstract
In this work, we develop and release Llama 2, a collection of pretrained and ?ne-tuned
large language models (LLMs) ranging in scale from 7 billion to 70 billion parameters.
-------
large language models (LLMs) ranging in scale from 7 billion to 70 billion parameters.
Our ?ne-tuned LLMs, called Llama 2-Chat, are optimized for dialogue use cases. Our
-------
Our ?ne-tuned LLMs, called Llama 2-Chat, are optimized for dialogue use cases. Our
models outperform open-source chat models on most benchmarks we tested, and based on
-------
models outperform open-source chat models on most benchmarks we tested, and based on
our human evaluations for helpfulness and safety, may be a suitable substitute for closed-
-------
our human evaluations for helpfulness and safety, may be a suitable substitute for closed-
source models. We provide a detailed description of our approach to ?ne-tuning and safety
-------
source models. We provide a detailed description of our approach to ?ne-tuning and safety
improvements of Llama 2-Chat in order to enable the community to build on our work and
-------
improvements of Llama 2-Chat in order to enable the community to build on our work and
contribute to the responsible development of LLMs.
-------
contribute to the responsible development of LLMs.
?Equal contribution, corresponding authors: {tscialom, htouvron}@meta.com
?Second author
-------
?Equal contribution, corresponding authors: {tscialom, htouvron}@meta.com
?Second author
Contributions for all the authors can be found in Section A.1.
arXiv:2307.09288v2 [cs.CL] 19 Jul 2023
-------
3.向量數據庫與向量檢索
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import FAISS
from langchain_openai import ChatOpenAI
from langchain_community.document_loaders import PyMuPDFLoader# 加載文檔
loader = PyMuPDFLoader("llama2.pdf")
pages = loader.load_and_split()# 文檔切分# 每個文本塊的最大字符數# chunk_overlap=100, # 塊之間的重疊字符數# length_function=len, # 使用Python內置len計算長度# add_start_index=True, # 保留原始文檔中的起始位置信息
text_splitter = RecursiveCharacterTextSplitter(chunk_size=300,chunk_overlap=100,length_function=len,add_start_index=True,
)texts = text_splitter.create_documents([page.page_content for page in pages[:4]]
)# 灌庫
# 將文本轉換為1536維的向量(ada模型維度)
# FAISS會建立索引結構加速相似度搜索
# 生成的db對象包含所有文本塊及其向量表示
embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")
db = FAISS.from_documents(texts, embeddings)# 檢索 top-3 結果
# 設置返回前3個最相關結果
retriever = db.as_retriever(search_kwargs={"k": 3})docs = retriever.invoke("llama2有多少參數")for doc in docs:print(doc.page_content)print("----")# 查詢文本會被自動轉換為向量# 通過余弦相似度在向量空間搜索最接近的3個文本塊# 返回包含原始文本和元數據的Document對象打印輸出:
but are not releasing.§
2. Llama 2-Chat, a ?ne-tuned version of Llama 2 that is optimized for dialogue use cases. We release
variants of this model with 7B, 13B, and 70B parameters as well.
We believe that the open release of LLMs, when done safely, will be a net bene?t to society. Like all LLMs,
----
Llama 2-Chat, at scales up to 70B parameters. On the series of helpfulness and safety benchmarks we tested,
Llama 2-Chat models generally perform better than existing open-source models. They also appear to
----
large language models (LLMs) ranging in scale from 7 billion to 70 billion parameters.
Our ?ne-tuned LLMs, called Llama 2-Chat, are optimized for dialogue use cases. Our
models outperform open-source chat models on most benchmarks we tested, and based on
----
關鍵參數說明:
參數 建議值范圍 作用
- chunk_size 300-1000 控制文本塊大小,影響信息完整性和檢索精度
- chunk_overlap 10%-30% 保持文本塊間的上下文關聯
- k值 3-10 平衡結果覆蓋率和噪聲過濾
潛在優化方向:
-
增加預處理步驟:清理PDF解析后的特殊字符
-
嘗試不同的embedding模型(如text-embedding-3-small)
-
結合元數據過濾(如指定搜索范圍)
-
添加Rerank層優化結果排序
-
實現對話歷史管理(對于多輪對話場景)
注意事項:
-
PDF解析質量取決于文檔結構復雜度
-
chunk_size過小可能導致上下文丟失
-
實際應用需要處理分頁錯誤等異常情況
這個流程展示了典型的RAG(Retrieval-Augmented Generation)架構實現,后續可連接LLM生成最終答案,形成完整的問答系統。
更多的三方檢索組件鏈接,參考:三方檢索組件