整理并翻譯自DeepLearning.AI×LangChain的官方課程:Evaluation(源代碼可見)
基于LLM的應用如何做評估是一個難點,本節介紹了一些思路和工具。
“從傳統開發轉換到基于prompt的開發,開發使用LLM的應用,整個工作流的評估方式需要重新考慮,本節會介紹很多激動人心的概念。”
Evaluation 評估
構建一個上節課介紹過的QA chain:
不同之處僅在于加了一個參數:chain_type_kwargs
,內部指定了一個doc的分隔符。
首先可以看一下數據示例:
Hard-Code example 手動編寫的用例
最容易想到的評價方法是手動構建評價數據,然后觀察LLM的輸出是否和評價數據中已經給定的答案一致,手動構建評價數據永遠逃不過成本問題。
LLM-Generated example LLM生成用例
可以考慮使用LLM生成代替人工編寫用例,下面介紹了一個生成QA用例的QAGenerationChain
:
可以把人工編寫的用例和生成的用例組合用來做評估,測試一下第一個query,得到如下回復:
Manual Evaluation 人工評估
LangChain提供了debug模式,可以像下面這樣開啟:
再次測試第一個query,LangChain會打印整個過程中的信息:
通過設置debug標志位為False
關閉debug模式:
LLM assisted evaluation LLM輔助評估
基于現階段LLM已經具備比較強的能力,可以使用LLM來輔助做評估
在前面構建的所有用例生成結果:
一共有7條用例,所以跑了7次。
LangChain提供了QAEvalChain
來進行QA場景的評估,使用方式如下:
下面我們來看一下模型輸出和評估Chain評估的結果:
Example 0:
Question: Do the Cozy Comfort Pullover Set have side pockets?
Real Answer: Yes
Predicted Answer: The Cozy Comfort Pullover Set, Stripe does have side pockets.
Predicted Grade: CORRECTExample 1:
Question: What collection is the Ultra-Lofty 850 Stretch Down Hooded Jacket from?
Real Answer: The DownTek collection
Predicted Answer: The Ultra-Lofty 850 Stretch Down Hooded Jacket is from the DownTek collection.
Predicted Grade: CORRECTExample 2:
Question: What is the weight of each pair of Women's Campside Oxfords?
Real Answer: The approximate weight of each pair of Women's Campside Oxfords is 1 lb. 1 oz.
Predicted Answer: The weight of each pair of Women's Campside Oxfords is approximately 1 lb. 1 oz.
Predicted Grade: CORRECTExample 3:
Question: What are the dimensions of the small and medium Recycled Waterhog Dog Mat?
Real Answer: The dimensions of the small Recycled Waterhog Dog Mat are 18" x 28" and the dimensions of the medium Recycled Waterhog Dog Mat are 22.5" x 34.5".
Predicted Answer: The small Recycled Waterhog Dog Mat has dimensions of 18" x 28" and the medium size has dimensions of 22.5" x 34.5".
Predicted Grade: CORRECTExample 4:
Question: What are some features of the Infant and Toddler Girls' Coastal Chill Swimsuit?
Real Answer: The swimsuit features bright colors, ruffles, and exclusive whimsical prints. It is made of four-way-stretch and chlorine-resistant fabric, ensuring that it keeps its shape and resists snags. The swimsuit is also UPF 50+ rated, providing the highest rated sun protection possible by blocking 98% of the sun's harmful rays. The crossover no-slip straps and fully lined bottom ensure a secure fit and maximum coverage. Finally, it can be machine washed and line dried for best results.
Predicted Answer: The Infant and Toddler Girls' Coastal Chill Swimsuit is a two-piece swimsuit with bright colors, ruffles, and exclusive whimsical prints. It is made of four-way-stretch and chlorine-resistant fabric that keeps its shape and resists snags. The swimsuit has UPF 50+ rated fabric that provides the highest rated sun protection possible, blocking 98% of the sun's harmful rays. The crossover no-slip straps and fully lined bottom ensure a secure fit and maximum coverage. It is machine washable and should be line dried for best results.
Predicted Grade: CORRECTExample 5:
Question: What is the fabric composition of the Refresh Swimwear V-Neck Tankini Contrasts?
Real Answer: The body of the Refresh Swimwear V-Neck Tankini Contrasts is made of 82% recycled nylon and 18% Lycra? spandex, while the lining is made of 90% recycled nylon and 10% Lycra? spandex.
Predicted Answer: The Refresh Swimwear V-Neck Tankini Contrasts is made of 82% recycled nylon with 18% Lycra? spandex for the body and 90% recycled nylon with 10% Lycra? spandex for the lining.
Predicted Grade: CORRECTExample 6:
Question: What is the fabric composition of the EcoFlex 3L Storm Pants?
Real Answer: The EcoFlex 3L Storm Pants are made of 100% nylon, exclusive of trim.
Predicted Answer: The fabric composition of the EcoFlex 3L Storm Pants is 100% nylon, exclusive of trim.
Predicted Grade: CORRECT
?```視頻接下來介紹了為什么要使用LLM來做評估:

在一個自然語言生成場景下(比如前面介紹的QA),模型的輸出可以是任意字符,因而無法通過字符完全匹配(是否相等)、字符部分匹配(是否含有子串)、正則(更復雜的匹配方式)來判定輸出是否正確。以上圖為例,真實答案“Yes”和模型的輸出“The Cozy Comfort Pullover Set, Stripe does have side pockets.”是完全不同的字符,無法通過字符匹配來判定相等,但是具備語義理解能力的LLM能夠判定它們在語義上相等,這是傳統字符匹配做不到的。
### LangChain 可視化評估工具
LangChain提供了可視化的評估工具`LangChainPlus`(可能需要額外安裝和配置),該工具會自動記錄在python notebook上的運行歷史。

可以點擊可視化查看調用鏈,也可以點擊節點查看當前節點chain的詳細信息,包含輸入、輸出、時延、額外新信息(運行環境)等,如下圖:

點擊LLM Chain節點可以查看模型輸入:包含SYSTREM、HUMAN、模型輸出、模型輸出元信息等內容。


右上角提供了一個【to Dataset】按鈕,點擊可以將當前的輸入輸出作為一個pair構建數據集,操作方式如下:

如果當前沒有數據集,需要點擊【Create dataset】創建一個:

創建數據集:

將當前QA Chain的輸入輸出加入到剛剛創建的數據集內:
