LangChain手記 Evalutation評估

整理并翻譯自DeepLearning.AI×LangChain的官方課程：Evaluation（源代碼可見）

基于LLM的應用如何做評估是一個難點，本節介紹了一些思路和工具。

“從傳統開發轉換到基于prompt的開發，開發使用LLM的應用，整個工作流的評估方式需要重新考慮，本節會介紹很多激動人心的概念。”

Evaluation 評估

構建一個上節課介紹過的QA chain：
在這里插入圖片描述
不同之處僅在于加了一個參數：chain_type_kwargs，內部指定了一個doc的分隔符。

首先可以看一下數據示例：
在這里插入圖片描述

Hard-Code example 手動編寫的用例

最容易想到的評價方法是手動構建評價數據，然后觀察LLM的輸出是否和評價數據中已經給定的答案一致，手動構建評價數據永遠逃不過成本問題。

在這里插入圖片描述

LLM-Generated example LLM生成用例

可以考慮使用LLM生成代替人工編寫用例，下面介紹了一個生成QA用例的QAGenerationChain：
在這里插入圖片描述

可以把人工編寫的用例和生成的用例組合用來做評估，測試一下第一個query，得到如下回復：

Manual Evaluation 人工評估

LangChain提供了debug模式，可以像下面這樣開啟：
在這里插入圖片描述
再次測試第一個query，LangChain會打印整個過程中的信息：

通過設置debug標志位為False關閉debug模式：

LLM assisted evaluation LLM輔助評估

基于現階段LLM已經具備比較強的能力，可以使用LLM來輔助做評估

在前面構建的所有用例生成結果：
在這里插入圖片描述
一共有7條用例，所以跑了7次。

LangChain提供了QAEvalChain來進行QA場景的評估，使用方式如下：
在這里插入圖片描述

下面我們來看一下模型輸出和評估Chain評估的結果：

Example 0:
Question: Do the Cozy Comfort Pullover Set have side pockets?
Real Answer: Yes
Predicted Answer: The Cozy Comfort Pullover Set, Stripe does have side pockets.
Predicted Grade: CORRECTExample 1:
Question: What collection is the Ultra-Lofty 850 Stretch Down Hooded Jacket from?
Real Answer: The DownTek collection
Predicted Answer: The Ultra-Lofty 850 Stretch Down Hooded Jacket is from the DownTek collection.
Predicted Grade: CORRECTExample 2:
Question: What is the weight of each pair of Women's Campside Oxfords?
Real Answer: The approximate weight of each pair of Women's Campside Oxfords is 1 lb. 1 oz.
Predicted Answer: The weight of each pair of Women's Campside Oxfords is approximately 1 lb. 1 oz.
Predicted Grade: CORRECTExample 3:
Question: What are the dimensions of the small and medium Recycled Waterhog Dog Mat?
Real Answer: The dimensions of the small Recycled Waterhog Dog Mat are 18" x 28" and the dimensions of the medium Recycled Waterhog Dog Mat are 22.5" x 34.5".
Predicted Answer: The small Recycled Waterhog Dog Mat has dimensions of 18" x 28" and the medium size has dimensions of 22.5" x 34.5".
Predicted Grade: CORRECTExample 4:
Question: What are some features of the Infant and Toddler Girls' Coastal Chill Swimsuit?
Real Answer: The swimsuit features bright colors, ruffles, and exclusive whimsical prints. It is made of four-way-stretch and chlorine-resistant fabric, ensuring that it keeps its shape and resists snags. The swimsuit is also UPF 50+ rated, providing the highest rated sun protection possible by blocking 98% of the sun's harmful rays. The crossover no-slip straps and fully lined bottom ensure a secure fit and maximum coverage. Finally, it can be machine washed and line dried for best results.
Predicted Answer: The Infant and Toddler Girls' Coastal Chill Swimsuit is a two-piece swimsuit with bright colors, ruffles, and exclusive whimsical prints. It is made of four-way-stretch and chlorine-resistant fabric that keeps its shape and resists snags. The swimsuit has UPF 50+ rated fabric that provides the highest rated sun protection possible, blocking 98% of the sun's harmful rays. The crossover no-slip straps and fully lined bottom ensure a secure fit and maximum coverage. It is machine washable and should be line dried for best results.
Predicted Grade: CORRECTExample 5:
Question: What is the fabric composition of the Refresh Swimwear V-Neck Tankini Contrasts?
Real Answer: The body of the Refresh Swimwear V-Neck Tankini Contrasts is made of 82% recycled nylon and 18% Lycra? spandex, while the lining is made of 90% recycled nylon and 10% Lycra? spandex.
Predicted Answer: The Refresh Swimwear V-Neck Tankini Contrasts is made of 82% recycled nylon with 18% Lycra? spandex for the body and 90% recycled nylon with 10% Lycra? spandex for the lining.
Predicted Grade: CORRECTExample 6:
Question: What is the fabric composition of the EcoFlex 3L Storm Pants?
Real Answer: The EcoFlex 3L Storm Pants are made of 100% nylon, exclusive of trim.
Predicted Answer: The fabric composition of the EcoFlex 3L Storm Pants is 100% nylon, exclusive of trim.
Predicted Grade: CORRECT
?```視頻接下來介紹了為什么要使用LLM來做評估:
![在這里插入圖片描述](https://img-blog.csdnimg.cn/73ac80581ea243d981b0db3ede2d5d8a.png)
在一個自然語言生成場景下（比如前面介紹的QA），模型的輸出可以是任意字符，因而無法通過字符完全匹配（是否相等）、字符部分匹配（是否含有子串）、正則（更復雜的匹配方式）來判定輸出是否正確。以上圖為例，真實答案“Yes”和模型的輸出“The Cozy Comfort Pullover Set, Stripe does have side pockets.”是完全不同的字符，無法通過字符匹配來判定相等，但是具備語義理解能力的LLM能夠判定它們在語義上相等，這是傳統字符匹配做不到的。
### LangChain 可視化評估工具
LangChain提供了可視化的評估工具`LangChainPlus`（可能需要額外安裝和配置），該工具會自動記錄在python notebook上的運行歷史。
![在這里插入圖片描述](https://img-blog.csdnimg.cn/89a584e6f74843a9af67e719ff185cbb.png)
可以點擊可視化查看調用鏈，也可以點擊節點查看當前節點chain的詳細信息，包含輸入、輸出、時延、額外新信息（運行環境）等，如下圖：
![在這里插入圖片描述](https://img-blog.csdnimg.cn/1bc61a5378934a248155957d17724f73.png)
點擊LLM Chain節點可以查看模型輸入：包含SYSTREM、HUMAN、模型輸出、模型輸出元信息等內容。
![在這里插入圖片描述](https://img-blog.csdnimg.cn/da19b50c29d740cab5c498f25e688722.png)
![在這里插入圖片描述](https://img-blog.csdnimg.cn/a9034d980ba54ddbb6ae8a136b2fe937.png)
右上角提供了一個【to Dataset】按鈕，點擊可以將當前的輸入輸出作為一個pair構建數據集，操作方式如下：
![在這里插入圖片描述](https://img-blog.csdnimg.cn/aac46bc18f6e4862bf6227e9ded7fb2c.png)
如果當前沒有數據集，需要點擊【Create dataset】創建一個：
![在這里插入圖片描述](https://img-blog.csdnimg.cn/26e015fa2877407a90d03822d723bf7f.png)
創建數據集：
![在這里插入圖片描述](https://img-blog.csdnimg.cn/96c7b5798c68423a8427cd1376d9cf57.png)
將當前QA Chain的輸入輸出加入到剛剛創建的數據集內：
![在這里插入圖片描述](https://img-blog.csdnimg.cn/827cf6901a9640478cc0b9888fa5f00d.png)

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/news/40874.shtml
繁體地址，請注明出處：http://hk.pswp.cn/news/40874.shtml
英文地址，請注明出處：http://en.pswp.cn/news/40874.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！