7.7 Extracting and saving responses

Chapter 7-Fine-tuning to follow instructions

7.7 Extracting and saving responses

在本節中，我們保存測試集響應以便在下一節中評分，除此之外保存模型的副本以供將來使用。

?

首先，讓我們簡單看看finetuned模型生成的響應

torch.manual_seed(123)for entry in test_data[:3]:input_text = format_input(entry)token_ids = generate(model=model,idx=text_to_token_ids(input_text, tokenizer).to(device),max_new_tokens=256,context_size=BASE_CONFIG["context_length"],eos_id=50256)generated_text = token_ids_to_text(token_ids, tokenizer)response_text = (generated_text[len(input_text):].replace("### Response:", "").strip()
)print(input_text)print(f"\nCorrect response:\n>> {entry['output']}")print(f"\nModel response:\n>> {response_text.strip()}")print("-------------------------------------")"""輸出"""
Below is an instruction that describes a task. Write a response that appropriately completes the request.### Instruction:
Rewrite the sentence using a simile.### Input:
The car is very fast.Correct response:
>> The car is as fast as lightning.Model response:
>> The car is as fast as a cheetah.
-------------------------------------
Below is an instruction that describes a task. Write a response that appropriately completes the request.### Instruction:
What type of cloud is typically associated with thunderstorms?Correct response:
>> The type of cloud typically associated with thunderstorms is cumulonimbus.Model response:
>> The type of cloud associated with thunderstorms is a cumulus cloud.
-------------------------------------
Below is an instruction that describes a task. Write a response that appropriately completes the request.### Instruction:
Name the author of 'Pride and Prejudice'.Correct response:
>> Jane Austen.Model response:
>> The author of 'Pride and Prejudice' is Jane Austen.
-------------------------------------

正如我們根據測試集指令、給定響應和模型響應所看到的，模型的性能相對較好，第一個和最后一個說明的答案顯然是正確的，第二個答案很接近；模型用“cumulus cloud”而不是“cumulonimbus”來回答（但是，請注意前者可以發展成后者，后者能夠產生thunderstorms）

最重要的是，我們可以看到模型評估不像上一章我們只需要計算正確垃圾郵件/非垃圾郵件類別標簽的百分比即可獲得分類準確性的那樣簡單。
在實踐中，聊天機器人等instruction-finetunedLLM通過多種方法進行評估
1. 短答案和多項選擇基準，例如MMLU（“測量大規模多任務語言理解”，[https://arxiv.org/pdf/2009.03300]（https://arxiv.org/pdf/2009.03300）），用于測試模型的知識
2. 與其他LLM的人類偏好比較，例如LMSYS聊天機器人競技場（[https://arena.lmsys.org]（https://arena.lmsys.org））
3. 自動會話基準測試，其中使用另一個LLM（如GPT-4）來評估響應，例如AlpackaEval（[https://tatsu-lab.github.io/alpaca_eval/]（https://tatsu-lab.github.io/alpaca_eval/））

在下一節中，我們將使用類似于AlpackaEval的方法，并使用另一個LLM來評估我們模型的響應；但是，我們將使用我們自己的測試集，而不是使用公開可用的基準數據集。為此，我們將模型響應添加到“test_data”字典中，并將其保存為“instruction-data-with-response. json”文件以進行記錄保存，以便我們可以在需要時在單獨的Python會話中加載和分析它

from tqdm import tqdmfor i, entry in tqdm(enumerate(test_data), total=len(test_data)):input_text = format_input(entry)token_ids = generate(model=model,idx=text_to_token_ids(input_text, tokenizer).to(device),max_new_tokens=256,context_size=BASE_CONFIG["context_length"],eos_id=50256)generated_text = token_ids_to_text(token_ids, tokenizer)response_text = generated_text[len(input_text):].replace("### Response:", "").strip()test_data[i]["model_response"] = response_textwith open("instruction-data-with-response.json", "w") as file:json.dump(test_data, file, indent=4)  # "indent" for pretty-printing"""輸出"""
100%|██████████| 110/110 [00:59<00:00,  1.86it/s]

檢查其中一個條目，看看響應是否已正確添加到“test_data”字典中

print(test_data[0])"""輸出"""
{'instruction': 'Rewrite the sentence using a simile.', 'input': 'The car is very fast.', 'output': 'The car is as fast as lightning.', 'model_response': 'The car is as fast as a cheetah.'}

可以看到原始的output和模型的model_response都在

最后我們保存微調后的模型以供將來用

import refile_name = f"E:\\LLM\\gpt2\\{re.sub(r'[ ()]', '', CHOOSE_MODEL) }-sft.pth"
torch.save(model.state_dict(), file_name)
print(f"Model saved as {file_name}")# Load model via
# model.load_state_dict(torch.load("E:\\LLM\\gpt2\\gpt2-medium355M-sft.pth"))"""輸出"""
Model saved as E:\LLM\gpt2\gpt2-medium355M-sft.pth

7.8 Evaluating the finetuned LLM

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/bicheng/84957.shtml
繁體地址，請注明出處：http://hk.pswp.cn/bicheng/84957.shtml
英文地址，請注明出處：http://en.pswp.cn/bicheng/84957.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！