Chapter 7-Fine-tuning to follow instructions
7.7 Extracting and saving responses
-
在本節中,我們保存測試集響應以便在下一節中評分,除此之外保存模型的副本以供將來使用。
?
-
首先,讓我們簡單看看finetuned模型生成的響應
torch.manual_seed(123)for entry in test_data[:3]:input_text = format_input(entry)token_ids = generate(model=model,idx=text_to_token_ids(input_text, tokenizer).to(device),max_new_tokens=256,context_size=BASE_CONFIG["context_length"],eos_id=50256)generated_text = token_ids_to_text(token_ids, tokenizer)response_text = (generated_text[len(input_text):].replace("### Response:", "").strip() )print(input_text)print(f"\nCorrect response:\n>> {entry['output']}")print(f"\nModel response:\n>> {response_text.strip()}")print("-------------------------------------")"""輸出""" Below is an instruction that describes a task. Write a response that appropriately completes the request.### Instruction: Rewrite the sentence using a simile.### Input: The car is very fast.Correct response: >> The car is as fast as lightning.Model response: >> The car is as fast as a cheetah. ------------------------------------- Below is an instruction that describes a task. Write a response that appropriately completes the request.### Instruction: What type of cloud is typically associated with thunderstorms?Correct response: >> The type of cloud typically associated with thunderstorms is cumulonimbus.Model response: >> The type of cloud associated with thunderstorms is a cumulus cloud. ------------------------------------- Below is an instruction that describes a task. Write a response that appropriately completes the request.### Instruction: Name the author of 'Pride and Prejudice'.Correct response: >> Jane Austen.Model response: >> The author of 'Pride and Prejudice' is Jane Austen. -------------------------------------
正如我們根據測試集指令、給定響應和模型響應所看到的,模型的性能相對較好,第一個和最后一個說明的答案顯然是正確的,第二個答案很接近;模型用“cumulus cloud”而不是“cumulonimbus”來回答(但是,請注意前者可以發展成后者,后者能夠產生thunderstorms)
-
最重要的是,我們可以看到模型評估不像上一章我們只需要計算正確垃圾郵件/非垃圾郵件類別標簽的百分比即可獲得分類準確性的那樣簡單。
-
在實踐中,聊天機器人等instruction-finetunedLLM通過多種方法進行評估
- 短答案和多項選擇基準,例如MMLU(“測量大規模多任務語言理解”,[https://arxiv.org/pdf/2009.03300](https://arxiv.org/pdf/2009.03300)),用于測試模型的知識
- 與其他LLM的人類偏好比較,例如LMSYS聊天機器人競技場([https://arena.lmsys.org](https://arena.lmsys.org))
- 自動會話基準測試,其中使用另一個LLM(如GPT-4)來評估響應,例如AlpackaEval([https://tatsu-lab.github.io/alpaca_eval/](https://tatsu-lab.github.io/alpaca_eval/))
-
在下一節中,我們將使用類似于AlpackaEval的方法,并使用另一個LLM來評估我們模型的響應;但是,我們將使用我們自己的測試集,而不是使用公開可用的基準數據集。為此,我們將模型響應添加到“test_data”字典中,并將其保存為“instruction-data-with-response. json”文件以進行記錄保存,以便我們可以在需要時在單獨的Python會話中加載和分析它
from tqdm import tqdmfor i, entry in tqdm(enumerate(test_data), total=len(test_data)):input_text = format_input(entry)token_ids = generate(model=model,idx=text_to_token_ids(input_text, tokenizer).to(device),max_new_tokens=256,context_size=BASE_CONFIG["context_length"],eos_id=50256)generated_text = token_ids_to_text(token_ids, tokenizer)response_text = generated_text[len(input_text):].replace("### Response:", "").strip()test_data[i]["model_response"] = response_textwith open("instruction-data-with-response.json", "w") as file:json.dump(test_data, file, indent=4) # "indent" for pretty-printing"""輸出""" 100%|██████████| 110/110 [00:59<00:00, 1.86it/s]
檢查其中一個條目,看看響應是否已正確添加到“test_data”字典中
print(test_data[0])"""輸出""" {'instruction': 'Rewrite the sentence using a simile.', 'input': 'The car is very fast.', 'output': 'The car is as fast as lightning.', 'model_response': 'The car is as fast as a cheetah.'}
可以看到原始的output和模型的model_response都在
-
最后我們保存微調后的模型以供將來用
import refile_name = f"E:\\LLM\\gpt2\\{re.sub(r'[ ()]', '', CHOOSE_MODEL) }-sft.pth" torch.save(model.state_dict(), file_name) print(f"Model saved as {file_name}")# Load model via # model.load_state_dict(torch.load("E:\\LLM\\gpt2\\gpt2-medium355M-sft.pth"))"""輸出""" Model saved as E:\LLM\gpt2\gpt2-medium355M-sft.pth
?