【LLM教程-llama】如何Fine Tuning大語言模型？

今天給大家帶來了一篇超級詳細的教程,手把手教你如何對大語言模型進行微調(Fine Tuning)！（代碼和詳細解釋放在后文）

大語言模型進行微調(Fine Tuning)需要哪些步驟？

大語言模型進行微調(Fine Tuning)訓練過程及代碼

大語言模型進行微調(Fine Tuning)需要哪些步驟？

大語言模型進行微調(Fine Tuning)的主要步驟🤩

📚 準備訓練數據集
首先你需要準備一個高質量的訓練數據集,最好是與你的應用場景相關的數據。可以是文本數據、對話數據等,格式一般為JSON/TXT等。
📦 選擇合適的基礎模型
接下來需要選擇一個合適的基礎預訓練模型,作為微調的起點。常見的有GPT、BERT、T5等大模型,可根據任務場景進行選擇。
?? 設置訓練超參數
然后是設置訓練的各種超參數,比如學習率、批量大小、訓練步數等等。選擇合理的超參數對模型效果影響很大哦。
🧑?💻 加載模型和數據集
使用HuggingFace等庫,把選定的基礎模型和訓練數據集加載進來。記得對數據集進行必要的前處理和劃分。
? 開始模型微調訓練
有了模型、數據集和超參數后,就可以開始模型微調訓練了!可以使用PyTorch/TensorFlow等框架進行訓練。
💾 保存微調后的模型
訓練結束后,別忘了把微調好的模型保存下來,方便后續加載使用哦。
🧪 在測試集上評估模型
最后在準備好的測試集上評估一下微調后模型的效果。看看與之前的基礎模型相比,是否有明顯提升?

大語言模型進行微調(Fine Tuning)訓練過程及代碼

那如何使用 Lamini 庫加載數據、設置模型和訓練超參數、定義推理函數、微調基礎模型、評估模型效果呢？

首先，導入必要的庫

import os
import lamini
import datasets
import tempfile
import logging
import random
import config
import os
import yaml
import time
import torch
import transformers
import pandas as pd
import jsonlinesfrom utilities import *
from transformers import AutoTokenizer
from transformers import AutoModelForCausalLM
from transformers import TrainingArguments
from transformers import AutoModelForCausalLM
from llama import BasicModelRunner

這部分導入了一些必需的Python庫,包括Lamini、Hugging Face的Datasets、Transformers等。

加載Lamini文檔數據集

dataset_name = "lamini_docs.jsonl"
dataset_path = f"/content/{dataset_name}"
use_hf = False
dataset_path = "lamini/lamini_docs"
use_hf = True

這里指定了數據集的路徑,同時設置了use_hf標志,表示是否使用Hugging Face的Datasets庫加載數據。

設置模型、訓練配置和分詞器

model_name = "EleutherAI/pythia-70m"
training_config = { ... }
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
train_dataset, test_dataset = tokenize_and_split_data(training_config, tokenizer)

這部分指定了基礎預訓練模型的名稱,并設置了訓練配置(如最大長度等)。然后,它使用AutoTokenizer從預訓練模型中加載分詞器,并對分詞器進行了一些調整。最后,它調用tokenize_and_split_data函數對數據進行分詞和劃分訓練/測試集。

加載基礎模型

base_model = AutoModelForCausalLM.from_pretrained(model_name)
device_count = torch.cuda.device_count()
if device_count > 0:device = torch.device("cuda")
else:device = torch.device("cpu")
base_model.to(device)

這里使用AutoModelForCausalLM從預訓練模型中加載基礎模型,并根據設備(GPU或CPU)將模型移動到相應的設備上。

定義推理函數

def inference(text, model, tokenizer, max_input_tokens=1000, max_output_tokens=100):...

這個函數用于在給定輸入文本的情況下,使用模型和分詞器進行推理并生成輸出。它包括對輸入文本進行分詞、使用模型生成輸出以及解碼輸出等步驟。

嘗試使用基礎模型進行推理

test_text = test_dataset[0]['question']
print("Question input (test):", test_text)
print(f"Correct answer from Lamini docs: {test_dataset[0]['answer']}")
print("Model's answer: ")
print(inference(test_text, base_model, tokenizer))

這部分使用上一步定義的inference函數,在測試數據集的第一個示例上嘗試使用基礎模型進行推理。它打印了輸入問題、正確答案和模型的輸出。

設置訓練參數

max_steps = 3
trained_model_name = f"lamini_docs_{max_steps}_steps"
output_dir = trained_model_name
training_args = TrainingArguments(# Learning ratelearning_rate=1.0e-5,# Number of training epochsnum_train_epochs=1,# Max steps to train for (each step is a batch of data)# Overrides num_train_epochs, if not -1max_steps=max_steps,# Batch size for trainingper_device_train_batch_size=1,# Directory to save model checkpointsoutput_dir=output_dir,# Other argumentsoverwrite_output_dir=False, # Overwrite the content of the output directorydisable_tqdm=False, # Disable progress barseval_steps=120, # Number of update steps between two evaluationssave_steps=120, # After # steps model is savedwarmup_steps=1, # Number of warmup steps for learning rate schedulerper_device_eval_batch_size=1, # Batch size for evaluationevaluation_strategy="steps",logging_strategy="steps",logging_steps=1,optim="adafactor",gradient_accumulation_steps = 4,gradient_checkpointing=False,# Parameters for early stoppingload_best_model_at_end=True,save_total_limit=1,metric_for_best_model="eval_loss",greater_is_better=False
)

這一部分設置了訓練的一些參數,包括最大訓練步數、輸出模型目錄、學習率等超參數。

為什么要這樣設置這些訓練超參數:

learning_rate=1.0e-5
學習率控制了模型在每個訓練步驟中從訓練數據中學習的速度。1e-5是一個相對較小的學習率,可以有助于穩定訓練過程,防止出現divergence(發散)的情況。

num_train_epochs=1
訓練的輪數,即讓數據在模型上循環多少次。這里設置為1,是因為我們只想進行輕微的微調,避免過度訓練(overfitting)。

max_steps=max_steps
最大訓練步數,會覆蓋num_train_epochs。這樣可以更好地控制訓練的總步數。

per_device_train_batch_size=1
每個設備(GPU/CPU)上的訓練批量大小。批量大小越大,內存占用越高,但訓練過程可能更加穩定。

output_dir=output_dir
用于保存訓練過程中的檢查點(checkpoints)和最終模型的目錄。

overwrite_output_dir=False
如果目錄已存在,是否覆蓋它。設為False可以避免意外覆蓋之前的結果。

eval_steps=120, save_steps=120
每120步評估一次模型性能,并保存模型。頻繁保存可以在訓練中斷時恢復。

warmup_steps=1
學習率warmup步數,一開始使用較小的學習率有助于穩定訓練早期階段。

per_device_eval_batch_size=1
評估時每個設備上的批量大小。通常與訓練時相同。

evaluation_strategy="steps", logging_strategy="steps"
以步數為間隔進行評估和記錄日志,而不是以epoch為間隔。

optim="adafactor"
使用Adafactor優化器,適用于大規模語言模型訓練。

gradient_accumulation_steps=4
梯度積累步數,可以模擬使用更大批量大小的效果,節省內存。

load_best_model_at_end=True
保存驗證集上性能最好的那個檢查點,作為最終模型。

metric_for_best_model="eval_loss", greater_is_better=False
根據驗證損失評估模型,損失越小越好。

model_flops = (base_model.floating_point_ops({"input_ids": torch.zeros((1, training_config["model"]["max_length"]))})* training_args.gradient_accumulation_steps
)print(base_model)
print("Memory footprint", base_model.get_memory_footprint() / 1e9, "GB")
print("Flops", model_flops / 1e9, "GFLOPs")print(base_model)
print("Memory footprint", base_model.get_memory_footprint() / 1e9, "GB")
print("Flops", model_flops / 1e9, "GFLOPs")

這里還計算并打印了模型的內存占用和計算復雜度(FLOPs)。

最后,使用這些參數創建了一個Trainer對象,用于實際進行模型訓練。

trainer = Trainer(model=base_model,model_flops=model_flops,total_steps=max_steps,args=training_args,train_dataset=train_dataset,eval_dataset=test_dataset,
)

訓練模型幾個步驟

training_output = trainer.train()

這一行代碼啟動了模型的微調訓練過程,并將訓練輸出存儲在training_output中。

保存微調后的模型

save_dir = f'{output_dir}/final'
trainer.save_model(save_dir)
print("Saved model to:", save_dir)
finetuned_slightly_model = AutoModelForCausalLM.from_pretrained(save_dir, local_files_only=True)
finetuned_slightly_model.to(device)

這部分將微調后的模型保存到指定的目錄中。

然后,它使用AutoModelForCausalLM.from_pretrained從保存的模型中重新加載該模型,并將其移動到相應的設備上。

使用微調后的模型進行推理

test_question = test_dataset[0]['question']
print("Question input (test):", test_question)
print("Finetuned slightly model's answer: ")
print(inference(test_question, finetuned_slightly_model, tokenizer))
test_answer = test_dataset[0]['answer']
print("Target answer output (test):", test_answer)

這里使用之前定義的inference函數,在測試數據集的第一個示例上嘗試使用微調后的模型進行推理。

打印了輸入問題、模型輸出以及正確答案。

加載并運行其他預訓練模型

finetuned_longer_model = AutoModelForCausalLM.from_pretrained("lamini/lamini_docs_finetuned")
tokenizer = AutoTokenizer.from_pretrained("lamini/lamini_docs_finetuned")
finetuned_longer_model.to(device)
print("Finetuned longer model's answer: ")
print(inference(test_question, finetuned_longer_model, tokenizer))bigger_finetuned_model = BasicModelRunner(model_name_to_id["bigger_model_name"])
bigger_finetuned_output = bigger_finetuned_model(test_question)
print("Bigger (2.8B) finetuned model (test): ", bigger_finetuned_output)

這部分加載了另一個經過更長時間微調的模型,以及一個更大的2.8B參數的微調模型。它使用這些模型在測試數據集的第一個示例上進行推理,并打印出結果。