unsloth微調QwQ32B(4bit)

GPU: 3090 24G

unsloth安裝部署

pip 安裝

pip install unsloth --index https://pypi.mirrors.usrc.edu.cn/simple

source /etc/network_turbopip install --force-reinstall --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git

注冊Wandb以監控模型微調過程

wandb地址

https://wandb.ai/site

?
登錄

下載
```
pip install wandb
```
使用api-key登錄
```
wandb login
```
?

使用官網示例看一看

備注：

需要聯網
需要將key改為自己的
entity需要提前設立

import random
import wandbwandb.login(key="api-key")# Start a new wandb run to track this script.
run = wandb.init(# Set the wandb entity where your project will be logged (generally your team name).entity="qinchihongye-pa",# Set the wandb project where this run will be logged.project="project_test",# Track hyperparameters and run metadata.config={"learning_rate": 0.02,"architecture": "CNN","dataset": "CIFAR-100","epochs": 10,},
)# Simulate training.
epochs = 10
offset = random.random() / 5
for epoch in range(2, epochs):acc = 1 - 2**-epoch - random.random() / epoch - offsetloss = 2**-epoch + random.random() / epoch + offset# Log metrics to wandb.run.log({"acc": acc, "loss": loss})# Finish the run and upload any remaining data.
run.finish()

下載QwQ32B量化模型

huggingface地址(unsloth量化的4bit，比Q4_K_M量化的損失精度更小)

https://huggingface.co/unsloth/QwQ-32B-unsloth-bnb-4bit

復制名稱

unsloth/QwQ-32B-unsloth-bnb-4bit
假設當前目錄為

/root/lanyun-tmp

創建文件夾統一存放Huggingface下載的模型

mkdir Hugging-Face 
mkdir -p Hugging-Face/QwQ-32B-unsloth-bnb-4bit

配置鏡像源
```
vim ~/.bashrc
```
填入以下兩個，以修改HuggingFace 的鏡像源、模型保存的默認

export HF_ENDPOINT=https://hf-mirror.com
export HF_HOME=/root/lanyun-tmp/Hugging-Face

重新加載，查看環境變量是否生效
```
source ~/.bashrcecho $HF_ENDPOINT
echo $HF_HOME
```
安裝 HuggingFace 官方下載工具
```
pip install -U huggingface_hub
```

執行下載模型的命令

huggingface-cli download --resume-download unsloth/QwQ-32B-unsloth-bnb-4bit --local-dir  /root/lanyun-tmp/Hugging-Face/QwQ-32B-unsloth-bnb-4bitHugging-Face/QwQ-32B-unsloth-bnb-4bit

或者使用python下載

from huggingface_hub import snapshot_download
snapshot_download(repo_id = "unsloth/QwQ-32B-unsloth-bnb-4bit",local_dir = "/root/lanyun-tmp/Hugging-Face/QwQ-32B-unsloth-bnb-4bit",
)

transformers庫調用示例

代碼

from transformers import AutoModelForCausalLM, AutoTokenizermodel_name = "/root/lanyun-tmp/Hugging-Face/QwQ-32B-unsloth-bnb-4bit"model = AutoModelForCausalLM.from_pretrained(model_name,torch_dtype="auto",device_map="cuda:0",
)
tokenizer = AutoTokenizer.from_pretrained(model_name)prompt = "你好"
messages = [{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(messages,tokenize=False,add_generation_prompt=True
)model_inputs = tokenizer([text], return_tensors="pt").to(model.device)generated_ids = model.generate(**model_inputs,max_new_tokens=32768
)generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

顯存占用：23G左右。

?

vllm啟動示例

啟動

cd /root/lanyun-tmp/Hugging-Facevllm serve ./QwQ-32B-unsloth-bnb-4bit \
--quantization bitsandbytes \
--load-format bitsandbytes \
--max-model-len 500 \
--port 8081

調用代碼

from openai import OpenAI
import openaiopenai.api_key = '1111111' # 這里隨便填一個
openai.base_url = 'http://127.0.0.1:8081/v1'def get_completion(prompt, model="QwQ-32B"):client = OpenAI(api_key=openai.api_key,base_url=openai.base_url)messages = [{"role": "user", "content": prompt}]response = client.chat.completions.create(model=model,messages=messages,stream=False)return response.choices[0].message.contentprompt = '你好，請幽默的介紹下你自己，不少于300字'
get_completion(prompt, model="./QwQ-32B-unsloth-bnb-4bit")

cot數據集

FreedomIntelligence/medical-o1-reasoning-SFT

https://huggingface.co/datasets/FreedomIntelligence/medical-o1-reasoning-SFT

英文數據集下載

from datasets import load_dataset
import rich# Login using e.g. `huggingface-cli login` to access this dataset
ds = load_dataset("FreedomIntelligence/medical-o1-reasoning-SFT", "en")rich.print(ds['train'][0])

中文數據集下載

from datasets import load_dataset
import rich# Login using e.g. `huggingface-cli login` to access this dataset
ds = load_dataset("FreedomIntelligence/medical-o1-reasoning-SFT", "zh")rich.print(ds['train'][0])

下載完成后會看到在HuggingFace目錄下的datasets目錄中有剛剛下載的數據
```
ll /root/lanyun-tmp/Hugging-Face/datasets/
```
?

unsloth加載QwQ32b模型

unsloth支持直接加載模型并推理,先加載模型

from unsloth import FastLanguageModelmax_seq_length = 2048
dtype = None
load_in_4bit = True # 4bitmodel,tokenizer = FastLanguageModel.from_pretrained(model_name = "/root/lanyun-tmp/Hugging-Face/QwQ-32B-unsloth-bnb-4bit/",max_seq_length = max_seq_length,dtype = dtype,load_in_4bit = load_in_4bit,
)

顯存占用22G左右

推理

# 將模型調整為推理模式
FastLanguageModel.for_inference(model)def QwQ32b_infer(question):# prompt模板prompt_style_chat = """請寫出一個恰當的回來來完成當前對話任務。### Instruction:你是一名助人為樂的助手。### Question:{}### Response:<think>{}"""# [prompt_style_chat.format(question,"")]inputs = tokenizer([prompt_style_chat.format(question, "")],return_tensors="pt").to("cuda")outputs = model.generate(input_ids = inputs.input_ids,max_new_tokens=2048,use_cache=True,)response = tokenizer.batch_decode(outputs)return response[0].split("### Response:")[1]question = "證明根號2是無理數"
response = QwQ32b_infer(question)

模型微調

測試:使用微調數據集進行測試

question_1 = "根據描述，一個1歲的孩子在夏季頭皮出現多處小結節，長期不愈合，且現在瘡大如梅，潰破流膿，口不收斂，頭皮下有空洞，患處皮膚增厚。這種病癥在中醫中診斷為什么病？"question_2 = "一個生后8天的男嬰因皮膚黃染伴發熱和拒乳入院。體檢發現其皮膚明顯黃染，肝脾腫大和臍部少量滲液伴臍周紅腫。在此情況下，哪種檢查方法最有助于確診感染病因？"response_1 = QwQ32b_infer(question_1)
response_2 = QwQ32b_infer(question_2)print(response_1)
print(response_2)

加載并處理數據，選擇訓練集前500條進行最小可行性實驗

import os
from datasets import load_dataset# 問答提示詞模板
train_prompt_style = """下面是描述任務的指令，與提供進一步上下文的輸入配對。編寫適當完成請求的響應。在回答之前，仔細思考問題，并創建逐步的思想鏈，以確保邏輯和準確的響應。### Instruction:
您是一位在臨床推理、診斷和治療計劃方面擁有先進知識的醫學專家。請回答以下醫學問題。 ### Question:
{}### Response:
<think>
{}
</think>
{}"""# 文本生成結束的基本標記
EOS_TOKEN = tokenizer.eos_token
tokenizer.eos_token # '<|im_end|>'# 定義函數，對數據集進行修改
def formatting_prompts_func(examples):inputs = examples["Question"]cots = examples["Complex_CoT"]outputs = examples["Response"]texts = []for input, cot, output in zip(inputs, cots, outputs):text = train_prompt_style.format(input, cot, output) + EOS_TOKENtexts.append(text)return {"text": texts,}# 先選擇訓練集前500條數據
dataset = load_dataset("FreedomIntelligence/medical-o1-reasoning-SFT","zh", split = "train[0:500]",trust_remote_code=True)
dataset = dataset.map(formatting_prompts_func, batched = True)import rich
rich.print(dataset[0])
rich.print(dataset[0]['text'])

將模型設置為微調模式

                       
# 將模型設置為微調模式
model = FastLanguageModel.get_peft_model(model,r=4, # r=16 # 低秩矩陣的秩target_modules=["q_proj","k_proj","v_proj","o_proj","gate_proj","up_proj","down_proj",],lora_alpha=16,lora_dropout=0,  bias="none",  use_gradient_checkpointing="unsloth",  # True or "unsloth" for very long contextrandom_state=1024,use_rslora=False,  loftq_config=None,
)

創建訓練器（有監督微調對象）

from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supportedtrainer = SFTTrainer(model=model, # 指定需要微調的預訓練模型tokenizer=tokenizer, # 分詞器train_dataset=dataset, # 訓練數據dataset_text_field="text", # 指定數據集中那一列包含訓練文本（在formatting_prompt_func里面指定）max_seq_length=max_seq_length, #最大序列長度，用于控制輸入文本的最大token數量dataset_num_proc=2, # 數據加載的并行進程數args=TrainingArguments(per_device_train_batch_size=1, # 每個GPU/設備的戌年批量大小（較小值適合大模型）gradient_accumulation_steps=4, # 梯度累計步數，相當于batch_size=1*4=4# num_train_epochs = 1, # 如果設置了num_train_epochs，則max_steps失效warmup_steps=5, # 預熱步數，初始階段學習率較低，然后逐步升高max_steps=60,# 最大訓練步數learning_rate=2e-4, # 學習率fp16=not is_bfloat16_supported(),  # 如果GPU不支持bfloat16，則使用fp16（16位浮點數）bf16=is_bfloat16_supported(), # 如果GPU支持bfloat16，則啟用bf16（訓練更穩定）logging_steps=10, # 每10步記錄一次日志optim="adamw_8bit", # 使用adamw_8bit 8bit adamw優化器減少顯存占用weight_decay=0.01, # 權重衰減 L2正則化，防止過擬合lr_scheduler_type="linear", # 學習率調整策略，線性衰減seed=1024, # 隨機種子，保證實驗結果可復現output_dir="/root/lanyun-tmp/outputs", # 訓練結果的輸出目錄),
)# 設置wandb(可選則)
import wandb
wandb.login(key="api-key")run = wandb.init(entity="qinchihongye-pa",project='QwQ-32B-4bit-FT')# 開始模型微調
trainer_stats = trainer.train()trainer_status

訓練過程中的顯存占用如上，訓練過程如下

點擊wandb鏈接，查看訓練過程中的損失函數，學習率，梯度等等的變化。

unsloth在微調結束后，會自動更新模型權重（在緩存中），因此無序手動合并集合直接調用微調后的模型
```
FastLanguageModel.for_inference(model)new_response_1 = QwQ32b_infer(question_1)
new_response_2 = QwQ32b_infer(question_2)new_response_1
new_response_2
```
?

?

可以看到第一個問題還是回答錯了，第二個問題也如舊，可以考慮繼續進行大規模微調，使用全部微調文件+多個epoch。

模型合并

此時本地保存的模型權重在/root/lanyun-tmp/outputs中

注意，unsloth中默認100步保存一個checkpoint，因為當前steps=60,所以只有一個checkpoint點。

合并保存為safetensors

model.save_pretrained_merged("/root/lanyun-tmp/QwQ-Medical-COT-Tiny", tokenizer, save_method = "merged_4bit_forced",#保存為4bit量化)# model.save_pretrained_merged("dir"
#                              , tokenizer
#                              , save_method = "merged_16bit",#保存為16bit
#                             )

合并為GGUF格式（需要量化，非常耗時）

# model.save_pretrained_gguf("dir"
#                            , tokenizer
#                            , quantization_method = "q4_k_m"
#                           )# model.save_pretrained_gguf("dir"
#                            , tokenizer
#                            , quantization_method = "q8_0"
#                           )# model.save_pretrained_gguf("dir"
#                            , tokenizer
#                            , quantization_method = "f16"
#                           )