【大模型實戰】利用ms-swift微調框架對QwQ-32B推理模型進行微調

1. 背景介紹

????????之前我們在《大模型訓練/微調的一些經驗分享》、《利用DeepSeek-R1數據微調蒸餾ChatGLM32B讓大模型具備思考能力》中做了相關模型微調的介紹。目前在基座大模型能力還沒有達到足夠牛的情況下，大模型微調在商業化、垂直領域應用依然是不可或缺，即使是使用DeepSeek-R1、QwQ-32B也難以保證商業應用的要求。

? ? ? ? 今天我們將介紹利用ms-swift框架對近期開源的阿里QwQ-32B推理模型進行lora微調的實驗。

2. 模型微調方案

????????ms-swift是modelscope提供的大模型與多模態大模型微調部署框架，支持語言大模型與多模態大模型的訓練（預訓練、微調、人類對齊）、推理、評測、量化與部署。其中語言大模型包括：Qwen2.5、GLM4、Llama3.3、DeepSeek-R1等模型，多模態大模型包括：Qwen2.5-VL、Qwen2-Audio、Llama3.2-Vision、Llava、GLM4v、Xcomposer2.5、DeepSeek-VL2等模型。

????????QwQ-32B模型參數大小約為62G，如果選擇全參微調，保守估計需要20倍的顯存資源，也就是需要1240G顯存，即使8卡的A800集群，大概在640G顯存，也不足以做微調，因此為了落地方便，選擇Lora微調方式。最終我們利用4卡-A800完成Lora的訓練。

? ? ? ?題外話：ms-swift的文檔不全，這點做的不是特別好。

3. 模型微調實驗

3.1 微調數據格式

????????格式如下所示，但我們微調使用的是帶推理過程的數據，這個需要注意一下。關于think的語料可以參考?《利用DeepSeek-R1數據微調蒸餾ChatGLM32B讓大模型具備思考能力》。

{"messages": [{"role": "system", "content": "你是一個樂于助人且知識淵博的AI助手。"},{"role": "user", "content": "請解釋AlphaGo的基本原理。"}, {"role": "assistant", "content": "AlphaGo結合深度神經網絡與樹搜索算法，通過棋局數據進行訓練，逐步提升棋力，能夠判斷圍棋局勢并選擇最佳著法。"}]}{"messages": [{"role": "system", "content": "你是一個樂于助人且知識淵博的AI助手。"},{"role": "user", "content": "列舉一些健康飲食的重要原則。"}, {"role": "assistant", "content": "健康飲食的重要原則包括：均衡飲食，合理攝入各類營養素；多吃新鮮蔬果，適量飲水；少吃糖、油、鹽，忌吃過量快餐；養成規律生活作息習慣，適量運動。"}]} {"messages": [{"role": "system", "content": "你是一個樂于助人且知識淵博的AI助手。"},{"role": "user", "content": "給出一些改善睡眠質量的建議。"}, {"role": "assistant", "content": "改善睡眠質量的建議包括：保證睡眠時間，避免睡眠不足；睡前放松，避免刺激；養成規律作息時間，不要頻繁熬夜；適量運動，但不要運動過度；睡前可以喝一杯熱牛奶等溫和飲料。"}]}

3.2 訓練腳本

NPROC_PER_NODE=4 \
CUDA_VISIBLE_DEVICES=0,1,2,3 \
swift sft \--model /data/QwQ-32B \--train_type lora \--dataset '/data/qwq32b_sft_lora/rl-v0312.jsonl' \--torch_dtype bfloat16 \--num_train_epochs 5 \--per_device_train_batch_size 1 \--per_device_eval_batch_size 1 \--learning_rate 1e-4 \--lora_rank 8 \--lora_alpha 32 \--target_modules all-linear \--gradient_accumulation_steps 8 \--eval_steps 50 \--save_steps 50 \--save_total_limit 5 \--logging_steps 5 \--max_length 8192 \--output_dir /data/qwq32b_sft_lora/output \--warmup_ratio 0.05 \--dataloader_num_workers 4 \--model_author swift \--model_name swift-robot \--deepspeed zero3

3.3 訓練日志

????????從訓練日志可以清晰看到，整個微調階段的loss逐步收斂。另外框架會輸出最佳的模型checkpoint模型參數。

[2025-03-11 19:28:37,083] [INFO] [config.py:734:__init__] Config mesh_device None world_size = 4
[2025-03-11 19:28:37,084] [INFO] [config.py:734:__init__] Config mesh_device None world_size = 4
[2025-03-11 19:28:37,092] [INFO] [config.py:734:__init__] Config mesh_device None world_size = 4
[2025-03-11 19:28:37,401] [INFO] [partition_parameters.py:348:__exit__] finished initializing model - num_params = 771, num_elems = 32.76B
Loading checkpoint shards: 100%|██████████| 14/14 [00:16<00:00, ?1.17s/it]
Loading checkpoint shards: 100%|██████████| 14/14 [00:16<00:00, ?1.17s/it]
Loading checkpoint shards: 100%|██████████| 14/14 [00:16<00:00, ?1.17s/it]
Loading checkpoint shards: 100%|██████████| 14/14 [00:17<00:00, ?1.22s/it]
[INFO:swift] model_info: ModelInfo(model_type='qwq', model_dir='/data/QwQ-32B', torch_dtype=torch.bfloat16, max_model_len=131072, quant_method=None, quant_bits=None, rope_scaling=None, config=Qwen2Config {
? "_name_or_path": "/data/QwQ-32B",
? "architectures": [
? ? "Qwen2ForCausalLM"
? ],
? "attention_dropout": 0.0,
? "bos_token_id": 151643,
? "eos_token_id": 151645,
? "hidden_act": "silu",
? "hidden_size": 5120,
? "initializer_range": 0.02,
? "intermediate_size": 27648,
? "max_position_embeddings": 131072,
? "max_window_layers": 64,
? "model_type": "qwen2",
? "num_attention_heads": 40,
? "num_hidden_layers": 64,
? "num_key_value_heads": 8,
? "rms_norm_eps": 1e-05,
? "rope_scaling": null,
? "rope_theta": 1000000.0,
? "sliding_window": 32768,
? "tie_word_embeddings": false,
? "torch_dtype": "bfloat16",
? "transformers_version": "4.49.0",
? "use_cache": true,
? "use_sliding_window": false,
? "vocab_size": 152064
}
, task_type='causal_lm', num_labels=None)
[INFO:swift] model.generation_config: GenerationConfig {
? "bos_token_id": 151643,
? "eos_token_id": [
? ? 151645,
? ? 151643
? ],
? "max_new_tokens": 64,
? "pad_token_id": 151643
}

[INFO:swift] default_system: None
[INFO:swift] The TrainArguments will be saved in: /data/qwq32b_sft_lora/output/v9-20250311-192834/args.json
[INFO:swift] Start time of running main: 2025-03-11 19:28:54.707260
Map: 100%|██████████| 2645/2645 [00:00<00:00, 8426.21 examples/s]?
Map: 100%|██████████| 2645/2645 [00:00<00:00, 7697.28 examples/s]?
Map: 100%|██████████| 2645/2645 [00:00<00:00, 6463.52 examples/s]?
Map: ? 0%| ? ? ? ? ?| 0/2619 [00:00<?, ? examples/s][INFO:swift] create tmp_dir: /.cache/modelscope/hub/tmp/hf_datasets-i15lb3_o
Map: 100%|██████████| 2645/2645 [00:00<00:00, 9980.89 examples/s]?
[INFO:swift] train_dataset: Dataset({
? ? features: ['messages'],
? ? num_rows: 2619
})
[INFO:swift] val_dataset: Dataset({
? ? features: ['messages'],
? ? num_rows: 26
? ??
? ??
[INFO:swift] [INPUT] <|im_start|>system
## 角色
你是一名AI客服，你很專業、友善、禮貌，擅長為客戶提供導購服務。

## 目標
結合歷史對話和用戶當前問題判斷對話場景（如：商品推薦、商品咨詢、拒答范圍、閑聊），針對不同場景作親和的“原始回復”，然后根據<紅線>內容，對”原始回復“作詳細的安全審查改寫，優化后直接輸出。

## 紅線
- 不提供任何無關的信息
- 不提供任何電話號碼和網站鏈接
- 不提供任何知識庫中不存在的商品信息

## 限制
- 避免談論競品、具體價格或優惠、商品鏈接、商品維修、訂單、售后、物流。
- 注意區分商品型號，若商品型號指代不明，在回復中需主動向用戶確認。
- 涉及商品信息時，你只能使用<知識庫>中的內容回復。
- 當<知識庫>為空或無法回答問題時，需要禮貌的婉拒回答。
- 在婉拒回答時，必須帶這句話“建議您咨詢官方商城的人工客服或訪問官網”。
- 不要主動解釋專業詞匯。
- 不要暴露任何思考過程，直接回答。
- 回答中不要暴露“原始回復”，直接輸出安全審查改寫后的內容。

## 建議
- 介紹商品時推薦使用markdown格式
- 善用追問去引導客戶把訴求表達清楚，但歷史對話已經表達清楚的訴求不要在追問。
<|im_end|>
<|im_start|>user
這款KD-85X85K和85x90L有什么區別？<|im_end|>
<|im_start|>assistant
X85K采用X1芯片，而X90L搭載XR認知芯片，能在畫面整體優化、焦點主體強化和音畫同步方面更出色。X90L通過多維度數據分析匹配畫質數據庫，重點優化用戶關注區域。如果您追求更好畫質建議選X90L，預算有限的話X85K也不錯。<|im_end|>
<|im_start|>user
尺寸一樣大嗎？<|im_end|>
<|im_start|>assistant
<think>
好的，用戶問的是“尺寸一樣大嗎？”，我需要先回顧歷史對話。之前的對話中，用戶詢問了KD-85X85K和85X90L的區別，客服提到兩者的芯片不同以及畫質優化方面的差異。現在用戶進一步詢問尺寸是否相同。

首先，檢查知識庫中的商品信息。KD-85X85K的型號中“85”代表85英寸，而XR-85X90L同樣有“85”在型號中，所以尺寸應該都是85英寸。確認知識庫中兩者的描述，確實都是85寸的電視。因此，用戶的問題應該回答尺寸相同，都是85英寸。同時，根據紅線規則，不能提供無關信息，所以不需要涉及其他方面，比如價格或功能差異，除非用戶進一步詢問。另外，確保回復中沒有鏈接或電話號碼，符合安全審查的要求。因此，直接確認兩者尺寸相同，并保持禮貌和簡潔。
</think>

兩款電視的尺寸相同，都是85英寸大屏設計哦~ 如果您還有其他疑問，歡迎隨時告訴我。建議您咨詢官方商城的人工客服或訪問官網了解更多信息。<|im_end|>

[INFO:swift] lora_config: LoraConfig(task_type='CAUSAL_LM', peft_type=<PeftType.LORA: 'LORA'>, auto_mapping=None, base_model_name_or_path='/data/QwQ-32B', revision=None, inference_mode=False, r=8, target_modules={'k_proj', 'down_proj', 'q_proj', 'v_proj', 'o_proj', 'gate_proj', 'up_proj'}, exclude_modules=None, lora_alpha=32, lora_dropout=0.05, fan_in_fan_out=False, bias='none', use_rslora=False, modules_to_save=[], init_lora_weights=True, layers_to_transform=None, layers_pattern=None, rank_pattern={}, alpha_pattern={}, megatron_config=None, megatron_core='megatron.core', loftq_config={}, eva_config=None, use_dora=False, layer_replication=None, runtime_config=LoraRuntimeConfig(ephemeral_gpu_offload=False), lora_bias=False, lora_dtype=None, lorap_lr_ratio=None, lorap_emb_lr=1e-06)
[INFO:swift] model: PeftModelForCausalLM(
? (base_model): LoraModel(
? ? (model): Qwen2ForCausalLM(
? ? ? (model): Qwen2Model(
? ? ? ? (embed_tokens): Embedding(152064, 5120)
? ? ? ? (layers): ModuleList(
? ? ? ? ? (0-63): 64 x Qwen2DecoderLayer(
? ? ? ? ? ? (self_attn): Qwen2Attention(
? ? ? ? ? ? ? (q_proj): lora.Linear(
? ? ? ? ? ? ? ? (base_layer): Linear(in_features=5120, out_features=5120, bias=True)
? ? ? ? ? ? ? ? (lora_dropout): ModuleDict(
? ? ? ? ? ? ? ? ? (default): Dropout(p=0.05, inplace=False)
? ? ? ? ? ? ? ? )
? ? ? ? ? ? ? ? (lora_A): ModuleDict(
? ? ? ? ? ? ? ? ? (default): Linear(in_features=5120, out_features=8, bias=False)
? ? ? ? ? ? ? ? )
? ? ? ? ? ? ? ? (lora_B): ModuleDict(
? ? ? ? ? ? ? ? ? (default): Linear(in_features=8, out_features=5120, bias=False)
? ? ? ? ? ? ? ? )
? ? ? ? ? ? ? ? (lora_embedding_A): ParameterDict()
? ? ? ? ? ? ? ? (lora_embedding_B): ParameterDict()
? ? ? ? ? ? ? ? (lora_magnitude_vector): ModuleDict()
? ? ? ? ? ? ? )
? ? ? ? ? ? ? (k_proj): lora.Linear(
? ? ? ? ? ? ? ? (base_layer): Linear(in_features=5120, out_features=1024, bias=True)
? ? ? ? ? ? ? ? (lora_dropout): ModuleDict(
? ? ? ? ? ? ? ? ? (default): Dropout(p=0.05, inplace=False)
? ? ? ? ? ? ? ? )
? ? ? ? ? ? ? ? (lora_A): ModuleDict(
? ? ? ? ? ? ? ? ? (default): Linear(in_features=5120, out_features=8, bias=False)
? ? ? ? ? ? ? ? )
? ? ? ? ? ? ? ? (lora_B): ModuleDict(
? ? ? ? ? ? ? ? ? (default): Linear(in_features=8, out_features=1024, bias=False)
? ? ? ? ? ? ? ? )
? ? ? ? ? ? ? ? (lora_embedding_A): ParameterDict()
? ? ? ? ? ? ? ? (lora_embedding_B): ParameterDict()
? ? ? ? ? ? ? ? (lora_magnitude_vector): ModuleDict()
? ? ? ? ? ? ? )
? ? ? ? ? ? ? (v_proj): lora.Linear(
? ? ? ? ? ? ? ? (base_layer): Linear(in_features=5120, out_features=1024, bias=True)
? ? ? ? ? ? ? ? (lora_dropout): ModuleDict(
? ? ? ? ? ? ? ? ? (default): Dropout(p=0.05, inplace=False)
? ? ? ? ? ? ? ? )
? ? ? ? ? ? ? ? (lora_A): ModuleDict(
? ? ? ? ? ? ? ? ? (default): Linear(in_features=5120, out_features=8, bias=False)
? ? ? ? ? ? ? ? )
? ? ? ? ? ? ? ? (lora_B): ModuleDict(
? ? ? ? ? ? ? ? ? (default): Linear(in_features=8, out_features=1024, bias=False)
? ? ? ? ? ? ? ? )
? ? ? ? ? ? ? ? (lora_embedding_A): ParameterDict()
? ? ? ? ? ? ? ? (lora_embedding_B): ParameterDict()
? ? ? ? ? ? ? ? (lora_magnitude_vector): ModuleDict()
? ? ? ? ? ? ? )
? ? ? ? ? ? ? (o_proj): lora.Linear(
? ? ? ? ? ? ? ? (base_layer): Linear(in_features=5120, out_features=5120, bias=False)
? ? ? ? ? ? ? ? (lora_dropout): ModuleDict(
? ? ? ? ? ? ? ? ? (default): Dropout(p=0.05, inplace=False)
? ? ? ? ? ? ? ? )
? ? ? ? ? ? ? ? (lora_A): ModuleDict(
? ? ? ? ? ? ? ? ? (default): Linear(in_features=5120, out_features=8, bias=False)
? ? ? ? ? ? ? ? )
? ? ? ? ? ? ? ? (lora_B): ModuleDict(
? ? ? ? ? ? ? ? ? (default): Linear(in_features=8, out_features=5120, bias=False)
? ? ? ? ? ? ? ? )
? ? ? ? ? ? ? ? (lora_embedding_A): ParameterDict()
? ? ? ? ? ? ? ? (lora_embedding_B): ParameterDict()
? ? ? ? ? ? ? ? (lora_magnitude_vector): ModuleDict()
? ? ? ? ? ? ? )
? ? ? ? ? ? )
? ? ? ? ? ? (mlp): Qwen2MLP(
? ? ? ? ? ? ? (gate_proj): lora.Linear(
? ? ? ? ? ? ? ? (base_layer): Linear(in_features=5120, out_features=27648, bias=False)
? ? ? ? ? ? ? ? (lora_dropout): ModuleDict(
? ? ? ? ? ? ? ? ? (default): Dropout(p=0.05, inplace=False)
? ? ? ? ? ? ? ? )
? ? ? ? ? ? ? ? (lora_A): ModuleDict(
? ? ? ? ? ? ? ? ? (default): Linear(in_features=5120, out_features=8, bias=False)
? ? ? ? ? ? ? ? )
? ? ? ? ? ? ? ? (lora_B): ModuleDict(
? ? ? ? ? ? ? ? ? (default): Linear(in_features=8, out_features=27648, bias=False)
? ? ? ? ? ? ? ? )
? ? ? ? ? ? ? ? (lora_embedding_A): ParameterDict()
? ? ? ? ? ? ? ? (lora_embedding_B): ParameterDict()
? ? ? ? ? ? ? ? (lora_magnitude_vector): ModuleDict()
? ? ? ? ? ? ? )
? ? ? ? ? ? ? (up_proj): lora.Linear(
? ? ? ? ? ? ? ? (base_layer): Linear(in_features=5120, out_features=27648, bias=False)
? ? ? ? ? ? ? ? (lora_dropout): ModuleDict(
? ? ? ? ? ? ? ? ? (default): Dropout(p=0.05, inplace=False)
? ? ? ? ? ? ? ? )
? ? ? ? ? ? ? ? (lora_A): ModuleDict(
? ? ? ? ? ? ? ? ? (default): Linear(in_features=5120, out_features=8, bias=False)
? ? ? ? ? ? ? ? )
? ? ? ? ? ? ? ? (lora_B): ModuleDict(
? ? ? ? ? ? ? ? ? (default): Linear(in_features=8, out_features=27648, bias=False)
? ? ? ? ? ? ? ? )
? ? ? ? ? ? ? ? (lora_embedding_A): ParameterDict()
? ? ? ? ? ? ? ? (lora_embedding_B): ParameterDict()
? ? ? ? ? ? ? ? (lora_magnitude_vector): ModuleDict()
? ? ? ? ? ? ? )
? ? ? ? ? ? ? (down_proj): lora.Linear(
? ? ? ? ? ? ? ? (base_layer): Linear(in_features=27648, out_features=5120, bias=False)
? ? ? ? ? ? ? ? (lora_dropout): ModuleDict(
? ? ? ? ? ? ? ? ? (default): Dropout(p=0.05, inplace=False)
? ? ? ? ? ? ? ? )
? ? ? ? ? ? ? ? (lora_A): ModuleDict(
? ? ? ? ? ? ? ? ? (default): Linear(in_features=27648, out_features=8, bias=False)
? ? ? ? ? ? ? ? )
? ? ? ? ? ? ? ? (lora_B): ModuleDict(
? ? ? ? ? ? ? ? ? (default): Linear(in_features=8, out_features=5120, bias=False)
? ? ? ? ? ? ? ? )
? ? ? ? ? ? ? ? (lora_embedding_A): ParameterDict()
? ? ? ? ? ? ? ? (lora_embedding_B): ParameterDict()
? ? ? ? ? ? ? ? (lora_magnitude_vector): ModuleDict()
? ? ? ? ? ? ? )
? ? ? ? ? ? ? (act_fn): SiLU()
? ? ? ? ? ? )
? ? ? ? ? ? (input_layernorm): Qwen2RMSNorm((0,), eps=1e-05)
? ? ? ? ? ? (post_attention_layernorm): Qwen2RMSNorm((0,), eps=1e-05)
? ? ? ? ? )
? ? ? ? )
? ? ? ? (norm): Qwen2RMSNorm((0,), eps=1e-05)
? ? ? ? (rotary_emb): Qwen2RotaryEmbedding()
? ? ? )
? ? ? (lm_head): Linear(in_features=5120, out_features=152064, bias=False)
? ? )
? )
)
[INFO:swift] model_parameter_info: PeftModelForCausalLM: 32830.9852M Params (67.1089M Trainable [0.2044%]), 0.0001M Buffers.

Parameter Offload: Total persistent parameters: 25760768 in 1025 params
{'loss': 1.32348752, 'token_acc': 0.70985222, 'grad_norm': 0.80846994, 'learning_rate': 4.76e-06, 'memory(GiB)': 60.01, 'train_speed(iter/s)': 0.01743, 'epoch': 0.01, 'global_step/max_steps': '1/405', 'percentage': '0.25%', 'elapsed_time': '53s', 'remaining_time': '6h 2m 10s'}
Train: ? 0%| ? ? ? ? ?| 2/405 [01:46<5:56:12, 53.03s/it][2025-03-11 19:32:17,225] [WARNING] [stage3.py:2139:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
{'loss': 1.24938524, 'token_acc': 0.69148486, 'grad_norm': 0.86987531, 'learning_rate': 2.381e-05, 'memory(GiB)': 74.21, 'train_speed(iter/s)': 0.018535, 'epoch': 0.06, 'global_step/max_steps': '5/405', 'percentage': '1.23%', 'elapsed_time': '4m 26s', 'remaining_time': '5h 54m 54s'}
{'loss': 1.22446156, 'token_acc': 0.69278702, 'grad_norm': 0.77689102, 'learning_rate': 4.762e-05, 'memory(GiB)': 74.21, 'train_speed(iter/s)': 0.019271, 'epoch': 0.12, 'global_step/max_steps': '10/405', 'percentage': '2.47%', 'elapsed_time': '8m 35s', 'remaining_time': '5h 39m 15s'}
{'loss': 1.13267899, 'token_acc': 0.71570596, 'grad_norm': 0.40197327, 'learning_rate': 7.143e-05, 'memory(GiB)': 74.21, 'train_speed(iter/s)': 0.01927, 'epoch': 0.18, 'global_step/max_steps': '15/405', 'percentage': '3.70%', 'elapsed_time': '12m 54s', 'remaining_time': '5h 35m 45s'}
{'loss': 0.97332687, 'token_acc': 0.72897148, 'grad_norm': 0.34967286, 'learning_rate': 9.524e-05, 'memory(GiB)': 74.21, 'train_speed(iter/s)': 0.019607, 'epoch': 0.24, 'global_step/max_steps': '20/405', 'percentage': '4.94%', 'elapsed_time': '16m 56s', 'remaining_time': '5h 26m 7s'}
{'loss': 0.95233335, 'token_acc': 0.71795399, 'grad_norm': 0.32512059, 'learning_rate': 9.997e-05, 'memory(GiB)': 74.21, 'train_speed(iter/s)': 0.01971, 'epoch': 0.31, 'global_step/max_steps': '25/405', 'percentage': '6.17%', 'elapsed_time': '21m 4s', 'remaining_time': '5h 20m 24s'}
{'loss': 0.92778244, 'token_acc': 0.72106543, 'grad_norm': 0.22549374, 'learning_rate': 9.986e-05, 'memory(GiB)': 74.21, 'train_speed(iter/s)': 0.019805, 'epoch': 0.37, 'global_step/max_steps': '30/405', 'percentage': '7.41%', 'elapsed_time': '25m 11s', 'remaining_time': '5h 14m 49s'}
{'loss': 0.91093416, 'token_acc': 0.73585944, 'grad_norm': 0.21213417, 'learning_rate': 9.967e-05, 'memory(GiB)': 74.21, 'train_speed(iter/s)': 0.019825, 'epoch': 0.43, 'global_step/max_steps': '35/405', 'percentage': '8.64%', 'elapsed_time': '29m 21s', 'remaining_time': '5h 10m 25s'}
{'loss': 0.86407394, 'token_acc': 0.73746765, 'grad_norm': 0.22134356, 'learning_rate': 9.94e-05, 'memory(GiB)': 74.21, 'train_speed(iter/s)': 0.019853, 'epoch': 0.49, 'global_step/max_steps': '40/405', 'percentage': '9.88%', 'elapsed_time': '33m 31s', 'remaining_time': '5h 5m 52s'}
{'loss': 0.86335802, 'token_acc': 0.73666894, 'grad_norm': 0.236291, 'learning_rate': 9.904e-05, 'memory(GiB)': 74.21, 'train_speed(iter/s)': 0.019929, 'epoch': 0.55, 'global_step/max_steps': '45/405', 'percentage': '11.11%', 'elapsed_time': '37m 34s', 'remaining_time': '5h 0m 35s'}
{'loss': 0.81436214, 'token_acc': 0.76214197, 'grad_norm': 0.19902774, 'learning_rate': 9.86e-05, 'memory(GiB)': 74.21, 'train_speed(iter/s)': 0.019918, 'epoch': 0.61, 'global_step/max_steps': '50/405', 'percentage': '12.35%', 'elapsed_time': '41m 46s', 'remaining_time': '4h 56m 37s'}
Train: ?12%|█▏ ? ? ? ?| 50/405 [41:46<5:01:46, 51.00s/it]
{'eval_loss': 0.82470703, 'eval_token_acc': 0.75927635, 'eval_runtime': 15.7907, 'eval_samples_per_second': 1.647, 'eval_steps_per_second': 0.443, 'epoch': 0.61, 'global_step/max_steps': '50/405', 'percentage': '12.35%', 'elapsed_time': '42m 2s', 'remaining_time': '4h 58m 30s'}
Val: 100%|██████████| 7/7 [00:13<00:00, ?1.87s/it]00s/it]
[INFO:swift] Saving model checkpoint to /data/qwq32b_sft_lora/output/v9-20250311-192834/checkpoint-50
***********************************************
[2025-03-11 20:12:35,271] [WARNING] [stage3.py:2139:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
{'loss': 0.84376278, 'token_acc': 0.74929837, 'grad_norm': 0.29243814, 'learning_rate': 9.808e-05, 'memory(GiB)': 76.21, 'train_speed(iter/s)': 0.019739, 'epoch': 0.67, 'global_step/max_steps': '55/405', 'percentage': '13.58%', 'elapsed_time': '46m 22s', 'remaining_time': '4h 55m 8s'}
{'loss': 0.82531147, 'token_acc': 0.75041408, 'grad_norm': 0.29134859, 'learning_rate': 9.748e-05, 'memory(GiB)': 76.21, 'train_speed(iter/s)': 0.019703, 'epoch': 0.73, 'global_step/max_steps': '60/405', 'percentage': '14.81%', 'elapsed_time': '50m 41s', 'remaining_time': '4h 51m 29s'}
{'loss': 0.8170001, 'token_acc': 0.75919308, 'grad_norm': 0.24516849, 'learning_rate': 9.68e-05, 'memory(GiB)': 76.21, 'train_speed(iter/s)': 0.019724, 'epoch': 0.8, 'global_step/max_steps': '65/405', 'percentage': '16.05%', 'elapsed_time': '54m 51s', 'remaining_time': '4h 46m 59s'}
{'loss': 0.81388254, 'token_acc': 0.75490298, 'grad_norm': 0.28124103, 'learning_rate': 9.604e-05, 'memory(GiB)': 76.21, 'train_speed(iter/s)': 0.019781, 'epoch': 0.86, 'global_step/max_steps': '70/405', 'percentage': '17.28%', 'elapsed_time': '58m 55s', 'remaining_time': '4h 41m 58s'}
{'loss': 0.81019135, 'token_acc': 0.74177519, 'grad_norm': 0.28694744, 'learning_rate': 9.52e-05, 'memory(GiB)': 76.21, 'train_speed(iter/s)': 0.019784, 'epoch': 0.92, 'global_step/max_steps': '75/405', 'percentage': '18.52%', 'elapsed_time': '1h 3m 7s', 'remaining_time': '4h 37m 44s'}
{'loss': 0.76696019, 'token_acc': 0.7639197, 'grad_norm': 0.311834, 'learning_rate': 9.429e-05, 'memory(GiB)': 76.21, 'train_speed(iter/s)': 0.019813, 'epoch': 0.98, 'global_step/max_steps': '80/405', 'percentage': '19.75%', 'elapsed_time': '1h 7m 14s', 'remaining_time': '4h 33m 8s'}
{'loss': 0.76195569, 'token_acc': 0.76973895, 'grad_norm': 0.43021317, 'learning_rate': 9.33e-05, 'memory(GiB)': 76.21, 'train_speed(iter/s)': 0.019856, 'epoch': 1.04, 'global_step/max_steps': '85/405', 'percentage': '20.99%', 'elapsed_time': '1h 11m 17s', 'remaining_time': '4h 28m 22s'}
{'loss': 0.7821136, 'token_acc': 0.74735605, 'grad_norm': 0.41759374, 'learning_rate': 9.224e-05, 'memory(GiB)': 76.21, 'train_speed(iter/s)': 0.019888, 'epoch': 1.1, 'global_step/max_steps': '90/405', 'percentage': '22.22%', 'elapsed_time': '1h 15m 21s', 'remaining_time': '4h 23m 45s'}
Train: ?23%|██▎ ? ? ? | 92/405 [1:17:03<4:18:27, 49.54s/it][2025-03-11 20:46:52,504] [WARNING] [stage3.py:2139:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
{'loss': 0.74946299, 'token_acc': 0.76573743, 'grad_norm': 0.31465808, 'learning_rate': 9.111e-05, 'memory(GiB)': 76.21, 'train_speed(iter/s)': 0.019875, 'epoch': 1.16, 'global_step/max_steps': '95/405', 'percentage': '23.46%', 'elapsed_time': '1h 19m 36s', 'remaining_time': '4h 19m 45s'}
{'loss': 0.75774355, 'token_acc': 0.76279737, 'grad_norm': 0.34568468, 'learning_rate': 8.992e-05, 'memory(GiB)': 76.21, 'train_speed(iter/s)': 0.019874, 'epoch': 1.22, 'global_step/max_steps': '100/405', 'percentage': '24.69%', 'elapsed_time': '1h 23m 48s', 'remaining_time': '4h 15m 35s'}
Train: ?25%|██▍ ? ? ? | 100/405 [1:23:48<4:17:26, 50.64s/it]
{'eval_loss': 0.720375, 'eval_token_acc': 0.77822903, 'eval_runtime': 15.6988, 'eval_samples_per_second': 1.656, 'eval_steps_per_second': 0.446, 'epoch': 1.22, 'global_step/max_steps': '100/405', 'percentage': '24.69%', 'elapsed_time': '1h 24m 3s', 'remaining_time': '4h 16m 23s'}
Val: 100%|██████████| 7/7 [00:13<00:00, ?1.86s/it]50.64s/it]
[INFO:swift] Saving model checkpoint to /data/qwq32b_sft_lora/output/v9-20250311-192834/checkpoint-100
**********************************************
{'loss': 0.72672591, 'token_acc': 0.76866752, 'grad_norm': 0.64908534, 'learning_rate': 8.865e-05, 'memory(GiB)': 76.21, 'train_speed(iter/s)': 0.019801, 'epoch': 1.28, 'global_step/max_steps': '105/405', 'percentage': '25.93%', 'elapsed_time': '1h 28m 19s', 'remaining_time': '4h 12m 20s'}
{'loss': 0.72024732, 'token_acc': 0.76941662, 'grad_norm': 0.36116413, 'learning_rate': 8.732e-05, 'memory(GiB)': 76.21, 'train_speed(iter/s)': 0.019815, 'epoch': 1.34, 'global_step/max_steps': '110/405', 'percentage': '27.16%', 'elapsed_time': '1h 32m 27s', 'remaining_time': '4h 7m 58s'}
{'loss': 0.68267331, 'token_acc': 0.7761134, 'grad_norm': 0.38293342, 'learning_rate': 8.593e-05, 'memory(GiB)': 76.21, 'train_speed(iter/s)': 0.019806, 'epoch': 1.4, 'global_step/max_steps': '115/405', 'percentage': '28.40%', 'elapsed_time': '1h 36m 42s', 'remaining_time': '4h 3m 52s'}
{'loss': 0.71170344, 'token_acc': 0.78053525, 'grad_norm': 0.3713337, 'learning_rate': 8.448e-05, 'memory(GiB)': 76.21, 'train_speed(iter/s)': 0.019831, 'epoch': 1.46, 'global_step/max_steps': '120/405', 'percentage': '29.63%', 'elapsed_time': '1h 40m 47s', 'remaining_time': '3h 59m 22s'}
{'loss': 0.70673256, 'token_acc': 0.77159011, 'grad_norm': 0.36822507, 'learning_rate': 8.297e-05, 'memory(GiB)': 76.21, 'train_speed(iter/s)': 0.019845, 'epoch': 1.53, 'global_step/max_steps': '125/405', 'percentage': '30.86%', 'elapsed_time': '1h 44m 55s', 'remaining_time': '3h 55m 1s'}
{'loss': 0.67356033, 'token_acc': 0.7921583, 'grad_norm': 0.4612934, 'learning_rate': 8.14e-05, 'memory(GiB)': 76.21, 'train_speed(iter/s)': 0.019851, 'epoch': 1.59, 'global_step/max_steps': '130/405', 'percentage': '32.10%', 'elapsed_time': '1h 49m 5s', 'remaining_time': '3h 50m 45s'}
Train: ?33%|███▎ ? ? ?| 132/405 [1:50:48<3:50:21, 50.63s/it][2025-03-11 21:20:37,710] [WARNING] [stage3.py:2139:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
{'loss': 0.68124514, 'token_acc': 0.78771819, 'grad_norm': 0.46047566, 'learning_rate': 7.978e-05, 'memory(GiB)': 76.21, 'train_speed(iter/s)': 0.019853, 'epoch': 1.65, 'global_step/max_steps': '135/405', 'percentage': '33.33%', 'elapsed_time': '1h 53m 16s', 'remaining_time': '3h 46m 32s'}
{'loss': 0.67308445, 'token_acc': 0.78043745, 'grad_norm': 0.46205863, 'learning_rate': 7.812e-05, 'memory(GiB)': 76.21, 'train_speed(iter/s)': 0.019881, 'epoch': 1.71, 'global_step/max_steps': '140/405', 'percentage': '34.57%', 'elapsed_time': '1h 57m 18s', 'remaining_time': '3h 42m 2s'}
{'loss': 0.65709753, 'token_acc': 0.794716, 'grad_norm': 0.46728156, 'learning_rate': 7.64e-05, 'memory(GiB)': 76.21, 'train_speed(iter/s)': 0.019893, 'epoch': 1.77, 'global_step/max_steps': '145/405', 'percentage': '35.80%', 'elapsed_time': '2h 1m 25s', 'remaining_time': '3h 37m 43s'}
{'loss': 0.66156731, 'token_acc': 0.78602904, 'grad_norm': 0.45510392, 'learning_rate': 7.464e-05, 'memory(GiB)': 76.21, 'train_speed(iter/s)': 0.019899, 'epoch': 1.83, 'global_step/max_steps': '150/405', 'percentage': '37.04%', 'elapsed_time': '2h 5m 34s', 'remaining_time': '3h 33m 28s'}
Train: ?37%|███▋ ? ? ?| 150/405 [2:05:34<3:31:14, 49.70s/it]
{'eval_loss': 0.65251857, 'eval_token_acc': 0.79853547, 'eval_runtime': 15.6574, 'eval_samples_per_second': 1.661, 'eval_steps_per_second': 0.447, 'epoch': 1.83, 'global_step/max_steps': '150/405', 'percentage': '37.04%', 'elapsed_time': '2h 5m 50s', 'remaining_time': '3h 33m 55s'}
Val: 100%|██████████| 7/7 [00:12<00:00, ?1.86s/it]49.70s/it]
[INFO:swift] Saving model checkpoint to /data/qwq32b_sft_lora/output/v9-20250311-192834/checkpoint-150
**********************************************
{'loss': 0.65750132, 'token_acc': 0.78596818, 'grad_norm': 0.47214887, 'learning_rate': 7.285e-05, 'memory(GiB)': 76.21, 'train_speed(iter/s)': 0.019847, 'epoch': 1.89, 'global_step/max_steps': '155/405', 'percentage': '38.27%', 'elapsed_time': '2h 10m 6s', 'remaining_time': '3h 29m 50s'}
{'loss': 0.63944697, 'token_acc': 0.80483245, 'grad_norm': 0.49222756, 'learning_rate': 7.101e-05, 'memory(GiB)': 76.21, 'train_speed(iter/s)': 0.019853, 'epoch': 1.95, 'global_step/max_steps': '160/405', 'percentage': '39.51%', 'elapsed_time': '2h 14m 15s', 'remaining_time': '3h 25m 35s'}
{'loss': 0.63674178, 'token_acc': 0.80768833, 'grad_norm': 0.59897131, 'learning_rate': 6.913e-05, 'memory(GiB)': 76.21, 'train_speed(iter/s)': 0.019895, 'epoch': 2.01, 'global_step/max_steps': '165/405', 'percentage': '40.74%', 'elapsed_time': '2h 18m 10s', 'remaining_time': '3h 20m 58s'}
{'loss': 0.64350748, 'token_acc': 0.80203466, 'grad_norm': 0.51221188, 'learning_rate': 6.723e-05, 'memory(GiB)': 76.21, 'train_speed(iter/s)': 0.019886, 'epoch': 2.07, 'global_step/max_steps': '170/405', 'percentage': '41.98%', 'elapsed_time': '2h 22m 25s', 'remaining_time': '3h 16m 52s'}
{'loss': 0.59812784, 'token_acc': 0.80184307, 'grad_norm': 0.52895864, 'learning_rate': 6.53e-05, 'memory(GiB)': 76.21, 'train_speed(iter/s)': 0.019905, 'epoch': 2.13, 'global_step/max_steps': '175/405', 'percentage': '43.21%', 'elapsed_time': '2h 26m 27s', 'remaining_time': '3h 12m 29s'}
{'loss': 0.60168495, 'token_acc': 0.80204451, 'grad_norm': 0.54771068, 'learning_rate': 6.334e-05, 'memory(GiB)': 76.21, 'train_speed(iter/s)': 0.019928, 'epoch': 2.2, 'global_step/max_steps': '180/405', 'percentage': '44.44%', 'elapsed_time': '2h 30m 28s', 'remaining_time': '3h 8m 6s'}
Train: ?44%|████▍ ? ? | 180/405 [2:30:28<2:59:27, 47.85s/it][2025-03-11 22:00:31,985] [WARNING] [stage3.py:2139:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
{'loss': 0.59545937, 'token_acc': 0.80456827, 'grad_norm': 0.57579227, 'learning_rate': 6.135e-05, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.019937, 'epoch': 2.26, 'global_step/max_steps': '185/405', 'percentage': '45.68%', 'elapsed_time': '2h 34m 35s', 'remaining_time': '3h 3m 50s'}
{'loss': 0.59948916, 'token_acc': 0.80121831, 'grad_norm': 0.53543298, 'learning_rate': 5.935e-05, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.019937, 'epoch': 2.32, 'global_step/max_steps': '190/405', 'percentage': '46.91%', 'elapsed_time': '2h 38m 46s', 'remaining_time': '2h 59m 39s'}
{'loss': 0.59326115, 'token_acc': 0.79956183, 'grad_norm': 0.55039623, 'learning_rate': 5.734e-05, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.01994, 'epoch': 2.38, 'global_step/max_steps': '195/405', 'percentage': '48.15%', 'elapsed_time': '2h 42m 55s', 'remaining_time': '2h 55m 27s'}
{'loss': 0.58592167, 'token_acc': 0.80714245, 'grad_norm': 0.69052059, 'learning_rate': 5.531e-05, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.019935, 'epoch': 2.44, 'global_step/max_steps': '200/405', 'percentage': '49.38%', 'elapsed_time': '2h 47m 9s', 'remaining_time': '2h 51m 19s'}
Train: ?49%|████▉ ? ? | 200/405 [2:47:09<2:49:56, 49.74s/it]
{'eval_loss': 0.56886792, 'eval_token_acc': 0.82167251, 'eval_runtime': 15.7067, 'eval_samples_per_second': 1.655, 'eval_steps_per_second': 0.446, 'epoch': 2.44, 'global_step/max_steps': '200/405', 'percentage': '49.38%', 'elapsed_time': '2h 47m 24s', 'remaining_time': '2h 51m 36s'}
Val: 100%|██████████| 7/7 [00:13<00:00, ?1.86s/it]49.74s/it]
[INFO:swift] Saving model checkpoint to /data/qwq32b_sft_lora/output/v9-20250311-192834/checkpoint-200
*****************************************
Train: ?50%|████▉ ? ? | 202/405 [2:49:15<3:08:49, 55.81s/it][2025-03-11 22:19:18,274] [WARNING] [stage3.py:2139:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
{'loss': 0.54969444, 'token_acc': 0.82021995, 'grad_norm': 0.51556883, 'learning_rate': 5.327e-05, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.019871, 'epoch': 2.5, 'global_step/max_steps': '205/405', 'percentage': '50.62%', 'elapsed_time': '2h 51m 52s', 'remaining_time': '2h 47m 41s'}
{'loss': 0.52501326, 'token_acc': 0.81536282, 'grad_norm': 0.54576287, 'learning_rate': 5.123e-05, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.019883, 'epoch': 2.56, 'global_step/max_steps': '210/405', 'percentage': '51.85%', 'elapsed_time': '2h 55m 58s', 'remaining_time': '2h 43m 24s'}
{'loss': 0.5639473, 'token_acc': 0.8235682, 'grad_norm': 0.51644597, 'learning_rate': 4.918e-05, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.019872, 'epoch': 2.62, 'global_step/max_steps': '215/405', 'percentage': '53.09%', 'elapsed_time': '3h 0m 15s', 'remaining_time': '2h 39m 17s'}
Train: ?53%|█████▎ ? ?| 215/405 [3:00:15<2:40:55, 50.82s/it][2025-03-11 22:30:17,273] [WARNING] [stage3.py:2139:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
{'loss': 0.54539089, 'token_acc': 0.82929161, 'grad_norm': 0.5427966, 'learning_rate': 4.714e-05, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.019887, 'epoch': 2.69, 'global_step/max_steps': '220/405', 'percentage': '54.32%', 'elapsed_time': '3h 4m 18s', 'remaining_time': '2h 34m 59s'}
{'loss': 0.54721932, 'token_acc': 0.82292752, 'grad_norm': 0.58632606, 'learning_rate': 4.51e-05, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.01989, 'epoch': 2.75, 'global_step/max_steps': '225/405', 'percentage': '55.56%', 'elapsed_time': '3h 8m 28s', 'remaining_time': '2h 30m 46s'}
{'loss': 0.51745701, 'token_acc': 0.82614152, 'grad_norm': 0.51928985, 'learning_rate': 4.307e-05, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.019892, 'epoch': 2.81, 'global_step/max_steps': '230/405', 'percentage': '56.79%', 'elapsed_time': '3h 12m 38s', 'remaining_time': '2h 26m 34s'}
{'loss': 0.54157047, 'token_acc': 0.81710944, 'grad_norm': 0.71657186, 'learning_rate': 4.105e-05, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.019899, 'epoch': 2.87, 'global_step/max_steps': '235/405', 'percentage': '58.02%', 'elapsed_time': '3h 16m 46s', 'remaining_time': '2h 22m 20s'}
{'loss': 0.54548702, 'token_acc': 0.81284619, 'grad_norm': 0.50686509, 'learning_rate': 3.904e-05, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.019907, 'epoch': 2.93, 'global_step/max_steps': '240/405', 'percentage': '59.26%', 'elapsed_time': '3h 20m 52s', 'remaining_time': '2h 18m 6s'}
{'loss': 0.51912632, 'token_acc': 0.83365523, 'grad_norm': 0.68279731, 'learning_rate': 3.706e-05, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.019907, 'epoch': 2.99, 'global_step/max_steps': '245/405', 'percentage': '60.49%', 'elapsed_time': '3h 25m 3s', 'remaining_time': '2h 13m 55s'}
{'loss': 0.52836185, 'token_acc': 0.83409461, 'grad_norm': 0.55463023, 'learning_rate': 3.509e-05, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.019923, 'epoch': 3.05, 'global_step/max_steps': '250/405', 'percentage': '61.73%', 'elapsed_time': '3h 29m 4s', 'remaining_time': '2h 9m 37s'}
Train: ?62%|██████▏ ? | 250/405 [3:29:04<2:07:39, 49.41s/it]
{'eval_loss': 0.52870411, 'eval_token_acc': 0.83231801, 'eval_runtime': 15.7131, 'eval_samples_per_second': 1.655, 'eval_steps_per_second': 0.445, 'epoch': 3.05, 'global_step/max_steps': '250/405', 'percentage': '61.73%', 'elapsed_time': '3h 29m 20s', 'remaining_time': '2h 9m 47s'}
Val: 100%|██████████| 7/7 [00:13<00:00, ?1.86s/it]49.41s/it]
[INFO:swift] Saving model checkpoint to /data/qwq32b_sft_lora/output/v9-20250311-192834/checkpoint-250
************************************************
{'loss': 0.51691947, 'token_acc': 0.82422604, 'grad_norm': 0.53855505, 'learning_rate': 3.316e-05, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.019896, 'epoch': 3.11, 'global_step/max_steps': '255/405', 'percentage': '62.96%', 'elapsed_time': '3h 33m 33s', 'remaining_time': '2h 5m 37s'}
Train: ?63%|██████▎ ? | 257/405 [3:35:20<2:08:40, 52.17s/it][2025-03-11 23:05:38,683] [WARNING] [stage3.py:2139:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
{'loss': 0.50732822, 'token_acc': 0.83722172, 'grad_norm': 0.7386699, 'learning_rate': 3.124e-05, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.019885, 'epoch': 3.17, 'global_step/max_steps': '260/405', 'percentage': '64.20%', 'elapsed_time': '3h 37m 51s', 'remaining_time': '2h 1m 30s'}
{'loss': 0.50304022, 'token_acc': 0.84518402, 'grad_norm': 0.5332581, 'learning_rate': 2.936e-05, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.019889, 'epoch': 3.23, 'global_step/max_steps': '265/405', 'percentage': '65.43%', 'elapsed_time': '3h 42m 0s', 'remaining_time': '1h 57m 17s'}
{'loss': 0.5034606, 'token_acc': 0.81697432, 'grad_norm': 0.67853403, 'learning_rate': 2.752e-05, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.019886, 'epoch': 3.29, 'global_step/max_steps': '270/405', 'percentage': '66.67%', 'elapsed_time': '3h 46m 13s', 'remaining_time': '1h 53m 6s'}
{'loss': 0.5183465, 'token_acc': 0.83563731, 'grad_norm': 0.6259943, 'learning_rate': 2.571e-05, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.019885, 'epoch': 3.35, 'global_step/max_steps': '275/405', 'percentage': '67.90%', 'elapsed_time': '3h 50m 25s', 'remaining_time': '1h 48m 55s'}
{'loss': 0.51731062, 'token_acc': 0.83534514, 'grad_norm': 0.61180005, 'learning_rate': 2.394e-05, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.019881, 'epoch': 3.42, 'global_step/max_steps': '280/405', 'percentage': '69.14%', 'elapsed_time': '3h 54m 40s', 'remaining_time': '1h 44m 45s'}
{'loss': 0.48814211, 'token_acc': 0.82283914, 'grad_norm': 0.57190785, 'learning_rate': 2.222e-05, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.019883, 'epoch': 3.48, 'global_step/max_steps': '285/405', 'percentage': '70.37%', 'elapsed_time': '3h 58m 50s', 'remaining_time': '1h 40m 33s'}
{'loss': 0.4921607, 'token_acc': 0.82588464, 'grad_norm': 0.52349298, 'learning_rate': 2.054e-05, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.019888, 'epoch': 3.54, 'global_step/max_steps': '290/405', 'percentage': '71.60%', 'elapsed_time': '4h 2m 58s', 'remaining_time': '1h 36m 20s'}
{'loss': 0.46711798, 'token_acc': 0.85013139, 'grad_norm': 0.6346718, 'learning_rate': 1.892e-05, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.019889, 'epoch': 3.6, 'global_step/max_steps': '295/405', 'percentage': '72.84%', 'elapsed_time': '4h 7m 8s', 'remaining_time': '1h 32m 9s'}
{'loss': 0.48140554, 'token_acc': 0.83738891, 'grad_norm': 0.62962168, 'learning_rate': 1.734e-05, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.019894, 'epoch': 3.66, 'global_step/max_steps': '300/405', 'percentage': '74.07%', 'elapsed_time': '4h 11m 16s', 'remaining_time': '1h 27m 56s'}
Train: ?74%|███████▍ ?| 300/405 [4:11:16<1:25:56, 49.11s/it]
{'eval_loss': 0.50453913, 'eval_token_acc': 0.84007138, 'eval_runtime': 15.7353, 'eval_samples_per_second': 1.652, 'eval_steps_per_second': 0.445, 'epoch': 3.66, 'global_step/max_steps': '300/405', 'percentage': '74.07%', 'elapsed_time': '4h 11m 32s', 'remaining_time': '1h 28m 2s'}
Val: 100%|██████████| 7/7 [00:13<00:00, ?1.87s/it]49.11s/it]
[INFO:swift] Saving model checkpoint to /data/qwq32b_sft_lora/output/v9-20250311-192834/checkpoint-300
********************************************
{'loss': 0.48724985, 'token_acc': 0.83901735, 'grad_norm': 0.59317845, 'learning_rate': 1.582e-05, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.019877, 'epoch': 3.72, 'global_step/max_steps': '305/405', 'percentage': '75.31%', 'elapsed_time': '4h 15m 40s', 'remaining_time': '1h 23m 49s'}
{'loss': 0.47789598, 'token_acc': 0.85550931, 'grad_norm': 0.67272985, 'learning_rate': 1.436e-05, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.019881, 'epoch': 3.78, 'global_step/max_steps': '310/405', 'percentage': '76.54%', 'elapsed_time': '4h 19m 49s', 'remaining_time': '1h 19m 37s'}
{'loss': 0.49318271, 'token_acc': 0.81770707, 'grad_norm': 0.64257801, 'learning_rate': 1.295e-05, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.01989, 'epoch': 3.84, 'global_step/max_steps': '315/405', 'percentage': '77.78%', 'elapsed_time': '4h 23m 53s', 'remaining_time': '1h 15m 23s'}
Train: ?78%|███████▊ ?| 316/405 [4:24:43<1:14:10, 50.00s/it][2025-03-11 23:54:59,684] [WARNING] [stage3.py:2139:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
{'loss': 0.46136761, 'token_acc': 0.8458454, 'grad_norm': 0.62953055, 'learning_rate': 1.161e-05, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.01989, 'epoch': 3.91, 'global_step/max_steps': '320/405', 'percentage': '79.01%', 'elapsed_time': '4h 28m 5s', 'remaining_time': '1h 11m 12s'}
{'loss': 0.4856822, 'token_acc': 0.83825816, 'grad_norm': 0.64470125, 'learning_rate': 1.033e-05, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.019893, 'epoch': 3.97, 'global_step/max_steps': '325/405', 'percentage': '80.25%', 'elapsed_time': '4h 32m 13s', 'remaining_time': '1h 7m 0s'}
{'loss': 0.46592345, 'token_acc': 0.84528571, 'grad_norm': 0.65905805, 'learning_rate': 9.12e-06, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.019917, 'epoch': 4.02, 'global_step/max_steps': '330/405', 'percentage': '81.48%', 'elapsed_time': '4h 36m 5s', 'remaining_time': '1h 2m 44s'}
{'loss': 0.48042569, 'token_acc': 0.85237186, 'grad_norm': 0.61635281, 'learning_rate': 7.98e-06, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.019923, 'epoch': 4.09, 'global_step/max_steps': '335/405', 'percentage': '82.72%', 'elapsed_time': '4h 40m 11s', 'remaining_time': '58m 32s'}
{'loss': 0.45569935, 'token_acc': 0.83371485, 'grad_norm': 0.64527875, 'learning_rate': 6.9e-06, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.01993, 'epoch': 4.15, 'global_step/max_steps': '340/405', 'percentage': '83.95%', 'elapsed_time': '4h 44m 16s', 'remaining_time': '54m 20s'}
{'loss': 0.46417255, 'token_acc': 0.84960884, 'grad_norm': 0.67313113, 'learning_rate': 5.9e-06, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.019931, 'epoch': 4.21, 'global_step/max_steps': '345/405', 'percentage': '85.19%', 'elapsed_time': '4h 48m 26s', 'remaining_time': '50m 9s'}
{'loss': 0.47292795, 'token_acc': 0.85013211, 'grad_norm': 0.59537749, 'learning_rate': 4.98e-06, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.019936, 'epoch': 4.27, 'global_step/max_steps': '350/405', 'percentage': '86.42%', 'elapsed_time': '4h 52m 32s', 'remaining_time': '45m 58s'}
Train: ?86%|████████▋ | 350/405 [4:52:32<45:19, 49.44s/it]
{'eval_loss': 0.490695, 'eval_token_acc': 0.84296351, 'eval_runtime': 15.6909, 'eval_samples_per_second': 1.657, 'eval_steps_per_second': 0.446, 'epoch': 4.27, 'global_step/max_steps': '350/405', 'percentage': '86.42%', 'elapsed_time': '4h 52m 48s', 'remaining_time': '46m 0s'}
Val: 100%|██████████| 7/7 [00:13<00:00, ?1.86s/it].44s/it]
[INFO:swift] Saving model checkpoint to /data/qwq32b_sft_lora/output/v9-20250311-192834/checkpoint-350
*****************************************
Train: ?87%|████████▋ | 352/405 [4:54:35<48:02, 54.39s/it][2025-03-12 00:25:04,775] [WARNING] [stage3.py:2139:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
{'loss': 0.46881456, 'token_acc': 0.83740075, 'grad_norm': 0.59338625, 'learning_rate': 4.13e-06, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.01991, 'epoch': 4.33, 'global_step/max_steps': '355/405', 'percentage': '87.65%', 'elapsed_time': '4h 57m 6s', 'remaining_time': '41m 50s'}
{'eval_loss': 0.48915866, 'eval_token_acc': 0.84357886, 'eval_runtime': 15.8494, 'eval_samples_per_second': 1.64, 'eval_steps_per_second': 0.442, 'epoch': 4.88, 'global_step/max_steps': '400/405', 'percentage': '98.77%', 'elapsed_time': '5h 34m 38s', 'remaining_time': '4m 10s'}
Val: 100%|██████████| 7/7 [00:13<00:00, ?1.88s/it].31s/it]
[INFO:swift] Saving model checkpoint to /data/qwq32b_sft_lora/output/v9-20250311-192834/checkpoint-400
{'loss': 0.49458728, 'token_acc': 0.83115697, 'grad_norm': 0.65133526, 'learning_rate': 0.0, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.01992, 'epoch': 4.94, 'global_step/max_steps': '405/405', 'percentage': '100.00%', 'elapsed_time': '5h 38m 48s', 'remaining_time': '0s'}
Train: 100%|██████████| 405/405 [5:38:48<00:00, 50.98s/it]
{'eval_loss': 0.4893617, 'eval_token_acc': 0.84308658, 'eval_runtime': 15.9508, 'eval_samples_per_second': 1.63, 'eval_steps_per_second': 0.439, 'epoch': 4.94, 'global_step/max_steps': '405/405', 'percentage': '100.00%', 'elapsed_time': '5h 39m 4s', 'remaining_time': '0s'}
Val: 100%|██████████| 7/7 [00:13<00:00, ?1.90s/it].98s/it]
[INFO:swift] Saving model checkpoint to /data/qwq32b_sft_lora/output/v9-20250311-192834/checkpoint-405
{'train_runtime': 20349.5179, 'train_samples_per_second': 0.642, 'train_steps_per_second': 0.02, 'train_loss': 0.63218051, 'epoch': 4.94, 'global_step/max_steps': '405/405', 'percentage': '100.00%', 'elapsed_time': '5h 39m 9s', 'remaining_time': '0s'}
Train: 100%|██████████| 405/405 [5:39:09<00:00, 50.25s/it]
[INFO:swift] last_model_checkpoint: /data/qwq32b_sft_lora/output/v9-20250311-192834/checkpoint-405
[INFO:swift] best_model_checkpoint: /data/qwq32b_sft_lora/output/v9-20250311-192834/checkpoint-400

4. 模型部署及推理腳本

我們采用多卡部署，并且自定義服務端口：

RAY_memory_monitor_refresh_ms=0
CUDA_VISIBLE_DEVICES=0,1 swift deploy \--ckpt_dir /data/qwq32b_sft_lora/output/v9-20250311-192834/checkpoint-400 \--infer_backend vllm \--max_new_tokens 2048 \--tensor_parallel_size 2 \--port 8011

推理腳本：

from openai import OpenAIopenai_api_key = "EMPTY"
openai_api_base = "http://ip:8011/v1"client = OpenAI(api_key=openai_api_key,base_url=openai_api_base,
)chat_response = client.chat.completions.create(model="QwQ-32B",messages=[{"role": "system", "content": "你是一款客戶機器人，幫助客戶解決問題"}, {"role": "user", "content": "問一下這款手機現在附帶什么配件"}, {"role": "assistant", "content": "附件內容：鋰離子電池組 NP-FW50，電源適配器AC-UUD12 ，Micro USB 連接線，肩帶，鏡頭蓋，熱靴蓋，遮光罩，使用說明書，保修卡"}, {"role": "user", "content": "售后和質保是什么標準"}],temperature=0.7,top_p=0.8,max_tokens=2048,extra_body={"repetition_penalty": 1.05,},
)
print("Chat response:", chat_response)