【大模型實戰】利用ms-swift微調框架對QwQ-32B推理模型進行微調

1. 背景介紹

????????之前我們在《大模型訓練/微調的一些經驗分享》、《利用DeepSeek-R1數據微調蒸餾ChatGLM32B讓大模型具備思考能力》中做了相關模型微調的介紹。目前在基座大模型能力還沒有達到足夠牛的情況下,大模型微調在商業化、垂直領域應用依然是不可或缺,即使是使用DeepSeek-R1、QwQ-32B也難以保證商業應用的要求。

? ? ? ? 今天我們將介紹利用ms-swift框架對近期開源的阿里QwQ-32B推理模型進行lora微調的實驗。

2. 模型微調方案

????????ms-swift是modelscope提供的大模型與多模態大模型微調部署框架,支持語言大模型與多模態大模型的訓練(預訓練、微調、人類對齊)、推理、評測、量化與部署。其中語言大模型包括:Qwen2.5、GLM4、Llama3.3、DeepSeek-R1等模型,多模態大模型包括:Qwen2.5-VL、Qwen2-Audio、Llama3.2-Vision、Llava、GLM4v、Xcomposer2.5、DeepSeek-VL2等模型。

????????QwQ-32B模型參數大小約為62G,如果選擇全參微調,保守估計需要20倍的顯存資源,也就是需要1240G顯存,即使8卡的A800集群,大概在640G顯存,也不足以做微調,因此為了落地方便,選擇Lora微調方式。最終我們利用4卡-A800完成Lora的訓練。

? ? ? ?題外話:ms-swift的文檔不全,這點做的不是特別好。

3. 模型微調實驗

3.1 微調數據格式

????????格式如下所示,但我們微調使用的是帶推理過程的數據,這個需要注意一下。關于think的語料可以參考?《利用DeepSeek-R1數據微調蒸餾ChatGLM32B讓大模型具備思考能力》。

{"messages": [{"role": "system", "content": "你是一個樂于助人且知識淵博的AI助手。"},{"role": "user", "content": "請解釋AlphaGo的基本原理。"}, {"role": "assistant", "content": "AlphaGo結合深度神經網絡與樹搜索算法,通過棋局數據進行訓練,逐步提升棋力,能夠判斷圍棋局勢并選擇最佳著法。"}]}{"messages": [{"role": "system", "content": "你是一個樂于助人且知識淵博的AI助手。"},{"role": "user", "content": "列舉一些健康飲食的重要原則。"}, {"role": "assistant", "content": "健康飲食的重要原則包括:均衡飲食,合理攝入各類營養素;多吃新鮮蔬果,適量飲水;少吃糖、油、鹽,忌吃過量快餐;養成規律生活作息習慣,適量運動。"}]} {"messages": [{"role": "system", "content": "你是一個樂于助人且知識淵博的AI助手。"},{"role": "user", "content": "給出一些改善睡眠質量的建議。"}, {"role": "assistant", "content": "改善睡眠質量的建議包括:保證睡眠時間,避免睡眠不足;睡前放松,避免刺激;養成規律作息時間,不要頻繁熬夜;適量運動,但不要運動過度;睡前可以喝一杯熱牛奶等溫和飲料。"}]}

3.2 訓練腳本

NPROC_PER_NODE=4 \
CUDA_VISIBLE_DEVICES=0,1,2,3 \
swift sft \--model /data/QwQ-32B \--train_type lora \--dataset '/data/qwq32b_sft_lora/rl-v0312.jsonl' \--torch_dtype bfloat16 \--num_train_epochs 5 \--per_device_train_batch_size 1 \--per_device_eval_batch_size 1 \--learning_rate 1e-4 \--lora_rank 8 \--lora_alpha 32 \--target_modules all-linear \--gradient_accumulation_steps 8 \--eval_steps 50 \--save_steps 50 \--save_total_limit 5 \--logging_steps 5 \--max_length 8192 \--output_dir /data/qwq32b_sft_lora/output \--warmup_ratio 0.05 \--dataloader_num_workers 4 \--model_author swift \--model_name swift-robot \--deepspeed zero3

3.3 訓練日志

????????從訓練日志可以清晰看到,整個微調階段的loss逐步收斂。另外框架會輸出最佳的模型checkpoint模型參數。

[2025-03-11 19:28:37,083] [INFO] [config.py:734:__init__] Config mesh_device None world_size = 4
[2025-03-11 19:28:37,084] [INFO] [config.py:734:__init__] Config mesh_device None world_size = 4
[2025-03-11 19:28:37,092] [INFO] [config.py:734:__init__] Config mesh_device None world_size = 4
[2025-03-11 19:28:37,401] [INFO] [partition_parameters.py:348:__exit__] finished initializing model - num_params = 771, num_elems = 32.76B
Loading checkpoint shards: 100%|██████████| 14/14 [00:16<00:00, ?1.17s/it]
Loading checkpoint shards: 100%|██████████| 14/14 [00:16<00:00, ?1.17s/it]
Loading checkpoint shards: 100%|██████████| 14/14 [00:16<00:00, ?1.17s/it]
Loading checkpoint shards: 100%|██████████| 14/14 [00:17<00:00, ?1.22s/it]
[INFO:swift] model_info: ModelInfo(model_type='qwq', model_dir='/data/QwQ-32B', torch_dtype=torch.bfloat16, max_model_len=131072, quant_method=None, quant_bits=None, rope_scaling=None, config=Qwen2Config {
? "_name_or_path": "/data/QwQ-32B",
? "architectures": [
? ? "Qwen2ForCausalLM"
? ],
? "attention_dropout": 0.0,
? "bos_token_id": 151643,
? "eos_token_id": 151645,
? "hidden_act": "silu",
? "hidden_size": 5120,
? "initializer_range": 0.02,
? "intermediate_size": 27648,
? "max_position_embeddings": 131072,
? "max_window_layers": 64,
? "model_type": "qwen2",
? "num_attention_heads": 40,
? "num_hidden_layers": 64,
? "num_key_value_heads": 8,
? "rms_norm_eps": 1e-05,
? "rope_scaling": null,
? "rope_theta": 1000000.0,
? "sliding_window": 32768,
? "tie_word_embeddings": false,
? "torch_dtype": "bfloat16",
? "transformers_version": "4.49.0",
? "use_cache": true,
? "use_sliding_window": false,
? "vocab_size": 152064
}
, task_type='causal_lm', num_labels=None)
[INFO:swift] model.generation_config: GenerationConfig {
? "bos_token_id": 151643,
? "eos_token_id": [
? ? 151645,
? ? 151643
? ],
? "max_new_tokens": 64,
? "pad_token_id": 151643
}

[INFO:swift] default_system: None
[INFO:swift] The TrainArguments will be saved in: /data/qwq32b_sft_lora/output/v9-20250311-192834/args.json
[INFO:swift] Start time of running main: 2025-03-11 19:28:54.707260
Map: 100%|██████████| 2645/2645 [00:00<00:00, 8426.21 examples/s]?
Map: 100%|██████████| 2645/2645 [00:00<00:00, 7697.28 examples/s]?
Map: 100%|██████████| 2645/2645 [00:00<00:00, 6463.52 examples/s]?
Map: ? 0%| ? ? ? ? ?| 0/2619 [00:00<?, ? examples/s][INFO:swift] create tmp_dir: /.cache/modelscope/hub/tmp/hf_datasets-i15lb3_o
Map: 100%|██████████| 2645/2645 [00:00<00:00, 9980.89 examples/s]?
[INFO:swift] train_dataset: Dataset({
? ? features: ['messages'],
? ? num_rows: 2619
})
[INFO:swift] val_dataset: Dataset({
? ? features: ['messages'],
? ? num_rows: 26
? ??
? ??
[INFO:swift] [INPUT] <|im_start|>system
## 角色
你是一名AI客服,你很專業、友善、禮貌,擅長為客戶提供導購服務。

## 目標
結合歷史對話和用戶當前問題判斷對話場景(如:商品推薦、商品咨詢、拒答范圍、閑聊),針對不同場景作親和的“原始回復”,然后根據<紅線>內容,對”原始回復“作詳細的安全審查改寫,優化后直接輸出。

## 紅線
- 不提供任何無關的信息
- 不提供任何電話號碼和網站鏈接
- 不提供任何知識庫中不存在的商品信息

## 限制
- 避免談論競品、具體價格或優惠、商品鏈接、商品維修、訂單、售后、物流。
- 注意區分商品型號,若商品型號指代不明,在回復中需主動向用戶確認。
- 涉及商品信息時,你只能使用<知識庫>中的內容回復。
- 當<知識庫>為空或無法回答問題時,需要禮貌的婉拒回答。
- 在婉拒回答時,必須帶這句話“建議您咨詢官方商城的人工客服或訪問官網”。
- 不要主動解釋專業詞匯。
- 不要暴露任何思考過程,直接回答。
- 回答中不要暴露“原始回復”,直接輸出安全審查改寫后的內容。

## 建議
- 介紹商品時推薦使用markdown格式
- 善用追問去引導客戶把訴求表達清楚,但歷史對話已經表達清楚的訴求不要在追問。
<|im_end|>
<|im_start|>user
這款KD-85X85K和85x90L有什么區別?<|im_end|>
<|im_start|>assistant
X85K采用X1芯片,而X90L搭載XR認知芯片,能在畫面整體優化、焦點主體強化和音畫同步方面更出色。X90L通過多維度數據分析匹配畫質數據庫,重點優化用戶關注區域。如果您追求更好畫質建議選X90L,預算有限的話X85K也不錯。<|im_end|>
<|im_start|>user
尺寸一樣大嗎?<|im_end|>
<|im_start|>assistant
<think>
好的,用戶問的是“尺寸一樣大嗎?”,我需要先回顧歷史對話。之前的對話中,用戶詢問了KD-85X85K和85X90L的區別,客服提到兩者的芯片不同以及畫質優化方面的差異。現在用戶進一步詢問尺寸是否相同。

首先,檢查知識庫中的商品信息。KD-85X85K的型號中“85”代表85英寸,而XR-85X90L同樣有“85”在型號中,所以尺寸應該都是85英寸。確認知識庫中兩者的描述,確實都是85寸的電視。因此,用戶的問題應該回答尺寸相同,都是85英寸。同時,根據紅線規則,不能提供無關信息,所以不需要涉及其他方面,比如價格或功能差異,除非用戶進一步詢問。另外,確保回復中沒有鏈接或電話號碼,符合安全審查的要求。因此,直接確認兩者尺寸相同,并保持禮貌和簡潔。
</think>

兩款電視的尺寸相同,都是85英寸大屏設計哦~ 如果您還有其他疑問,歡迎隨時告訴我。建議您咨詢官方商城的人工客服或訪問官網了解更多信息。<|im_end|>

[INFO:swift] lora_config: LoraConfig(task_type='CAUSAL_LM', peft_type=<PeftType.LORA: 'LORA'>, auto_mapping=None, base_model_name_or_path='/data/QwQ-32B', revision=None, inference_mode=False, r=8, target_modules={'k_proj', 'down_proj', 'q_proj', 'v_proj', 'o_proj', 'gate_proj', 'up_proj'}, exclude_modules=None, lora_alpha=32, lora_dropout=0.05, fan_in_fan_out=False, bias='none', use_rslora=False, modules_to_save=[], init_lora_weights=True, layers_to_transform=None, layers_pattern=None, rank_pattern={}, alpha_pattern={}, megatron_config=None, megatron_core='megatron.core', loftq_config={}, eva_config=None, use_dora=False, layer_replication=None, runtime_config=LoraRuntimeConfig(ephemeral_gpu_offload=False), lora_bias=False, lora_dtype=None, lorap_lr_ratio=None, lorap_emb_lr=1e-06)
[INFO:swift] model: PeftModelForCausalLM(
? (base_model): LoraModel(
? ? (model): Qwen2ForCausalLM(
? ? ? (model): Qwen2Model(
? ? ? ? (embed_tokens): Embedding(152064, 5120)
? ? ? ? (layers): ModuleList(
? ? ? ? ? (0-63): 64 x Qwen2DecoderLayer(
? ? ? ? ? ? (self_attn): Qwen2Attention(
? ? ? ? ? ? ? (q_proj): lora.Linear(
? ? ? ? ? ? ? ? (base_layer): Linear(in_features=5120, out_features=5120, bias=True)
? ? ? ? ? ? ? ? (lora_dropout): ModuleDict(
? ? ? ? ? ? ? ? ? (default): Dropout(p=0.05, inplace=False)
? ? ? ? ? ? ? ? )
? ? ? ? ? ? ? ? (lora_A): ModuleDict(
? ? ? ? ? ? ? ? ? (default): Linear(in_features=5120, out_features=8, bias=False)
? ? ? ? ? ? ? ? )
? ? ? ? ? ? ? ? (lora_B): ModuleDict(
? ? ? ? ? ? ? ? ? (default): Linear(in_features=8, out_features=5120, bias=False)
? ? ? ? ? ? ? ? )
? ? ? ? ? ? ? ? (lora_embedding_A): ParameterDict()
? ? ? ? ? ? ? ? (lora_embedding_B): ParameterDict()
? ? ? ? ? ? ? ? (lora_magnitude_vector): ModuleDict()
? ? ? ? ? ? ? )
? ? ? ? ? ? ? (k_proj): lora.Linear(
? ? ? ? ? ? ? ? (base_layer): Linear(in_features=5120, out_features=1024, bias=True)
? ? ? ? ? ? ? ? (lora_dropout): ModuleDict(
? ? ? ? ? ? ? ? ? (default): Dropout(p=0.05, inplace=False)
? ? ? ? ? ? ? ? )
? ? ? ? ? ? ? ? (lora_A): ModuleDict(
? ? ? ? ? ? ? ? ? (default): Linear(in_features=5120, out_features=8, bias=False)
? ? ? ? ? ? ? ? )
? ? ? ? ? ? ? ? (lora_B): ModuleDict(
? ? ? ? ? ? ? ? ? (default): Linear(in_features=8, out_features=1024, bias=False)
? ? ? ? ? ? ? ? )
? ? ? ? ? ? ? ? (lora_embedding_A): ParameterDict()
? ? ? ? ? ? ? ? (lora_embedding_B): ParameterDict()
? ? ? ? ? ? ? ? (lora_magnitude_vector): ModuleDict()
? ? ? ? ? ? ? )
? ? ? ? ? ? ? (v_proj): lora.Linear(
? ? ? ? ? ? ? ? (base_layer): Linear(in_features=5120, out_features=1024, bias=True)
? ? ? ? ? ? ? ? (lora_dropout): ModuleDict(
? ? ? ? ? ? ? ? ? (default): Dropout(p=0.05, inplace=False)
? ? ? ? ? ? ? ? )
? ? ? ? ? ? ? ? (lora_A): ModuleDict(
? ? ? ? ? ? ? ? ? (default): Linear(in_features=5120, out_features=8, bias=False)
? ? ? ? ? ? ? ? )
? ? ? ? ? ? ? ? (lora_B): ModuleDict(
? ? ? ? ? ? ? ? ? (default): Linear(in_features=8, out_features=1024, bias=False)
? ? ? ? ? ? ? ? )
? ? ? ? ? ? ? ? (lora_embedding_A): ParameterDict()
? ? ? ? ? ? ? ? (lora_embedding_B): ParameterDict()
? ? ? ? ? ? ? ? (lora_magnitude_vector): ModuleDict()
? ? ? ? ? ? ? )
? ? ? ? ? ? ? (o_proj): lora.Linear(
? ? ? ? ? ? ? ? (base_layer): Linear(in_features=5120, out_features=5120, bias=False)
? ? ? ? ? ? ? ? (lora_dropout): ModuleDict(
? ? ? ? ? ? ? ? ? (default): Dropout(p=0.05, inplace=False)
? ? ? ? ? ? ? ? )
? ? ? ? ? ? ? ? (lora_A): ModuleDict(
? ? ? ? ? ? ? ? ? (default): Linear(in_features=5120, out_features=8, bias=False)
? ? ? ? ? ? ? ? )
? ? ? ? ? ? ? ? (lora_B): ModuleDict(
? ? ? ? ? ? ? ? ? (default): Linear(in_features=8, out_features=5120, bias=False)
? ? ? ? ? ? ? ? )
? ? ? ? ? ? ? ? (lora_embedding_A): ParameterDict()
? ? ? ? ? ? ? ? (lora_embedding_B): ParameterDict()
? ? ? ? ? ? ? ? (lora_magnitude_vector): ModuleDict()
? ? ? ? ? ? ? )
? ? ? ? ? ? )
? ? ? ? ? ? (mlp): Qwen2MLP(
? ? ? ? ? ? ? (gate_proj): lora.Linear(
? ? ? ? ? ? ? ? (base_layer): Linear(in_features=5120, out_features=27648, bias=False)
? ? ? ? ? ? ? ? (lora_dropout): ModuleDict(
? ? ? ? ? ? ? ? ? (default): Dropout(p=0.05, inplace=False)
? ? ? ? ? ? ? ? )
? ? ? ? ? ? ? ? (lora_A): ModuleDict(
? ? ? ? ? ? ? ? ? (default): Linear(in_features=5120, out_features=8, bias=False)
? ? ? ? ? ? ? ? )
? ? ? ? ? ? ? ? (lora_B): ModuleDict(
? ? ? ? ? ? ? ? ? (default): Linear(in_features=8, out_features=27648, bias=False)
? ? ? ? ? ? ? ? )
? ? ? ? ? ? ? ? (lora_embedding_A): ParameterDict()
? ? ? ? ? ? ? ? (lora_embedding_B): ParameterDict()
? ? ? ? ? ? ? ? (lora_magnitude_vector): ModuleDict()
? ? ? ? ? ? ? )
? ? ? ? ? ? ? (up_proj): lora.Linear(
? ? ? ? ? ? ? ? (base_layer): Linear(in_features=5120, out_features=27648, bias=False)
? ? ? ? ? ? ? ? (lora_dropout): ModuleDict(
? ? ? ? ? ? ? ? ? (default): Dropout(p=0.05, inplace=False)
? ? ? ? ? ? ? ? )
? ? ? ? ? ? ? ? (lora_A): ModuleDict(
? ? ? ? ? ? ? ? ? (default): Linear(in_features=5120, out_features=8, bias=False)
? ? ? ? ? ? ? ? )
? ? ? ? ? ? ? ? (lora_B): ModuleDict(
? ? ? ? ? ? ? ? ? (default): Linear(in_features=8, out_features=27648, bias=False)
? ? ? ? ? ? ? ? )
? ? ? ? ? ? ? ? (lora_embedding_A): ParameterDict()
? ? ? ? ? ? ? ? (lora_embedding_B): ParameterDict()
? ? ? ? ? ? ? ? (lora_magnitude_vector): ModuleDict()
? ? ? ? ? ? ? )
? ? ? ? ? ? ? (down_proj): lora.Linear(
? ? ? ? ? ? ? ? (base_layer): Linear(in_features=27648, out_features=5120, bias=False)
? ? ? ? ? ? ? ? (lora_dropout): ModuleDict(
? ? ? ? ? ? ? ? ? (default): Dropout(p=0.05, inplace=False)
? ? ? ? ? ? ? ? )
? ? ? ? ? ? ? ? (lora_A): ModuleDict(
? ? ? ? ? ? ? ? ? (default): Linear(in_features=27648, out_features=8, bias=False)
? ? ? ? ? ? ? ? )
? ? ? ? ? ? ? ? (lora_B): ModuleDict(
? ? ? ? ? ? ? ? ? (default): Linear(in_features=8, out_features=5120, bias=False)
? ? ? ? ? ? ? ? )
? ? ? ? ? ? ? ? (lora_embedding_A): ParameterDict()
? ? ? ? ? ? ? ? (lora_embedding_B): ParameterDict()
? ? ? ? ? ? ? ? (lora_magnitude_vector): ModuleDict()
? ? ? ? ? ? ? )
? ? ? ? ? ? ? (act_fn): SiLU()
? ? ? ? ? ? )
? ? ? ? ? ? (input_layernorm): Qwen2RMSNorm((0,), eps=1e-05)
? ? ? ? ? ? (post_attention_layernorm): Qwen2RMSNorm((0,), eps=1e-05)
? ? ? ? ? )
? ? ? ? )
? ? ? ? (norm): Qwen2RMSNorm((0,), eps=1e-05)
? ? ? ? (rotary_emb): Qwen2RotaryEmbedding()
? ? ? )
? ? ? (lm_head): Linear(in_features=5120, out_features=152064, bias=False)
? ? )
? )
)
[INFO:swift] model_parameter_info: PeftModelForCausalLM: 32830.9852M Params (67.1089M Trainable [0.2044%]), 0.0001M Buffers.

Parameter Offload: Total persistent parameters: 25760768 in 1025 params
{'loss': 1.32348752, 'token_acc': 0.70985222, 'grad_norm': 0.80846994, 'learning_rate': 4.76e-06, 'memory(GiB)': 60.01, 'train_speed(iter/s)': 0.01743, 'epoch': 0.01, 'global_step/max_steps': '1/405', 'percentage': '0.25%', 'elapsed_time': '53s', 'remaining_time': '6h 2m 10s'}
Train: ? 0%| ? ? ? ? ?| 2/405 [01:46<5:56:12, 53.03s/it][2025-03-11 19:32:17,225] [WARNING] [stage3.py:2139:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
{'loss': 1.24938524, 'token_acc': 0.69148486, 'grad_norm': 0.86987531, 'learning_rate': 2.381e-05, 'memory(GiB)': 74.21, 'train_speed(iter/s)': 0.018535, 'epoch': 0.06, 'global_step/max_steps': '5/405', 'percentage': '1.23%', 'elapsed_time': '4m 26s', 'remaining_time': '5h 54m 54s'}
{'loss': 1.22446156, 'token_acc': 0.69278702, 'grad_norm': 0.77689102, 'learning_rate': 4.762e-05, 'memory(GiB)': 74.21, 'train_speed(iter/s)': 0.019271, 'epoch': 0.12, 'global_step/max_steps': '10/405', 'percentage': '2.47%', 'elapsed_time': '8m 35s', 'remaining_time': '5h 39m 15s'}
{'loss': 1.13267899, 'token_acc': 0.71570596, 'grad_norm': 0.40197327, 'learning_rate': 7.143e-05, 'memory(GiB)': 74.21, 'train_speed(iter/s)': 0.01927, 'epoch': 0.18, 'global_step/max_steps': '15/405', 'percentage': '3.70%', 'elapsed_time': '12m 54s', 'remaining_time': '5h 35m 45s'}
{'loss': 0.97332687, 'token_acc': 0.72897148, 'grad_norm': 0.34967286, 'learning_rate': 9.524e-05, 'memory(GiB)': 74.21, 'train_speed(iter/s)': 0.019607, 'epoch': 0.24, 'global_step/max_steps': '20/405', 'percentage': '4.94%', 'elapsed_time': '16m 56s', 'remaining_time': '5h 26m 7s'}
{'loss': 0.95233335, 'token_acc': 0.71795399, 'grad_norm': 0.32512059, 'learning_rate': 9.997e-05, 'memory(GiB)': 74.21, 'train_speed(iter/s)': 0.01971, 'epoch': 0.31, 'global_step/max_steps': '25/405', 'percentage': '6.17%', 'elapsed_time': '21m 4s', 'remaining_time': '5h 20m 24s'}
{'loss': 0.92778244, 'token_acc': 0.72106543, 'grad_norm': 0.22549374, 'learning_rate': 9.986e-05, 'memory(GiB)': 74.21, 'train_speed(iter/s)': 0.019805, 'epoch': 0.37, 'global_step/max_steps': '30/405', 'percentage': '7.41%', 'elapsed_time': '25m 11s', 'remaining_time': '5h 14m 49s'}
{'loss': 0.91093416, 'token_acc': 0.73585944, 'grad_norm': 0.21213417, 'learning_rate': 9.967e-05, 'memory(GiB)': 74.21, 'train_speed(iter/s)': 0.019825, 'epoch': 0.43, 'global_step/max_steps': '35/405', 'percentage': '8.64%', 'elapsed_time': '29m 21s', 'remaining_time': '5h 10m 25s'}
{'loss': 0.86407394, 'token_acc': 0.73746765, 'grad_norm': 0.22134356, 'learning_rate': 9.94e-05, 'memory(GiB)': 74.21, 'train_speed(iter/s)': 0.019853, 'epoch': 0.49, 'global_step/max_steps': '40/405', 'percentage': '9.88%', 'elapsed_time': '33m 31s', 'remaining_time': '5h 5m 52s'}
{'loss': 0.86335802, 'token_acc': 0.73666894, 'grad_norm': 0.236291, 'learning_rate': 9.904e-05, 'memory(GiB)': 74.21, 'train_speed(iter/s)': 0.019929, 'epoch': 0.55, 'global_step/max_steps': '45/405', 'percentage': '11.11%', 'elapsed_time': '37m 34s', 'remaining_time': '5h 0m 35s'}
{'loss': 0.81436214, 'token_acc': 0.76214197, 'grad_norm': 0.19902774, 'learning_rate': 9.86e-05, 'memory(GiB)': 74.21, 'train_speed(iter/s)': 0.019918, 'epoch': 0.61, 'global_step/max_steps': '50/405', 'percentage': '12.35%', 'elapsed_time': '41m 46s', 'remaining_time': '4h 56m 37s'}
Train: ?12%|█▏ ? ? ? ?| 50/405 [41:46<5:01:46, 51.00s/it]
{'eval_loss': 0.82470703, 'eval_token_acc': 0.75927635, 'eval_runtime': 15.7907, 'eval_samples_per_second': 1.647, 'eval_steps_per_second': 0.443, 'epoch': 0.61, 'global_step/max_steps': '50/405', 'percentage': '12.35%', 'elapsed_time': '42m 2s', 'remaining_time': '4h 58m 30s'}
Val: 100%|██████████| 7/7 [00:13<00:00, ?1.87s/it]00s/it]
[INFO:swift] Saving model checkpoint to /data/qwq32b_sft_lora/output/v9-20250311-192834/checkpoint-50
***********************************************
[2025-03-11 20:12:35,271] [WARNING] [stage3.py:2139:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
{'loss': 0.84376278, 'token_acc': 0.74929837, 'grad_norm': 0.29243814, 'learning_rate': 9.808e-05, 'memory(GiB)': 76.21, 'train_speed(iter/s)': 0.019739, 'epoch': 0.67, 'global_step/max_steps': '55/405', 'percentage': '13.58%', 'elapsed_time': '46m 22s', 'remaining_time': '4h 55m 8s'}
{'loss': 0.82531147, 'token_acc': 0.75041408, 'grad_norm': 0.29134859, 'learning_rate': 9.748e-05, 'memory(GiB)': 76.21, 'train_speed(iter/s)': 0.019703, 'epoch': 0.73, 'global_step/max_steps': '60/405', 'percentage': '14.81%', 'elapsed_time': '50m 41s', 'remaining_time': '4h 51m 29s'}
{'loss': 0.8170001, 'token_acc': 0.75919308, 'grad_norm': 0.24516849, 'learning_rate': 9.68e-05, 'memory(GiB)': 76.21, 'train_speed(iter/s)': 0.019724, 'epoch': 0.8, 'global_step/max_steps': '65/405', 'percentage': '16.05%', 'elapsed_time': '54m 51s', 'remaining_time': '4h 46m 59s'}
{'loss': 0.81388254, 'token_acc': 0.75490298, 'grad_norm': 0.28124103, 'learning_rate': 9.604e-05, 'memory(GiB)': 76.21, 'train_speed(iter/s)': 0.019781, 'epoch': 0.86, 'global_step/max_steps': '70/405', 'percentage': '17.28%', 'elapsed_time': '58m 55s', 'remaining_time': '4h 41m 58s'}
{'loss': 0.81019135, 'token_acc': 0.74177519, 'grad_norm': 0.28694744, 'learning_rate': 9.52e-05, 'memory(GiB)': 76.21, 'train_speed(iter/s)': 0.019784, 'epoch': 0.92, 'global_step/max_steps': '75/405', 'percentage': '18.52%', 'elapsed_time': '1h 3m 7s', 'remaining_time': '4h 37m 44s'}
{'loss': 0.76696019, 'token_acc': 0.7639197, 'grad_norm': 0.311834, 'learning_rate': 9.429e-05, 'memory(GiB)': 76.21, 'train_speed(iter/s)': 0.019813, 'epoch': 0.98, 'global_step/max_steps': '80/405', 'percentage': '19.75%', 'elapsed_time': '1h 7m 14s', 'remaining_time': '4h 33m 8s'}
{'loss': 0.76195569, 'token_acc': 0.76973895, 'grad_norm': 0.43021317, 'learning_rate': 9.33e-05, 'memory(GiB)': 76.21, 'train_speed(iter/s)': 0.019856, 'epoch': 1.04, 'global_step/max_steps': '85/405', 'percentage': '20.99%', 'elapsed_time': '1h 11m 17s', 'remaining_time': '4h 28m 22s'}
{'loss': 0.7821136, 'token_acc': 0.74735605, 'grad_norm': 0.41759374, 'learning_rate': 9.224e-05, 'memory(GiB)': 76.21, 'train_speed(iter/s)': 0.019888, 'epoch': 1.1, 'global_step/max_steps': '90/405', 'percentage': '22.22%', 'elapsed_time': '1h 15m 21s', 'remaining_time': '4h 23m 45s'}
Train: ?23%|██▎ ? ? ? | 92/405 [1:17:03<4:18:27, 49.54s/it][2025-03-11 20:46:52,504] [WARNING] [stage3.py:2139:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
{'loss': 0.74946299, 'token_acc': 0.76573743, 'grad_norm': 0.31465808, 'learning_rate': 9.111e-05, 'memory(GiB)': 76.21, 'train_speed(iter/s)': 0.019875, 'epoch': 1.16, 'global_step/max_steps': '95/405', 'percentage': '23.46%', 'elapsed_time': '1h 19m 36s', 'remaining_time': '4h 19m 45s'}
{'loss': 0.75774355, 'token_acc': 0.76279737, 'grad_norm': 0.34568468, 'learning_rate': 8.992e-05, 'memory(GiB)': 76.21, 'train_speed(iter/s)': 0.019874, 'epoch': 1.22, 'global_step/max_steps': '100/405', 'percentage': '24.69%', 'elapsed_time': '1h 23m 48s', 'remaining_time': '4h 15m 35s'}
Train: ?25%|██▍ ? ? ? | 100/405 [1:23:48<4:17:26, 50.64s/it]
{'eval_loss': 0.720375, 'eval_token_acc': 0.77822903, 'eval_runtime': 15.6988, 'eval_samples_per_second': 1.656, 'eval_steps_per_second': 0.446, 'epoch': 1.22, 'global_step/max_steps': '100/405', 'percentage': '24.69%', 'elapsed_time': '1h 24m 3s', 'remaining_time': '4h 16m 23s'}
Val: 100%|██████████| 7/7 [00:13<00:00, ?1.86s/it]50.64s/it]
[INFO:swift] Saving model checkpoint to /data/qwq32b_sft_lora/output/v9-20250311-192834/checkpoint-100
**********************************************
{'loss': 0.72672591, 'token_acc': 0.76866752, 'grad_norm': 0.64908534, 'learning_rate': 8.865e-05, 'memory(GiB)': 76.21, 'train_speed(iter/s)': 0.019801, 'epoch': 1.28, 'global_step/max_steps': '105/405', 'percentage': '25.93%', 'elapsed_time': '1h 28m 19s', 'remaining_time': '4h 12m 20s'}
{'loss': 0.72024732, 'token_acc': 0.76941662, 'grad_norm': 0.36116413, 'learning_rate': 8.732e-05, 'memory(GiB)': 76.21, 'train_speed(iter/s)': 0.019815, 'epoch': 1.34, 'global_step/max_steps': '110/405', 'percentage': '27.16%', 'elapsed_time': '1h 32m 27s', 'remaining_time': '4h 7m 58s'}
{'loss': 0.68267331, 'token_acc': 0.7761134, 'grad_norm': 0.38293342, 'learning_rate': 8.593e-05, 'memory(GiB)': 76.21, 'train_speed(iter/s)': 0.019806, 'epoch': 1.4, 'global_step/max_steps': '115/405', 'percentage': '28.40%', 'elapsed_time': '1h 36m 42s', 'remaining_time': '4h 3m 52s'}
{'loss': 0.71170344, 'token_acc': 0.78053525, 'grad_norm': 0.3713337, 'learning_rate': 8.448e-05, 'memory(GiB)': 76.21, 'train_speed(iter/s)': 0.019831, 'epoch': 1.46, 'global_step/max_steps': '120/405', 'percentage': '29.63%', 'elapsed_time': '1h 40m 47s', 'remaining_time': '3h 59m 22s'}
{'loss': 0.70673256, 'token_acc': 0.77159011, 'grad_norm': 0.36822507, 'learning_rate': 8.297e-05, 'memory(GiB)': 76.21, 'train_speed(iter/s)': 0.019845, 'epoch': 1.53, 'global_step/max_steps': '125/405', 'percentage': '30.86%', 'elapsed_time': '1h 44m 55s', 'remaining_time': '3h 55m 1s'}
{'loss': 0.67356033, 'token_acc': 0.7921583, 'grad_norm': 0.4612934, 'learning_rate': 8.14e-05, 'memory(GiB)': 76.21, 'train_speed(iter/s)': 0.019851, 'epoch': 1.59, 'global_step/max_steps': '130/405', 'percentage': '32.10%', 'elapsed_time': '1h 49m 5s', 'remaining_time': '3h 50m 45s'}
Train: ?33%|███▎ ? ? ?| 132/405 [1:50:48<3:50:21, 50.63s/it][2025-03-11 21:20:37,710] [WARNING] [stage3.py:2139:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
{'loss': 0.68124514, 'token_acc': 0.78771819, 'grad_norm': 0.46047566, 'learning_rate': 7.978e-05, 'memory(GiB)': 76.21, 'train_speed(iter/s)': 0.019853, 'epoch': 1.65, 'global_step/max_steps': '135/405', 'percentage': '33.33%', 'elapsed_time': '1h 53m 16s', 'remaining_time': '3h 46m 32s'}
{'loss': 0.67308445, 'token_acc': 0.78043745, 'grad_norm': 0.46205863, 'learning_rate': 7.812e-05, 'memory(GiB)': 76.21, 'train_speed(iter/s)': 0.019881, 'epoch': 1.71, 'global_step/max_steps': '140/405', 'percentage': '34.57%', 'elapsed_time': '1h 57m 18s', 'remaining_time': '3h 42m 2s'}
{'loss': 0.65709753, 'token_acc': 0.794716, 'grad_norm': 0.46728156, 'learning_rate': 7.64e-05, 'memory(GiB)': 76.21, 'train_speed(iter/s)': 0.019893, 'epoch': 1.77, 'global_step/max_steps': '145/405', 'percentage': '35.80%', 'elapsed_time': '2h 1m 25s', 'remaining_time': '3h 37m 43s'}
{'loss': 0.66156731, 'token_acc': 0.78602904, 'grad_norm': 0.45510392, 'learning_rate': 7.464e-05, 'memory(GiB)': 76.21, 'train_speed(iter/s)': 0.019899, 'epoch': 1.83, 'global_step/max_steps': '150/405', 'percentage': '37.04%', 'elapsed_time': '2h 5m 34s', 'remaining_time': '3h 33m 28s'}
Train: ?37%|███▋ ? ? ?| 150/405 [2:05:34<3:31:14, 49.70s/it]
{'eval_loss': 0.65251857, 'eval_token_acc': 0.79853547, 'eval_runtime': 15.6574, 'eval_samples_per_second': 1.661, 'eval_steps_per_second': 0.447, 'epoch': 1.83, 'global_step/max_steps': '150/405', 'percentage': '37.04%', 'elapsed_time': '2h 5m 50s', 'remaining_time': '3h 33m 55s'}
Val: 100%|██████████| 7/7 [00:12<00:00, ?1.86s/it]49.70s/it]
[INFO:swift] Saving model checkpoint to /data/qwq32b_sft_lora/output/v9-20250311-192834/checkpoint-150
**********************************************
{'loss': 0.65750132, 'token_acc': 0.78596818, 'grad_norm': 0.47214887, 'learning_rate': 7.285e-05, 'memory(GiB)': 76.21, 'train_speed(iter/s)': 0.019847, 'epoch': 1.89, 'global_step/max_steps': '155/405', 'percentage': '38.27%', 'elapsed_time': '2h 10m 6s', 'remaining_time': '3h 29m 50s'}
{'loss': 0.63944697, 'token_acc': 0.80483245, 'grad_norm': 0.49222756, 'learning_rate': 7.101e-05, 'memory(GiB)': 76.21, 'train_speed(iter/s)': 0.019853, 'epoch': 1.95, 'global_step/max_steps': '160/405', 'percentage': '39.51%', 'elapsed_time': '2h 14m 15s', 'remaining_time': '3h 25m 35s'}
{'loss': 0.63674178, 'token_acc': 0.80768833, 'grad_norm': 0.59897131, 'learning_rate': 6.913e-05, 'memory(GiB)': 76.21, 'train_speed(iter/s)': 0.019895, 'epoch': 2.01, 'global_step/max_steps': '165/405', 'percentage': '40.74%', 'elapsed_time': '2h 18m 10s', 'remaining_time': '3h 20m 58s'}
{'loss': 0.64350748, 'token_acc': 0.80203466, 'grad_norm': 0.51221188, 'learning_rate': 6.723e-05, 'memory(GiB)': 76.21, 'train_speed(iter/s)': 0.019886, 'epoch': 2.07, 'global_step/max_steps': '170/405', 'percentage': '41.98%', 'elapsed_time': '2h 22m 25s', 'remaining_time': '3h 16m 52s'}
{'loss': 0.59812784, 'token_acc': 0.80184307, 'grad_norm': 0.52895864, 'learning_rate': 6.53e-05, 'memory(GiB)': 76.21, 'train_speed(iter/s)': 0.019905, 'epoch': 2.13, 'global_step/max_steps': '175/405', 'percentage': '43.21%', 'elapsed_time': '2h 26m 27s', 'remaining_time': '3h 12m 29s'}
{'loss': 0.60168495, 'token_acc': 0.80204451, 'grad_norm': 0.54771068, 'learning_rate': 6.334e-05, 'memory(GiB)': 76.21, 'train_speed(iter/s)': 0.019928, 'epoch': 2.2, 'global_step/max_steps': '180/405', 'percentage': '44.44%', 'elapsed_time': '2h 30m 28s', 'remaining_time': '3h 8m 6s'}
Train: ?44%|████▍ ? ? | 180/405 [2:30:28<2:59:27, 47.85s/it][2025-03-11 22:00:31,985] [WARNING] [stage3.py:2139:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
{'loss': 0.59545937, 'token_acc': 0.80456827, 'grad_norm': 0.57579227, 'learning_rate': 6.135e-05, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.019937, 'epoch': 2.26, 'global_step/max_steps': '185/405', 'percentage': '45.68%', 'elapsed_time': '2h 34m 35s', 'remaining_time': '3h 3m 50s'}
{'loss': 0.59948916, 'token_acc': 0.80121831, 'grad_norm': 0.53543298, 'learning_rate': 5.935e-05, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.019937, 'epoch': 2.32, 'global_step/max_steps': '190/405', 'percentage': '46.91%', 'elapsed_time': '2h 38m 46s', 'remaining_time': '2h 59m 39s'}
{'loss': 0.59326115, 'token_acc': 0.79956183, 'grad_norm': 0.55039623, 'learning_rate': 5.734e-05, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.01994, 'epoch': 2.38, 'global_step/max_steps': '195/405', 'percentage': '48.15%', 'elapsed_time': '2h 42m 55s', 'remaining_time': '2h 55m 27s'}
{'loss': 0.58592167, 'token_acc': 0.80714245, 'grad_norm': 0.69052059, 'learning_rate': 5.531e-05, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.019935, 'epoch': 2.44, 'global_step/max_steps': '200/405', 'percentage': '49.38%', 'elapsed_time': '2h 47m 9s', 'remaining_time': '2h 51m 19s'}
Train: ?49%|████▉ ? ? | 200/405 [2:47:09<2:49:56, 49.74s/it]
{'eval_loss': 0.56886792, 'eval_token_acc': 0.82167251, 'eval_runtime': 15.7067, 'eval_samples_per_second': 1.655, 'eval_steps_per_second': 0.446, 'epoch': 2.44, 'global_step/max_steps': '200/405', 'percentage': '49.38%', 'elapsed_time': '2h 47m 24s', 'remaining_time': '2h 51m 36s'}
Val: 100%|██████████| 7/7 [00:13<00:00, ?1.86s/it]49.74s/it]
[INFO:swift] Saving model checkpoint to /data/qwq32b_sft_lora/output/v9-20250311-192834/checkpoint-200
*****************************************
Train: ?50%|████▉ ? ? | 202/405 [2:49:15<3:08:49, 55.81s/it][2025-03-11 22:19:18,274] [WARNING] [stage3.py:2139:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
{'loss': 0.54969444, 'token_acc': 0.82021995, 'grad_norm': 0.51556883, 'learning_rate': 5.327e-05, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.019871, 'epoch': 2.5, 'global_step/max_steps': '205/405', 'percentage': '50.62%', 'elapsed_time': '2h 51m 52s', 'remaining_time': '2h 47m 41s'}
{'loss': 0.52501326, 'token_acc': 0.81536282, 'grad_norm': 0.54576287, 'learning_rate': 5.123e-05, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.019883, 'epoch': 2.56, 'global_step/max_steps': '210/405', 'percentage': '51.85%', 'elapsed_time': '2h 55m 58s', 'remaining_time': '2h 43m 24s'}
{'loss': 0.5639473, 'token_acc': 0.8235682, 'grad_norm': 0.51644597, 'learning_rate': 4.918e-05, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.019872, 'epoch': 2.62, 'global_step/max_steps': '215/405', 'percentage': '53.09%', 'elapsed_time': '3h 0m 15s', 'remaining_time': '2h 39m 17s'}
Train: ?53%|█████▎ ? ?| 215/405 [3:00:15<2:40:55, 50.82s/it][2025-03-11 22:30:17,273] [WARNING] [stage3.py:2139:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
{'loss': 0.54539089, 'token_acc': 0.82929161, 'grad_norm': 0.5427966, 'learning_rate': 4.714e-05, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.019887, 'epoch': 2.69, 'global_step/max_steps': '220/405', 'percentage': '54.32%', 'elapsed_time': '3h 4m 18s', 'remaining_time': '2h 34m 59s'}
{'loss': 0.54721932, 'token_acc': 0.82292752, 'grad_norm': 0.58632606, 'learning_rate': 4.51e-05, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.01989, 'epoch': 2.75, 'global_step/max_steps': '225/405', 'percentage': '55.56%', 'elapsed_time': '3h 8m 28s', 'remaining_time': '2h 30m 46s'}
{'loss': 0.51745701, 'token_acc': 0.82614152, 'grad_norm': 0.51928985, 'learning_rate': 4.307e-05, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.019892, 'epoch': 2.81, 'global_step/max_steps': '230/405', 'percentage': '56.79%', 'elapsed_time': '3h 12m 38s', 'remaining_time': '2h 26m 34s'}
{'loss': 0.54157047, 'token_acc': 0.81710944, 'grad_norm': 0.71657186, 'learning_rate': 4.105e-05, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.019899, 'epoch': 2.87, 'global_step/max_steps': '235/405', 'percentage': '58.02%', 'elapsed_time': '3h 16m 46s', 'remaining_time': '2h 22m 20s'}
{'loss': 0.54548702, 'token_acc': 0.81284619, 'grad_norm': 0.50686509, 'learning_rate': 3.904e-05, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.019907, 'epoch': 2.93, 'global_step/max_steps': '240/405', 'percentage': '59.26%', 'elapsed_time': '3h 20m 52s', 'remaining_time': '2h 18m 6s'}
{'loss': 0.51912632, 'token_acc': 0.83365523, 'grad_norm': 0.68279731, 'learning_rate': 3.706e-05, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.019907, 'epoch': 2.99, 'global_step/max_steps': '245/405', 'percentage': '60.49%', 'elapsed_time': '3h 25m 3s', 'remaining_time': '2h 13m 55s'}
{'loss': 0.52836185, 'token_acc': 0.83409461, 'grad_norm': 0.55463023, 'learning_rate': 3.509e-05, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.019923, 'epoch': 3.05, 'global_step/max_steps': '250/405', 'percentage': '61.73%', 'elapsed_time': '3h 29m 4s', 'remaining_time': '2h 9m 37s'}
Train: ?62%|██████▏ ? | 250/405 [3:29:04<2:07:39, 49.41s/it]
{'eval_loss': 0.52870411, 'eval_token_acc': 0.83231801, 'eval_runtime': 15.7131, 'eval_samples_per_second': 1.655, 'eval_steps_per_second': 0.445, 'epoch': 3.05, 'global_step/max_steps': '250/405', 'percentage': '61.73%', 'elapsed_time': '3h 29m 20s', 'remaining_time': '2h 9m 47s'}
Val: 100%|██████████| 7/7 [00:13<00:00, ?1.86s/it]49.41s/it]
[INFO:swift] Saving model checkpoint to /data/qwq32b_sft_lora/output/v9-20250311-192834/checkpoint-250
************************************************
{'loss': 0.51691947, 'token_acc': 0.82422604, 'grad_norm': 0.53855505, 'learning_rate': 3.316e-05, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.019896, 'epoch': 3.11, 'global_step/max_steps': '255/405', 'percentage': '62.96%', 'elapsed_time': '3h 33m 33s', 'remaining_time': '2h 5m 37s'}
Train: ?63%|██████▎ ? | 257/405 [3:35:20<2:08:40, 52.17s/it][2025-03-11 23:05:38,683] [WARNING] [stage3.py:2139:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
{'loss': 0.50732822, 'token_acc': 0.83722172, 'grad_norm': 0.7386699, 'learning_rate': 3.124e-05, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.019885, 'epoch': 3.17, 'global_step/max_steps': '260/405', 'percentage': '64.20%', 'elapsed_time': '3h 37m 51s', 'remaining_time': '2h 1m 30s'}
{'loss': 0.50304022, 'token_acc': 0.84518402, 'grad_norm': 0.5332581, 'learning_rate': 2.936e-05, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.019889, 'epoch': 3.23, 'global_step/max_steps': '265/405', 'percentage': '65.43%', 'elapsed_time': '3h 42m 0s', 'remaining_time': '1h 57m 17s'}
{'loss': 0.5034606, 'token_acc': 0.81697432, 'grad_norm': 0.67853403, 'learning_rate': 2.752e-05, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.019886, 'epoch': 3.29, 'global_step/max_steps': '270/405', 'percentage': '66.67%', 'elapsed_time': '3h 46m 13s', 'remaining_time': '1h 53m 6s'}
{'loss': 0.5183465, 'token_acc': 0.83563731, 'grad_norm': 0.6259943, 'learning_rate': 2.571e-05, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.019885, 'epoch': 3.35, 'global_step/max_steps': '275/405', 'percentage': '67.90%', 'elapsed_time': '3h 50m 25s', 'remaining_time': '1h 48m 55s'}
{'loss': 0.51731062, 'token_acc': 0.83534514, 'grad_norm': 0.61180005, 'learning_rate': 2.394e-05, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.019881, 'epoch': 3.42, 'global_step/max_steps': '280/405', 'percentage': '69.14%', 'elapsed_time': '3h 54m 40s', 'remaining_time': '1h 44m 45s'}
{'loss': 0.48814211, 'token_acc': 0.82283914, 'grad_norm': 0.57190785, 'learning_rate': 2.222e-05, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.019883, 'epoch': 3.48, 'global_step/max_steps': '285/405', 'percentage': '70.37%', 'elapsed_time': '3h 58m 50s', 'remaining_time': '1h 40m 33s'}
{'loss': 0.4921607, 'token_acc': 0.82588464, 'grad_norm': 0.52349298, 'learning_rate': 2.054e-05, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.019888, 'epoch': 3.54, 'global_step/max_steps': '290/405', 'percentage': '71.60%', 'elapsed_time': '4h 2m 58s', 'remaining_time': '1h 36m 20s'}
{'loss': 0.46711798, 'token_acc': 0.85013139, 'grad_norm': 0.6346718, 'learning_rate': 1.892e-05, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.019889, 'epoch': 3.6, 'global_step/max_steps': '295/405', 'percentage': '72.84%', 'elapsed_time': '4h 7m 8s', 'remaining_time': '1h 32m 9s'}
{'loss': 0.48140554, 'token_acc': 0.83738891, 'grad_norm': 0.62962168, 'learning_rate': 1.734e-05, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.019894, 'epoch': 3.66, 'global_step/max_steps': '300/405', 'percentage': '74.07%', 'elapsed_time': '4h 11m 16s', 'remaining_time': '1h 27m 56s'}
Train: ?74%|███████▍ ?| 300/405 [4:11:16<1:25:56, 49.11s/it]
{'eval_loss': 0.50453913, 'eval_token_acc': 0.84007138, 'eval_runtime': 15.7353, 'eval_samples_per_second': 1.652, 'eval_steps_per_second': 0.445, 'epoch': 3.66, 'global_step/max_steps': '300/405', 'percentage': '74.07%', 'elapsed_time': '4h 11m 32s', 'remaining_time': '1h 28m 2s'}
Val: 100%|██████████| 7/7 [00:13<00:00, ?1.87s/it]49.11s/it]
[INFO:swift] Saving model checkpoint to /data/qwq32b_sft_lora/output/v9-20250311-192834/checkpoint-300
********************************************
{'loss': 0.48724985, 'token_acc': 0.83901735, 'grad_norm': 0.59317845, 'learning_rate': 1.582e-05, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.019877, 'epoch': 3.72, 'global_step/max_steps': '305/405', 'percentage': '75.31%', 'elapsed_time': '4h 15m 40s', 'remaining_time': '1h 23m 49s'}
{'loss': 0.47789598, 'token_acc': 0.85550931, 'grad_norm': 0.67272985, 'learning_rate': 1.436e-05, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.019881, 'epoch': 3.78, 'global_step/max_steps': '310/405', 'percentage': '76.54%', 'elapsed_time': '4h 19m 49s', 'remaining_time': '1h 19m 37s'}
{'loss': 0.49318271, 'token_acc': 0.81770707, 'grad_norm': 0.64257801, 'learning_rate': 1.295e-05, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.01989, 'epoch': 3.84, 'global_step/max_steps': '315/405', 'percentage': '77.78%', 'elapsed_time': '4h 23m 53s', 'remaining_time': '1h 15m 23s'}
Train: ?78%|███████▊ ?| 316/405 [4:24:43<1:14:10, 50.00s/it][2025-03-11 23:54:59,684] [WARNING] [stage3.py:2139:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
{'loss': 0.46136761, 'token_acc': 0.8458454, 'grad_norm': 0.62953055, 'learning_rate': 1.161e-05, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.01989, 'epoch': 3.91, 'global_step/max_steps': '320/405', 'percentage': '79.01%', 'elapsed_time': '4h 28m 5s', 'remaining_time': '1h 11m 12s'}
{'loss': 0.4856822, 'token_acc': 0.83825816, 'grad_norm': 0.64470125, 'learning_rate': 1.033e-05, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.019893, 'epoch': 3.97, 'global_step/max_steps': '325/405', 'percentage': '80.25%', 'elapsed_time': '4h 32m 13s', 'remaining_time': '1h 7m 0s'}
{'loss': 0.46592345, 'token_acc': 0.84528571, 'grad_norm': 0.65905805, 'learning_rate': 9.12e-06, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.019917, 'epoch': 4.02, 'global_step/max_steps': '330/405', 'percentage': '81.48%', 'elapsed_time': '4h 36m 5s', 'remaining_time': '1h 2m 44s'}
{'loss': 0.48042569, 'token_acc': 0.85237186, 'grad_norm': 0.61635281, 'learning_rate': 7.98e-06, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.019923, 'epoch': 4.09, 'global_step/max_steps': '335/405', 'percentage': '82.72%', 'elapsed_time': '4h 40m 11s', 'remaining_time': '58m 32s'}
{'loss': 0.45569935, 'token_acc': 0.83371485, 'grad_norm': 0.64527875, 'learning_rate': 6.9e-06, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.01993, 'epoch': 4.15, 'global_step/max_steps': '340/405', 'percentage': '83.95%', 'elapsed_time': '4h 44m 16s', 'remaining_time': '54m 20s'}
{'loss': 0.46417255, 'token_acc': 0.84960884, 'grad_norm': 0.67313113, 'learning_rate': 5.9e-06, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.019931, 'epoch': 4.21, 'global_step/max_steps': '345/405', 'percentage': '85.19%', 'elapsed_time': '4h 48m 26s', 'remaining_time': '50m 9s'}
{'loss': 0.47292795, 'token_acc': 0.85013211, 'grad_norm': 0.59537749, 'learning_rate': 4.98e-06, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.019936, 'epoch': 4.27, 'global_step/max_steps': '350/405', 'percentage': '86.42%', 'elapsed_time': '4h 52m 32s', 'remaining_time': '45m 58s'}
Train: ?86%|████████▋ | 350/405 [4:52:32<45:19, 49.44s/it]
{'eval_loss': 0.490695, 'eval_token_acc': 0.84296351, 'eval_runtime': 15.6909, 'eval_samples_per_second': 1.657, 'eval_steps_per_second': 0.446, 'epoch': 4.27, 'global_step/max_steps': '350/405', 'percentage': '86.42%', 'elapsed_time': '4h 52m 48s', 'remaining_time': '46m 0s'}
Val: 100%|██████████| 7/7 [00:13<00:00, ?1.86s/it].44s/it]
[INFO:swift] Saving model checkpoint to /data/qwq32b_sft_lora/output/v9-20250311-192834/checkpoint-350
*****************************************
Train: ?87%|████████▋ | 352/405 [4:54:35<48:02, 54.39s/it][2025-03-12 00:25:04,775] [WARNING] [stage3.py:2139:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
{'loss': 0.46881456, 'token_acc': 0.83740075, 'grad_norm': 0.59338625, 'learning_rate': 4.13e-06, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.01991, 'epoch': 4.33, 'global_step/max_steps': '355/405', 'percentage': '87.65%', 'elapsed_time': '4h 57m 6s', 'remaining_time': '41m 50s'}
{'eval_loss': 0.48915866, 'eval_token_acc': 0.84357886, 'eval_runtime': 15.8494, 'eval_samples_per_second': 1.64, 'eval_steps_per_second': 0.442, 'epoch': 4.88, 'global_step/max_steps': '400/405', 'percentage': '98.77%', 'elapsed_time': '5h 34m 38s', 'remaining_time': '4m 10s'}
Val: 100%|██████████| 7/7 [00:13<00:00, ?1.88s/it].31s/it]
[INFO:swift] Saving model checkpoint to /data/qwq32b_sft_lora/output/v9-20250311-192834/checkpoint-400
{'loss': 0.49458728, 'token_acc': 0.83115697, 'grad_norm': 0.65133526, 'learning_rate': 0.0, 'memory(GiB)': 76.86, 'train_speed(iter/s)': 0.01992, 'epoch': 4.94, 'global_step/max_steps': '405/405', 'percentage': '100.00%', 'elapsed_time': '5h 38m 48s', 'remaining_time': '0s'}
Train: 100%|██████████| 405/405 [5:38:48<00:00, 50.98s/it]
{'eval_loss': 0.4893617, 'eval_token_acc': 0.84308658, 'eval_runtime': 15.9508, 'eval_samples_per_second': 1.63, 'eval_steps_per_second': 0.439, 'epoch': 4.94, 'global_step/max_steps': '405/405', 'percentage': '100.00%', 'elapsed_time': '5h 39m 4s', 'remaining_time': '0s'}
Val: 100%|██████████| 7/7 [00:13<00:00, ?1.90s/it].98s/it]
[INFO:swift] Saving model checkpoint to /data/qwq32b_sft_lora/output/v9-20250311-192834/checkpoint-405
{'train_runtime': 20349.5179, 'train_samples_per_second': 0.642, 'train_steps_per_second': 0.02, 'train_loss': 0.63218051, 'epoch': 4.94, 'global_step/max_steps': '405/405', 'percentage': '100.00%', 'elapsed_time': '5h 39m 9s', 'remaining_time': '0s'}
Train: 100%|██████████| 405/405 [5:39:09<00:00, 50.25s/it]
[INFO:swift] last_model_checkpoint: /data/qwq32b_sft_lora/output/v9-20250311-192834/checkpoint-405
[INFO:swift] best_model_checkpoint: /data/qwq32b_sft_lora/output/v9-20250311-192834/checkpoint-400

4. 模型部署及推理腳本

我們采用多卡部署,并且自定義服務端口:

RAY_memory_monitor_refresh_ms=0
CUDA_VISIBLE_DEVICES=0,1 swift deploy \--ckpt_dir /data/qwq32b_sft_lora/output/v9-20250311-192834/checkpoint-400 \--infer_backend vllm \--max_new_tokens 2048 \--tensor_parallel_size 2 \--port 8011

推理腳本:

from openai import OpenAIopenai_api_key = "EMPTY"
openai_api_base = "http://ip:8011/v1"client = OpenAI(api_key=openai_api_key,base_url=openai_api_base,
)chat_response = client.chat.completions.create(model="QwQ-32B",messages=[{"role": "system", "content": "你是一款客戶機器人,幫助客戶解決問題"}, {"role": "user", "content": "問一下這款手機現在附帶什么配件"}, {"role": "assistant", "content": "附件內容:鋰離子電池組 NP-FW50,電源適配器AC-UUD12 ,Micro USB 連接線,肩帶,鏡頭蓋,熱靴蓋,遮光罩,使用說明書,保修卡"}, {"role": "user", "content": "售后和質保是什么標準"}],temperature=0.7,top_p=0.8,max_tokens=2048,extra_body={"repetition_penalty": 1.05,},
)
print("Chat response:", chat_response)

5. 參考材料

【1】https://github.com/modelscope/ms-swift

【2】推理和部署

本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。
如若轉載,請注明出處:http://www.pswp.cn/diannao/73538.shtml
繁體地址,請注明出處:http://hk.pswp.cn/diannao/73538.shtml
英文地址,請注明出處:http://en.pswp.cn/diannao/73538.shtml

如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!

相關文章

【Unity3D】Addressables使用流程

Package Manager - 搜索 Addressables 安裝 Window -> Asset Management -> Addressables 打開窗口 New -> 新建Packed Assets 資源組 默認資源組Default xxx (Default) 將資源&#xff0c;如預制體直接拖拽進資源組 Build -> New Build -> Default Buil…

k8s serviceaccount在集群內指定apiserver時驗證錯誤的問題

在主機上&#xff0c;找到TOKEN&#xff0c;可以直接指定apiserver使用 rootubuntu-server:/home# kubectl auth can-i --list --server https://192.168.85.198:6443 --token"eyJhbGciOiJSUzI1NiIsImtpZCI6IlFlMHQ3TzhpcGw1SnRqbkYtOC1NUWlWNUpWdGo5SGRXeTBvZU9ib25iZD…

Linux驅動開發-①pinctrl 和 gpio 子系統②并發和競爭③內核定時器

Linux驅動開發-①pinctrl 和 gpio 子系統②并發和競爭③內核定時器 一&#xff0c;pinctrl 和 gpio 子系統1.pinctrl子系統2.GPIO子系統 二&#xff0c;并發和競爭1.原子操作2.自旋鎖3.信號量4.互斥體 三&#xff0c;按鍵實驗四&#xff0c;內核定時器1.關于定時器的有關概念1.…

數據庫的高階知識

目錄 一、case when二、幾種常見的嵌套查詢2.1 比較運算符2.2 ANY/ALL 關鍵詞2.3 in 關鍵詞2.4 EXISTS關鍵詞2.5 in和exists的異同點 三、開窗函數 數據庫的基本知識 數據庫的高階知識 一、case when 在實際工作中&#xff0c;經常會涉及以下兩類問題&#xff1a; 數據的映射…

【Kubernetes】Service 的類型有哪些?ClusterIP、NodePort 和 LoadBalancer 的區別?

在 Kubernetes 中&#xff0c;Service 是一種抽象的方式&#xff0c;用于將一組 Pod 進行連接并暴露給外部或集群內部訪問。它的主要目的是通過提供穩定的 IP 地址和端口來允許其他服務或客戶端與一組 Pod 進行通信。 Service 類型 Kubernetes 中 Service 有四種主要類型&…

MapReduce處理數據流程

&#xff08;一&#xff09;Shuffle MapReduce中的Shuffle過程指的是在Map方法執行后、Reduce方法執行前對數據進行分區排序的階段 &#xff08;二&#xff09;處理流程 1. 首先MapReduce會將處理的數據集劃分成多個split&#xff0c;split劃分是邏輯上進行劃分&#xff0c;…

OrioleDB: 新一代PostgreSQL存儲引擎

PostgreSQL 12 引入了可插拔式的表存儲方法接口&#xff0c;允許為不同的表選擇不同的存儲機制&#xff0c;例如用于 OLTP 操作的堆表&#xff08;HEAP、默認&#xff09;、用于 OLAP 操作的列式表&#xff08;Citus&#xff09;&#xff0c;以及用于超快速搜索處理的內存表。 …

電腦自動關機故障維修案例分享

電腦基本配置&#xff1a; C P U: AMD A10 9700 內存&#xff1a;8G 硬盤&#xff1a;金邦512G固態硬盤 主板&#xff1a;華碩 A320M-F 顯卡&#xff1a;集成&#xff08;核心顯卡&#xff09; 操作系統&#xff1a;Win10專業版 故障描述&#xff1a; 使用一段時間會黑屏…

JVM垃圾收集器相關面試題(1)

垃圾收集與內存管理摘要 一.核心垃圾收集算法對比 算法原理優點缺點適用場景標記-清除兩次遍歷&#xff08;標記存活對象→清除未標記對象&#xff09;實現簡單內存碎片化、雙遍歷效率低老年代&#xff08;結合整理&#xff09;標記-復制內存對半分&#xff0c;存活對象復制到…

棧(LIFO)算法題

1.刪除字符串中所有相鄰的重復字符 注意&#xff0c;我們需要重復處理&#xff0c;而不是處理一次相鄰的相同元素就結束了。對示例來說&#xff0c;如果只進行一次處理&#xff0c;結果為aaca&#xff0c;但是處理之后又出現了相鄰的重復元素&#xff0c;我們還得繼續處理&…

conda的基本使用及pycharm里設置conda環境

創建conda環境 conda create --name your_env_name python3.8 把your_env_name換成實際的conda環境名稱&#xff0c;python后邊的根據自己的需要&#xff0c;選擇python的版本。 激活conda環境 conda activate your_env_name 安裝相關的包、庫 conda install package_name …

Python基于深度學習的多模態人臉情緒識別研究與實現

一、系統架構設計 A[數據采集] --> B[預處理模塊] B --> C[特征提取] C --> D[多模態融合] D --> E[情緒分類] E --> F[系統部署] F --> G[用戶界面] 二、數據準備與處理 1. 數據收集 - 視頻數據&#xff1a;FER2013&#xff08;靜態圖像&#xff0…

synchronized與 Java內置鎖(未寫完)

文章目錄 一、 synchronized 關鍵字二、Java對象結構1. 對象頭2. 對象體3. 對齊字節4. 對象頭中的字段長度5. Mark Word 的結構信息6. 使用 JOL 工具查看對象的布局 三、Java 內置鎖機制3.1 內置鎖的演進過程1. 無鎖狀態2. 偏向鎖狀態3. 輕量級鎖狀態4. 重量級鎖狀態 一、 sync…

LLM(3): Transformer 架構

Transformer 架構是當前大語言模型的主力架構和基礎技術&#xff0c;本文以通俗易懂的方式&#xff0c;對此作簡要介紹。 1.4 介紹 Transformer 架構 大多數現代的大規模語言模型&#xff08;LLMs&#xff09;依賴于 Transformer 架構&#xff0c;這是一種在 2017 年的論文《…

11.【.NET 8 實戰--孢子記賬--從單體到微服務--轉向微服務】--微服務基礎工具與技術--Ocelot 網關--整合日志

網關作為微服務架構的入口&#xff0c;承載著各服務間的請求轉發與安全校驗&#xff0c;其日志信息尤為關鍵。通過整合網關日志&#xff0c;可以將分散在不同系統中的訪問記錄、錯誤提示和異常信息集中管理&#xff0c;為問題排查提供全景視角。在排查故障時&#xff0c;統一日…

88.HarmonyOS NEXT 性能監控與調試指南:構建高性能應用

溫馨提示&#xff1a;本篇博客的詳細代碼已發布到 git : https://gitcode.com/nutpi/HarmonyosNext 可以下載運行哦&#xff01; HarmonyOS NEXT 性能監控與調試指南&#xff1a;構建高性能應用 文章目錄 HarmonyOS NEXT 性能監控與調試指南&#xff1a;構建高性能應用1. 性能監…

012---狀態機的基本知識

1. 摘要 文章為學習記錄。主要介紹狀態機概述、狀態轉移圖、狀態編碼、狀態機寫法、狀態機代碼示例。 2. 狀態機概述 狀態機 &#xff08;Finite State Machine&#xff09;&#xff0c;也稱為同步有限狀態機&#xff0c;用于描述有先后順序或時序規律的事情。 “同步”&…

deepseek+kimi做ppt教程記錄

1.首先注冊deepseek和kimi deepseek官網&#xff1a;https://chat.deepseek.com/ kimi官網&#xff1a;https://kimi.moonshot.cn/ 以下以一篇工作總結報告為例 2.使用deepseek生成ppt大綱 讓deepseek生成kimi生成ppt所需要的內容時&#xff0c;需要注意提示詞內容&#xff0c;…

Java Module介紹

Java模塊系統自Java 9開始引入&#xff0c;旨在提供更強大的封裝機制、清晰的依賴關系定義以及可靠的配置。Java平臺本身也被模塊化了&#xff0c;提供了多個核心模塊以及其他用于支持不同功能的模塊。以下是一些重要的Java標準模塊&#xff1a; java.base - 這是最基礎的模塊…

SOME/IP:用Python實現協議訂閱、Offer、訂閱ACK與報文接收

文章目錄 前言一、代碼層次二、詳細代碼1. eth_scapy_sd.py2、eth_scapy_someip.py3、network_define.py4、packet_define.py5、unpack_define.py6、someip_controller.py 前言 1、需要pip安裝scapy庫 2、需要修改根據實際情況配置network_define.py 3、執行someip_controller…