SFT訓練模型的命令
CUDA_VISIBLE_DEVICES=0 python src/train_bash.py \--stage sft \--model_name_or_path models/chatglm3-6b \--do_train \--dataset self_cognition \--template chatglm3 \--finetuning_type lora \--lora_target query_key_value \--output_dir output/chatglm3_sft_lora_self/ \--overwrite_cache \--per_device_train_batch_size 4 \--gradient_accumulation_steps 4 \--lr_scheduler_type cosine \--logging_steps 10 \--save_steps 200 \--learning_rate 5e-5 \--num_train_epochs 100 \--plot_loss \--fp16
模型基座是下載正確的,torch.cuda.is_available()也是True
訓練報錯:
11/21/2023 09:11:23 - INFO - llmtuner.data.loader - Loading dataset self_cognition.json...
Using custom data configuration default-aaabbbccc
Loading Dataset Infos from /usr/local/lib/python3.10/site-packages/datasets/packaged_modules/json
Generating dataset json (/root/.cache/huggingface/datasets/json/default-aaabbbccc/0.0.0/34bc96c741b2e8a1f18598ffdd8bb11242116d54740a1d4f2a2872c7a28b6900)
Downloading and preparing dataset json/default to /root/.cache/huggingface/datasets/json/default-aaabbbccc/0.0.0/34bc96c741b2e8a1f18598ffdd8bb11242116d54740a1d4f2a2872c7a28b6900...
Downloading data files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 6364.65it/s]
Downloading took 0.0 min
Checksum Computation took 0.0 min
Extracting data files: 0%| | 0/1 [00:00<?, ?it/s]train.sft.1gpu.lora.fp16.self.sh: line 19: 2551 Segmentation fault (core dumped) CUDA_VISIBLE_DEVICES=0 python src/train_bash.py --stage sft --model_name_or_path models/chatglm3-6b --do_train --dataset self_cognition --template chatglm3 --finetuning_type lora --lora_target query_key_value --output_dir output/chatglm3_sft_lora_self/ --overwrite_cache --per_device_train_batch_size 4 --gradient_accumulation_steps 4 --lr_scheduler_type cosine --logging_steps 10 --save_steps 200 --learning_rate 5e-5 --num_train_epochs 100 --plot_loss --fp16
調試
- 用pycharm,跟蹤入口,從src/train_bash.py進去,一步一步看
- 根據報錯log,找到"Loading dataset {}…"
- 然后加斷點,單步調試
調試后定位到:Segmentation fault (core dumped)是在 運行 datasets的 load_dataset() 函數發生的,單獨調用這個函數也復現問題了: https://github.com/hiyouga/LLaMA-Factory/blob/main/src/llmtuner/data/loader.py#L56
試了好幾個和datasets的Segmentation fault (core dumped)報錯相關的修復方案,都也沒有解決,包括網上搜到要更新一些相關庫的版本,也都不行。
最終從報錯的ubuntu18.04切換到20.04,換了個操作系統,就沒有這個報錯了