使用 XTuner 微調 InternLM2-Chat-7B 實現自己的小助手認知
- 1 環境配置與數據準備
- 步驟 0. 使用 conda 先構建一個 Python-3.10 的虛擬環境
- 步驟 1. 安裝 XTuner
- 修改提供的數據
- 步驟 0. 創建一個新的文件夾用于存儲微調數據
- 步驟 1. 創建修改腳本
- 步驟 2. 執行腳本
- 步驟 3. 查看數據
- 訓練啟動
- 步驟 0. 復制模型
- 步驟 1. 修改 Config
- 步驟 2. 啟動微調
- 步驟 3. 權重轉換
- 步驟 4. 模型合并
- 模型 WebUI 對話
參考文檔: 書生訓練營
XTuner 文檔鏈接
1 環境配置與數據準備
步驟 0. 使用 conda 先構建一個 Python-3.10 的虛擬環境
cd ~
#git clone 本repo
git clone https://github.com/InternLM/Tutorial.git -b camp4
mkdir -p /root/finetune && cd /root/finetune
conda create -n xtuner-env python=3.10 -y
conda activate xtuner-env
步驟 1. 安裝 XTuner
cd /root/Tutorial/docs/L1/XTuner
pip install -r requirements.txt
pip install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 --index-url https://download.pytorch.org/whl/cu121
# requirements.txt
accelerate==0.27.0
addict==2.4.0
aiohttp==3.9.3
aiosignal==1.3.1
aliyun-python-sdk-core==2.14.0
aliyun-python-sdk-kms==2.16.2
altair==5.2.0
annotated-types==0.6.0
anyio==4.2.0
argon2-cffi==23.1.0
argon2-cffi-bindings==21.2.0
arrow==1.3.0
arxiv==2.1.0
asttokens==2.4.1
async-lru==2.0.4
async-timeout==4.0.3
attrs==23.2.0
Babel==2.14.0
beautifulsoup4==4.12.3
bitsandbytes==0.42.0
bleach==6.1.0
blinker==1.7.0
cachetools==5.3.2
certifi==2024.2.2
cffi==1.16.0
charset-normalizer==3.3.2
click==8.1.7
colorama==0.4.6
comm==0.2.1
contourpy==1.2.0
crcmod==1.7
cryptography==42.0.2
cycler==0.12.1
datasets==2.17.0
debugpy==1.8.1
decorator==5.1.1
deepspeed==0.14.4
defusedxml==0.7.1
dill==0.3.8
distro==1.9.0
einops==0.8.0
einx==0.3.0
et-xmlfile==1.1.0
exceptiongroup==1.2.0
executing==2.0.1
fastapi==0.112.0
fastjsonschema==2.19.1
feedparser==6.0.10
filelock==3.14.0
fonttools==4.48.1
fqdn==1.5.1
frozendict==2.4.4
frozenlist==1.4.1
fsspec==2023.10.0
func-timeout==4.3.5
gast==0.5.4
gitdb==4.0.11
GitPython==3.1.41
google-search-results==2.4.2
griffe==0.40.1
h11==0.14.0
hjson==3.1.0
httpcore==1.0.3
httpx==0.26.0
huggingface-hub==0.24.2
idna==3.6
imageio==2.34.2
importlib-metadata==7.0.1
ipykernel==6.29.2
ipython==8.21.0
ipywidgets==8.1.2
isoduration==20.11.0
jedi==0.19.1
Jinja2==3.1.3
jmespath==0.10.0
json5==0.9.14
jsonpointer==2.4
jsonschema==4.21.1
jsonschema-specifications==2023.12.1
kiwisolver==1.4.5
lagent==0.2.1
lazy_loader==0.4
llvmlite==0.43.0
lxml==5.1.0
markdown-it-py==3.0.0
MarkupSafe==2.1.5
matplotlib==3.8.2
matplotlib-inline==0.1.6
mdurl==0.1.2
mistune==3.0.2
mmengine==0.10.3
modelscope==1.12.0
mpi4py_mpich==3.1.5
mpmath==1.3.0
multidict==6.0.5
multiprocess==0.70.16
nbclient==0.9.0
nbconvert==7.16.0
nbformat==5.9.2
nest-asyncio==1.6.0
networkx==3.2.1
ninja==1.11.1.1
notebook==7.0.8
notebook_shim==0.2.3
numba==0.60.0
numpy==1.26.4
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==8.9.2.26
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-nccl-cu12==2.19.3
nvidia-nvjitlink-cu12==12.3.101
nvidia-nvtx-cu12==12.1.105
openai==1.12.0
opencv-python==4.9.0.80
openpyxl==3.1.2
oss2==2.17.0
overrides==7.7.0
packaging==24.1
pandas==2.2.0
pandocfilters==1.5.1
parso==0.8.3
peft==0.8.2
pexpect==4.9.0
phx-class-registry==4.1.0
pillow==10.2.0
platformdirs==4.2.0
prometheus-client==0.19.0
prompt-toolkit==3.0.43
protobuf==4.25.2
psutil==5.9.8
ptyprocess==0.7.0
pure-eval==0.2.2
py-cpuinfo==9.0.0
pyarrow==15.0.0
pyarrow-hotfix==0.6
pybase16384==0.3.7
pycparser==2.21
pycryptodome==3.20.0
pydantic==2.6.1
pydantic_core==2.16.2
pydeck==0.8.1b0
Pygments==2.17.2
pynvml==11.5.0
pyparsing==3.1.1
python-dateutil==2.8.2
python-json-logger==2.0.7
python-pptx==0.6.23
PyYAML==6.0.1
pyzmq==25.1.2
qtconsole==5.5.1
QtPy==2.4.1
referencing==0.33.0
regex==2023.12.25
rfc3339-validator==0.1.4
rfc3986-validator==0.1.1
rich==13.4.2
rpds-py==0.17.1
safetensors==0.4.2
scikit-image==0.24.0
scipy==1.12.0
seaborn==0.13.2
Send2Trash==1.8.2
sentencepiece==0.1.99
sgmllib3k==1.0.0
simplejson==3.19.2
six==1.16.0
smmap==5.0.1
sniffio==1.3.0
sortedcontainers==2.4.0
soupsieve==2.5
stack-data==0.6.3
starlette==0.37.2
sympy==1.12
tenacity==8.2.3
termcolor==2.4.0
terminado==0.18.0
tifffile==2024.7.24
tiktoken==0.6.0
timeout-decorator==0.5.0
tinycss2==1.2.1
tokenizers==0.15.2
toml==0.10.2
tomli==2.0.1
toolz==0.12.1
torch==2.2.1
torchvision==0.17.1
tornado==6.4
tqdm==4.65.2
traitlets==5.14.1
transformers==4.39.0
transformers-stream-generator==0.0.4
triton==2.2.0
types-python-dateutil==2.8.19.20240106
typing_extensions==4.9.0
tzdata==2024.1
tzlocal==5.2
uri-template==1.3.0
urllib3==1.26.18
uvicorn==0.30.6
validators==0.22.0
watchdog==4.0.0
wcwidth==0.2.13
webcolors==1.13
webencodings==0.5.1
websocket-client==1.7.0
widgetsnbextension==4.0.10
XlsxWriter==3.1.9
xtuner==0.1.23
xxhash==3.4.1
yapf==0.40.2
yarl==1.9.4
zipp==3.17.0
驗證一下:
xtuner list-cfg
修改提供的數據
步驟 0. 創建一個新的文件夾用于存儲微調數據
mkdir -p /root/finetune/data && cd /root/finetune/data
cp -r /root/Tutorial/data/assistant_Tuner.jsonl /root/finetune/data
步驟 1. 創建修改腳本
# 創建 `change_script.py` 文件
touch /root/finetune/data/change_script.py
import json
import argparse
from tqdm import tqdmdef process_line(line, old_text, new_text):# 解析 JSON 行data = json.loads(line)# 遞歸函數來處理嵌套的字典和列表def replace_text(obj):if isinstance(obj, dict):return {k: replace_text(v) for k, v in obj.items()}elif isinstance(obj, list):return [replace_text(item) for item in obj]elif isinstance(obj, str):return obj.replace(old_text, new_text)else:return obj# 處理整個 JSON 對象processed_data = replace_text(data)# 將處理后的對象轉回 JSON 字符串return json.dumps(processed_data, ensure_ascii=False)def main(input_file, output_file, old_text, new_text):with open(input_file, 'r', encoding='utf-8') as infile, \open(output_file, 'w', encoding='utf-8') as outfile:# 計算總行數用于進度條total_lines = sum(1 for _ in infile)infile.seek(0) # 重置文件指針到開頭# 使用 tqdm 創建進度條for line in tqdm(infile, total=total_lines, desc="Processing"):processed_line = process_line(line.strip(), old_text, new_text)outfile.write(processed_line + '\n')if __name__ == "__main__":parser = argparse.ArgumentParser(description="Replace text in a JSONL file.")parser.add_argument("input_file", help="Input JSONL file to process")parser.add_argument("output_file", help="Output file for processed JSONL")parser.add_argument("--old_text", default="尖米", help="Text to be replaced")parser.add_argument("--new_text", default="聞星", help="Text to replace with")args = parser.parse_args()main(args.input_file, args.output_file, args.old_text, args.new_text)
然后修改如下: 打開 change_script.py ,修改 --new_text 中 default=“聞星” 為你的名字
步驟 2. 執行腳本
# usage:python change_script.py {input_file.jsonl} {output_file.jsonl}
cd ~/finetune/data
python change_script.py ./assistant_Tuner.jsonl ./assistant_Tuner_change.jsonl
步驟 3. 查看數據
cat assistant_Tuner_change.jsonl | head -n 3
訓練啟動
步驟 0. 復制模型
在InternStudio開發機中的已經提供了微調模型,可以直接軟鏈接即可。
本模型位于/root/share/new_models/Shanghai_AI_Laboratory/internlm2_5-7b-chat
mkdir /root/finetune/modelsln -s /root/share/new_models/Shanghai_AI_Laboratory/internlm2_5-7b-chat /root/finetune/models/internlm2_5-7b-chat
步驟 1. 修改 Config
# cd {path/to/finetune}
cd /root/finetune
mkdir ./config
cd config
xtuner copy-cfg internlm2_5_chat_7b_qlora_alpaca_e3 ./
修改config文件,
#######################################################################
# PART 1 Settings #
#######################################################################
- pretrained_model_name_or_path = 'internlm/internlm2_5-7b-chat'
+ pretrained_model_name_or_path = '/root/finetune/models/internlm2_5-7b-chat'- alpaca_en_path = 'tatsu-lab/alpaca'
+ alpaca_en_path = '/root/finetune/data/assistant_Tuner_change.jsonl'evaluation_inputs = [
- '請給我介紹五個上海的景點', 'Please tell me five scenic spots in Shanghai'
+ '請介紹一下你自己', 'Please introduce yourself'
]#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(type=process_hf_dataset,
- dataset=dict(type=load_dataset, path=alpaca_en_path),
+ dataset=dict(type=load_dataset, path='json', data_files=dict(train=alpaca_en_path)),tokenizer=tokenizer,max_length=max_length,
- dataset_map_fn=alpaca_map_fn,
+ dataset_map_fn=None,template_map_fn=dict(type=template_map_fn_factory, template=prompt_template),remove_unused_columns=True,shuffle_before_pack=True,pack_to_max_length=pack_to_max_length,use_varlen_attn=use_varlen_attn)
修改后如下
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,BitsAndBytesConfig)from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = '/root/finetune/models/internlm2_5-7b-chat'
use_varlen_attn = False# Data
alpaca_en_path = '/root/finetune/data/assistant_Tuner_change.jsonl'
prompt_template = PROMPT_TEMPLATE.internlm2_chat
max_length = 2048
pack_to_max_length = True# parallel
sequence_parallel_size = 1# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 1
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = ['請介紹一下你自己', 'Please introduce yourself'
]#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(type=AutoTokenizer.from_pretrained,pretrained_model_name_or_path=pretrained_model_name_or_path,trust_remote_code=True,padding_side='right')model = dict(type=SupervisedFinetune,use_varlen_attn=use_varlen_attn,llm=dict(type=AutoModelForCausalLM.from_pretrained,pretrained_model_name_or_path=pretrained_model_name_or_path,trust_remote_code=True,torch_dtype=torch.float16,quantization_config=dict(type=BitsAndBytesConfig,load_in_4bit=True,load_in_8bit=False,llm_int8_threshold=6.0,llm_int8_has_fp16_weight=False,bnb_4bit_compute_dtype=torch.float16,bnb_4bit_use_double_quant=True,bnb_4bit_quant_type='nf4')),lora=dict(type=LoraConfig,r=64,lora_alpha=16,lora_dropout=0.1,bias='none',task_type='CAUSAL_LM'))#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(type=process_hf_dataset,dataset=dict(type=load_dataset, path='json', data_files=dict(train=alpaca_en_path)),tokenizer=tokenizer,max_length=max_length,dataset_map_fn=None,template_map_fn=dict(type=template_map_fn_factory, template=prompt_template),remove_unused_columns=True,shuffle_before_pack=True,pack_to_max_length=pack_to_max_length,use_varlen_attn=use_varlen_attn)sampler = SequenceParallelSampler \if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(batch_size=batch_size,num_workers=dataloader_num_workers,dataset=alpaca_en,sampler=dict(type=sampler, shuffle=True),collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(type=AmpOptimWrapper,optimizer=dict(type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),accumulative_counts=accumulative_counts,loss_scale='dynamic',dtype='float16')# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [dict(type=LinearLR,start_factor=1e-5,by_epoch=True,begin=0,end=warmup_ratio * max_epochs,convert_to_iter_based=True),dict(type=CosineAnnealingLR,eta_min=0.0,by_epoch=True,begin=warmup_ratio * max_epochs,end=max_epochs,convert_to_iter_based=True)
]# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [dict(type=DatasetInfoHook, tokenizer=tokenizer),dict(type=EvaluateChatHook,tokenizer=tokenizer,every_n_iters=evaluation_freq,evaluation_inputs=evaluation_inputs,system=SYSTEM,prompt_template=prompt_template)
]if use_varlen_attn:custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]# configure default hooks
default_hooks = dict(# record the time of every iteration.timer=dict(type=IterTimerHook),# print log every 10 iterations.logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),# enable the parameter scheduler.param_scheduler=dict(type=ParamSchedulerHook),# save checkpoint per `save_steps`.checkpoint=dict(type=CheckpointHook,by_epoch=False,interval=save_steps,max_keep_ckpts=save_total_limit),# set sampler seed in distributed evrionment.sampler_seed=dict(type=DistSamplerSeedHook),
)# configure environment
env_cfg = dict(# whether to enable cudnn benchmarkcudnn_benchmark=False,# set multi process parametersmp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),# set distributed parametersdist_cfg=dict(backend='nccl'),
)# set visualizer
visualizer = None# set log level
log_level = 'INFO'# load from which checkpoint
load_from = None# whether to resume training from the loaded checkpoint
resume = False# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)# set log processor
log_processor = dict(by_epoch=False)
步驟 2. 啟動微調
cd /root/finetune
conda activate xtuner-envxtuner train ./config/internlm2_5_chat_7b_qlora_alpaca_e3_copy.py --deepspeed deepspeed_zero2 --work-dir ./work_dirs/assistTuner
xtuner train 命令用于啟動模型微調進程。
該命令需要一個參數:CONFIG 用于指定微調配置文件。這里我們使用修改好的配置文件 internlm2_5_chat_7b_qlora_alpaca_e3_copy.py。
訓練過程中產生的所有文件,包括日志、配置文件、檢查點文件、微調后的模型等,默認保存在 work_dirs 目錄下,我們也可以通過添加 --work-dir 指定特定的文件保存位置。
–deepspeed 則為使用 deepspeed, deepspeed 可以節約顯存。
步驟 3. 權重轉換
模型轉換的本質其實就是將原本使用 Pytorch 訓練出來的模型權重文件轉換為目前通用的 HuggingFace 格式文件,那么我們可以通過以下命令來實現一鍵轉換。
我們可以使用 xtuner convert pth_to_hf 命令來進行模型格式轉換
xtuner convert pth_to_hf 命令用于進行模型格式轉換。
該命令需要三個參數:
CONFIG 表示微調的配置文件,
PATH_TO_PTH_MODEL 表示微調的模型權重文件路徑,即要轉換的模型權重,
SAVE_PATH_TO_HF_MODEL 表示轉換后的 HuggingFace 格式文件的保存路徑。
–fp32 代表以fp32的精度開啟,假如不輸入則默認為fp16
–max-shard-size {GB} 代表每個權重文件最大的大小(默認為2GB)
cd /root/finetune/work_dirs/assistTunerconda activate xtuner-env# 先獲取最后保存的一個pth文件
pth_file=`ls -t /root/finetune/work_dirs/assistTuner/*.pth | head -n 1 | sed 's/:$//'`
export MKL_SERVICE_FORCE_INTEL=1
export MKL_THREADING_LAYER=GNU
xtuner convert pth_to_hf ./internlm2_5_chat_7b_qlora_alpaca_e3_copy.py ${pth_file} ./hf
步驟 4. 模型合并
對于 LoRA 或者 QLoRA 微調出來的模型其實并不是一個完整的模型,而是一個額外的層(Adapter),訓練完的這個層最終還是要與原模型進行合并才能被正常的使用。
對于全量微調的模型(full)其實是不需要進行整合這一步的,因為全量微調修改的是原模型的權重而非微調一個新的 Adapter ,因此是不需要進行模型整合的。
在 XTuner 中提供了一鍵合并的命令 xtuner convert merge,在使用前我們需要準備好三個路徑,包括原模型的路徑、訓練好的 Adapter 層的(模型格式轉換后的)路徑以及最終保存的路徑。
xtuner convert merge命令用于合并模型。該命令需要三個參數:LLM 表示原模型路徑,ADAPTER 表示 Adapter 層的路徑, SAVE_PATH 表示合并后的模型最終的保存路徑。
參數名解釋
–max-shard-size {GB} 代表每個權重文件最大的大小(默認為2GB)
–device {device_name} 這里指的就是device的名稱,可選擇的有cuda、cpu和auto,默認為cuda即使用gpu進行運算
–is-clip 這個參數主要用于確定模型是不是CLIP模型,假如是的話就要加上,不是就不需要添加
cd /root/finetune/work_dirs/assistTuner
conda activate xtuner-envexport MKL_SERVICE_FORCE_INTEL=1
export MKL_THREADING_LAYER=GNU
xtuner convert merge /root/finetune/models/internlm2_5-7b-chat ./hf ./merged --max-shard-size 2GB
模型 WebUI 對話
cd ~/Tutorial/tools/L1_XTuner_code
# 直接修改xtuner_streamlit_demo.py腳本文件第18行
- model_name_or_path = "Shanghai_AI_Laboratory/internlm2_5-7b-chat"
+ model_name_or_path = "/root/finetune/work_dirs/assistTuner/merged"conda activate xtuner-envpip install streamlit==1.31.0
streamlit run /root/Tutorial/tools/L1_XTuner_code/xtuner_streamlit_demo.py
# 運行后,確保端口映射正常,如果映射已斷開則需要重新做一次端口映射
ssh -CNg -L 8501:127.0.0.1:8501 root@ssh.intern-ai.org.cn -p *****
最后,通過瀏覽器訪問:http://127.0.0.1:8501 來進行對話了