【InternLM 實戰營筆記】XTuner 大模型單卡低成本微調實戰

XTuner概述

一個大語言模型微調工具箱。由 MMRazor 和 MMDeploy 聯合開發。

支持的開源LLM (2023.11.01)

InternLM
Llama，Llama2
ChatGLM2，ChatGLM3
Qwen
Baichuan，Baichuan2
Zephyr

特色

傻瓜化：以配置文件的形式封裝了大部分微調場景，0基礎的非專業人員也能一鍵開始微調。
輕量級：對于 7B 參數量的LLM，微調所需的最小顯存僅為 8GB ：消費級顯卡，colab

微調原理

因此，你找到了一種叫 LoRA 的方法：只對玩具中的某些零件進行改動，而不是對整個玩具進行全面改動。
而 QLoRA 是 LoRA 的一種改進：如果你手里只有一把生銹的螺絲刀，也能改造你的玩具。

實踐過程

安裝

# 如果你是在 InternStudio 平臺，則從本地 clone 一個已有 pytorch 2.0.1 的環境：
/root/share/install_conda_env_internlm_base.sh xtuner0.1.9# 激活環境
conda activate xtuner0.1.9
# 進入家目錄 （~的意思是 “當前用戶的home路徑”）
cd ~
# 創建版本文件夾并進入，以跟隨本教程
mkdir xtuner019 && cd xtuner019# 拉取 0.1.9 的版本源碼
git clone -b v0.1.9  https://github.com/InternLM/xtuner
# 無法訪問github的用戶請從 gitee 拉取:
# git clone -b v0.1.9 https://gitee.com/Internlm/xtuner# 進入源碼目錄
cd xtuner# 從源碼安裝 XTuner
pip install -e '.[all]'

準備在 oasst1 數據集上微調 internlm-7b-chat

# 創建一個微調 oasst1 數據集的工作路徑，進入
mkdir ~/ft-oasst1 && cd ~/ft-oasst1

微調

XTuner 提供多個開箱即用的配置文件，用戶可以通過下列命令查看：

# 列出所有內置配置
xtuner list-cfg

拷貝一個配置文件到當前目錄：

# xtuner copy-cfg ${CONFIG_NAME} ${SAVE_PATH}

本例中

cd ~/ft-oasst1
xtuner copy-cfg internlm_chat_7b_qlora_oasst1_e3 .

模型下載

直接復制模型

ln -s /share/temp/model_repos/internlm-chat-7b ~/ft-oasst1/

數據集下載

cd ~/ft-oasst1
# ...-guanaco 后面有個空格和英文句號啊
cp -r /root/share/temp/datasets/openassistant-guanaco .

修改配置文件

cd ~/ft-oasst1
vim internlm_chat_7b_qlora_oasst1_e3_copy.py

開始微調

xtuner train ./internlm_chat_7b_qlora_oasst1_e3_copy.py

將得到的 PTH 模型轉換為 HuggingFace 模型，即：生成 Adapter 文件夾

mkdir hf
export MKL_SERVICE_FORCE_INTEL=1
export MKL_THREADING_LAYER=GNU
xtuner convert pth_to_hf ./internlm_chat_7b_qlora_oasst1_e3_copy.py ./work_dirs/internlm_chat_7b_qlora_oasst1_e3_copy/epoch_1.pth ./hf

部署與測試

將 HuggingFace adapter 合并到大語言模型

xtuner convert merge ./internlm-chat-7b ./hf ./merged --max-shard-size 2GB

與合并后的模型對話

xtuner chat ./merged --prompt-template internlm_chat

修改 cli_demo.py 中的模型路徑

- model_name_or_path = "/root/model/Shanghai_AI_Laboratory/internlm-chat-7b"
+ model_name_or_path = "merged"

運行

python ./cli_demo.py

自定義微調

基于 InternLM-chat-7B 模型，用 MedQA 數據集進行微調，將其往醫學問答領域對齊。

數據集

https://github.com/abachaa/Medication_QA_MedInfo2019

準備工作

將數據轉為 XTuner 的數據格式(jsonL
目標格式：(.jsonL)

[{"conversation":[{"system": "xxx","input": "xxx","output": "xxx"}]
},
{"conversation":[{"system": "xxx","input": "xxx","output": "xxx"}]
}]

通過 python 腳本：將 .xlsx 中的問題和回答兩列提取出來，再放入 .jsonL 文件的每個 conversation 的 input 和 output 中。
這一步的 python 腳本可以請 ChatGPT 來完成。

Write a python file for me. using openpyxl. input file name is MedQA2019.xlsx
Step1: The input file is .xlsx. Exact the column A and column D in the sheet named "DrugQA" .
Step2: Put each value in column A into each "input" of each "conversation". Put each value in column D into each "output" of each "conversation".
Step3: The output file is .jsonL. It looks like:
[{"conversation":[{"system": "xxx","input": "xxx","output": "xxx"}]
},
{"conversation":[{"system": "xxx","input": "xxx","output": "xxx"}]
}]
Step4: All "system" value changes to "You are a professional, highly experienced doctor professor. You always provide accurate, comprehensive, and detailed answers based on the patients' questions."

生成的代碼如下:

import openpyxl
import jsondef process_excel_to_json(input_file, output_file):# Load the workbookwb = openpyxl.load_workbook(input_file)# Select the "DrugQA" sheetsheet = wb["DrugQA"]# Initialize the output data structureoutput_data = []# Iterate through each row in column A and Dfor row in sheet.iter_rows(min_row=2, max_col=4, values_only=True):system_value = "You are a professional, highly experienced doctor professor. You always provide accurate, comprehensive, and detailed answers based on the patients' questions."# Create the conversation dictionaryconversation = {"system": system_value,"input": row[0],"output": row[3]}# Append the conversation to the output dataoutput_data.append({"conversation": [conversation]})# Write the output data to a JSON filewith open(output_file, 'w', encoding='utf-8') as json_file:json.dump(output_data, json_file, indent=4)print(f"Conversion complete. Output written to {output_file}")# Replace 'MedQA2019.xlsx' and 'output.jsonl' with your actual input and output file names
process_excel_to_json('MedQA2019.xlsx', 'output.jsonl')

執行腳本：

python xlsx2jsonl.py

劃分訓練集和測試集

my .jsonL file looks like:
[{"conversation":[{"system": "xxx","input": "xxx","output": "xxx"}]
},
{"conversation":[{"system": "xxx","input": "xxx","output": "xxx"}]
}]
Step1, read the .jsonL file.
Step2, count the amount of the "conversation" elements.
Step3, randomly split all "conversation" elements by 7:3. Targeted structure is same as the input.
Step4, save the 7/10 part as train.jsonl. save the 3/10 part as test.jsonl

代碼如下：

import json
import randomdef split_conversations(input_file, train_output_file, test_output_file):# Read the input JSONL filewith open(input_file, 'r', encoding='utf-8') as jsonl_file:data = json.load(jsonl_file)# Count the number of conversation elementsnum_conversations = len(data)# Shuffle the data randomlyrandom.shuffle(data)random.shuffle(data)random.shuffle(data)# Calculate the split points for train and testsplit_point = int(num_conversations * 0.7)# Split the data into train and testtrain_data = data[:split_point]test_data = data[split_point:]# Write the train data to a new JSONL filewith open(train_output_file, 'w', encoding='utf-8') as train_jsonl_file:json.dump(train_data, train_jsonl_file, indent=4)# Write the test data to a new JSONL filewith open(test_output_file, 'w', encoding='utf-8') as test_jsonl_file:json.dump(test_data, test_jsonl_file, indent=4)print(f"Split complete. Train data written to {train_output_file}, Test data written to {test_output_file}")# Replace 'input.jsonl', 'train.jsonl', and 'test.jsonl' with your actual file names
split_conversations('MedQA2019-structured.jsonl', 'MedQA2019-structured-train.jsonl', 'MedQA2019-structured-test.jsonl')

開始自定義微調

此時，我們重新建一個文件夾來玩“微調自定義數據集”

mkdir ~/ft-medqa && cd ~/ft-medqa

把前面下載好的internlm-chat-7b模型文件夾拷貝過來。

cp -r ~/ft-oasst1/internlm-chat-7b .

別忘了把自定義數據集，即幾個 .jsonL，也傳到服務器上。

git clone https://github.com/InternLM/tutorial
cp ~/tutorial/xtuner/MedQA2019-structured-train.jsonl .

準備配置文件

# 復制配置文件到當前目錄
xtuner copy-cfg internlm_chat_7b_qlora_oasst1_e3 .
# 改個文件名
mv internlm_chat_7b_qlora_oasst1_e3_copy.py internlm_chat_7b_qlora_medqa2019_e3.py# 修改配置文件內容
vim internlm_chat_7b_qlora_medqa2019_e3.py

減號代表要刪除的行，加號代表要增加的行。

# 修改import部分
- from xtuner.dataset.map_fns import oasst1_map_fn, template_map_fn_factory
+ from xtuner.dataset.map_fns import template_map_fn_factory# 修改模型為本地路徑
- pretrained_model_name_or_path = 'internlm/internlm-chat-7b'
+ pretrained_model_name_or_path = './internlm-chat-7b'# 修改訓練數據為 MedQA2019-structured-train.jsonl 路徑
- data_path = 'timdettmers/openassistant-guanaco'
+ data_path = 'MedQA2019-structured-train.jsonl'# 修改 train_dataset 對象
train_dataset = dict(type=process_hf_dataset,
-   dataset=dict(type=load_dataset, path=data_path),
+   dataset=dict(type=load_dataset, path='json', data_files=dict(train=data_path)),tokenizer=tokenizer,max_length=max_length,
-   dataset_map_fn=alpaca_map_fn,
+   dataset_map_fn=None,template_map_fn=dict(type=template_map_fn_factory, template=prompt_template),remove_unused_columns=True,shuffle_before_pack=True,pack_to_max_length=pack_to_max_length)

啟動

xtuner train internlm_chat_7b_qlora_medqa2019_e3.py --deepspeed deepspeed_zero2

用 MS-Agent 數據集賦予 LLM 以 Agent 能力

MSAgent 數據集每條樣本包含一個對話列表（conversations），其里面包含了 system、user、assistant 三種字段。其中：

system: 表示給模型前置的人設輸入，其中有告訴模型如何調用插件以及生成請求
user: 表示用戶的輸入 prompt，分為兩種，通用生成的prompt和調用插件需求的 prompt
assistant: 為模型的回復。其中會包括插件調用代碼和執行代碼，調用代碼是要 LLM 生成的，而執行代碼是調用服務來生成結果的

微調

xtuner 是從國內的 ModelScope 平臺下載 MS-Agent 數據集，因此不用提前手動下載數據集文件。

# 準備工作
mkdir ~/ft-msagent && cd ~/ft-msagent
cp -r ~/ft-oasst1/internlm-chat-7b .# 查看配置文件
xtuner list-cfg | grep msagent# 復制配置文件到當前目錄
xtuner copy-cfg internlm_7b_qlora_msagent_react_e3_gpu8 .# 修改配置文件中的模型為本地路徑
vim ./internlm_7b_qlora_msagent_react_e3_gpu8_copy.py

- pretrained_model_name_or_path = 'internlm/internlm-chat-7b'
+ pretrained_model_name_or_path = './internlm-chat-7b'

開始微調

xtuner train ./internlm_7b_qlora_msagent_react_e3_gpu8_copy.py --deepspeed deepspeed_zero2

使用

由于 msagent 的訓練非常費時，大家如果想盡快把這個教程跟完，可以直接從 modelScope 拉取咱們已經微調好了的 Adapter。如下演示。

下載 Adapter

cd ~/ft-msagent
apt install git git-lfs
git lfs install
git lfs clone https://www.modelscope.cn/xtuner/internlm-7b-qlora-msagent-react.git

OK，現在目錄應該長這樣：

internlm_7b_qlora_msagent_react_e3_gpu8_copy.py
internlm-7b-qlora-msagent-react
internlm-chat-7b
work_dir（可有可無）`

有了這個在 msagent 上訓練得到的Adapter，模型現在已經有 agent 能力了！就可以加 --lagent 以調用來自 lagent 的代理功能了！

添加 serper 環境變量

開始 chat 之前，還要加個 serper 的環境變量：

去 serper.dev 免費注冊一個賬號，生成自己的 api key。這個東西是用來給 lagent 去獲取 google 搜索的結果的。等于是 serper.dev 幫你去訪問 google，而不是從你自己本地去訪問 google 了。

添加 serper api key 到環境變量：

export SERPER_API_KEY=你的API_KEY

啟動

xtuner chat ./internlm-chat-7b --adapter internlm-7b-qlora-msagent-react --lagent

報錯處理
xtuner chat 增加 --lagent 參數后，報錯

TypeError: transfomers.modelsauto.auto factory. BaseAutoModelClass.from pretrained() got multiple values for keyword argument "trust remote code"

注釋掉已安裝包中的代碼：

vim /root/xtuner019/xtuner/xtuner/tools/chat.py

作業

conda create --name personal_assistant --clone=/root/share/conda_envs/internlm-base
# 如果在其他平臺：
# conda create --name personal_assistant python=3.10 -y# 激活環境
conda activate personal_assistant
# 進入家目錄 （~的意思是 “當前用戶的home路徑”）
cd ~
# 創建版本文件夾并進入，以跟隨本教程
# personal_assistant用于存放本教程所使用的東西
mkdir /root/personal_assistant && cd /root/personal_assistant
mkdir /root/personal_assistant/xtuner019 && cd /root/personal_assistant/xtuner019# 拉取 0.1.9 的版本源碼
git clone -b v0.1.9  https://github.com/InternLM/xtuner
# 無法訪問github的用戶請從 gitee 拉取:
# git clone -b v0.1.9 https://gitee.com/Internlm/xtuner# 進入源碼目錄
cd xtuner# 從源碼安裝 XTuner
pip install -e '.[all]'

數據準備

mkdir -p /root/personal_assistant/data && cd /root/personal_assistant/data

在data目錄下創建一個json文件personal_assistant.json作為本次微調所使用的數據集。json中內容可參考下方(復制粘貼n次做數據增廣，數據量小無法有效微調，下面僅用于展示格式，下面也有生成腳本)

其中conversation表示一次對話的內容，input為輸入，即用戶會問的問題，output為輸出，即想要模型回答的答案。

以下是一個python腳本，用于生成數據集。在data目錄下新建一個generate_data.py文件，將以下代碼復制進去，然后運行該腳本即可生成數據集。

import json# 輸入你的名字
name = 'Shengshenlan'
# 重復次數
n = 10000data = [{"conversation": [{"input": "請做一下自我介紹","output": "我是{}的小助手，內在是上海AI實驗室書生·浦語的7B大模型哦".format(name)}]}
]for i in range(n):data.append(data[0])with open('personal_assistant.json', 'w', encoding='utf-8') as f:json.dump(data, f, ensure_ascii=False, indent=4)

配置準備

mkdir -p /root/personal_assistant/model/Shanghai_AI_Laboratory
cp -r /root/share/temp/model_repos/internlm-chat-7b /root/personal_assistant/model/Shanghai_AI_Laboratory#創建用于存放配置的文件夾config并進入
mkdir /root/personal_assistant/config && cd /root/personal_assistant/configxtuner copy-cfg internlm_chat_7b_qlora_oasst1_e3 .

修改拷貝后的文件internlm_chat_7b_qlora_oasst1_e3_copy.py

# Copyright (c) OpenMMLab. All rights reserved.
import torch
from bitsandbytes.optim import PagedAdamW32bit
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR
from peft import LoraConfig
from transformers import (AutoModelForCausalLM, AutoTokenizer,BitsAndBytesConfig)from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import oasst1_map_fn, template_map_fn_factory
from xtuner.engine import DatasetInfoHook, EvaluateChatHook
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE#######################################################################
#                          PART 1  Settings                           #
#######################################################################
# Model
pretrained_model_name_or_path = '/root/personal_assistant/model/Shanghai_AI_Laboratory/internlm-chat-7b'# Data
data_path = '/root/personal_assistant/data/personal_assistant.json'
prompt_template = PROMPT_TEMPLATE.internlm_chat
max_length = 512
pack_to_max_length = True# Scheduler & Optimizer
batch_size = 2  # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = PagedAdamW32bit
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1  # grad clip# Evaluate the generation performance during the training
evaluation_freq = 90
SYSTEM = ''
evaluation_inputs = [ '請介紹一下你自己', '請做一下自我介紹' ]#######################################################################
#                      PART 2  Model & Tokenizer                      #
#######################################################################
tokenizer = dict(type=AutoTokenizer.from_pretrained,pretrained_model_name_or_path=pretrained_model_name_or_path,trust_remote_code=True,padding_side='right')model = dict(type=SupervisedFinetune,llm=dict(type=AutoModelForCausalLM.from_pretrained,pretrained_model_name_or_path=pretrained_model_name_or_path,trust_remote_code=True,torch_dtype=torch.float16,quantization_config=dict(type=BitsAndBytesConfig,load_in_4bit=True,load_in_8bit=False,llm_int8_threshold=6.0,llm_int8_has_fp16_weight=False,bnb_4bit_compute_dtype=torch.float16,bnb_4bit_use_double_quant=True,bnb_4bit_quant_type='nf4')),lora=dict(type=LoraConfig,r=64,lora_alpha=16,lora_dropout=0.1,bias='none',task_type='CAUSAL_LM'))#######################################################################
#                      PART 3  Dataset & Dataloader                   #
#######################################################################
train_dataset = dict(type=process_hf_dataset,dataset=dict(type=load_dataset, path='json', data_files=dict(train=data_path)),tokenizer=tokenizer,max_length=max_length,dataset_map_fn=None,template_map_fn=dict(type=template_map_fn_factory, template=prompt_template),remove_unused_columns=True,shuffle_before_pack=True,pack_to_max_length=pack_to_max_length)train_dataloader = dict(batch_size=batch_size,num_workers=dataloader_num_workers,dataset=train_dataset,sampler=dict(type=DefaultSampler, shuffle=True),collate_fn=dict(type=default_collate_fn))#######################################################################
#                    PART 4  Scheduler & Optimizer                    #
#######################################################################
# optimizer
optim_wrapper = dict(type=AmpOptimWrapper,optimizer=dict(type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),accumulative_counts=accumulative_counts,loss_scale='dynamic',dtype='float16')# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md  # noqa: E501
param_scheduler = dict(type=CosineAnnealingLR,eta_min=0.0,by_epoch=True,T_max=max_epochs,convert_to_iter_based=True)# train, val, test setting
train_cfg = dict(by_epoch=True, max_epochs=max_epochs, val_interval=1)#######################################################################
#                           PART 5  Runtime                           #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [dict(type=DatasetInfoHook, tokenizer=tokenizer),dict(type=EvaluateChatHook,tokenizer=tokenizer,every_n_iters=evaluation_freq,evaluation_inputs=evaluation_inputs,system=SYSTEM,prompt_template=prompt_template)
]# configure default hooks
default_hooks = dict(# record the time of every iteration.timer=dict(type=IterTimerHook),# print log every 100 iterations.logger=dict(type=LoggerHook, interval=10),# enable the parameter scheduler.param_scheduler=dict(type=ParamSchedulerHook),# save checkpoint per epoch.checkpoint=dict(type=CheckpointHook, interval=1),# set sampler seed in distributed evrionment.sampler_seed=dict(type=DistSamplerSeedHook),
)# configure environment
env_cfg = dict(# whether to enable cudnn benchmarkcudnn_benchmark=False,# set multi process parametersmp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),# set distributed parametersdist_cfg=dict(backend='nccl'),
)# set visualizer
visualizer = None# set log level
log_level = 'INFO'# load from which checkpoint
load_from = None# whether to resume training from the loaded checkpoint
resume = False# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)

微調啟動

xtuner train /root/personal_assistant/config/internlm_chat_7b_qlora_oasst1_e3_copy.py

微調后參數轉換/合并

# 創建用于存放Hugging Face格式參數的hf文件夾
mkdir /root/personal_assistant/config/work_dirs/hfexport MKL_SERVICE_FORCE_INTEL=1# 配置文件存放的位置
export CONFIG_NAME_OR_PATH=/root/personal_assistant/config/internlm_chat_7b_qlora_oasst1_e3_copy.py# 模型訓練后得到的pth格式參數存放的位置
export PTH=/root/personal_assistant/config/work_dirs/internlm_chat_7b_qlora_oasst1_e3_copy/epoch_3.pth# pth文件轉換為Hugging Face格式后參數存放的位置
export SAVE_PATH=/root/personal_assistant/config/work_dirs/hf# 執行參數轉換
xtuner convert pth_to_hf $CONFIG_NAME_OR_PATH $PTH $SAVE_PATH

網頁DEMO

安裝依賴

pip install streamlit==1.24.0

下載項目

# 創建code文件夾用于存放InternLM項目代碼
mkdir /root/personal_assistant/code && cd /root/personal_assistant/code
git clone https://github.com/InternLM/InternLM.git

將 /root/code/InternLM/web_demo.py 中 29 行和 33 行的模型路徑更換為Merge后存放參數的路徑 /root/personal_assistant/config/work_dirs/hf_merge

啟動

streamlit run /root/personal_assistant/code/InternLM/web_demo.py --server.address 127.0.0.1 --server.port 6006

運行效果

在這里插入圖片描述