【機器學習深度學習】LLamaFactory微調效果與vllm部署效果不一致如何解決

前言

一、問題本質

1.1 問題說明

1.2?問題本質示意

二、常見原因

LLaMAFactory對話模板規則定義

模型對話模板定義規則?

三、解決方法

?提取代碼myset.py

創建jinja文件

安裝VLLM

運行VLLM

安裝運行open webui流程

四、流程梳理

前言

本文主要講述的主要內容是關于在LLaMAFactory中微調后加載的模型測試效果和加載到其他推理框架（VLLM，LMDeploy）的測試效果不一致是什么原因導致的？又該如何解決？

主要原因是對話模板不一致所導致的

一、問題本質

1.1 問題說明

LLaMA-Factory 使用內置對話模板（chat_template）來格式化訓練/推理數據，而 vLLM、LMDeploy 等推理框架則可能使用自己的模板體系或根本不管模板。
這就是導致微調效果與部署效果不一致的核心原因之一。

LLaMA-Factory 的微調過程使用的是 HuggingFace Transformers 格式，而 vLLM 在推理時，依賴精確的：

tokenizer 配置
模型 config
權重格式（尤其是 LoRA 或 QLoRA）
prompt 模板

如果這些配置在部署時和訓練時不一致，模型生成效果就會發生偏差。

1.2?問題本質示意

在 LLaMA-Factory 中使用的訓練模板（例如 chatml）：

<|im_start|>system
你是一個智能助手<|im_end|>
<|im_start|>user
你好<|im_end|>
<|im_start|>assistant

?而 vLLM 默認可能認為只需要：

用戶：你好
助手：

?或者根本不加模板，直接拼接文本。這會讓模型生成完全跑偏。

二、常見原因

2.1 原因匯總

問題類別	具體原因	解決方案
Tokenizer 不一致	`tokenizer_config.json`、`special_tokens_map.json` 中信息缺失或被修改	? 確保 `tokenizer` 跟隨微調模型一起保存并部署到 vLLM
Prompt 模板不一致	如使用 `chatml`、`alpaca`、`baichuan` 模板不同會導致上下文格式變化	? 明確使用的 prompt 格式，在 `vLLM` 推理端加上相同格式的 `系統提示 + 用戶輸入 + assistant` 結構
LoRA 權重未合并	微調后未合并 LoRA 層，推理端加載的是 base 模型	? 使用 `merge_and_save` 合并權重，再加載
Quantization 不一致	訓練中使用了 QLoRA 等量化策略，而部署端未支持或解碼異常	? 部署前使用 `LLaMA-Factory` 的 `export_model.py` 導出標準 FP16 或 INT4 GGUF 權重
vLLM 加載錯誤配置	vLLM 讀取模型參數時未正確加載 `config.json` 或 `generation_config.json`	? 檢查是否帶上 `--tokenizer`、`--trust-remote-code` 等參數

2.2 LLaMAFactory對話模板規則定義

對話模板路徑：LLaMA-Factory/src/llamafactory/data/template.py

在LLaMAFactory文件夾中有個template文件（LLaMA-Factory/src/llamafactory/data/template.py）定義了各個模型的對話模板規則，template.py中定義了大部分主流，通用的模型對話規則。

如下圖展示的是qwen的幾個對話模板

單獨拿出一個對話示例模板，也就是【1.2】所展示的LLaMAFactory對話模板格式示例：

# copied from chatml template
register_template(name="qwen",format_user=StringFormatter(slots=["<|im_start|>user\n{{content}}<|im_end|>\n<|im_start|>assistant\n"]),format_assistant=StringFormatter(slots=["{{content}}<|im_end|>\n"]),format_system=StringFormatter(slots=["<|im_start|>system\n{{content}}<|im_end|>\n"]),format_function=FunctionFormatter(slots=["{{content}}<|im_end|>\n"], tool_format="qwen"),format_observation=StringFormatter(slots=["<|im_start|>user\n<tool_response>\n{{content}}\n</tool_response><|im_end|>\n<|im_start|>assistant\n"]),format_tools=ToolFormatter(tool_format="qwen"),default_system="You are Qwen, created by Alibaba Cloud. You are a helpful assistant.",stop_words=["<|im_end|>"],replace_eos=True,
)

2.3 模型對話模板定義規則?

模型對話模板路徑：模型名稱/tokenizer_config.json

如果LLaMAFactory的內置對話模板中定義了相同的對話模板規則，回復效果自然也就沒有差異，但在大多數情況下，對話模板的規則一致的條件是很難滿足的。

三、解決方法

把在LLaMAFactory微調時的那套對話模板規則模板導出。使用微調時的對話規則，再部署到其它推理框架上就能夠保證模型的輸出效果一致了。?

3.1? 提取代碼myset.py

# mytest.py
import sys
import os# 將項目根目錄添加到 Python 路徑
root_dir = os.path.dirname(os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__)))))
sys.path.append(root_dir)from llamafactory.data.template import TEMPLATES
from transformers import AutoTokenizer# 1. 初始化分詞器（任意支持的分詞器均可）
tokenizer = AutoTokenizer.from_pretrained("/mnt/workspace/model/qwen/Qwen2.5-7B-Instruct")# 2. 獲取模板對象
template_name = "qwen"  # 替換為你需要查看的模板名稱
template = TEMPLATES[template_name]# 3. 修復分詞器的 Jinja 模板
template.fix_jinja_template(tokenizer)# 4. 直接輸出模板的 Jinja 格式
print("=" * 40)
print(f"Template [{template_name}] 的 Jinja 格式:")
print("=" * 40)
print(tokenizer.chat_template)

【替換片段】

# 1. 初始化分詞器（任意支持的分詞器均可）
tokenizer = AutoTokenizer.from_pretrained("/mnt/workspace/model/qwen/Qwen2.5-7B-Instruct")

▲/mnt/workspace/model/qwen/Qwen2.5-7B-Instruct：模型路徑，里面必須包含分詞器tokenizers.json,下載的模型文件會自帶這個文件；

template_name = "qwen"  # 替換為你需要查看的模板名稱

▲qwen：在LLaMA-Factory/src/llamafactory/data/template.py中的內置對話模板規則，根據需求選擇；

?【使用說明】

1、放置位置：LLaMA-Factory/src/llamafactory/data/mytest.py

2、創建虛擬環境配置

#創建虛擬環境llamafactory
conda create -n llamafactory python=3.11#激活虛擬環境
conda activate llamafactory

3、安裝依賴

在虛擬環境內安裝依賴

# 基礎依賴
pip install transformers# 必須安裝的依賴（LLamaFactory 相關）
pip install jinja2
pip install torch  # PyTorch 基礎庫# 安裝 LLamaFactory 及其依賴（推薦從源碼安裝）
git clone https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory
pip install -e .  # 可編輯模式安裝# 可選但推薦的依賴（用于完整功能）
pip install datasets accelerate sentencepiece

4、運行提取程序mytest.py

python /mnt/workspace/LLaMA-Factory/src/llamafactory/data/mytest.py

▲/mnt/workspace/LLaMA-Factory/src/llamafactory/data/mytest.py：表示myset.py路徑

5、運行結果

========================================
Template [qwen] 的 Jinja 格式:
========================================
{%- if tools %}{{- '<|im_start|>system\n' }}{%- if messages[0]['role'] == 'system' %}{{- messages[0]['content'] }}{%- else %}{{- 'You are Qwen, created by Alibaba Cloud. You are a helpful assistant.' }}{%- endif %}{{- "\n\n# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }}{%- for tool in tools %}{{- "\n" }}{{- tool | tojson }}{%- endfor %}{{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }}
{%- else %}{%- if messages[0]['role'] == 'system' %}{{- '<|im_start|>system\n' + messages[0]['content'] + '<|im_end|>\n' }}{%- else %}{{- '<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n' }}{%- endif %}
{%- endif %}
{%- for message in messages %}{%- if (message.role == "user") or (message.role == "system" and not loop.first) or (message.role == "assistant" and not message.tool_calls) %}{{- '<|im_start|>' + message.role + '\n' + message.content + '<|im_end|>' + '\n' }}{%- elif message.role == "assistant" %}{{- '<|im_start|>' + message.role }}{%- if message.content %}{{- '\n' + message.content }}{%- endif %}{%- for tool_call in message.tool_calls %}{%- if tool_call.function is defined %}{%- set tool_call = tool_call.function %}{%- endif %}{{- '\n<tool_call>\n{"name": "' }}{{- tool_call.name }}{{- '", "arguments": ' }}{{- tool_call.arguments | tojson }}{{- '}\n</tool_call>' }}{%- endfor %}{{- '<|im_end|>\n' }}{%- elif message.role == "tool" %}{%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != "tool") %}{{- '<|im_start|>user' }}{%- endif %}{{- '\n<tool_response>\n' }}{{- message.content }}{{- '\n</tool_response>' }}{%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}{{- '<|im_end|>\n' }}{%- endif %}{%- endif %}
{%- endfor %}
{%- if add_generation_prompt %}{{- '<|im_start|>assistant\n' }}
{%- endif %}

3.2 創建jinja文件

創建一個jinja文件，將執行結果復制進去

3.3 安裝VLLM

參考文章：【VLLM】大模型本地化部署_vllm部署大模型-CSDN博客

直接看這篇文章的【三、部署VLLM】即可

3.4 運行VLLM

vllm serve <model> --chat-template ./path-to-chat-template.jinja

▲<model>：表示訓練模型路徑；

▲./path-to-chat-template.jinja：表示jinja文件的路徑；

示例

vllm serve /mnt/workspace/model/qwen_7b --chat-template /mnt/workspace/model/jinja2/qwen.jinja

3.5 安裝運行open webui流程

#創建一個open-webui的conda環境
conda activate open-webui
# 安裝open-webui
pip install -U open-webui torch transformers# 切換到新建的環境
conda activate open-webui
#配置
export HF_ENDPOINT=https://hf-mirror.com
#因為open-webui默認為ollama框架，所以使用vllm框架啟動大模型的話需要將這里改為false
export ENABLE_OLLAMA_API=false
#調用大模型的地址，vllm的默認啟動端口為8000
export OPENAI_API_BASE_URL=http://127.0.0.1:8000/v1#啟動openwebui
open-webui serve

四、流程梳理

具體部署流程可看：【VLLM】open-webui部署模型全流程-CSDN博客

1、下載模型到本地（可選）：根據業務場景選擇合適的模型即可；

3、安裝vllm：用虛擬環境隔離安裝，安裝教程：【VLLM】大模型本地化部署_vllm部署大模型-CSDN博客；

4、安裝open-webui：用虛擬環境隔離安裝；

5、運行vllm和open-webui：

方法1：vllm環境中啟動下載的本地模型

vllm serve <模型存放路徑>

方法2：vllm環境中啟動微調后導出的本地模型

vllm serve /mnt/workspace/model/qwen_7b --chat-template /mnt/workspace/model/jinja2/qwen.jinja

▲/mnt/workspace/model/qwen_7b：表示微調后導出的模型的存放路徑；

▲/mnt/workspace/model/jinja2/qwen.jinja：LLaMAFactory中提取的內置對話模板規則路徑

加上LLaMAFactory導出的內置對話模板規則，隨后開啟open-webui服務，在vllm框架下，模型的回復效果就和在LLaMAFactory上測試的效果是一樣的啦；