指令微調Qwen3實現文本分類任務

?參考文檔：

SwanLab入門深度學習：Qwen3大模型指令微調 - 肖祥 - 博客園

vLLM：讓大語言模型推理更高效的新一代引擎 —— 原理詳解一_vllm 原理-CSDN博客

概述

為了實現對100+個標簽的多標簽文本分類任務，前期調用gpt-4o進行prompt優化，實現了較高的準確率，積累了大量的語料。后續利用積累的語料對大模型進行微調，部署自己的服務，取代gpt的API調用。因為顯卡大小有限，僅有一個16G的顯卡，因此選擇了小參數量的qwen3-1.7B模型進行指令微調，將100+個標簽拆解為兩個任務來做，使用LoRA訓練各自的adapter，這樣可以通過加載不同的adapter，最終實現100+個標簽的多標簽文本分類任務。

環境介紹

python                            3.12
torch                             2.7.1+cu128
transformers                      4.54.0
vllm                              0.10.0

實現過程

為了快速實現，復用了參考文章的代碼，做了比較微小的調整，追加了指標評估部分和vllm推理部分，結構如下：

project/
├── train.py??????????????# 訓練模型
├── predict.py ?????????# 推理/預測
├── predict_vllm.py ?????????# vllm推理/預測
├──?evaluate.py???????# 指標評估

模型訓練

# train.pyimport json
import pandas as pd
import torch
from datasets import Dataset
from modelscope import snapshot_download, AutoTokenizer
from swanlab.integration.huggingface import SwanLabCallback
from peft import LoraConfig, TaskType, get_peft_model
from transformers import AutoModelForCausalLM, TrainingArguments, Trainer, DataCollatorForSeq2Seq
import os
import swanlabdef dataset_jsonl_transfer(origin_path, new_path):"""將原始數據集轉換為大模型微調所需數據格式的新數據集"""messages = []# 讀取舊的JSONL文件with open(origin_path, "r", encoding="utf-8") as file:for line in file:# 解析每一行的json數據data = json.loads(line)context = data["text"]catagory = data["category"]label = data["output"]message = {"instruction": "你是一個文本分類領域的專家，你會接收到一段文本和幾個潛在的分類選項，請輸出文本內容的正確類型","input": f"文本:{context},類型選型:{catagory}","output": str(label),}messages.append(message)# 保存重構后的JSONL文件with open(new_path, "w", encoding="utf-8") as file:for message in messages:file.write(json.dumps(message, ensure_ascii=False) + "\n")def process_func(example):"""將數據集進行預處理"""MAX_LENGTH = 800instruction = tokenizer(f"<|im_start|>system\n你是一個文本分類領域的專家，你會接收到一段文本和幾個潛在的分類選項，請輸出文本內容的正確類型<|im_end|>\n<|im_start|>user\n{example['input']}<|im_end|>\n<|im_start|>assistant\n",add_special_tokens=False,)response = tokenizer(f"{example['output']}", add_special_tokens=False)input_ids = instruction["input_ids"] + \response["input_ids"] + [tokenizer.pad_token_id]attention_mask = (instruction["attention_mask"] + response["attention_mask"] + [1])# 構建labels（只計算答案部分的loss）# 將輸入部分的labels設為-100（不計算loss）labels = [-100] * len(instruction["input_ids"]) + \response["input_ids"] + [tokenizer.pad_token_id]# 為了保持標簽的完整性，截斷部分尚需優化，可選擇優先級截斷策略，優先保留前幾句和后幾句if len(input_ids) > MAX_LENGTH:  # 做一個截斷input_ids = input_ids[:MAX_LENGTH]attention_mask = attention_mask[:MAX_LENGTH]labels = labels[:MAX_LENGTH]return {"input_ids": input_ids, "attention_mask": attention_mask, "labels": labels}def predict(messages, model, tokenizer):device = "cuda"text = tokenizer.apply_chat_template(messages,tokenize=False,add_generation_prompt=True)model_inputs = tokenizer([text], return_tensors="pt").to(device)generated_ids = model.generate(model_inputs.input_ids,max_new_tokens=512)generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)]response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]return response# 在modelscope上下載Qwen模型到本地目錄下
# model_dir = snapshot_download("qwen/Qwen2-1.5B-Instruct", cache_dir="./", revision="master")
model_name = "/home/emma/.cache/modelscope/hub/models/Qwen/Qwen3-1.7B"# Transformers加載模型權重
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", torch_dtype=torch.bfloat16)model.enable_input_require_grads()  # 開啟梯度檢查點時，要執行該方法# 加載、處理數據集和測試集
train_dataset_path = "./multi_classification_data/train.jsonl"  # "./zh_cls_fudan-news/train.jsonl"
test_dataset_path = "./multi_classification_data/test.jsonl"    #"./zh_cls_fudan-news/test.jsonl"train_jsonl_new_path = "new_train.jsonl"
test_jsonl_new_path = "new_test.jsonl"if not os.path.exists(train_jsonl_new_path):dataset_jsonl_transfer(train_dataset_path, train_jsonl_new_path)
if not os.path.exists(test_jsonl_new_path):dataset_jsonl_transfer(test_dataset_path, test_jsonl_new_path)# 得到訓練集
train_df = pd.read_json(train_jsonl_new_path, lines=True)
train_ds = Dataset.from_pandas(train_df)
train_dataset = train_ds.map(process_func, remove_columns=train_ds.column_names)config = LoraConfig(task_type=TaskType.CAUSAL_LM,target_modules=["q_proj", "k_proj", "v_proj","o_proj", "gate_proj", "up_proj", "down_proj"],inference_mode=False,  # 訓練模式r=32,  # Lora 秩lora_alpha=64,  # Lora alaph，具體作用參見 Lora 原理lora_dropout=0.1,  # Dropout 比例
)model = get_peft_model(model, config)# 打印可訓練參數信息
print("=== PEFT模型參數統計 ===")
model.print_trainable_parameters()args = TrainingArguments(output_dir="./output/Qwen3-cls_wanle_r64",       # "./output/Qwen3-zh_cls_fudan-news",per_device_train_batch_size=4,gradient_accumulation_steps=4,logging_steps=10,num_train_epochs=2,save_steps=100,learning_rate=1e-4,save_on_each_node=True,gradient_checkpointing=True,report_to="none",dataloader_drop_last=True,  # 關鍵設置
)swanlab_callback = SwanLabCallback(project="Qwen3-fintune",experiment_name="Qwen3-1.7B-r32",description="使用通義千問Qwen3-1.7B型在cls_wanle數據集上微調。",config={"model": "Qwen/Qwen3-1.7B","dataset": "ly/cls_wanle_cls50",}
)
# 開始微調
trainer = Trainer(model=model,args=args,train_dataset=train_dataset,data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer, padding=True),callbacks=[swanlab_callback],
)trainer.train()# 保存模型和分詞器
output_dir = "./output/Qwen3-cls_wanle_r32"  # "./output/Qwen3-zh_cls_fudan-news"
# 保存整個模型
model.save_pretrained(output_dir, save_config=True)
tokenizer.save_pretrained(output_dir)# 用測試集的前10條，測試模型
test_df = pd.read_json(test_jsonl_new_path, lines=True)[:10]test_text_list = []
for index, row in test_df.iterrows():instruction = row['instruction']input_value = row['input']messages = [{"role": "system", "content": f"{instruction}"},{"role": "user", "content": f"{input_value}"}]response = predict(messages, model, tokenizer)messages.append({"role": "assistant", "content": f"{response}"})result_text = f"{messages[0]}\n\n{messages[1]}\n\n{messages[2]}"test_text_list.append(swanlab.Text(result_text, caption=response))swanlab.log({"Prediction": test_text_list})
swanlab.finish()

推理/預測

# predict.py 推理/預測import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
import pandas as pd
import json# 設置模型路徑（微調后的模型保存路徑）
model_path = "./output/Qwen3-cls_wanle_r8"  # 替換為你的模型保存路徑# 加載微調后的模型和分詞器
tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False, trust_remote_code=True, model_type="qwen")
model = AutoModelForCausalLM.from_pretrained(model_path, device_map="auto", torch_dtype=torch.bfloat16, trust_remote_code=True, model_type="qwen")
device = model.device# 開啟模型評估模式
model.eval()def predict(model, input_text):# 構造Promptmessages = [{"role": "system", "content": '你是一個文本分類領域的專家，你會接收到一段文本和幾個潛在的分類選項，請輸出文本內容的正確類型。''請只輸出分類結果，不要包含任何其他內容。'},{"role": "user", "content": input_text}]# 對輸入數據進行編碼input_ids = tokenizer.apply_chat_template(messages, tokenize=True, return_tensors="pt").to(model.device)# 生成響應outputs = model.generate(input_ids,max_new_tokens=30,  # 設置最大新生成的 tokens 數量do_sample=False,     # 禁用采樣，使用貪心解碼temperature=0.7,     # 溫度參數，值越低生成結果越確定num_beams=5,         # 設置束寬度為5early_stopping=True  # 啟用提前終止)# 對生成的響應進行解碼response = tokenizer.decode(outputs[0], skip_special_tokens=True).strip()# 提取最后的內容last_line = response.split("\n")[-1].strip()  # 獲取最后一行final_result = last_line# print("\n最終分類結果：", final_result)return final_result# 得到訓練集
test_jsonl_new_path = "new_test.jsonl"
test_df = pd.read_json(test_jsonl_new_path, lines=True)
predict_path = "new_test_res.jsonl"# 保存重構后的JSONL文件
with open(predict_path, "w", encoding="utf-8") as file:for input_text, output_text in zip(test_df["input"], test_df["output"]):res = {"input": input_text, "output": output_text, "prediction": predict(model, input_text)}print(output_text)print(predict(model, input_text))file.write(json.dumps(res, ensure_ascii=False) + "\n")

推理速度：

predict time: 392.27 seconds
total requests: 1119
vllm speed: 2.85 samples/second

vLLM加速推理

vLLM推理速度遠快于transformers推理速度

# vllm_predict.py 推理/預測# 模型加載
from vllm import LLM, SamplingParams
from vllm.lora.request import LoRARequest
from transformers import AutoTokenizer
import pandas as pd
import json, timebase_path = "/home/emma/.cache/modelscope/hub/models/Qwen/Qwen3-1.7B"
lora_path1 = "./output/Qwen3-cls_wanle_r8"   # 適配器路徑# 創建模型
llm = LLM(model=base_path,enable_lora=True,max_model_len=2048,dtype="auto",gpu_memory_utilization=0.7,  # 默認0.9# max_lora_rank=64
)
tokenizer = AutoTokenizer.from_pretrained(base_path)
print("base model load success!")# 定義LoRA請求
lora_request1 = LoRARequest("adapter_v1", 1, lora_path=lora_path1)  #參數說明： lora_name="adapter_v1" 自定義名稱; lora_int_id=1 唯一整數 ID; lora_path=lora_path1 本地適配器路徑;# 設置生成所需參數
sampling_params = SamplingParams(max_tokens=20,       # 可能需要更多空間temperature=0.0,# repetition_penalty=1.2
)# 單條推理
def predict_single(prompt):# 通過prompts構造prompt_token_idstemp_prompts = [tokenizer.apply_chat_template([{"role": "system", "content": '你是一個文本分類領域的專家，你會接收到一段文本和幾個潛在的分類選項，請輸出文本內容的正確類型。請只輸出分類結果，不要包含任何其他內容。'},{"role": "user", "content": prompt}],tokenize=False, add_generation_wohaisprompt=True, enable_thinking=False)]# print("加載Lora1進行模型推理：")# 調用generate時，請求調用lora參數outputs = llm.generate(sampling_params=sampling_params, prompts=temp_prompts,lora_request=lora_request1)# 輸出結果generated_text = outputs[0].outputs[0].textreturn generated_text# 批量推理
def predict_batch(prompts):# 通過prompts構造prompt_token_idstemp_prompts = [tokenizer.apply_chat_template([{"role": "system", "content": '你是一個文本分類領域的專家，你會接收到一段文本和幾個潛在的分類選項，請輸出文本內容的正確類型。請只輸出分類結果，不要包含任何其他內容。'},{"role": "user", "content": prompt}],tokenize=False, add_generation_wohaisprompt=True, enable_thinking=False) for prompt in prompts]# print("加載Lora1進行模型推理：")# 調用generate時，請求調用lora參數outputs = llm.generate(sampling_params=sampling_params, prompts=temp_prompts,lora_request=lora_request1)return outputs# 得到訓練集
test_jsonl_new_path = "new_test.jsonl"
test_df = pd.read_json(test_jsonl_new_path, lines=True)
predict_path = "new_test_res_vllm.jsonl"# 構造輸入數據
inputs, labels = [], []
for input_text, output_text in zip(test_df["input"], test_df["output"]):inputs.append(input_text)labels.append(output_text)# 耗時計算
start = time.time()
outputs = predict_batch(inputs)
end = time.time()
print(f"predict time: {end - start:.2f} seconds")
print(f"total requests: {len(inputs)}")
print(f"vllm speed: {len(inputs) / (end - start):.2f} samples/second")# 保存預測結果
with open(predict_path, "w", encoding="utf-8") as file:for input_text, label, output in zip(inputs, labels, outputs):pred = output.outputs[0].textif not pred:pred = ["其他"]res = {"input": input_text, "output": label, "prediction": pred}file.write(json.dumps(res, ensure_ascii=False) + "\n")

日志

INFO 07-31 16:00:04 [config.py:2434] Chunked prefill is enabled with max_num_batched_tokens=8192.
INFO 07-31 16:00:04 [core.py:572] Waiting for init message from front-end.
INFO 07-31 16:00:04 [core.py:71] Initializing a V1 LLM engine (v0.10.0) with config: model='/home/emma/.cache/modelscope/hub/models/Qwen/Qwen3-1.7B', speculative_config=None, tokenizer='/home/emma/.cache/modelscope/hub/models/Qwen/Qwen3-1.7B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=/home/emma/.cache/modelscope/hub/models/Qwen/Qwen3-1.7B, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output","vllm.mamba_mixer2"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":512,"local_cache_dir":null}
INFO 07-31 16:00:05 [parallel_state.py:1102] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
WARNING 07-31 16:00:05 [topk_topp_sampler.py:59] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
INFO 07-31 16:00:05 [gpu_model_runner.py:1843] Starting to load model /home/emma/.cache/modelscope/hub/models/Qwen/Qwen3-1.7B...
INFO 07-31 16:00:05 [gpu_model_runner.py:1875] Loading model from scratch...
INFO 07-31 16:00:05 [cuda.py:290] Using Flash Attention backend on V1 engine.

上圖中紅框表示正在使用 FlashAttention 的 V1 版本 作為注意力計算后端。

顯存占用情況分析：

INFO 07-31 10:45:03 [default_loader.py:262] Loading weights took 0.41 seconds
INFO 07-31 10:45:03 [punica_selector.py:19] Using PunicaWrapperGPU.
INFO 07-31 10:45:03 [gpu_model_runner.py:1892] Model loading took 3.2480 GiB and 0.509056 seconds
INFO 07-31 10:45:09 [backends.py:530] Using cache directory: /home/emma/.cache/vllm/torch_compile_cache/7fe05d07ef/rank_0_0/backbone for vLLM's torch.compile
INFO 07-31 10:45:09 [backends.py:541] Dynamo bytecode transform time: 5.43 s
INFO 07-31 10:45:12 [backends.py:161] Directly load the compiled graph(s) for dynamic shape from the cache, took 3.176 s
INFO 07-31 10:45:14 [monitor.py:34] torch.compile takes 5.43 s in total
INFO 07-31 10:45:15 [gpu_worker.py:255] Available KV cache memory: 6.15 GiB
INFO 07-31 10:45:15 [kv_cache_utils.py:833] GPU KV cache size: 57,600 tokens
INFO 07-31 10:45:15 [kv_cache_utils.py:837] Maximum concurrency for 2,048 tokens per request: 28.12x
INFO 07-31 10:45:22 [gpu_model_runner.py:2485] Graph capturing finished in 7 secs, took 0.84 GiB
INFO 07-31 10:45:22 [core.py:193] init engine (profile, create kv cache, warmup model) took 18.89 seconds

推理速度詳情：

Adding requests: 100%|██████████| 1119/1119 [00:00<00:00, 1554.98it/s]
Processed prompts: 100%|██████████| 1119/1119 [00:25<00:00, 44.37it/s, est. speed input: 13767.59 toks/s, output: 244.55 toks/s]predict time: 25.99 seconds
total requests: 1119
vllm speed: 43.05 samples/second

推理速度對比

指標	TRANSFORMERS	VLLM	提升
吞吐量	2.8 req/s	43 req/s	15x

可以看出推理速度最終提升至15倍，做了實驗驗證，其中約5倍是貪心搜索（即束寬=1時的束搜索）帶來的，約3倍是vLLM帶來的，疊加起來等于5x3=15倍。

?指標評估?

?此部分為qwen生成的代碼，用于評估拋去“其他”這個標簽之后的分類效果

# evaluate.py 指標評估import json
from collections import defaultdictdef calculate_metrics_from_jsonl(file_path: str) -> dict:"""從jsonl文件中讀取結果并計算多標簽分類的準確率和召回率Args:file_path: jsonl文件路徑Returns:包含每個標簽的準確率、召回率和文本數量的字典"""# 統計每個標簽的相關信息tag_stats = defaultdict(lambda: {'tp': 0,  # True Positive'fp': 0,  # False Positive'fn': 0,  # False Negative'support': 0  # 實際包含該標簽的文本數量})total_samples = 0error_count = 0categories = ['三輪車游覽', '叢林飛躍', '主題公園', '乘雪橇', '體育場館游覽', '公共假期', '其他', '冬', '沖浪','動物園游覽', '博物館體驗', '歷史文化類體驗', '古城古鎮之旅', '品酒之旅', '嘟嘟車游覽', '坐船游覽','城堡之旅', '城市漫步', '城市騎行', '夏', '大眾文化類體驗', '寺廟教堂', '展覽館體驗', '工廠參觀之旅','帆傘', '帆船', '建筑類體驗', '當地洗浴', '當地特色', '徒步', '快艇尾波沖浪', '戲劇演出', '戶外觀光','摩托艇', '攀巖', '旅拍', '時尚之旅', '春', '服飾體驗', '桑拿', '模擬飛行器', '水上樂園','水上飛機游覽', '水族館游覽', '水療', '水翼船', '沙漠游覽', '泡溫泉', '洞穴探秘', '浮潛']# 讀取jsonl文件with open(file_path, 'r', encoding='utf-8') as f:for line_num, line in enumerate(f, 1):if line.strip():try:item = json.loads(line.strip())total_samples += 1# 解析真實的標簽true_labels = item.get('output', [])if isinstance(true_labels, str):# 如果是字符串格式，需要解析true_labels = eval(true_labels) if true_labels.startswith('[') else [true_labels]# 解析預測的標簽# pred_labels = item.get('prediction', [])# if isinstance(pred_labels, str):#     # 如果是字符串格式，需要解析#     pred_labels = eval(pred_labels) if pred_labels.startswith('[') else [pred_labels]pred_labels = item.get('prediction', '')pred_labels = [x for x in categories if x in pred_labels]# 轉換為集合以便計算true_labels_set = set(true_labels)pred_labels_set = set(pred_labels)# 統計每個標簽在當前樣本中的情況all_labels = true_labels_set.union(pred_labels_set)for label in all_labels:if label in true_labels_set and label in pred_labels_set:# 真正例tag_stats[label]['tp'] += 1elif label not in true_labels_set and label in pred_labels_set:# 假正例tag_stats[label]['fp'] += 1elif label in true_labels_set and label not in pred_labels_set:# 假負例tag_stats[label]['fn'] += 1# 統計實際包含該標簽的文本數量if label in true_labels_set:tag_stats[label]['support'] += 1except Exception as e:error_count += 1print(f"第{line_num}行處理錯誤: {e}")continueprint(f"總共處理了 {total_samples} 個樣本，錯誤 {error_count} 個")# 計算每個標簽的準確率和召回率metrics = {}for label, stats in tag_stats.items():tp = stats['tp']fp = stats['fp']fn = stats['fn']support = stats['support']# 準確率 = TP / (TP + FP)precision = tp / (tp + fp) if (tp + fp) > 0 else 0.0# 召回率 = TP / (TP + FN)recall = tp / (tp + fn) if (tp + fn) > 0 else 0.0# F1分數f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0.0metrics[label] = {'precision': precision,'recall': recall,'f1': f1,'support': support,'tp': tp,'fp': fp,'fn': fn}return metricsdef print_detailed_metrics(metrics: dict):"""打印詳細的指標結果"""if not metrics:print("沒有找到有效的標簽數據")returnprint(f"\n{'標簽':<35} {'準確率':<10} {'召回率':<10} {'F1分數':<10} {'支持數':<8} {'TP':<6} {'FP':<6} {'FN':<6}")print("=" * 110)# 按支持數排序sorted_items = sorted(metrics.items(), key=lambda x: x[1]['support'], reverse=True)for label, stats in sorted_items:print(f"{label:<35} {stats['precision']:<10.4f} {stats['recall']:<10.4f} "f"{stats['f1']:<10.4f} {stats['support']:<8} {stats['tp']:<6} {stats['fp']:<6} {stats['fn']:<6}")def print_summary_metrics(metrics: dict):"""打印摘要指標"""if not metrics:print("沒有找到有效的標簽數據")returnprint(f"\n{'標簽':<35} {'準確率':<10} {'召回率':<10} {'支持數':<8}")print("-" * 70)# 按支持數排序sorted_items = sorted(metrics.items(), key=lambda x: x[1]['support'], reverse=True)for label, stats in sorted_items:print(f"{label:<35} {stats['precision']:<10.4f} {stats['recall']:<10.4f} {stats['support']:<8}")def calculate_overall_metrics(metrics: dict) -> dict:"""計算整體指標（宏平均和微平均）"""if not metrics:return {}# 宏平均macro_precision = sum(stats['precision'] for stats in metrics.values()) / len(metrics)macro_recall = sum(stats['recall'] for stats in metrics.values()) / len(metrics)macro_f1 = sum(stats['f1'] for stats in metrics.values()) / len(metrics)# 微平均total_tp = sum(stats['tp'] for stats in metrics.values())total_fp = sum(stats['fp'] for stats in metrics.values())total_fn = sum(stats['fn'] for stats in metrics.values())micro_precision = total_tp / (total_tp + total_fp) if (total_tp + total_fp) > 0 else 0.0micro_recall = total_tp / (total_tp + total_fn) if (total_tp + total_fn) > 0 else 0.0micro_f1 = 2 * (micro_precision * micro_recall) / (micro_precision + micro_recall) if (micro_precision + micro_recall) > 0 else 0.0return {'macro_precision': macro_precision,'macro_recall': macro_recall,'macro_f1': macro_f1,'micro_precision': micro_precision,'micro_recall': micro_recall,'micro_f1': micro_f1}def save_metrics_to_csv(metrics: dict, output_file: str):"""將指標保存為CSV格式"""with open(output_file, 'w', encoding='utf-8') as f:f.write("標簽,準確率,召回率,F1分數,支持數,TP,FP,FN\n")for label, stats in sorted(metrics.items(), key=lambda x: x[1]['support'], reverse=True):f.write(f'"{label}",{stats["precision"]:.4f},{stats["recall"]:.4f},'f'{stats["f1"]:.4f},{stats["support"]},{stats["tp"]},{stats["fp"]},{stats["fn"]}\n')print(f"\n指標結果已保存到: {output_file}")# 主函數
def main():# 替換為你的實際文件路徑# input_file = "new_test_res.jsonl"  # 你的jsonl文件路徑input_file = "new_test_res_vllm.jsonl"output_csv = "metrics_result_vllm.csv"  # 輸出CSV文件路徑try:# 計算指標print("正在計算指標...")metrics = calculate_metrics_from_jsonl(input_file)if not metrics:print("未找到任何有效數據")return# 打印詳細結果print("\n=== 詳細指標結果 ===")print_detailed_metrics(metrics)# 打印摘要結果print("\n=== 摘要指標結果 ===")print_summary_metrics(metrics)# 打印整體指標print("\n=== 整體指標 ===")overall_metrics = calculate_overall_metrics(metrics)print(f"宏平均 - 準確率: {overall_metrics['macro_precision']:.4f}, "f"召回率: {overall_metrics['macro_recall']:.4f}, "f"F1: {overall_metrics['macro_f1']:.4f}")print(f"微平均 - 準確率: {overall_metrics['micro_precision']:.4f}, "f"召回率: {overall_metrics['micro_recall']:.4f}, "f"F1: {overall_metrics['micro_f1']:.4f}")# 保存結果save_metrics_to_csv(metrics, output_csv)except FileNotFoundError:print(f"錯誤: 找不到文件 {input_file}")except Exception as e:print(f"處理過程中出現錯誤: {e}")# 簡化使用版本
def quick_analysis(file_path: str):"""快速分析函數"""metrics = calculate_metrics_from_jsonl(file_path)print_summary_metrics(metrics)return metricsif __name__ == "__main__":# 使用方法1: 完整分析main()# 使用方法2: 快速分析（取消注釋下面這行）# quick_analysis("your_file.jsonl")

微調前后效果對比

針對50個標簽進行了指令微調，拋去了“其他”標簽，效果如下：

transformers推理效果對比

束搜索配置
# 生成響應
outputs = model.generate(input_ids,max_new_tokens=20,  # 設置最大新生成的 tokens 數量do_sample=False,     # 禁用采樣，使用貪心解碼temperature=0.7,     # 溫度參數，值越低生成結果越確定num_beams=5,         # 設置束寬度為5early_stopping=True  # 啟用提前終止
)

微調前

=== 整體指標 ===
宏平均 - 準確率: 0.7172, 召回率: 0.4394, F1: 0.4770
微平均 - 準確率: 0.6476, 召回率: 0.3949, F1: 0.4906

r=8

=== 整體指標 ===
宏平均 - 準確率: 0.9124, 召回率: 0.8723, F1: 0.8873
微平均 - 準確率: 0.9002, 召回率: 0.8542, F1: 0.8766

r=16

宏平均 - 準確率: 0.8904, 召回率: 0.8693, F1: 0.8731
微平均 - 準確率: 0.8981, 召回率: 0.8549, F1: 0.8760

可以看出，微調后，準確率和召回率均得到了顯著提升，r=8已經滿足需要，具有較高的性價比。

vLLM推理效果對比

vLLM中沒有找到對應的束搜索方法，使用了貪心搜索方法
# 生成所需參數設置
sampling_params = SamplingParams(max_tokens=20,temperature=0.0,
)

微調前

=== 整體指標 ===
宏平均 - 準確率: 0.6281, 召回率: 0.3851, F1: 0.4185
微平均 - 準確率: 0.6445, 召回率: 0.3980, F1: 0.4921

微調后，r=8

=== PEFT模型參數統計 ===
trainable params: 8,716,288 || all params: 1,729,291,264 || trainable%: 0.5040

=== 整體指標 ===
宏平均 - 準確率: 0.8985, 召回率: 0.7724, F1: 0.8196
微平均 - 準確率: 0.8944, 召回率: 0.7752, F1: 0.8306

微調后，r=16

=== PEFT模型參數統計 ===
trainable params: 17,432,576 || all params: 1,738,007,552 || trainable%: 1.0030

=== 整體指標 ===
宏平均 - 準確率: 0.8890, 召回率: 0.8043, F1: 0.8387
微平均 - 準確率: 0.8961, 召回率: 0.7957, F1: 0.8429

微調后，準確率和召回率均得到了顯著提升
r=16效果有小幅提升
貪婪搜索方法相比束搜索方法，效果有一定程度的下降

注意事項

問題1：

批量推理時提示：A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.

為什么建議設置 padding_side='left'？
因為 decoder-only 模型使用因果注意力（causal attention），只能看到前面（左側）的 token。如果右邊填充（right-padding），模型會把填充的 pad_token 當作有效輸入，導致生成結果錯誤或不穩定。
?

舉例如下：

"input": "文本:在景區內自由活動，打卡拍照，休閑游玩，將在這里度過一個愉快的下午~農莊游玩臺球、乒乓球、麻將、釣魚（自備漁具）各項娛樂活動任其選，射箭、鄉村保齡球、羽毛球、農場 KTV 免費玩，免費提供拔河繩、跳繩等團隊活動道具,類型選型:['主題公園', '乘雪橇', ...... , '浮潛']"? -- 此處只給了部分標簽，大部分標簽被省略

?"output": "['其他']"

推理結果如下：

"prediction": "你是一個文本分類領域的專家，你會接收到一段文本和幾個潛在的分類選項，請"? -- batch模式下，padding_side='right'

"prediction": "大眾文化類體驗"? --?batch模式下，padding_side='right'

問題2：batch模式下效果不如單條推理模式下的效果

例子同上

"prediction": "你是一個文本分類領域的專家，你會接收到一段文本和幾個潛在的分類選項，請"? --?batch模式下，padding_side='right'

"prediction": "其他"? -- 單條模式下

問題3：vllm推理過程中問題

默認限制與調整
vLLM 默認最大 LoRA rank 為?16，若微調時 rank 超過此值（如 32 或 64），推理時會報錯：ValueError: LoRA rank X is greater than max_lora_rank 1613。
需在啟動推理服務時通過?--max-lora-rank?參數手動提升限制，例如：
python -m vllm.entrypoints.openai.api_server --max-lora-rank 64 --model your_model --enable-lora
或在代碼中配置：
llm = LLM(model="your_model", enable_lora=True, max_lora_rank=64)
支持的范圍
vLLM 官方支持的 rank 值包括?8、16、32、64。若需使用 rank=64，必須顯式聲明?--max-lora-rank=646。

多 LoRA 場景的限制

同時加載多個 LoRA 適配器時，--max-lora-rank?需設置為所有適配器中的?最大 rank 值（例如一個 rank=16，另一個 rank=32，則需設為 32）6。

單批次支持的 LoRA 適配器數量由?--max-loras?控制（默認上限 32），但硬件顯存可能限制實際可加載數量64。