【大模型實戰篇】使用GPTQ量化QwQ-32B微調后的推理模型

1. 量化背景

????????之所以做量化，就是希望在現有的硬件條件下，提升性能。量化能將模型權重從高精度（如FP32）轉換為低精度（如INT8/FP16），內存占用可減少50%~75%。低精度運算（如INT8）在GPU等硬件上計算效率更高，推理速度可提升2~4倍。

? ? ? ? 我們的任務是，將QwQ-32B微調后的推理模型，也就是bf16的精度，通過量化，壓縮到int4。關于QwQ-32B微調，可以參考《利用ms-swift微調框架對QwQ-32B推理模型進行微調》。關于推理模型吞吐性能對比，可以參考《對比包括QwQ-32B在內的不同推理模型的吞吐量表現》。

2. 量化流程

????????接下來進入量化介紹：

????????QwQ-32B的模型架構依然還是Qwen2系列，所以可以使用GPTQ進行量化。之前嘗試用AWQ，會報錯。下列內容是基于AutoGPTQ實現量化。

? ? ? ? 首先通過安裝源代碼的方式獲取并安裝最新版本的該軟件包。

git clone https://github.com/AutoGPTQ/AutoGPTQ
cd AutoGPTQ
pip install -e .

? ? ? ? 假設基于QwQ-32B模型進行微調，并將該微調后的模型命名為?QwQ-32B-finetuned?，且使用的是自己的帶推理鏈的數據集。要構建GPTQ量化模型，還需要使用訓練數據進行校準。

? ? ? ? 這里校準數據的設置，最好配置參數damp_percent=0.1，然后我采用的校準樣本量是128個sample。不然會報錯【1】：

torch._C._LinAlgError: linalg.cholesky: The factorization could not be completed because the input is not positive-definite

? ? ? ? 在我的場景中，damp_percent我設置0.01，通過調整校準樣本量解決了該報錯。

? ? ? ? 我們采用雙卡進行量化，腳本如下：????????

from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from transformers import AutoTokenizer
import torch
import json# 設置路徑
model_path = "/data/QwQ-32B-finetuned"
quant_path = "/data/quantized_model"# 設置量化配置
quantize_config = BaseQuantizeConfig(bits=4,  # 可選擇4或8位量化group_size=128,damp_percent=0.01,desc_act=False,  # 為了加速推理，可將其設置為False，但可能會導致困惑度稍差static_groups=False,sym=True,true_sequential=True,model_name_or_path=None,model_file_base_name="model"
)max_len = 8192  # 設置最大文本長度# 加載tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path)# 加載模型，并指定使用GPU 2和GPU 5
model = AutoGPTQForCausalLM.from_pretrained(model_path,quantize_config,max_memory={2: "80GB", 5: "80GB"}  # 使用GPU 2和GPU 5，各分配80GB顯存
)# 準備校準數據集
data = []
with open("/data/jz_v0303.jsonl", "r") as f:for line in f:msg = json.loads(line)text = tokenizer.apply_chat_template(msg["messages"], tokenize=False, add_generation_prompt=False)model_inputs = tokenizer([text])input_ids = torch.tensor(model_inputs.input_ids[:max_len], dtype=torch.int)data.append(dict(input_ids=input_ids, attention_mask=input_ids.ne(tokenizer.pad_token_id)))# 運行量化過程
import logginglogging.basicConfig(format="%(asctime)s %(levelname)s [%(name)s] %(message)s", level=logging.INFO, datefmt="%Y-%m-%d %H:%M:%S"
)
model.quantize(data, cache_examples_on_gpu=False)# 保存量化后的模型
model.save_quantized(quant_path, use_safetensors=True)
tokenizer.save_pretrained(quant_path)

量化日志：

? ? ? ? QwQ有64層transformer層，整個量化共花費約110分鐘。

Loading checkpoint shards: 100%|██████████| 14/14 [00:24<00:00, ?1.74s/it]
INFO - Start quantizing layer 1/64
INFO - Quantizing self_attn.k_proj in layer 1/64...
2025-03-16 13:08:27 INFO [auto_gptq.quantization.gptq] duration: 4.176503658294678
2025-03-16 13:08:27 INFO [auto_gptq.quantization.gptq] avg loss: 4.942690849304199
INFO - Quantizing self_attn.v_proj in layer 1/64...
2025-03-16 13:08:28 INFO [auto_gptq.quantization.gptq] duration: 1.400636911392212
2025-03-16 13:08:28 INFO [auto_gptq.quantization.gptq] avg loss: 1.4266357421875
INFO - Quantizing self_attn.q_proj in layer 1/64...
2025-03-16 13:08:30 INFO [auto_gptq.quantization.gptq] duration: 1.4035542011260986
2025-03-16 13:08:30 INFO [auto_gptq.quantization.gptq] avg loss: 14.252044677734375
INFO - Quantizing self_attn.o_proj in layer 1/64...
2025-03-16 13:08:35 INFO [auto_gptq.quantization.gptq] duration: 1.4259772300720215
2025-03-16 13:08:35 INFO [auto_gptq.quantization.gptq] avg loss: 21.492481231689453
INFO - Quantizing mlp.up_proj in layer 1/64...
2025-03-16 13:08:42 INFO [auto_gptq.quantization.gptq] duration: 1.4980144500732422
2025-03-16 13:08:42 INFO [auto_gptq.quantization.gptq] avg loss: 11.520009994506836
INFO - Quantizing mlp.gate_proj in layer 1/64...
2025-03-16 13:08:43 INFO [auto_gptq.quantization.gptq] duration: 1.4689013957977295
2025-03-16 13:08:43 INFO [auto_gptq.quantization.gptq] avg loss: 13.158416748046875
INFO - Quantizing mlp.down_proj in layer 1/64...
2025-03-16 13:09:36 INFO [auto_gptq.quantization.gptq] duration: 11.233691692352295
2025-03-16 13:09:36 INFO [auto_gptq.quantization.gptq] avg loss: 5.198782444000244
INFO - Start quantizing layer 2/64
INFO - Quantizing self_attn.k_proj in layer 2/64...
2025-03-16 13:09:50 INFO [auto_gptq.quantization.gptq] duration: 1.4270472526550293
2025-03-16 13:09:50 INFO [auto_gptq.quantization.gptq] avg loss: 0.25423723459243774
INFO - Quantizing self_attn.v_proj in layer 2/64...
2025-03-16 13:09:51 INFO [auto_gptq.quantization.gptq] duration: 1.377784252166748
2025-03-16 13:09:51 INFO [auto_gptq.quantization.gptq] avg loss: 0.12605950236320496
INFO - Quantizing self_attn.q_proj in layer 2/64...
2025-03-16 13:09:53 INFO [auto_gptq.quantization.gptq] duration: 1.3954062461853027
2025-03-16 13:09:53 INFO [auto_gptq.quantization.gptq] avg loss: 0.6923567056655884
INFO - Quantizing self_attn.o_proj in layer 2/64...
2025-03-16 13:09:58 INFO [auto_gptq.quantization.gptq] duration: 1.4187729358673096
2025-03-16 13:09:58 INFO [auto_gptq.quantization.gptq] avg loss: 0.21527329087257385
INFO - Quantizing mlp.up_proj in layer 2/64...
2025-03-16 13:10:05 INFO [auto_gptq.quantization.gptq] duration: 1.4918739795684814
2025-03-16 13:10:05 INFO [auto_gptq.quantization.gptq] avg loss: 42.98908615112305
INFO - Quantizing mlp.gate_proj in layer 2/64...
2025-03-16 13:10:07 INFO [auto_gptq.quantization.gptq] duration: 1.4632303714752197
2025-03-16 13:10:07 INFO [auto_gptq.quantization.gptq] avg loss: 254.09523010253906
INFO - Quantizing mlp.down_proj in layer 2/64...
2025-03-16 13:10:59 INFO [auto_gptq.quantization.gptq] duration: 11.405533790588379
2025-03-16 13:10:59 INFO [auto_gptq.quantization.gptq] avg loss: 1.4062278270721436
INFO - Start quantizing layer 3/64

.......

2025-03-16 14:33:08 INFO [auto_gptq.quantization.gptq] duration: 11.416744709014893
2025-03-16 14:33:08 INFO [auto_gptq.quantization.gptq] avg loss: 10015.05078125
INFO - Start quantizing layer 62/64
INFO - Quantizing self_attn.k_proj in layer 62/64...
2025-03-16 14:33:22 INFO [auto_gptq.quantization.gptq] duration: 1.4608099460601807
2025-03-16 14:33:22 INFO [auto_gptq.quantization.gptq] avg loss: 129.20584106445312
INFO - Quantizing self_attn.v_proj in layer 62/64...
2025-03-16 14:33:23 INFO [auto_gptq.quantization.gptq] duration: 1.417314052581787
2025-03-16 14:33:23 INFO [auto_gptq.quantization.gptq] avg loss: 834.720947265625
INFO - Quantizing self_attn.q_proj in layer 62/64...
2025-03-16 14:33:25 INFO [auto_gptq.quantization.gptq] duration: 1.4364099502563477
2025-03-16 14:33:25 INFO [auto_gptq.quantization.gptq] avg loss: 770.3301391601562
INFO - Quantizing self_attn.o_proj in layer 62/64...
2025-03-16 14:33:30 INFO [auto_gptq.quantization.gptq] duration: 1.4644238948822021
2025-03-16 14:33:30 INFO [auto_gptq.quantization.gptq] avg loss: 1413.948486328125
INFO - Quantizing mlp.up_proj in layer 62/64...
2025-03-16 14:33:38 INFO [auto_gptq.quantization.gptq] duration: 1.5320115089416504
2025-03-16 14:33:38 INFO [auto_gptq.quantization.gptq] avg loss: 7386.39453125
INFO - Quantizing mlp.gate_proj in layer 62/64...
2025-03-16 14:33:39 INFO [auto_gptq.quantization.gptq] duration: 1.5006358623504639
2025-03-16 14:33:39 INFO [auto_gptq.quantization.gptq] avg loss: 6787.9912109375
INFO - Quantizing mlp.down_proj in layer 62/64...
2025-03-16 14:34:32 INFO [auto_gptq.quantization.gptq] duration: 11.412427186965942
2025-03-16 14:34:32 INFO [auto_gptq.quantization.gptq] avg loss: 11235.9814453125
INFO - Start quantizing layer 63/64
INFO - Quantizing self_attn.k_proj in layer 63/64...
2025-03-16 14:34:46 INFO [auto_gptq.quantization.gptq] duration: 1.4546654224395752
2025-03-16 14:34:46 INFO [auto_gptq.quantization.gptq] avg loss: 130.98355102539062
INFO - Quantizing self_attn.v_proj in layer 63/64...
2025-03-16 14:34:48 INFO [auto_gptq.quantization.gptq] duration: 1.4156157970428467
2025-03-16 14:34:48 INFO [auto_gptq.quantization.gptq] avg loss: 958.8649291992188
INFO - Quantizing self_attn.q_proj in layer 63/64...
2025-03-16 14:34:49 INFO [auto_gptq.quantization.gptq] duration: 1.4323241710662842
2025-03-16 14:34:49 INFO [auto_gptq.quantization.gptq] avg loss: 780.7476196289062
INFO - Quantizing self_attn.o_proj in layer 63/64...
2025-03-16 14:34:55 INFO [auto_gptq.quantization.gptq] duration: 1.4556679725646973
2025-03-16 14:34:55 INFO [auto_gptq.quantization.gptq] avg loss: 2276.7041015625
INFO - Quantizing mlp.up_proj in layer 63/64...
2025-03-16 14:35:01 INFO [auto_gptq.quantization.gptq] duration: 1.533803939819336
2025-03-16 14:35:01 INFO [auto_gptq.quantization.gptq] avg loss: 7764.6142578125
INFO - Quantizing mlp.gate_proj in layer 63/64...
2025-03-16 14:35:03 INFO [auto_gptq.quantization.gptq] duration: 1.4962470531463623
2025-03-16 14:35:03 INFO [auto_gptq.quantization.gptq] avg loss: 7304.74365234375
INFO - Quantizing mlp.down_proj in layer 63/64...
2025-03-16 14:35:56 INFO [auto_gptq.quantization.gptq] duration: 11.429993629455566
2025-03-16 14:35:56 INFO [auto_gptq.quantization.gptq] avg loss: 17015.2734375
INFO - Start quantizing layer 64/64
INFO - Quantizing self_attn.k_proj in layer 64/64...
2025-03-16 14:36:10 INFO [auto_gptq.quantization.gptq] duration: 1.453392744064331
2025-03-16 14:36:10 INFO [auto_gptq.quantization.gptq] avg loss: 112.55108642578125
INFO - Quantizing self_attn.v_proj in layer 64/64...
2025-03-16 14:36:11 INFO [auto_gptq.quantization.gptq] duration: 1.4028844833374023
2025-03-16 14:36:11 INFO [auto_gptq.quantization.gptq] avg loss: 509.4556884765625
INFO - Quantizing self_attn.q_proj in layer 64/64...
2025-03-16 14:36:12 INFO [auto_gptq.quantization.gptq] duration: 1.434821605682373
2025-03-16 14:36:12 INFO [auto_gptq.quantization.gptq] avg loss: 685.0777587890625
INFO - Quantizing self_attn.o_proj in layer 64/64...
2025-03-16 14:36:18 INFO [auto_gptq.quantization.gptq] duration: 1.4707720279693604
2025-03-16 14:36:18 INFO [auto_gptq.quantization.gptq] avg loss: 990.3109130859375
INFO - Quantizing mlp.up_proj in layer 64/64...
2025-03-16 14:36:25 INFO [auto_gptq.quantization.gptq] duration: 1.572035312652588
2025-03-16 14:36:25 INFO [auto_gptq.quantization.gptq] avg loss: 8309.283203125
INFO - Quantizing mlp.gate_proj in layer 64/64...
2025-03-16 14:36:27 INFO [auto_gptq.quantization.gptq] duration: 1.8046717643737793
2025-03-16 14:36:27 INFO [auto_gptq.quantization.gptq] avg loss: 7995.7509765625
INFO - Quantizing mlp.down_proj in layer 64/64...
2025-03-16 14:37:20 INFO [auto_gptq.quantization.gptq] duration: 11.410486698150635
2025-03-16 14:37:20 INFO [auto_gptq.quantization.gptq] avg loss: 27875.2734375
INFO - Packing model...
2025-03-16 14:37:25 INFO [auto_gptq.modeling._utils] Packing model...
Packing model.layers.63.mlp.down_proj...: 100%|██████████| 448/448 [20:01<00:00, ?2.68s/it] ??
INFO - Model packed.
2025-03-16 14:57:31 INFO [auto_gptq.modeling._utils] Model packed.

量化前模型大小為62G：

total 62G
-rw-r--r-- 1 research research ?707 Mar 12 10:19 added_tokens.json
-rw-r--r-- 1 research research ?16K Mar 12 10:19 args.json
-rw-r--r-- 1 research research ?785 Mar 12 10:15 config.json
-rw-r--r-- 1 research research ?214 Mar 12 10:15 generation_config.json
-rw-r--r-- 1 research research 1.6M Mar 12 10:19 merges.txt
-rw-r--r-- 1 research research 4.6G Mar 12 10:15 model-00001-of-00014.safetensors
-rw-r--r-- 1 research research 4.6G Mar 12 10:16 model-00002-of-00014.safetensors
-rw-r--r-- 1 research research 4.6G Mar 12 10:16 model-00003-of-00014.safetensors
-rw-r--r-- 1 research research 4.6G Mar 12 10:16 model-00004-of-00014.safetensors
-rw-r--r-- 1 research research 4.6G Mar 12 10:16 model-00005-of-00014.safetensors
-rw-r--r-- 1 research research 4.6G Mar 12 10:17 model-00006-of-00014.safetensors
-rw-r--r-- 1 research research 4.6G Mar 12 10:17 model-00007-of-00014.safetensors
-rw-r--r-- 1 research research 4.6G Mar 12 10:17 model-00008-of-00014.safetensors
-rw-r--r-- 1 research research 4.6G Mar 12 10:18 model-00009-of-00014.safetensors
-rw-r--r-- 1 research research 4.6G Mar 12 10:18 model-00010-of-00014.safetensors
-rw-r--r-- 1 research research 4.6G Mar 12 10:18 model-00011-of-00014.safetensors
-rw-r--r-- 1 research research 4.6G Mar 12 10:18 model-00012-of-00014.safetensors
-rw-r--r-- 1 research research 4.6G Mar 12 10:19 model-00013-of-00014.safetensors
-rw-r--r-- 1 research research 2.0G Mar 12 10:19 model-00014-of-00014.safetensors
-rw-r--r-- 1 research research ?62K Mar 12 10:19 model.safetensors.index.json
-rw-r--r-- 1 research research ?613 Mar 12 10:19 special_tokens_map.json
-rw-r--r-- 1 research research 8.0K Mar 12 10:19 tokenizer_config.json
-rw-r--r-- 1 research research ?11M Mar 12 10:19 tokenizer.json
-rw-r--r-- 1 research research 2.7M Mar 12 10:19 vocab.json

量化后模型大小為19G：

total 19G
-rw-r--r-- 1 research research ?707 Mar 16 14:58 added_tokens.json
-rw-r--r-- 1 research research 1.2K Mar 16 14:58 config.json
-rw-r--r-- 1 research research 1.6M Mar 16 14:58 merges.txt
-rw-r--r-- 1 research research ?19G Mar 16 14:58 model.safetensors
-rw-r--r-- 1 research research ?271 Mar 16 14:58 quantize_config.json
-rw-r--r-- 1 research research ?613 Mar 16 14:58 special_tokens_map.json
-rw-r--r-- 1 research research 8.0K Mar 16 14:58 tokenizer_config.json
-rw-r--r-- 1 research research ?11M Mar 16 14:58 tokenizer.json
-rw-r--r-- 1 research research 2.7M Mar 16 14:58 vocab.json

3. 量化模型部署

????????vLLM已支持GPTQ，可以直接使用AutoGPTQ量化的模型。使用GPTQ模型與vLLM的基本用法相同。

CUDA_VISIBLE_DEVICES=0,1,2,3 \
vllm serve /data/quantized_model \
--tensor-parallel-size 4 \
--port 8001

????????另外對api調用的model id，可以通過設置別名方式，而不需要暴露完整路徑：

vllm serve my_model --served-model-name my_alias

????????隨后，可以這樣調用API：

curl http://localhost:8001/v1/chat/completions -H "Content-Type: application/json" -d '{"model": "quantized_model","messages": [{"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},{"role": "user", "content": "推薦一款防水耳機."}],"temperature": 0.7,"top_p": 0.8,"repetition_penalty": 1.05,"max_tokens": 512
}'

? ? ? ? 也可以使用?openai?Python包中的API客戶端：

from openai import OpenAIopenai_api_key = "EMPTY"
openai_api_base = "http://localhost:8001/v1"client = OpenAI(api_key=openai_api_key,base_url=openai_api_base,
)chat_response = client.chat.completions.create(model="/data/quantized_model",messages=[{"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},{"role": "user", "content": "推薦一款防水耳機"},],temperature=0.7,top_p=0.8,max_tokens=512,extra_body={"repetition_penalty": 1.05,},
)
print("Chat response:", chat_response)

????????實測了下，模型生成吞吐量可以在92 tokens/s，還是很不錯的。

? ? ? ?注意：需要注意下，百億參數的模型，一般還是選擇int8量化比較合適。int4更適合是千億模型，百億規模損失會有點大。

? ? ? ? 以下是int8的量化loss表現：

? ? ? ?還有一個需要注意的是，量化后用vllm推理，默認會在prompt中添加<|im_start>assistant\n<think>這段，其實就是強制模型先輸出推理鏈，本質上是指令遵循。所以你推理拿到的生成結果看起來是丟了<think>這個特殊token，實際上是已經在prompt中體現了。

4. 參考材料

【1】https://github.com/AutoGPTQ/AutoGPTQ/issues/196? ?

【2】GPTQ - Qwen? ??