【模型量化】GPTQ 與 AutoGPTQ

GPTQ是一種用于類GPT線性最小二乘法的量化方法，它使用基于近似二階信息的一次加權量化。

本文中也展示了如何使用量化模型以及如何量化自己的模型AutoGPTQ。

AutoGPTQ：一個易于使用的LLM量化包，帶有用戶友好的API，基于GPTQ算法(僅權重量化)。

GPTQ與Hugging Face transformers的使用

結合vLLM使用GPTQ模型

用AutoGPTQ量化自己的模型

GPTQ與Hugging Face transformers的使用

注意

使用Qwen2.5 GPTQ官方型號transformers。

請確保optimum>=1.20.0和兼容版本的transformers和auto_gptq已安裝。
pip install -U "optimum>=1.20.0"

現在，transformers已經正式支持AutoGPTQ，這意味著可以通過transformers使用量化模型。對于Qwen2.5的每個尺寸，提供Int4和Int8 GPTQ量化模型。下面是一個非常簡單的代碼片段，展示了如何運行Qwen2.5-7B-Instruct-GPTQ-Int4:

from transformers import AutoModelForCausalLM, AutoTokenizermodel_name = "Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4"model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)prompt = "Give me a short introduction to large language model."
messages = [{"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},{"role": "user", "content": prompt},
]
text = tokenizer.apply_chat_template(messages,tokenize=False,add_generation_prompt=True,
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)generated_ids = model.generate(**model_inputs,max_new_tokens=512,
)
generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

結合vLLM使用GPTQ模型

vLLM支持GPTQ，這意味著可以直接使用提供的GPTQ模型或那些用VLLM訓練的模型。如果可能，它會自動使用GPTQ Marlin內核，這樣效率更高。

實際上，用法與vLLM的基本用法相同。提供了一個簡單的例子說明如何用vLLM和Qwen2.5-7B-Instruct-GPTQ-Int4:

在shell中運行以下命令，啟動與OpenAI兼容的API服務:

vllm serve Qwen2.5-7B-Instruct-GPTQ-Int4

然后，可以像這樣調用API：

curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{"model": "Qwen2.5-7B-Instruct-GPTQ-Int4","messages": [{"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},{"role": "user", "content": "Tell me something about large language models."}],"temperature": 0.7,"top_p": 0.8,"repetition_penalty": 1.05,"max_tokens": 512
}'

或者可以使用API客戶端，如下所示:

from openai import OpenAIopenai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"client = OpenAI(api_key=openai_api_key,base_url=openai_api_base,
)chat_response = client.chat.completions.create(model="Qwen2.5-7B-Instruct-GPTQ-Int4",messages=[{"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},{"role": "user", "content": "Tell me something about large language models."},],temperature=0.7,top_p=0.8,max_tokens=512,extra_body={"repetition_penalty": 1.05,},
)
print("Chat response:", chat_response)

用AutoGPTQ量化自己的模型

如果需要把自己的模型量化成GPTQ量化模型，建議使用AutoGPTQ。建議通過從源代碼安裝來安裝最新版本的軟件包:

git clone https://github.com/AutoGPTQ/AutoGPTQ
cd AutoGPTQ
pip install -e .

假設已經使用自己的數據集對模型Qwen2.5-7B進行了微調，它被命名為Qwen2.5-7B-finetuned，要構建自己的GPTQ量化模型，需要使用訓練數據進行校準。下面，提供一個簡單的示例。

from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from transformers import AutoTokenizer# Specify paths and hyperparameters for quantization
model_path = "your_model_path"
quant_path = "your_quantized_model_path"
quantize_config = BaseQuantizeConfig(bits=8, # 4 or 8group_size=128,damp_percent=0.01,desc_act=False,  # set to False can significantly speed up inference but the perplexity may slightly badstatic_groups=False,sym=True,true_sequential=True,model_name_or_path=None,model_file_base_name="model"
)
max_len = 8192# Load your tokenizer and model with AutoGPTQ
# To learn about loading model to multiple GPUs,
# visit https://github.com/AutoGPTQ/AutoGPTQ/blob/main/docs/tutorial/02-Advanced-Model-Loading-and-Best-Practice.md
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoGPTQForCausalLM.from_pretrained(model_path, quantize_config)

但是，如果想在多個GPU上加載模型，需要使用max_memory代替device_map。這里有一個例子:

model = AutoGPTQForCausalLM.from_pretrained(model_path,quantize_config,max_memory={i: "20GB" for i in range(4)}
)

然后，需要為校準準備數據。需要做的只是把樣本放到一個列表中，每個樣本都是一個文本。由于直接使用微調數據進行校準，首先用ChatML模板對其進行格式化。舉個例子：

import torchdata = []
for msg in dataset:text = tokenizer.apply_chat_template(msg, tokenize=False, add_generation_prompt=False)model_inputs = tokenizer([text])input_ids = torch.tensor(model_inputs.input_ids[:max_len], dtype=torch.int)data.append(dict(input_ids=input_ids, attention_mask=input_ids.ne(tokenizer.pad_token_id)))

其中每個msg是一條典型的聊天消息，如下所示:

[{"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},{"role": "user", "content": "Tell me who you are."},{"role": "assistant", "content": "I am a large language model named Qwen..."}
]

然后通過一行代碼運行校準過程:

import logginglogging.basicConfig(format="%(asctime)s %(levelname)s [%(name)s] %(message)s", level=logging.INFO, datefmt="%Y-%m-%d %H:%M:%S"
)
model.quantize(data, cache_examples_on_gpu=False)

最后，保存量化模型:

model.save_quantized(quant_path, use_safetensors=True)
tokenizer.save_pretrained(quant_path)

save_quantized方法不支持分片。對于分片，需要加載模型并使用save_pretrained來保存和分割模型。

至此，本文的內容結束啦。

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/news/900296.shtml
繁體地址，請注明出處：http://hk.pswp.cn/news/900296.shtml
英文地址，請注明出處：http://en.pswp.cn/news/900296.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！