高效運行 QwQ-32B + 錯誤修復

文章目錄

- QwQ-32B 錯誤修復
- ?? 官方推薦設置
- 👍 推薦的 llama.cpp 設置
- 📖 教程：運行和修復的 QwQ-32B
- - 1、對于 llama.cpp 及使用 llama.cpp 的引擎：
  - 2、下載模型 + 測試
  - 3、測試/評估
  - 4、嘗試不使用我們的修復方案：
- 💡 `<think>` 令牌未顯示？
- 🧪 實驗結果 + 備注
- 🦥 動態 4 位量化
- 🛠? 微調 QwQ-32B
- 性能基準測試

本文翻譯整理自：Run QwQ-32B effectively + Bug Fixes (Mar 7, 2025 ? By Daniel & Michael
https://unsloth.ai/blog/qwq-32b

Qwen發布了QwQ-32B，這是一個性能可與DeepSeek-R1相媲美的強大推理模型。你可能遇到過諸如無限循環、重復、令牌錯誤以及微調挑戰等問題，這些問題并不能反映模型的真實質量。我們希望這篇博客能幫助你調試和修復大多數問題！[查看教程](https://unsloth.ai/blog/qwq-32b#Tutorial QwQ)
我們的模型上傳包含錯誤修復和對微調、vLLM 和 Transformers 的工作，但是如果你在使用 llama.cpp 以及作為后端使用 llama.cpp 的引擎，你可能已經遇到了問題。要解決問題，請遵循下面的教程，或閱讀我們文檔中的詳細指南和分析。
查看所有Unsloth修復的QwQ-32B上傳，包括GGUF和動態4位，在此處。

QwQ-32B 錯誤修復

我們發現了一些問題，尤其是影響了微調的部分！EOS令牌是正確的，但PAD令牌可能更應該被 “<|vision_pad|>” 替代。我們已經在這里更新了它。

"eos_token": "<|im_end|>",
"pad_token": "<|endoftext|>",

?? 官方推薦設置

根據Qwen，這些是推薦的推理設置：

Temperature of 0.6
Top_K of 40 (or 20 to 40)
Min_P of 0.0
Top_P of 0.95
重復懲罰為1.0。（1.0表示在llama.cpp和transformers中禁用）
聊天模板: <|im_start|>user\nCreate a Flappy Bird game in Python.<|im_end|>\n<|im_start|>assistant\n<think>\n

👍 推薦的 llama.cpp 設置

我們注意到很多人使用大于1.0的重復懲罰系數。例如1.1到1.5。這實際上干擾了llama.cpp的采樣機制。重復懲罰的目標是懲罰重復的生成，但我們發現這并沒有按預期工作。

關閉重復懲罰（即將其設置為1.0）也有效，但我們發現使用它來懲罰無限生成是有用的。

要使用它，我們發現您還必須編輯 llama.cpp 中采樣器的順序，在應用重復懲罰之前，否則將會有無盡的生成。所以添加這個：

--samplers "top_k;top_p;min_p;temperature;dry;typ_p;xtc"

默認情況下，llama.cpp 使用以下排序順序：

--samplers "dry;top_k;typ_p;top_p;min_p;xtc;temperature"

我們重新排序了基本溫度和干燥，并將 min_p 前移。這意味著我們按照以下順序應用采樣器：

top_k=40
top_p=0.95
min_p=0.0
temperature=0.6
dry
typ_p
xtc

📖 教程：運行和修復的 QwQ-32B

1、對于 llama.cpp 及使用 llama.cpp 的引擎：

您可以在我們的這里閱讀我們的完整指南。獲取最新的 llama.cpp 在：github.com/ggml-org/llama.cpp。

您也可以按照下面的構建說明進行操作。如果您沒有 GPU 或者只想使用 CPU 推理，將 -DGGML_CUDA=ON 改為 -DGGML_CUDA=OFF。

apt-get update
apt-get install build-essential cmake curl libcurl4-openssl-dev -y
git clone https://github.com/ggerganov/llama.cpp
cmake llama.cpp -B llama.cpp/build \-DBUILD_SHARED_LIBS=ON -DGGML_CUDA=ON -DLLAMA_CURL=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-quantize llama-cli llama-gguf-split
cp llama.cpp/build/bin/llama-* llama.cpp

2、下載模型 + 測試

下載模型通過（在安裝 pip install huggingface_hub hf_transfer 后）。您可以選擇 Q4_K_M，或其他量化版本（如 BF16 全精度）。其他變體：huggingface.co/unsloth/QwQ-32B-GGUF
然后運行Unsloth的Flappy Bird測試，該測試會將輸出保存到 Q4_K_M_yes_samplers.txt

# !pip install huggingface_hub hf_transfer
import os
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
from huggingface_hub import snapshot_download
snapshot_download(repo_id = "unsloth/QwQ-32B-GGUF",local_dir = "unsloth-QwQ-32B-GGUF",allow_patterns = ["*Q4_K_M*"], # For Q4_K_M
)

3、測試/評估

編輯 --threads 32 以設置 CPU 線程數，--ctx-size 16384 以設置上下文長度，--n-gpu-layers 99 以設置在多少層上進行 GPU 負載卸載。

如果您的 GPU 內存不足，請嘗試調整它。如果您只有 CPU 推理，也請將其刪除。
我們使用 --repeat-penalty 1.1 和 --dry-multiplier 0.5，這些值你可以調整。

./llama.cpp/llama-cli \--model unsloth-QwQ-32B-GGUF/QwQ-32B-Q4_K_M.gguf \--threads 32 \--ctx-size 16384 \--n-gpu-layers 99 \--seed 3407 \--prio 2 \--temp 0.6 \--repeat-penalty 1.5 \--repeat-penalty 1.1 \--dry-multiplier 0.5 \--min-p 0.0 \--top-k 40 \--top-p 0.95 \-no-cnv \--samplers "top_k;top_p;min_p;temperature;dry;typ_p;xtc" \--prompt "<|im_start|>user\nCreate a Flappy Bird game in Python. You must include these things:\n1. You must use pygame.\n2. The background color should be randomly chosen and is a light shade. Start with a light blue color.\n3. Pressing SPACE multiple times will accelerate the bird.\n4. The bird's shape should be randomly chosen as a square, circle or triangle. The color should be randomly chosen as a dark color.\n5. Place on the bottom some land colored as dark brown or yellow chosen randomly.\n6. Make a score shown on the top right side. Increment if you pass pipes and don't hit them.\n7. Make randomly spaced pipes with enough space. Color them randomly as dark green or light brown or a dark gray shade.\n8. When you lose, show the best score. Make the text inside the screen. Pressing q or Esc will quit the game. Restarting is pressing SPACE again.\nThe final game should be inside a markdown section in Python. Check your code for errors and fix them before the final markdown section.<|im_end|>\n<|im_start|>assistant\n<think>\n"  \2>&1 | tee Q4_K_M_yes_samplers.txt

查看示例最終 Python 輸出在此. 完整輸入為：

<|im_start|>user
Create a Flappy Bird game in Python. You must include these things:
1. You must use pygame.
2. The background color should be randomly chosen and is a light shade. Start with a light blue color.
3. Pressing SPACE multiple times will accelerate the bird.
4. The bird's shape should be randomly chosen as a square, circle or triangle. The color should be randomly chosen as a dark color.
5. Place on the bottom some land colored as dark brown or yellow chosen randomly.
6. Make a score shown on the top right side. Increment if you pass pipes and don't hit them.
7. Make randomly spaced pipes with enough space. Color them randomly as dark green or light brown or a dark gray shade.
8. When you lose, show the best score. Make the text inside the screen. Pressing q or Esc will quit the game. Restarting is pressing SPACE again.
The final game should be inside a markdown section in Python. Check your code for errors and fix them before the final markdown section.<|im_end|>
<|im_start|>assistant
<think>

運行它時，我們得到一個可執行的游戲！

在這里插入圖片描述

4、嘗試不使用我們的修復方案：

現在嘗試不使用我們的修復方法！所以移除 --samplers "top_k;top_p;min_p;temperature;dry;typ_p;xtc" 這將保存輸出到 Q4_K_M_no_samplers.txt

./llama.cpp/llama-cli \--model unsloth-QwQ-32B-GGUF/QwQ-32B-Q4_K_M.gguf \--threads 32 \--ctx-size 16384 \--n-gpu-layers 99 \--seed 3407 \--prio 2 \--temp 0.6 \--repeat-penalty 1.5 \--repeat-penalty 1.1 \--dry-multiplier 0.5 \--min-p 0.1 \--top-k 40 \--top-p 0.95 \-no-cnv \--prompt "<|im_start|>user\nCreate a Flappy Bird game in Python. You must include these things:\n1. You must use pygame.\n2. The background color should be randomly chosen and is a light shade. Start with a light blue color.\n3. Pressing SPACE multiple times will accelerate the bird.\n4. The bird's shape should be randomly chosen as a square, circle or triangle. The color should be randomly chosen as a dark color.\n5. Place on the bottom some land colored as dark brown or yellow chosen randomly.\n6. Make a score shown on the top right side. Increment if you pass pipes and don't hit them.\n7. Make randomly spaced pipes with enough space. Color them randomly as dark green or light brown or a dark gray shade.\n8. When you lose, show the best score. Make the text inside the screen. Pressing q or Esc will quit the game. Restarting is pressing SPACE again.\nThe final game should be inside a markdown section in Python. Check your code for errors and fix them before the final markdown section.<|im_end|>\n<|im_start|>assistant\n<think>\n"  \2>&1 | tee Q4_K_M_no_samplers.txt

您將遇到一些循環問題，但 問題性的不正確 Python 語法 和許多其他問題。例如下面看起來是正確的，但實際上是錯誤的！

即第39行 pipes.clear() 拋出錯誤：NameError: name 'pipes' is not defined. 你忘記導入 ‘pipes’ 了嗎？請參考我們的示例，它展示了完全錯誤的結果在這里。

如果您使用 --repeat-penalty 1.5，情況會更糟，并且更加明顯，實際上語法完全錯誤。

你可能想知道，也許是 Q4_K_M？B16 即全精度應該可以正常工作吧？不正確 - 如果我們不在使用重復懲罰時使用我們的修復方案 --samplers "top_k;top_p;min_p;temperature;dry;typ_p;xtc"，輸出又會失敗。

💡 `<think>` 令牌未顯示？

有些人報告說，由于在聊天模板中默認添加了 <think>，一些系統無法正確輸出思維跟蹤。您將需要手動編輯 Jinja 模板，從：

{%- if tools %} {{- '<|im_start|>system\n' }} {%- if messages[0]['role'] == 'system' %} {{- messages[0]['content'] }} {%- else %} {{- '' }} {%- endif %} {{- "\n\n# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }} {%- for tool in tools %} {{- "\n" }} {{- tool | tojson }} {%- endfor %} {{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }} {%- else %} {%- if messages[0]['role'] == 'system' %} {{- '<|im_start|>system\n' + messages[0]['content'] + '<|im_end|>\n' }} {%- endif %} {%- endif %} {%- for message in messages %} {%- if (message.role == "user") or (message.role == "system" and not loop.first) %} {{- '<|im_start|>' + message.role + '\n' + message.content + '<|im_end|>' + '\n' }} {%- elif message.role == "assistant" and not message.tool_calls %} {%- set content = message.content.split('</think>')[-1].lstrip('\n') %} {{- '<|im_start|>' + message.role + '\n' + content + '<|im_end|>' + '\n' }} {%- elif message.role == "assistant" %} {%- set content = message.content.split('</think>')[-1].lstrip('\n') %} {{- '<|im_start|>' + message.role }} {%- if message.content %} {{- '\n' + content }} {%- endif %} {%- for tool_call in message.tool_calls %} {%- if tool_call.function is defined %} {%- set tool_call = tool_call.function %} {%- endif %} {{- '\n<tool_call>\n{"name": "' }} {{- tool_call.name }} {{- '", "arguments": ' }} {{- tool_call.arguments | tojson }} {{- '}\n</tool_call>' }} {%- endfor %} {{- '<|im_end|>\n' }} {%- elif message.role == "tool" %} {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != "tool") %} {{- '<|im_start|>user' }} {%- endif %} {{- '\n<tool_response>\n' }} {{- message.content }} {{- '\n</tool_response>' }} {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %} {{- '<|im_end|>\n' }} {%- endif %} {%- endif %} {%- endfor %} {%- if add_generation_prompt %} {{- '<|im_start|>assistant\n<think>\n' }} {%- endif %}

要將以下英文 markdown 文檔內容翻譯成中文，并保留原本的 markdown 格式，斜體字不翻譯，代碼也不翻譯，內容如下：

通過刪除末尾的 <think>\n 來將其移動到另一個位置。現在模型在推理時將需要手動添加 <think>\n，這可能并不總是成功。

DeepSeek 還編輯了所有模型，以默認添加一個 <think> 令牌來強制模型進入推理模式。

因此，將 {%- if add_generation_prompt %}{{- '<|im_start|>assistant\n<think>\n' }} {%- endif %} 更改為 {%- if add_generation_prompt %} {{- '<|im_start|>assistant\n' }} {%- endif %}，即刪除 <think>\n。

查看移除 <think> 部分（此處）的完整 Jinga 模板在此.

🧪 實驗結果 + 備注

我們首先想的是：

1、QwQ的上下文長度并非原生128K，而是32K，通過YaRN擴展實現。我們嘗試了覆蓋llama.cpp中的YaRN處理，但沒有任何變化。例如，在QwQ-32B的readme文件中我們看到以下內容：

{...,"rope_scaling": {"factor": 4.0,"original_max_position_embeddings": 32768,"type": "yarn"}
}

2、我們也認為可能是 RMS Layernorm 的 epsilon 值不正確——不是 1e-5，而是可能是 1e-6。例如這個有 rms_norm_eps=1e-06，而這個有 rms_norm_eps=1e-05。我們也將它覆蓋了，但并沒有起作用：

3、我們還測試了在 llama.cpp 和普通 Transformers 之間分詞器 ID 是否匹配，歸功于 @kalomaze。它們匹配了，所以這并非罪魁禍首。

我們提供了我們的實驗結果在我們的文檔中。

🦥 動態 4 位量化

我們還上傳了動態 4 位量化，與簡單的 4 位量化相比提高了準確性！我們將動態 4 位量化上傳到了這里。下面附上了 QwQ 量化誤差分析圖，包括激活和權重量化誤差：
自vLLM 0.7.3（2025年2月20日）起，vLLM現在支持加載Unsloth動態4位量化！

在這里插入圖片描述

🛠? 微調 QwQ-32B

QwQ-32B 調優在不到 20GB 的 VRAM 中與 Unsloth 兼容！它還快了 2 倍，并且默認使用我們動態的 4 位量化來提升 QLoRA 的準確性。
由于模型大小，很遺憾模型無法適應免費的Google Colab 16GB VRAM GPU，因此您需要至少20GB VRAM的GPU。要查看我們其他筆記本和模型上傳，請訪問我們的文檔。