1. 模型簡介
Kimi K2 是一款尖端專家混合(MoE)語言模型,激活參數量達320億,總參數量突破1萬億。該模型采用Muon優化器訓練,在前沿知識、推理和編程任務中展現出卓越性能,同時針對智能體能力進行了精細化優化。
核心特性
- 超大規模訓練:基于15.5萬億token預訓練1萬億參數MoE模型,全程保持訓練穩定性
- MuonClip優化器:將Muon優化器應用于前所未有的規模,開發新型優化技術解決擴展過程中的穩定性問題
- 智能體能力:專為工具調用、邏輯推理和自主問題解決設計
模型變體
- Kimi-K2-Base:基礎模型,為希望完全掌控微調和定制解決方案的研究者與開發者提供堅實的起點。
- Kimi-K2-Instruct:經過后訓練的模型,最適合即插即用的通用聊天及代理體驗。它屬于無需長思考的反射級模型。
2. 模型概述
架構 | 專家混合模型 (MoE) |
總參數量 | 1萬億 |
激活參數量 | 320億 |
層數 (含全連接層) | 61 |
全連接層數量 | 1 |
注意力隱藏層維度 | 7168 |
MoE隱藏層維度 (單專家) | 2048 |
注意力頭數量 | 64 |
專家總數 | 384 |
單token選用專家數 | 8 |
共享專家數量 | 1 |
詞表大小 | 16萬 |
上下文長度 | 12萬8千 |
注意力機制 | 多層注意力 |
激活函數 | SwiGLU |
3. 評估結果
指令模型評估結果
Benchmark | Metric | Kimi K2 Instruct | DeepSeek-V3-0324 | Qwen3-235B-A22B (non-thinking) | Claude Sonnet 4 (w/o extended thinking) | Claude Opus 4 (w/o extended thinking) | GPT-4.1 | Gemini 2.5 Flash Preview (05-20) |
---|---|---|---|---|---|---|---|---|
Coding Tasks | ||||||||
LiveCodeBench v6 (Aug 24 - May 25) | Pass@1 | 53.7 | 46.9 | 37.0 | 48.5 | 47.4 | 44.7 | 44.7 |
OJBench | Pass@1 | 27.1 | 24.0 | 11.3 | 15.3 | 19.6 | 19.5 | 19.5 |
MultiPL-E | Pass@1 | 85.7 | 83.1 | 78.2 | 88.6 | 89.6 | 86.7 | 85.6 |
SWE-bench Verified (Agentless Coding) | Single Patch | 51.8 | 36.6 | 39.4 | 50.2 | 53.0 | 40.8 | 32.6 |
SWE-bench Verified (Agentic Coding) | Single Attempt (Acc) | 65.8 | 38.8 | 34.4 | 72.7* | 72.5* | 54.6 | — |
Multiple Attempts (Acc) | 71.6 | — | — | 80.2 | 79.4* | — | — | |
SWE-bench Multilingual (Agentic Coding) | Single Attempt (Acc) | 47.3 | 25.8 | 20.9 | 51.0 | — | 31.5 | — |
TerminalBench | Inhouse Framework (Acc) | 30.0 | — | — | 35.5 | 43.2 | 8.3 | — |
Acc | 25.0 | 16.3 | 6.6 | — | — | 30.3 | 16.8 | |
Aider-Polyglot | Acc | 60.0 | 55.1 | 61.8 | 56.4 | 70.7 | 52.4 | 44.0 |
Tool Use Tasks | ||||||||
Tau2 retail | Avg@4 | 70.6 | 69.1 | 57.0 | 75.0 | 81.8 | 74.8 | 64.3 |
Tau2 airline | Avg@4 | 56.5 | 39.0 | 26.5 | 55.5 | 60.0 | 54.5 | 42.5 |
Tau2 telecom | Avg@4 | 65.8 | 32.5 | 22.1 | 45.2 | 57.0 | 38.6 | 16.9 |
AceBench | Acc | 76.5 | 72.7 | 70.5 | 76.2 | 75.6 | 80.1 | 74.5 |
Math & STEM Tasks | ||||||||
AIME 2024 | Avg@64 | 69.6 | 59.4* | 40.1* | 43.4 | 48.2 | 46.5 | 61.3 |
AIME 2025 | Avg@64 | 49.5 | 46.7 | 24.7* | 33.1* | 33.9* | 37.0 | 46.6 |
MATH-500 | Acc | 97.4 | 94.0* | 91.2* | 94.0 | 94.4 | 92.4 | 95.4 |
HMMT 2025 | Avg@32 | 38.8 | 27.5 | 11.9 | 15.9 | 15.9 | 19.4 | 34.7 |
CNMO 2024 | Avg@16 | 74.3 | 74.7 | 48.6 | 60.4 | 57.6 | 56.6 | 75.0 |
PolyMath-en | Avg@4 | 65.1 | 59.5 | 51.9 | 52.8 | 49.8 | 54.0 | 49.9 |
ZebraLogic | Acc | 89.0 | 84.0 | 37.7* | 73.7 | 59.3 | 58.5 | 57.9 |
AutoLogi | Acc | 89.5 | 88.9 | 83.3 | 89.8 | 86.1 | 88.2 | 84.1 |
GPQA-Diamond | Avg@8 | 75.1 | 68.4* | 62.9* | 70.0* | 74.9* | 66.3 | 68.2 |
SuperGPQA | Acc | 57.2 | 53.7 | 50.2 | 55.7 | 56.5 | 50.8 | 49.6 |
Humanity's Last Exam (Text Only) | - | 4.7 | 5.2 | 5.7 | 5.8 | 7.1 | 3.7 | 5.6 |
General Tasks | ||||||||
MMLU | EM | 89.5 | 89.4 | 87.0 | 91.5 | 92.9 | 90.4 | 90.1 |
MMLU-Redux | EM | 92.7 | 90.5 | 89.2 | 93.6 | 94.2 | 92.4 | 90.6 |
MMLU-Pro | EM | 81.1 | 81.2* | 77.3 | 83.7 | 86.6 | 81.8 | 79.4 |
IFEval | Prompt Strict | 89.8 | 81.1 | 83.2* | 87.6 | 87.4 | 88.0 | 84.3 |
Multi-Challenge | Acc | 54.1 | 31.4 | 34.0 | 46.8 | 49.0 | 36.4 | 39.5 |
SimpleQA | Correct | 31.0 | 27.7 | 13.2 | 15.9 | 22.8 | 42.3 | 23.3 |
Livebench | Pass@1 | 76.4 | 72.4 | 67.6 | 74.8 | 74.6 | 69.8 | 67.8 |
? 標記有 * 的數據點直接取自模型的技術報告或博客。
? 除SWE-bench Verified (Agentless)外,所有指標均在8k輸出標記長度下進行評估。SWE-bench Verified (Agentless)則限制在16k輸出標記長度。
? Kimi K2在SWE-bench Verified測試中的單次嘗試補丁(無需測試時計算)通過率達到了65.8%(使用bash/編輯器工具)。在相同條件下,其在SWE-bench Multilingual測試中的單次通過率為47.3%。此外,我們報告了利用并行測試時計算的SWE-bench Verified測試結果(71.6%),即通過采樣多個序列并通過內部評分模型選擇最佳方案。
?為確保評估的穩定性,我們在AIME、HMMT、CNMO、PolyMath-en、GPQA-Diamond、EvalPlus、Tau2上采用了avg@k方法。
? 由于評估成本過高,部分數據點已被省略。
基礎模型評估結果
Benchmark | Metric | Shot | Kimi K2 Base | Deepseek-V3-Base | Qwen2.5-72B | Llama 4 Maverick |
---|---|---|---|---|---|---|
General Tasks | ||||||
MMLU | EM | 5-shot | 87.8 | 87.1 | 86.1 | 84.9 |
MMLU-pro | EM | 5-shot | 69.2 | 60.6 | 62.8 | 63.5 |
MMLU-redux-2.0 | EM | 5-shot | 90.2 | 89.5 | 87.8 | 88.2 |
SimpleQA | Correct | 5-shot | 35.3 | 26.5 | 10.3 | 23.7 |
TriviaQA | EM | 5-shot | 85.1 | 84.1 | 76.0 | 79.3 |
GPQA-Diamond | Avg@8 | 5-shot | 48.1 | 50.5 | 40.8 | 49.4 |
SuperGPQA | EM | 5-shot | 44.7 | 39.2 | 34.2 | 38.8 |
Code Tasks | ||||||
LiveCodeBench v6 | Pass@1 | 1-shot | 26.3 | 22.9 | 21.1 | 25.1 |
EvalPlus | Pass@1 | - | 80.3 | 65.6 | 66.0 | 65.5 |
Mathematics Tasks | ||||||
MATH | EM | 4-shot | 70.2 | 60.1 | 61.0 | 63.0 |
GSM8k | EM | 8-shot | 92.1 | 91.7 | 90.4 | 86.3 |
Chinese Tasks | ||||||
C-Eval | EM | 5-shot | 92.5 | 90.0 | 90.9 | 80.9 |
CSimpleQA | Correct | 5-shot | 77.6 | 72.1 | 50.5 | 53.5 |
? 所有模型均采用相同的評估協議進行測試。
4. 部署說明
[!注意]
您可以通過 https://platform.moonshot.ai 訪問Kimi K2的API服務,我們提供了兼容OpenAI/Anthropic規范的API接口。其中Anthropic兼容API的溫度參數映射關系為
real_temperature = request_temperature * 0.6
,以更好地適配現有應用程序。
我們的模型檢查點采用block-fp8格式存儲,您可以在Huggingface平臺獲取。
當前推薦在以下推理引擎上運行Kimi-K2模型:
- vLLM
- SGLang
- KTransformers
- TensorRT-LLM
關于vLLM和SGLang的部署示例,請參閱模型部署指南。
5. 模型使用
聊天補全
本地推理服務啟動后,您可以通過聊天端點與之交互:
def simple_chat(client: OpenAI, model_name: str):messages = [{"role": "system", "content": "You are Kimi, an AI assistant created by Moonshot AI."},{"role": "user", "content": [{"type": "text", "text": "Please give a brief self-introduction."}]},]response = client.chat.completions.create(model=model_name,messages=messages,stream=False,temperature=0.6,max_tokens=256)print(response.choices[0].message.content)
[!注意]
Kimi-K2-Instruct 的推薦溫度為temperature = 0.6
。
如無特殊要求,上述系統提示是良好的默認設置。
工具調用
Kimi-K2-Instruct 具備強大的工具調用能力。
啟用功能需在每次請求中傳入可用工具列表,模型將自主決定調用時機與方式。
以下示例展示了端到端的天氣工具調用流程:
# Your tool implementation
def get_weather(city: str) -> dict:return {"weather": "Sunny"}# Tool schema definition
tools = [{"type": "function","function": {"name": "get_weather","description": "Retrieve current weather information. Call this when the user asks about the weather.","parameters": {"type": "object","required": ["city"],"properties": {"city": {"type": "string","description": "Name of the city"}}}}
}]# Map tool names to their implementations
tool_map = {"get_weather": get_weather
}def tool_call_with_client(client: OpenAI, model_name: str):messages = [{"role": "system", "content": "You are Kimi, an AI assistant created by Moonshot AI."},{"role": "user", "content": "What's the weather like in Beijing today? Use the tool to check."}]finish_reason = Nonewhile finish_reason is None or finish_reason == "tool_calls":completion = client.chat.completions.create(model=model_name,messages=messages,temperature=0.6,tools=tools, # tool list defined abovetool_choice="auto")choice = completion.choices[0]finish_reason = choice.finish_reasonif finish_reason == "tool_calls":messages.append(choice.message)for tool_call in choice.message.tool_calls:tool_call_name = tool_call.function.nametool_call_arguments = json.loads(tool_call.function.arguments)tool_function = tool_map[tool_call_name]tool_result = tool_function(**tool_call_arguments)print("tool_result:", tool_result)messages.append({"role": "tool","tool_call_id": tool_call.id,"name": tool_call_name,"content": json.dumps(tool_result)})print("-" * 100)print(choice.message.content)
tool_call_with_client
函數實現了從用戶查詢到工具執行的完整流程。
該流程要求推理引擎支持Kimi-K2的原生工具解析邏輯。
如需了解流式輸出和手動工具解析方法,請參閱工具調用指南。
6. 許可協議
代碼倉庫和模型權重均采用修訂版MIT許可證發布。