LLMs之RLVR：《Absolute Zero: Reinforced Self-play Reasoning with Zero Data》翻譯與解讀

導讀：Absolute Zero范式通過讓模型在沒有外部數據的情況下，自主提出和解決任務，實現了推理能力的顯著提升。Absolute Zero Reasoner (AZR) 作為該范式的一個具體實現，在代碼和數學推理任務上取得了優異的成績，展示了通過自我博弈和環境交互實現模型自我進化的巨大潛力。該研究為未來的AI推理模型發展提供了一個全新的方向，即擺脫對人工標注數據的依賴，探索更廣闊的自我學習空間。

>> 背景痛點：

● 人工監督的局限性：

●● 人工標注成本高昂：隨著推理模型能力的提升，構建大規模高質量數據集所需的成本變得越來越難以承受。

●● 人類智能的限制：當AI系統超越人類智能時，依賴人類設計的任務可能會限制其自主學習和增長的能力。

● 現有RLVR方法的不足：依賴專家知識。現有的可驗證獎勵強化學習（RLVR）方法仍然依賴于人工策劃的推理問題-答案對，這限制了其長期可擴展性。

>> 具體的解決方案：

● Absolute Zero范式：

●● 自博弈學習：模型同時學習定義最大化自身學習進度的任務，并通過解決這些任務來提高推理能力，無需任何外部數據。

●● 環境反饋：依賴于來自環境的可驗證反饋作為獎勵來源，促進開放式但有實際依據的學習。

● Absolute Zero Reasoner (AZR)：

●● 統一模型：使用單一LLM作為任務的提出者和解決者。

●● 三種推理模式：通過三種不同類型的編碼任務進行學習，分別對應于歸納、演繹和溯因三種基本的推理模式。

●● 代碼執行器：將代碼執行器用作開放式但有實際依據的環境，以驗證任務的完整性并提供可驗證的反饋。

●● 多任務學習：采用專門為多任務學習設計的新型強化學習優勢估計器進行模型更新。

>> 核心思路步驟：

● 任務提出：模型（πpropose）根據變量z提出任務τ。

● 任務驗證：環境（e）驗證任務τ，構建有效的推理任務(x, y?)，并給出可學習性獎勵rpropose。

● 任務解決：模型（πsolve）解決任務x，得到答案y，并從環境獲得獎勵rsolve。

● 聯合訓練：聯合訓練πpropose和πsolve，通過最大化目標函數J(θ)來優化模型參數θ。

● 循環迭代：不斷重復上述過程，實現模型的自我進化和能力提升。

>> 優勢：

● 無需外部數據：模型完全通過自我博弈和與環境的交互進行學習，無需任何人工標注或外部數據集。

● 卓越的性能：在編碼和數學推理任務上實現了最先進的性能，超越了依賴于大量人工標注數據的現有模型。

● 跨領域泛化：在代碼環境中訓練的模型在數學任務上表現出更強的泛化能力。

● 可擴展性：性能隨著模型規模的增加而提升。

● 涌現行為：模型在訓練過程中自然涌現出中間計劃（如ReAct提示框架）和認知行為（如逐步推理、枚舉和試錯）。

>> 結論和觀點：

● 代碼先驗知識的重要性：具有更強編碼能力的模型在經過AZR訓練后，整體推理能力提升更為顯著。

● 探索學習任務空間： AI推理智能體可以從動態定義和改進自己的學習任務中受益，這為未來的研究開辟了一個有希望的新方向。

● 安全性問題：在使用Llama-3.1-8B模型進行AZR訓練時，觀察到了一些令人擔憂的思維鏈，這突出了未來工作中對安全意識訓練的需求。

● 環境多樣性：建議探索更多樣化的環境，例如萬維網、形式化數學語言、世界模擬器，甚至是現實世界，以提供可驗證的反饋。

● 多模態推理：未來的研究可以探索多模態推理模型，并設計探索/多樣性獎勵來進一步提升模型性能。

《Absolute Zero: Reinforced Self-play Reasoning with Zero Data》翻譯與解讀

Abstract

Figure 1:Absolute Zero Reasoner (AZR) achieves state-of-the-art performance with ZERO DATA. Without relying on any gold labels or human-defined queries, Absolute Zero Reasoner trained using our proposed self-play approach demonstrates impressive general reasoning capabilities improvements in both math and coding, despite operating entirely out-of-distribution. Remarkably, AZR surpasses models trained on tens of thousands of expert-labeled in-domain examples in the combined average score across both domains.圖 1：絕對零度推理器（AZR）在零數據的情況下達到了最先進的性能。無需依賴任何人工標注的黃金標簽或人為定義的查詢，通過我們提出的自對弈方法訓練的絕對零度推理器在數學和編程領域均展現出令人矚目的通用推理能力提升，盡管其完全是在分布外進行操作。值得注意的是，AZR 在兩個領域的綜合平均得分上超過了那些基于數萬個專家標注的領域內示例訓練出來的模型。

Figure 2:Absolute Zero Paradigm. Supervised learning relies on human-curated reasoning traces for behavior cloning. Reinforcement learning from verified rewards, enables agents to self-learn reasoning, but still depends on expert-defined learning distribution and a respective set of curated QA pairs, demanding domain expertise and manual effort. In contrast, we introduce a new paradigm, Absolute Zero, for training reasoning models without any human-curated data. We envision that the agent should autonomously propose tasks optimized for learnability and learn how to solve them using an unified model. The agent learns by interacting with an environment that provides verifiable feedback, enabling reliable and continuous self-improvement entirely without human intervention.圖 2：絕對零度范式。監督學習依賴于人工整理的推理軌跡來進行行為克隆。強化學習從經過驗證的獎勵中學習，使智能體能夠自主學習推理，但仍依賴于專家定義的學習分布以及相應的一組人工整理的問答對，這需要領域專業知識和人工努力。相比之下，我們引入了一種新的范式——絕對零度，用于在沒有任何人工整理數據的情況下訓練推理模型。我們設想智能體應自主提出針對可學習性進行優化的任務，并學習使用統一模型解決這些任務。智能體通過與提供可驗證反饋的環境進行交互來學習，從而在完全無需人工干預的情況下實現可靠且持續的自我改進。

1、Introduction

6、Conclusion and Discussion

《Absolute Zero: Reinforced Self-play Reasoning with Zero Data》翻譯與解讀

地址	論文地址：[2505.03335] Absolute Zero: Reinforced Self-play Reasoning with Zero Data
時間	2025年5月6日
作者	清華大學，北京通用人工智能研究院，賓夕法尼亞州立大學

Abstract

Reinforcement learning with verifiable rewards (RLVR) has shown promise in enhancing the reasoning capabilities of large language models by learning directly from outcome-based rewards. Recent RLVR works that operate under the zero setting avoid supervision in labeling the reasoning process, but still depend on manually curated collections of questions and answers for training. The scarcity of high-quality, human-produced examples raises concerns about the long-term scalability of relying on human supervision, a challenge already evident in the domain of language model pretraining. Furthermore, in a hypothetical future where AI surpasses human intelligence, tasks provided by humans may offer limited learning potential for a superintelligent system. To address these concerns, we propose a new RLVR paradigm called Absolute Zero, in which a single model learns to propose tasks that maximize its own learning progress and improves reasoning by solving them, without relying on any external data. Under this paradigm, we introduce the Absolute Zero Reasoner (AZR), a system that self-evolves its training curriculum and reasoning ability by using a code executor to both validate proposed code reasoning tasks and verify answers, serving as an unified source of verifiable reward to guide open-ended yet grounded learning. Despite being trained entirely without external data, AZR achieves overall SOTA performance on coding and mathematical reasoning tasks, outperforming existing zero-setting models that rely on tens of thousands of in-domain human-curated examples. Furthermore, we demonstrate that AZR can be effectively applied across different model scales and is compatible with various model classes.

具有可驗證獎勵的強化學習（RLVR）通過直接從基于結果的獎勵中學習，已展現出增強大型語言模型推理能力的潛力。近期在零樣本設置下運行的 RLVR 工作避免了對推理過程進行標注的監督，但仍依賴于人工整理的問題和答案集合進行訓練。高質量的人工生成示例稀缺，這引發了對長期依賴人工監督的可擴展性的擔憂，這一挑戰在語言模型預訓練領域已顯而易見。此外，在假設未來人工智能超越人類智能的情況下，由人類提供的任務可能對超級智能系統的學習潛力有限。為解決這些擔憂，我們提出了一種新的 RLVR 范式，稱為絕對零，其中單個模型學會提出能最大化自身學習進度的任務，并通過解決這些任務來改進推理，而無需依賴任何外部數據。在此范式下，我們引入了絕對零度推理器（AZR），這是一個通過使用代碼執行器來驗證所提出的代碼推理任務和答案，從而自我進化其訓練課程和推理能力的系統，作為可驗證獎勵的統一來源，以引導開放但有根據的學習。盡管完全不依賴外部數據進行訓練，AZR 在編碼和數學推理任務上的整體性能達到了最先進的水平，超過了依賴數萬個領域內人工策劃示例的現有零樣本設置模型。此外，我們證明了 AZR 可以有效地應用于不同規模的模型，并且與各種模型類別兼容。

Figure 1:Absolute Zero Reasoner (AZR) achieves state-of-the-art performance with ZERO DATA. Without relying on any gold labels or human-defined queries, Absolute Zero Reasoner trained using our proposed self-play approach demonstrates impressive general reasoning capabilities improvements in both math and coding, despite operating entirely out-of-distribution. Remarkably, AZR surpasses models trained on tens of thousands of expert-labeled in-domain examples in the combined average score across both domains.圖 1：絕對零度推理器（AZR）在零數據的情況下達到了最先進的性能。無需依賴任何人工標注的黃金標簽或人為定義的查詢，通過我們提出的自對弈方法訓練的絕對零度推理器在數學和編程領域均展現出令人矚目的通用推理能力提升，盡管其完全是在分布外進行操作。值得注意的是，AZR 在兩個領域的綜合平均得分上超過了那些基于數萬個專家標注的領域內示例訓練出來的模型。

Figure 2:Absolute Zero Paradigm. Supervised learning relies on human-curated reasoning traces for behavior cloning. Reinforcement learning from verified rewards, enables agents to self-learn reasoning, but still depends on expert-defined learning distribution and a respective set of curated QA pairs, demanding domain expertise and manual effort. In contrast, we introduce a new paradigm, Absolute Zero, for training reasoning models without any human-curated data. We envision that the agent should autonomously propose tasks optimized for learnability and learn how to solve them using an unified model. The agent learns by interacting with an environment that provides verifiable feedback, enabling reliable and continuous self-improvement entirely without human intervention.圖 2：絕對零度范式。監督學習依賴于人工整理的推理軌跡來進行行為克隆。強化學習從經過驗證的獎勵中學習，使智能體能夠自主學習推理，但仍依賴于專家定義的學習分布以及相應的一組人工整理的問答對，這需要領域專業知識和人工努力。相比之下，我們引入了一種新的范式——絕對零度，用于在沒有任何人工整理數據的情況下訓練推理模型。我們設想智能體應自主提出針對可學習性進行優化的任務，并學習使用統一模型解決這些任務。智能體通過與提供可驗證反饋的環境進行交互來學習，從而在完全無需人工干預的情況下實現可靠且持續的自我改進。

1、Introduction

Large language models (LLMs) have recently achieved remarkable improvements in reasoning capabilities by employing Reinforcement Learning with Verifiable Rewards (RLVR) (Lambert et al., 2024). Unlike methods that explicitly imitate intermediate reasoning steps, RLVR uses only outcome-based feedback, enabling large-scale reinforcement learning over vast task datasets (DeepSeek-AI et al., 2025; Team et al., 2025; Jaech et al., 2024; OpenAI, 2025b; a). A particularly compelling variant is the “zero” RLVR paradigm (DeepSeek-AI et al., 2025), which forgoes any cold-start distillation data, using neither human-generated nor AI-generated reasoning traces, and applies RLVR directly on the base model with task rewards. However, these methods still depend heavily on expertly curated distributions of reasoning question–answer pairs, which raises serious concerns about their long-term scalability (Villalobos et al., 2024). As reasoning models continue to advance, the effort required to construct large-scale, high-quality datasets may soon become unsustainable (Yue et al., 2025). A similar scalability bottleneck has already been identified in the domain of LLM pretraining (Sutskever et al., 2024). Furthermore, as AI systems continue to evolve and potentially exceed human intellect, an exclusive dependence on human-designed tasks risks imposing constraints on their capacity for autonomous learning and growth (Hughes et al., 2024). This underscores the need for a new paradigm that begins to explore possibilities beyond the constraints of human-designed tasks and prepares for a future in which AI systems may surpass human intelligence.

大型語言模型（LLMs）最近通過采用可驗證獎勵的強化學習（RLVR）（Lambert 等人，2024 年）在推理能力方面取得了顯著進步。與那些明確模仿中間推理步驟的方法不同，RLVR 僅使用基于結果的反饋，從而能夠在大規模任務數據集上進行強化學習（DeepSeek-AI 等人，2025 年；Team 等人，2025 年；Jaech 等人，2024 年；OpenAI，2025 年 b；a）。其中一種特別有吸引力的變體是“零”RLVR 范式（DeepSeek-AI 等人，2025 年），它完全不依賴冷啟動蒸餾數據，既不使用人類生成的也不使用 AI 生成的推理軌跡，而是直接在基礎模型上應用 RLVR 并給予任務獎勵。然而，這些方法仍然嚴重依賴專家精心策劃的推理問答對分布，這引發了對其長期可擴展性的嚴重擔憂（Villalobos 等人，2024 年）。隨著推理模型的不斷進步，構建大規模高質量數據集所需的努力可能很快就會變得不可持續（Yue 等人，2025 年）。在大型語言模型預訓練領域（Sutskever 等人，2024 年），類似的可擴展性瓶頸問題已經顯現。此外，隨著人工智能系統的不斷發展，甚至有可能超越人類智力，過度依賴人類設計的任務可能會限制其自主學習和成長的能力（Hughes 等人，2024 年）。這凸顯了需要一種新的范式，開始探索超越人類設計任務限制的可能性，并為人工智能系統可能超越人類智能的未來做好準備。

To this end, we propose “Absolute Zero”, a new paradigm for reasoning models in which the model simultaneously learns to define tasks that maximize learnability and to solve them effectively, enabling self-evolution through self-play without relying on external data. In contrast to prior self-play methods that are limited to narrow domains, fixed functionalities, or learned reward models that are prone to hacking (Silver et al., 2017; Chen et al., 2025; 2024), the Absolute Zero paradigm is designed to operate in open-ended settings while remaining grounded in a real environment. It relies on feedback from the environment as a verifiable source of reward, mirroring how humans learn and reason through interaction with the world, and helps prevent issues such as hacking with neural reward models (Hughes et al., 2024). Similar to AlphaZero (Silver et al., 2017), which improves through self-play, our proposed paradigm requires no human supervision and learns entirely through self-interaction. We believe the Absolute Zero paradigm represents a promising step toward enabling large language models to autonomously achieve superhuman reasoning capabilities.

Building on this new reasoning paradigm, we introduce the Absolute Zero Reasoner (AZR), which proposes and solves coding tasks. We cast code executor as an open-ended yet grounded environment, sufficient to both validate task integrity and also provide verifiable feedback for stable training. We let AZR construct three types of coding tasks: infer and reason about one particular element in a program, input, output triplet, which corresponds to three complementary modes of reasoning: induction, abduction, and deduction. We train the entire system end-to-end with a newly proposed reinforcement learning advantage estimator tailored to the multitask nature of the proposed approach.

為此，我們提出“絕對零度”這一新范式，用于推理模型，該模型能夠同時學習定義最大化可學習性的任務，并有效解決這些任務，從而通過自我博弈實現自我進化，無需依賴外部數據。與先前僅限于狹窄領域、固定功能或易受黑客攻擊的神經獎勵模型的自博弈方法（Silver 等人，2017 年；Chen 等人，2025 年；2024 年）不同，“絕對零度”范式旨在開放環境中運行，同時扎根于真實環境。它依靠來自環境的反饋作為可驗證的獎勵來源，這與人類通過與世界互動學習和推理的方式相似，并有助于防止諸如神經獎勵模型被黑客攻擊等問題（Hughes 等人，2024 年）。與通過自我博弈不斷改進的 AlphaZero（Silver 等人，2017 年）類似，我們提出的范式無需人類監督，完全通過自我交互進行學習。我們認為，“絕對零度”范式是使大型語言模型能夠自主實現超人類推理能力的一個有前景的步驟。基于這一新的推理范式，我們引入了絕對零推理器（AZR），它能夠提出并解決編碼任務。我們將代碼執行器視為一個開放但有依據的環境，足以驗證任務的完整性，并提供可驗證的反饋以實現穩定訓練。我們讓 AZR 構建三種類型的編碼任務：推斷和推理程序、輸入、輸出三元組中的一個特定元素，這對應于三種互補的推理模式：歸納、溯因和演繹。我們使用一種新提出的強化學習優勢估計器對整個系統進行端到端訓練，該估計器專為所提出方法的多任務性質而設計。

Despite being trained entirely without any in-distribution data, AZR demonstrates remarkable capabilities across diverse reasoning tasks in math and coding. In mathematics, AZR achieves competitive performance compared to zero reasoner models explicitly fine-tuned with domain-specific supervision. In coding tasks, AZR establishes a new state-of-the-art performance, surpassing models specifically trained with code datasets using RLVR. Furthermore, AZR outperforms all previous models by an average of 1.8 absolute points compared to models trained in the “zero” setting using in-domain data. These surprising results highlight that general reasoning skills can emerge without human-curated domain targeted data, positioning Absolute Zero as an promising research direction and AZR as a first pivotal milestone. Besides the remarkable results AZR achieved with zero human data for reasoning, we also make very interesting findings summarized below:

? Code priors amplify reasoning. The base Qwen-Coder-7b model started with math performance 3.6 points lower than Qwen-7b. But after AZR training for both models, the coder variant surpassed the base by 0.7 points, suggesting that strong coding capabilities may potentially amplify overall reasoning improvements after AZR training.

? Cross domain transfer is more pronounced for AZR. After RLVR, expert code models raise math accuracy by only 0.65 points on average, whereas AZR-Base-7B and AZR-Coder-7B trained on self-proposed code reasoning tasks improve math average by 10.9 and 15.2, respectively, demonstrating much stronger generalized reasoning capability gains.

? Bigger bases yield bigger gains. Performance improvements scale with model size: the 3B, 7B, and 14B coder models gain +5.7, +10.2, and +13.2 points respectively, suggesting continued scaling is advantageous for AZR.

? Comments as intermediate plans emerge naturally. When solving code induction tasks, AZR often interleaves step-by-step plans as comments and code (Figure 19), resembling the ReAct prompting framework (Yao et al., 2023). Similar behavior has been observed in much larger formal-math models such as DeepSeek Prover v2 (671B) (Ren et al., 2025). We therefore believe that allowing the model to use intermediate scratch-pads when generating long-form answers may be beneficial in other domains as well.

? Cognitive Behaviors and Token length depends on reasoning mode. Distinct cognitive behaviors—such as step-by-step reasoning, enumeration, and trial-and-error all emerged through AZR training, but different behaviors are particularly evident across different types of tasks. Furthermore token counts grow over AZR training, but the magnitude of increase also differs by task types: abduction grows the most because the model performs trial-and-error until output matches, whereas deduction and induction grow modestly.

? Safety alarms ringing. We observe AZR with Llama3.1-8b occasionally produces concerning chains of thought, we term the “uh-oh moment”, example shown in Figure 32, highlighting the need for future work on safety-aware training (Zhang et al., 2025a).

盡管完全在無分布數據的情況下進行訓練，AZR 在數學和編碼的各種推理任務中都展現出了卓越的能力。在數學方面，AZR 達到了與專門針對特定領域進行微調的零推理器模型相當的性能。在編碼任務中，AZR 創立了新的性能標桿，超越了使用 RLVR 專門在代碼數據集上訓練的模型。此外，與使用領域內數據在“零”設置下訓練的所有先前模型相比，AZR 平均高出 1.8 個絕對點。這些令人驚訝的結果表明，通用推理技能可以在沒有人工整理的領域定向數據的情況下出現，這將“絕對零度”定位為一個有前景的研究方向，而 AZR 則是這一方向上的首個關鍵里程碑。除了 AZR 在推理方面無需人類數據就能取得的顯著成果外，我們還總結了以下非常有趣的發現：

? 代碼先驗知識能增強推理能力。Qwen-Coder-7b 基礎模型在數學性能上比 Qwen-7b 低 3.6 個點。但在 AZR 訓練后，編碼器變體比基礎模型高出 0.7 個點，這表明強大的編碼能力可能在 AZR 訓練后潛在地放大整體推理能力的提升。

? AZR 的跨領域遷移能力更為顯著。經過 RLVR 訓練后，專家代碼模型在數學準確率上平均僅提高 0.65 個點，而 AZR-Base-7B 和 AZR-Coder-7B 在自我提出的代碼推理任務上訓練后，數學平均準確率分別提高了 10.9 和 15.2 個點，這表明 AZR 具有更強的通用推理能力提升。? 更大的基礎模型帶來更大的收益。性能提升與模型規模成正比：30 億、70 億和 140 億參數的編碼器模型分別提升了 5.7、10.2 和 13.2 個點，這表明對于 AZR 而言，繼續擴大規模是有利的。

? 中間計劃以注釋形式自然出現。在解決代碼歸納任務時，AZR 經常將逐步計劃以注釋和代碼的形式交錯出現（圖 19），這類似于 ReAct 提示框架（Yao 等人，2023）。在諸如 DeepSeek Prover v2（6710 億參數）（Ren 等人，2025）這樣規模大得多的正式數學模型中也觀察到了類似的行為。因此，我們認為在生成長篇答案時允許模型使用中間草稿區可能在其他領域也有益處。

? 認知行為和標記長度取決于推理模式。不同的認知行為——如逐步推理、枚舉和試錯——都在 AZR 訓練過程中自然出現，但不同行為在不同類型的任務中表現得尤為明顯。此外，在 AZR 訓練過程中，標記數量會增加，但增加的幅度因任務類型而異：演繹推理的標記數量增長最多，因為模型會反復嘗試直至輸出匹配，而歸納推理的標記數量則增長較少。

? 安全警報響起。我們觀察到使用 Llama3.1-8b 的 AZR 有時會產生令人擔憂的思維鏈，我們將其稱為“哎呀時刻”，如圖 32 所示，這凸顯了未來需要開展安全意識訓練方面的研究（Zhang 等人，2025a）。

6、Conclusion and Discussion

In this work, we proposed the Absolute Zero paradigm, a novel setting that addresses the data limitations of existing RLVR frameworks. In this paradigm, reasoning agents are tasked with generating their own learning task distributions and improving their reasoning abilities with environmental guidance. We then presented our own instantiation, the Absolute Zero Reasoner (AZR), which is trained by having them propose and solve code-related reasoning tasks grounded by code executor.

We evaluated our trained models on out-of-distribution benchmarks in both the code generation and mathematical reasoning domains. Remarkably, even though our models were not directly trained on these tasks and lacked human expert-curated datasets, our reasoning agents achieved exceptional performance, surpassing the state-of-the-art in combined general reasoning scores and in coding. This demonstrates the potential of the absolute zero paradigm to drive superior reasoning capabilities without the need for extensive domain-specific training data. Furthermore, we showed that AZR scales efficiently, offering strong performance across varying model sizes, and can enhance the capabilities of other model classes as well. To foster further exploration and advancement of this emerging paradigm, we are releasing the code, models, and logs as open-source, encouraging the research community to build upon our findings.

在本研究中，我們提出了絕對零度范式，這是一種新穎的設置，旨在解決現有 RLVR 框架的數據限制問題。在該范式中，推理代理的任務是生成自己的學習任務分布，并在環境引導下提升其推理能力。隨后，我們介紹了我們自己的實例化模型——絕對零度推理器（AZR），它通過讓代理提出并解決由代碼執行器支持的代碼相關推理任務來進行訓練。

我們在代碼生成和數學推理領域的分布外基準上對訓練好的模型進行了評估。令人矚目的是，盡管我們的模型并未直接在這些任務上進行訓練，也缺乏由人類專家整理的數據集，但我們的推理代理仍取得了卓越的性能，在綜合推理得分和編碼方面均超越了現有最佳水平。這表明絕對零度范式具有在無需大量特定領域訓練數據的情況下推動更優推理能力的潛力。此外，我們還證明了 AZR 能夠高效擴展，在不同規模的模型中均表現出色，并且還能增強其他模型類別的能力。為了促進這一新興范式的進一步探索和進步，我們將代碼、模型和日志作為開源資源發布，鼓勵研究社區在我們的發現基礎上繼續發展。

We believe there remains much to explore, such as altering the environment from which the reasoner receives verifiable feedback, including sources like the world wide web, formal math languages?Sutton?(2001);?Ren et?al.?(2025), world simulators, or even the real world. Furthermore, AZ’s generality could possibly be extend to domains such as embodied AI?(Zitkovich et?al.,?2023;?Yue et?al.,?2024). Additionally, more complex agentic tasks or scientific experiments, present exciting opportunities to further advance the absolute zero setting to different application domains?(Wu et?al.,?2024;?2023). Beyond that, future directions could include exploring multimodal reasoning models, modifying the distribution?p?(z)?to incorporate privileged information, defining or even let the model dynamically learn how to define?f?(Equation 3), or designing exploration/diversity rewards for both the propose and solve roles.

While underappreciated in current reasoning literature, the exploration component of RL has long been recognized as a critical driver for emergent behavior in traditional RL?(Yue et?al.,?2025;?Silver et?al.,?2016;?Ladosz et?al.,?2022). Years of research have examined various forms of exploration, even in related subfields using LLMs such as red teaming?Zhao et?al.?(2025a), yet its role in LLM reasoning models remains underexplored. Taking this a step further, our framework investigates an even more meta-level exploration problem: exploration within the learning task space—where the agent learns not just how to solve tasks, but what tasks to learn from and how to find them. Rather than being confined to a fixed problem set, AI reasoner agents may benefit from dynamically defining and refining their own learning tasks. This shift opens a powerful new frontier—where agents explore not only solution spaces but also expand the boundaries of problem spaces. We believe this is a promising and important direction for future research.

One limitation of our work is that we did not address how to safely manage a system composed of such self-improving components. To our surprise, we observed several instances of safety-concerning CoT from the?Llama-3.1-8B?model, which we term the “uh-oh moment”. These findings suggest that the proposed absolute zero paradigm, while reducing the need for human intervention for curating tasks, still necessitates oversight due to lingering safety concerns and is a critical direction for future research?Wang et?al.?(2024;?2025a).

As a final note, we explored reasoning models that possess experience—models that not only solve given tasks, but also define and evolve their own learning task distributions with the help of an environment. Our results with AZR show that this shift enables strong performance across diverse reasoning tasks, even with significantly fewer privileged resources, such as curated human data. We believe this could finally free reasoning models from the constraints of human-curated data?(Morris,?2025)?and marks the beginning of a new chapter for reasoning models:?“welcome to the era of experience”?Silver & Sutton?(2025);?Zhao et?al.?(2024).

我們認為還有許多值得探索的地方，比如改變推理器接收可驗證反饋的環境，包括像萬維網、形式化數學語言（Sutton，2001；Ren 等人，2025）、世界模擬器甚至現實世界這樣的來源。此外，AZ 的通用性或許還能拓展到諸如具身人工智能（Zitkovich 等人，2023；Yue 等人，2024）這樣的領域。另外，更復雜的代理任務或科學實驗，為將絕對零度設置進一步推進到不同的應用領域提供了令人興奮的機會（Wu 等人，2024；2023）。除此之外，未來的研究方向可能包括探索多模態推理模型、修改分布 p(z) 以納入特權信息、定義甚至讓模型動態學習如何定義 f（公式 3），或者為提議和解決角色設計探索/多樣性獎勵。

盡管在當前的推理文獻中未得到充分重視，但強化學習中的探索成分長期以來一直被認為是傳統強化學習中出現新行為的關鍵驅動因素（Yue 等人，2025 年；Silver 等人，2016 年；Ladosz 等人，2022 年）。多年來，研究者們已經對各種形式的探索進行了研究，甚至在使用 LLM 的相關子領域中也是如此，例如紅隊對抗（Zhao 等人，2025a），但其在 LLM 推理模型中的作用仍未得到充分探索。更進一步，我們的框架研究了一個更高級別的探索問題：在學習任務空間內的探索——在這種情況下，智能體不僅學習如何解決任務，還學習從哪些任務中學習以及如何找到這些任務。與局限于固定的問題集不同，AI 推理智能體可能會從動態定義和細化自己的學習任務中受益。這種轉變開辟了一個強大的新領域——智能體不僅探索解決方案空間，還拓展問題空間的邊界。我們認為這是未來研究的一個有前景且重要的方向。

我們工作的一個局限性在于，我們沒有解決如何安全地管理由這種自我改進組件組成的系統。令我們驚訝的是，我們從 Llama-3.1-8B 模型中觀察到了幾個令人擔憂的 CoT 實例，我們將其稱為“哎呀時刻”。這些發現表明，盡管絕對零范式減少了對人工干預來策劃任務的需求，但由于仍存在安全問題，因此仍需要監督，這是未來研究的一個關鍵方向（Wang 等人，2024 年；2025a 年）。

最后，我們探索了具備經驗模型的推理模型，這些模型不僅能解決給定的任務，還能在環境的幫助下定義和演化自己的學習任務分布。我們的 AZR 結果表明，這種轉變即使在顯著減少特權資源（如人工策劃的數據）的情況下，也能在各種推理任務中實現強大的性能。我們認為這最終能夠使推理模型擺脫人工整理數據的限制（莫里斯，2025 年），并標志著推理模型新篇章的開啟：“歡迎來到經驗時代”（西爾弗和薩頓，2025 年；趙等人，2024 年）。