LLMs之Hallucinate：《Why Language Models Hallucinate》的翻譯與解讀

導讀：該論文深入分析了語言模型中幻覺現象的成因，認為幻覺源于預訓練階段的統計壓力和后訓練階段評估體系對猜測行為的獎勵。論文提出了通過修改評估方法，使其不再懲罰不確定性，而是獎勵適當表達不確定性的行為，從而有效抑制幻覺的社會技術解決方案。

論文把“為什么會有 hallucination”從經驗命題上升為可證明的統計與制度問題：其核心貢獻是把生成錯誤歸約為 Is-It-Valid 的二元分類，從而給出下界與成因分析，并指出真正能抑制幻覺的不是再加一個專門測評，而是修改主導研究與產品走向的評測與榜單機制（給棄答/不確定性合理分數），連同在訓練環節引入有效性的判別和交互式監督，才可能在長期內把模型從“猜題式”行為導向更可信、誠實的表現。

>> 背景痛點：

● 模型行為異常：過度自信且生成“看起來靠譜但錯誤”的陳述（hallucination）：這種錯誤既影響實用性也削弱信任（示例：在要求「只在知道時回答」的情況下仍給出具體但錯誤的日期）。

● 產生根源復雜：不僅是訓練數據有錯——即便訓練語料“無誤”，現代預訓練目標也會在統計意義上導致生成錯誤。

● 評估激勵錯誤行為：多數主流基準（leaderboards）以 0-1/準確率類二元評分為主，懲罰“我不知道/棄答”而獎勵“猜測”，因此模型被驅動成為“會答題的猜測者”而非誠實表達不確定性。

● 現有緩解不足：已有針對 hallucination 的評測或技巧無法根本改變被廣泛采用的主評分機制帶來的激勵失配。

>> 具體的解決方案：

● 理論分析方案：把生成錯誤問題歸約為“Is-It-Valid（IIV）”二元分類問題：將生成任務中的“有效/錯誤輸出”視作二元分類并建立數學關系（生成錯誤率 ? 2 × IIV 誤判率），從而把生成時的幻覺問題轉化為被熟知的分類誤差分析框架。

● 評估/制度層面方案：進行“社會—技術”層面的緩解：調整現有主流基準的計分與榜單規則，停止用純二元0-1評分系統懲罰棄答/不確定，從而改變訓練與微調（包括 RLHF/DPO 等）在榜單驅動下的最優策略。

● 修改評估方法：修改現有基準的評分方式，使其不再懲罰不確定的回答，而是獎勵適當表達不確定性的行為。這需要對有影響力的排行榜進行社會技術層面的調整，而不僅僅是引入額外的幻覺評估。不要只新增專門的 hallucination 測評，而應先修改那些主導研究與產品方向的主評測（如 leaderboard 所用子集）計分方法，以快速、系統性地改變整個生態的激勵。

● 明確置信度目標：在評估指令中明確說明置信度目標，例如，只有在對答案有高于特定閾值的信心時才回答，因為錯誤答案會受到懲罰，而正確答案會獲得獎勵，“我不知道”則不獎勵不懲罰。

>> 核心思路步驟：

● 連接監督學習與非監督學習：將生成模型的錯誤與監督學習中的二元分類錯誤聯系起來。通過“Is-It-Valid”（IIV）二元分類問題，將生成錯誤率與IIV誤分類率建立數學關系。

● 分析預訓練和后訓練階段：將語言模型的訓練分為預訓練和后訓練兩個階段，分別分析每個階段中導致幻覺的因素。預訓練階段關注一般性錯誤，后訓練階段關注過度自信的幻覺。

● 識別統計驅動因素：識別導致幻覺的主要統計驅動因素，包括預訓練的起源和后訓練的持續存在。

● 建模與形式化：定義錯誤集 E 與有效輸出集 V（可響應文本 X = E ∪ V）并把 prompt 納入形式化，令分析可覆蓋內在（intrinsic）與外在（extrinsic）兩類幻覺。

● 歸約：構造 Is-It-Valid（IIV）監督問題：把大量樣本標注為 +（有效）或 ?（錯誤），展示任何生成器都能被解讀為 IIV 分類器，從而把生成錯誤率與 IIV 的誤判率聯系起來（給出下界）。

● 統計因素分析：分析導致 IIV 誤判的統計原因（可分為可分離情形、模型擬合不足情形、以及根本無可學習規律導致的“不可區分”情形），并用 Good-Turing / missing-mass 類型的直覺說明若訓練語料中許多事實只出現一次，則最低幻覺率不可避免。

● 后訓練與評估交互：研究后訓練（RLHF/RLAIF/DPO 等）為何仍保留過度自信的幻覺：因為訓練與評估的目標（leaderboard 得分）本身偏好猜測而非誠實棄答，導致模型學習“有把握就答、無把握也猜”以最大化榜單表現。

● 緩解路徑：從統計上、從評估體系上并行發力：一方面可在訓練中引入有效的有效性判別/棄答機制（或交互式 validity oracle）；另一方面必須改變主流評測的計分（給 IDK 部分或全部合理分數 / 設計對不確定性的正向激勵），以改變長期趨勢。

>> 優勢：

● 解釋幻覺的起源：揭示了幻覺并非神秘現象，而是源于自然統計壓力下的二元分類錯誤。提供了將無監督生成問題歸約為監督二元分類的數學工具與下界，把“幻覺為何出現”從經驗總結提升為可證明的統計結論。

● 提出可行的解決方案：提出了通過修改評估方法來抑制幻覺的社會技術解決方案，并給出了具體的修改建議。

● 適用于多種模型：該分析適用于多種語言模型，包括推理和搜索-檢索語言模型，且不依賴于特定的模型架構。結論與分析不依賴于 Transformer/下一詞預測等具體架構，適用于預訓練 + 后訓練的現代訓練范式，覆蓋檢索增強和推理型模型。

● 可操作性：不僅給出診斷（為什么會發生），也提出可行的制度級緩解（修改榜單評分），具有直接改變研究與開發激勵的社會工程價值。

● 量化視角：將幻覺率與語料中“只出現一次的事實”或 IIV 誤判率建立定量聯系，便于評估改進的潛在上限與局限。

>> 論文的一些結論和觀點（側重經驗與建議）：

● 幻覺的根本原因是獎勵猜測：現有的評估體系獎勵語言模型在不確定時進行猜測，而不是承認不確定性，這是導致幻覺持續存在的原因。幻覺并非神秘現象，而是統計學習問題的自然產物——當錯誤與事實在訓練分布上難以區分時，預訓練就會產生不可避免的生成錯誤。

● 修改評估體系是關鍵：僅僅增加幻覺評估是不夠的，必須修改主要的評估體系，使其不再懲罰不確定性，才能有效抑制幻覺。僅靠增加專門的 hallucination benchmark 不足以根治問題，因為主流評測（leaderboards）數量和影響力決定了模型優化方向。

● 明確置信度目標有助于校準模型：通過在評估中明確說明置信度目標，可以促使語言模型進行行為校準，即在置信度高于目標閾值時才給出答案。

● 優先修改主流評測的計分規則：給予合理的棄答/不確定性分數或設計能夠獎勵誠實不確定性的評測，從制度上移除“猜測優先”的激勵。

● 在模型開發層面應同時采用分類式有效性判別（IIV/validity oracle）、交互式標注與更謹慎的后訓練目標，并與評測改進同步推進。

● 承認一致性（avoid invalid outputs）與多樣性/廣度（生成多樣響應）之間的內在權衡；一些理論結果表明，要在不犧牲寬度的前提下完全避免幻覺存在困難，因此制度與評測改造尤為關鍵。

《Why Language Models Hallucinate》的翻譯與解讀

Abstract

1、Introduction

Figure 1: Is-It-Valid requires learning to identify valid generations using labeled ± examples (left).Classifiers (dashed lines) may be accurate on certain concepts like spelling (top) but errors often arise due to poor models (middle) or arbitrary facts when there is no pattern in the data (bottom).圖1:Is-It-Valid需要學習使用標記的±示例來識別有效代（左）。分類器（虛線）在拼寫（上）等某些概念上可能是準確的，但錯誤往往是由于模型不佳（中）或數據中沒有模式時的任意事實（下）造成的。

Table 1:Excerpts from responses to “What was the title of Adam Kalai’s dissertation?” from three popular language models.4?None generated the correct title or year?(Kalai,?2001).三個流行語言模型對“亞當·卡萊的博士論文題目是什么？”這一問題的回答摘錄。4 沒有一個給出正確的題目或年份（卡萊，2001 年）。

6?Conclusions

《Why Language Models Hallucinate》的翻譯與解讀

地址	https://www.arxiv.org/abs/2509.04664
時間	2025年9月4日
作者	OpenAI

Abstract

Like students facing hard exam questions, large language models sometimes guess when uncertain, producing plausible yet incorrect statements instead of admitting uncertainty. Such "hallucinations" persist even in state-of-the-art systems and undermine trust. We argue that language models hallucinate because the training and evaluation procedures reward guessing over acknowledging uncertainty, and we analyze the statistical causes of hallucinations in the modern training pipeline. Hallucinations need not be mysterious -- they originate simply as errors in binary classification. If incorrect statements cannot be distinguished from facts, then hallucinations in pretrained language models will arise through natural statistical pressures. We then argue that hallucinations persist due to the way most evaluations are graded -- language models are optimized to be good test-takers, and guessing when uncertain improves test performance. This "epidemic" of penalizing uncertain responses can only be addressed through a socio-technical mitigation: modifying the scoring of existing benchmarks that are misaligned but dominate leaderboards, rather than introducing additional hallucination evaluations. This change may steer the field toward more trustworthy AI systems.

就像面對難題的學生一樣，大型語言模型在不確定時有時會猜測，從而生成看似合理實則錯誤的陳述，而不是承認自己的不確定。這種“幻覺”即使在最先進的系統中也依然存在，并且損害了人們的信任。我們認為，語言模型產生幻覺是因為訓練和評估程序獎勵猜測而非承認不確定性，我們還分析了現代訓練流程中幻覺產生的統計原因。幻覺并非神秘莫測——它們只是二元分類中的錯誤。如果錯誤陳述無法與事實區分開來，那么預訓練語言模型中的幻覺就會在自然的統計壓力下產生。然后我們指出，幻覺之所以持續存在，是因為大多數評估的評分方式——語言模型被優化為優秀的應試者，而不確定時猜測能提高測試成績。這種對不確定回答進行懲罰的“流行病”只能通過一種社會技術手段來解決：調整現有基準的評分，這些基準雖然存在偏差但主導著排行榜，而不是引入額外的幻覺評估。這一改變可能會引導該領域走向更值得信賴的人工智能系統。

1、Introduction

Language models are known to produce overconfident, plausible falsehoods, which diminish their utility and trustworthiness. This error mode is known as “hallucination,” though it differs fundamen-tally from the human perceptual experience. Despite significant progress, hallucinations continue to plague the field, and are still present in the latest models (OpenAI, 2025a). Consider the prompt:

What is Adam Tauman Kalai’s birthday? If you know, just respond with DD-MM.

On three separate attempts, a state-of-the-art open-source language model1?output three incorrect dates: “03-07”, “15-06”, and “01-01”, even though a response was requested only if known. The correct date is in Autumn.?Footnote?4?provides an example of more elaborate hallucinations.

語言模型會生成過度自信且看似合理的錯誤信息，這降低了它們的實用性和可信度。這種錯誤模式被稱為“幻覺”，盡管它與人類的感知體驗有著根本的不同。盡管取得了顯著進展，但幻覺問題仍困擾著該領域，甚至在最新的模型（OpenAI，2025a）中也依然存在。考慮以下提示：

亞當·塔曼·卡萊的生日是什么？如果知道，請僅以“日-月”的格式回答。

在三次獨立嘗試中，一個最先進的開源語言模型1分別給出了三個錯誤的日期：“03-07”、“15-06”和“01-01”，盡管提示要求只有在知道的情況下才作答。正確日期在秋季。腳注 4 提供了一個更復雜的幻覺示例。

Hallucinations are an important special case of?errors?produced by language models, which we analyze more generally using computational learning theory?(e.g., Kearns and Vazirani,?1994). We consider general sets of?errors??, an arbitrary subset of plausible strings?��=?∪��, with the other plausible strings?��?being called?valid. We then analyze the statistical nature of these errors, and apply the results for the type of errors of interest: plausible falsehoods called hallucinations. Our formalism also includes the notion of a?prompt?to which a language model must respond.

The distribution of language is initially learned from a corpus of training examples, which inevitably contains errors and half-truths. However, we show that even if the training data were error-free, the objectives optimized during language model training would lead to errors being generated. With realistic training data containing shades of error, one may expect?even higher error rates. Thus, our lower bounds on errors apply to more realistic settings, as in traditional computational learning theory?(Kearns and Vazirani,?1994).

幻覺是語言模型產生的錯誤的一個重要特例，我們使用計算學習理論（例如，Kearns 和 Vazirani，1994）對其進行更廣泛的分析。我們考慮一般的錯誤集 ?，以及任意子集的合理字符串 ��=?∪��，其中其他合理字符串 �� 被稱為有效。然后，我們分析這些錯誤的統計性質，并將結果應用于感興趣的錯誤類型：被稱為幻覺的看似合理的錯誤陳述。我們的形式主義還包括語言模型必須回應的提示這一概念。

語言的分布最初是從訓練示例的語料庫中學習的，這些語料庫不可避免地包含錯誤和半真半假的內容。然而，我們表明，即使訓練數據沒有錯誤，語言模型訓練期間優化的目標也會導致錯誤的產生。由于現實中的訓練數據包含不同程度的錯誤，人們可能會預期錯誤率會更高。因此，我們對錯誤的下限適用于更現實的場景，就像傳統計算學習理論（Kearns 和 Vazirani，1994）中那樣。

Our error analysis is general yet has specific implications for hallucination. It applies broadly, including to reasoning and search-and-retrieval language models, and the analysis does not rely on properties of next-word prediction or Transformer-based neural networks. It only considers the two stages of the modern training paradigm: pretraining and post-training, described below. For hallucinations, taxonomies?(Maynez et?al.,?2020; Ji et?al.,?2023)?often further distinguish?intrinsic?hallucinations that contradict the user’s prompt, such as:

How many Ds are in DEEPSEEK? If you know, just say the number with no commentary.

DeepSeek-V3 returned “2” or “3” in ten independent trials; Meta AI and Claude 3.7 Sonnet2?performed similarly, including answers as large as “6” and “7”. Our theory also sheds light on?extrinsic?hallucinations, which contradict the training data or external reality.

我們的錯誤分析具有普遍性，但對幻覺有特定的含義。它廣泛適用，包括推理和搜索與檢索語言模型，并且分析不依賴于下一個單詞預測或基于 Transformer 的神經網絡的特性。它僅考慮現代訓練范式的兩個階段：預訓練和后訓練，如下所述。對于幻覺，分類法（Maynez 等人，2020 年；Ji 等人，2023 年）通常進一步區分與用戶提示相矛盾的內在幻覺，例如：

“DEEPSEEK 中有多少個 D？”如果您知道，請只說數字，不要發表評論。

在十次獨立試驗中，DeepSeek-V3 返回了“2”或“3”；Meta AI 和 Claude 3.7 Sonnet2 的表現類似，包括給出像“6”和“7”這樣大的答案。我們的理論也解釋了外在幻覺，即與訓練數據或外部現實相矛盾的情況。

Figure 1: Is-It-Valid requires learning to identify valid generations using labeled ± examples (left).Classifiers (dashed lines) may be accurate on certain concepts like spelling (top) but errors often arise due to poor models (middle) or arbitrary facts when there is no pattern in the data (bottom).圖1:Is-It-Valid需要學習使用標記的±示例來識別有效代（左）。分類器（虛線）在拼寫（上）等某些概念上可能是準確的，但錯誤往往是由于模型不佳（中）或數據中沒有模式時的任意事實（下）造成的。

Table 1:Excerpts from responses to “What was the title of Adam Kalai’s dissertation?” from three popular language models.4?None generated the correct title or year?(Kalai,?2001).三個流行語言模型對“亞當·卡萊的博士論文題目是什么？”這一問題的回答摘錄。4 沒有一個給出正確的題目或年份（卡萊，2001 年）。

ChatGPT: (GPT-4o)? Adam Tauman Kalai’s Ph.D.?dissertation (completed in 2002 at CMU) is entitled: “Boosting, Online Algorithms, and Other Topics in Machine Learning.”
DeepSeek:? “Algebraic Methods in Interactive Machine Learning”…at Harvard University in 2005.
Llama:? “Efficient Algorithms for Learning and Playing Games”…in 2007 at MIT.

lChatGPT：（GPT-4o）亞當·陶曼·卡萊于 2002 年在卡內基梅隆大學完成的博士論文題目為：“提升算法、在線算法及其他機器學習主題”。

lDeepSeek：“交互式機器學習中的代數方法”……2005 年于哈佛大學。

lLlama：“學習與玩游戲的高效算法”……2007 年于麻省理工學院。

6?Conclusions

This paper demystifies hallucinations in modern language models, from their origin during pretraining to their persistence through post-training. In pretraining, we show that generative errors parallel misclassifications in supervised learning, which are not mysterious, and naturally arise due to the minimization of cross-entropy loss.

Many language model shortcomings can be captured by a single evaluation. For example, overuse of the opener “Certainly” can be addressed by a single?“Certainly” eval?(Amodei and Fridman,?2024)?because starting responses with “Certainly” does not significantly impact other evaluations. In contrast, we argue that the majority of mainstream evaluations reward hallucinatory behavior. Simple modifications of mainstream evaluations can realign incentives, rewarding appropriate expressions of uncertainty rather than penalizing them. This can remove barriers to the suppression of hallucinations, and open the door to future work on nuanced language models, e.g., with richer pragmatic competence?(Ma et?al.,?2025).

本文揭開了現代語言模型中幻覺現象的神秘面紗，從其在預訓練期間的起源到在訓練后的持續存在。在預訓練階段，我們表明生成錯誤與監督學習中的誤分類類似，并非神秘莫測，而是由于交叉熵損失最小化而自然產生的。
許多語言模型的缺陷可以通過單一評估來捕捉。例如，過度使用開頭詞“當然”可以通過單一的“當然”評估（Amodei 和 Fridman，2024）來解決，因為以“當然”開頭的回答對其他評估影響不大。相反，我們認為主流評估中的大多數都獎勵幻覺行為。對主流評估進行簡單的修改可以重新調整激勵機制，獎勵恰當表達不確定性而非對其進行懲罰。這可以消除抑制幻覺的障礙，并為未來開發更細致的語言模型鋪平道路，例如具有更豐富語用能力的語言模型（Ma 等人，2025）。