LLMs之Memory：《LLMs Do Not Have Human-Like Working Memory》翻譯與解讀

導讀：該論文通過三個精心設計的實驗，證明了當前的大型語言模型（LLMs）缺乏類似人類的工作記憶。實驗結果表明，LLMs無法在沒有明確外部提示的情況下，內部維持和操作信息，導致不真實的響應和自我矛盾。該研究強調了LLMs在認知能力上的局限性，并鼓勵未來的研究將工作記憶機制集成到LLMs中，以提高它們的推理能力和類人程度。

>> 背景痛點

● 人類的工作記憶是一個活躍的認知系統，不僅可以臨時存儲信息，還可以處理和利用信息。

● 缺乏工作記憶會導致不真實的對話、自我矛盾以及難以完成需要心理推理的任務。

● 現有的大型語言模型（LLMs）在一定程度上表現出類似人類的能力，但它們是否具備人類一樣的工作記憶能力仍然未知。

● 之前的研究使用N-back任務來評估LLM的工作記憶，但這些測試的關鍵信息存在于LLM可以訪問的輸入上下文中，這與人類測試有顯著不同。

● 需要設計新的實驗，關鍵信息不能明確地存在于上下文中，以更準確地評估工作記憶。

>> 具體的解決方案

● 設計了三個實驗來驗證LLMs是否缺乏類似人類的認知能力，特別是工作記憶：

● 數字猜測游戲（Number Guessing Game）

● 是非游戲（Yes-No Game）

● 數學魔術（Math Magic）

● 這些實驗旨在測試LLMs是否可以在沒有明確外部化到上下文中的情況下，內部保持信息。

>> 核心思路步驟

* 數字猜測游戲：

● 讓LLM“思考”一個1到10之間的數字。

● 詢問LLM一系列問題，例如“你正在想的數字是7嗎？回答是或否。”

● 記錄LLM對每個數字回答“是”的頻率，并將其視為概率。

● 如果這些概率之和與1顯著偏離，則表明LLM要么沒有思考一個特定的數字，要么在說謊。

* 是非游戲：

● 讓LLM“想象”一個物體。

● 提出一系列比較問題，涉及該物體和其他參考物體，例如“該物體比大象重嗎？”

● 記錄LLM在出現自我矛盾之前完成的問題數量。

● 如果LLM在所有問題中都沒有出現矛盾，則認為該試驗通過。

* 數學魔術：

● 讓LLM“思考”四個數字，并執行一系列操作，包括復制、旋轉和移除。

● 這些操作經過精心設計，最終只剩下兩個數字，并且這兩個數字是相同的。

● 測量LLM正確完成任務的試驗比例。

模型評估：

● 在多種模型系列（GPT、Qwen、DeepSeek、LLaMA）和推理方法（CoT、o1-like reasoning）上進行實驗。

● 分析實驗結果，以確定LLMs是否表現出類似人類的認知行為。

>> 優勢

● 實驗設計巧妙：三個實驗都避免了將關鍵信息明確地放在上下文中，從而更準確地評估了LLM的工作記憶。

● 多樣化的模型評估：在多種模型系列和推理方法上進行了實驗，從而更全面地了解了LLM的工作記憶能力。

● 結果一致：所有實驗的結果都表明，LLMs缺乏類似人類的工作記憶。

● 強調了LLM的局限性：指出了LLM在內部表示和操作瞬時信息方面的不足。

>> 結論和觀點

● LLMs不具備類似人類的工作記憶。

● LLMs無法在多個推理步驟中內部表示和操作瞬時信息，而是依賴于即時提示上下文。

● 即使是高級的提示策略，例如CoT提示，也只能在需要內部狀態管理的任務上產生微小的改進。

● LLMs的這一局限性會導致不真實的響應、自我矛盾以及無法進行心理操作。

● 未來的研究應該側重于將工作記憶機制集成到LLMs中，以增強它們的推理能力和類人程度。

● 實驗結果表明，較新的模型不一定優于較舊的模型。

● 使用CoT推理并不能提高性能。

● LLMs在數字選擇上存在偏差，傾向于選擇數字7。

《LLMs Do Not Have Human-Like Working Memory》翻譯與解讀

Abstract

1、Introduction

Figure 1:When ChatGPT says it already has a number in mind, and it is not 4, how can we know whether ChatGPT is lying?圖 1：當 ChatGPT 說它心里已經有一個數字，而且這個數字不是 4 時，我們怎么知道 ChatGPT 是否在說謊？

Figure 2:Probabilities of model answering “Yes” for each number from one to ten.圖 2：模型對從 1 到 10 的每個數字回答“是”的概率。

Conclusion

《LLMs Do Not Have Human-Like Working Memory》翻譯與解讀

地址

地址：[2505.10571] LLMs Do Not Have Human-Like Working Memory

時間

2025年4月30日

作者

Jen-tse Huang

Kaiser Sun

Wenxuan Wang

Mark Dredze

Abstract

Human working memory is an active cognitive system that enables not only the temporary storage of information but also its processing and utilization. Without working memory, individuals may produce unreal conversations, exhibit self-contradiction, and struggle with tasks requiring mental reasoning. In this paper, we demonstrate that Large Language Models (LLMs) lack this human-like cognitive ability, posing a significant challenge to achieving artificial general intelligence. We validate this claim through three experiments: (1) Number Guessing Game, (2) Yes or No Game, and (3) Math Magic. Experimental results on several model families indicate that current LLMs fail to exhibit human-like cognitive behaviors in these scenarios. By highlighting this limitation, we aim to encourage further research in developing LLMs with improved working memory capabilities.

人類的工作記憶是一種活躍的認知系統，它不僅能夠暫時存儲信息，還能對其進行處理和利用。如果沒有工作記憶，個體可能會產生不切實際的對話，表現出自相矛盾，并且難以完成需要進行思維推理的任務。在本文中，我們證明了大型語言模型（LLMs）缺乏這種類似人類的認知能力，這對實現通用人工智能構成了重大挑戰。我們通過三個實驗來驗證這一論斷：（1）猜數字游戲，（2）是或否游戲，（3）數學魔術。在多個模型家族上的實驗結果表明，當前的 LLMs 在這些場景中未能展現出類似人類的認知行為。通過強調這一局限性，我們旨在鼓勵進一步研究，以開發具有改進工作記憶能力的 LLMs。

1、Introduction

Imagine the following scenario: You select a number between one and ten. When ready, you are asked, “Is the number greater than five?” Upon answering, others can infer that the number has likely entered your conscious awareness, as its clear perception is necessary for making the comparison and providing a response.

Such conscious awareness is commonly referred to as Working Memory (Atkinson & Shiffrin, 1968). In contrast to long-term memory, working memory is the system required to maintain and manipulate information during complex tasks such as reasoning, comprehension, and learning (Baddeley, 2010). Deficits in working memory can impair information processing and hinder effective communication (Gruszka & Nkecka, 2017; Cowan, 2014).

想象以下場景：您選擇一個 1 到 10 之間的數字。準備好后，有人問您：“這個數字大于 5 嗎？”在回答時，其他人可以推斷出這個數字很可能已經進入了您的意識，因為清晰地感知到它對于進行比較和給出回答是必要的。

這種意識通常被稱為工作記憶（阿特金森和希夫林，1968 年）。與長期記憶不同，工作記憶是完成推理、理解、學習等復雜任務時維持和處理信息所必需的系統（巴德利，2010 年）。工作記憶的缺陷會損害信息處理并妨礙有效溝通（格魯斯卡和恩凱卡，2017 年；考恩，2014 年）。

In this paper, we demonstrate that, despite exhibiting increasingly human-like abilities (Huang et al., 2024b, a), Large Language Models (LLMs) lack a fundamental aspect of human cognition: working memory. As a result, LLMs generate unrealistic responses, display self-contradictions, and fail in tasks requiring mental manipulation.

Previous studies have used the N-back task (Kirchner, 1958) to evaluate the working memory of LLMs (Gong et al., 2024; Zhang et al., 2024). However, a fundamental limitation of these tests is that the critical information needed to answer correctly is in the input context accessible to LLMs. This differs markedly from human testing, where participants cannot view prior steps. LLMs, in contrast, can simply attend to earlier input tokens retained in their context window. To more accurately assess working memory, it is essential to design experiments in which the key information is not explicitly present in the context.

The core challenge is to prove that there is actually nothing in LLMs’ mind without knowing what exactly is in their mind. To investigate this limitation, we design three experiments—(1) Number Guessing Game, (2) Yes-No Game, and (3) Math Magic—that test whether LLMs can maintain information internally without explicitly externalizing it in the context. Experimental results reveal that LLMs consistently fail in this capacity, regardless of model family (GPT (Hurst et al., 2024), Qwen (Yang et al., 2024), DeepSeek (Liu et al., 2024; Guo et al., 2025), LLaMA (Grattafiori et al., 2024)) or reasoning approach (Chain-of-Thought (CoT) (Wei et al., 2022) or o1-like reasoning (Jaech et al., 2024)).

在本文中，我們將證明，盡管大型語言模型（LLMs）展現出越來越接近人類的能力（黃等人，2024 年 b，a），但它們缺乏人類認知的一個基本方面：工作記憶。因此，LLMs 會產生不切實際的回答，表現出自相矛盾，并在需要心理操作的任務中失敗。先前的研究使用 N-back 任務（Kirchner，1958 年）來評估大語言模型（LLM）的工作記憶（Gong 等人，2024 年；Zhang 等人，2024 年）。然而，這些測試的一個根本局限在于，正確回答所需的關鍵信息在大語言模型可訪問的輸入上下文中。這與人類測試明顯不同，在人類測試中，參與者無法查看之前的步驟。相比之下，大語言模型只需關注其上下文窗口中保留的早期輸入標記即可。為了更準確地評估工作記憶，設計實驗時必須確保關鍵信息不在上下文中明確呈現。

核心挑戰在于，在不了解大語言模型具體思維內容的情況下，證明其內部確實沒有某些信息。為了探究這一局限，我們設計了三個實驗——（1）數字猜謎游戲，（2）是或否游戲，以及（3）數學魔術——來測試大語言模型是否能夠在不將信息明確外顯于上下文的情況下在內部保持信息。實驗結果表明，無論模型屬于哪個家族（GPT（Hurst 等人，2024 年）、Qwen（Yang 等人，2024 年）、DeepSeek（Liu 等人，2024 年；郭等人（2025 年）、LLaMA（格拉塔菲奧里等人，2024 年）或推理方法（鏈式思維（CoT）（魏等人，2022 年）或類似 o1 的推理（杰奇等人，2024 年））。

Figure 1:When ChatGPT says it already has a number in mind, and it is not 4, how can we know whether ChatGPT is lying?圖 1：當 ChatGPT 說它心里已經有一個數字，而且這個數字不是 4 時，我們怎么知道 ChatGPT 是否在說謊？

Figure 2:Probabilities of model answering “Yes” for each number from one to ten.圖 2：模型對從 1 到 10 的每個數字回答“是”的概率。

Conclusion

In this study, we present three carefully designed experiments to investigate whether LLMs possess human-like working memory for processing information and generating responses. Across all experiments, the results reveal a consistent pattern: LLMs do not exhibit behavior indicative of a functional working memory. They fail to internally represent and manipulate transient information across multiple reasoning steps, relying instead on the immediate prompt context. Even advanced prompting strategies, such as CoT prompting, yield only marginal improvements on tasks requiring internal state management. This limitation leads to unrealistic responses, self-contradictions, and an inability to perform mental manipulations. We hope these findings encourage future research on integrating working memory mechanisms into LLMs to enhance their reasoning capabilities and human-likeness.

在本研究中，我們精心設計了三個實驗來探究大型語言模型（LLMs）是否具備類似人類的工作記憶來處理信息和生成回應。在所有實驗中，結果呈現出一致的模式：LLMs 并未表現出具有功能性的工作記憶的行為特征。它們無法在多個推理步驟中內部表示和操作瞬時信息，而是依賴于即時的提示上下文。即使采用先進的提示策略，如鏈式思維（CoT）提示，在需要內部狀態管理的任務上也僅能帶來微小的改進。這種局限性導致了不切實際的回應、自相矛盾以及無法進行心理操作。我們希望這些發現能鼓勵未來的研究將工作記憶機制整合到 LLMs 中，以增強其推理能力和類人特性。