MemGPT: Towards LLMs as Operating Systems

1 MemGPT: Towards LLMs as Operating Systems

論文地址：MemGPT: Towards LLMs as Operating Systems
代碼地址：https://github.com/letta-ai/letta

1.1 MemGPT

??MemGPT（MemoryGPT）借鑒傳統操作系統的分層內存管理思想（物理內存與磁盤間的分頁機制），通過 “虛擬上下文管理” 技術，讓固定上下文窗口的 LLM 具備 “無限上下文” 的使用錯覺。其核心邏輯是：將 LLM 的上下文窗口視為 “物理內存”，外部存儲視為 “磁盤”，通過函數調用實現信息在兩者間的 “分頁調入 / 調出”，同時管理控制流以優化上下文利用效率。

??MemGPT 通過多層級內存設計與功能模塊協作，實現上下文的智能管理，包括：

主上下文（Main Context）：類比RAM。LLM 的Prompt Tokens，可被 LLM 推理直接訪問，分為三部分：
- 系統指令：只讀靜態內容，包含 MemGPT 控制流、內存層級用途、函數調用規則；
- 工作上下文：固定大小可讀寫區塊，存儲用戶關鍵信息（如偏好、事實）、智能體角色信息；
- FIFO 隊列：滾動存儲消息歷史（用戶 - 智能體對話、系統提示、函數調用記錄），頭部含已淘汰消息的遞歸摘要。
外部上下文（External Context）：類比磁盤。超出主上下文窗口的信息，需通過函數調用調入主上下文才能使用，包含兩類存儲：
- 召回存儲（Recall Storage）：消息數據庫，由隊列管理器自動寫入對話歷史，支持分頁搜索與重新調入主上下文；
- 歸檔存儲（Archival Storage）：支持向量搜索的數據庫（如 PostgreSQL+pgvector），存儲長文檔、鍵值對等大規模數據，需顯式函數調用訪問。

核心功能模塊

隊列管理器（Queue Manager）：
- 消息處理：接收新消息并追加到 FIFO 隊列，拼接提示詞令牌觸發 LLM 推理，同時將消息與推理結果寫入召回存儲；
- 上下文溢出控制：當主上下文令牌數達到 “警告閾值”（如窗口的 70%），插入 “內存壓力” 系統提示，引導 LLM 將關鍵信息存入工作上下文 / 歸檔存儲；達到 “刷新閾值”（如 100%）時，淘汰部分消息（如 50% 窗口），生成新遞歸摘要并寫入 FIFO 頭部，淘汰消息永久保留在召回存儲。
函數執行器（Function Executor）：
- 解析 LLM 生成的輸出，執行內存管理函數（如搜索外部存儲、修改工作上下文、分頁調入數據），并將執行結果（含錯誤信息）反饋給 LLM，形成 “決策 - 執行 - 反饋” 閉環。同時支持分頁機制，避免單次檢索結果溢出主上下文。
控制流與函數鏈（Control Flow & Function Chaining）：
- 事件觸發推理：用戶消息、系統警告、定時任務等事件均會觸發 LLM 推理，事件經解析后轉為文本追加到主上下文；
- 多函數順序執行：通過 “request heartbeat=true” 標志，支持 LLM 在返回用戶響應前，連續調用多個函數（如多頁文檔檢索、多跳鍵值對查詢），提升復雜任務處理能力。

1.2 源碼分析

1.2.1 內存層次結構 — 主內存 / 歸檔 / 檢索層

??Memory hierarchy（論文里的主內存 / 外部存儲分層）是通過 Memory Blocks + Archival Memory 來實現的：

核心上下文（主內存）由 memory blocks 組成，每個 block 都是可編輯、可共享的小單元，在 client.blocks.create() 與 agent 的 block_ids 字段綁定后，會被拼接進 agent 的 in-context prompt 中。

agent_state = client.agents.create(memory_blocks=[{"label": "human","value": "The human's name is Bob the Builder.","limit": 5000},{"label": "persona","value": "My name is Sam, the all-knowing sentient AI.","limit": 5000}],model="openai/gpt-4o-mini",embedding="openai/text-embedding-3-small"
)

外部上下文：超出 context 的長期存儲（外部存儲層）則通過 filesystem / folders / passages 模塊實現，文件內容會被分段、索引、存檔，必要時由agent通過工具調用再取回到主內存。這樣 Letta 就把 “有限的 prompt context” 和 “無限的外部持久存儲” 分層管理。

# create the folder
folder = client.folders.create(name="my_folder",embedding_config=embedding_config
)# upload a file into the folder
job = client.folders.files.upload(folder_id=folder.id,file=open("my_file.txt", "rb")
)

MemoryBlock
??MemGPT中定義了一些系列的Memory，都是基于MemoryBlock來實現的。而外部Memory直接通過FileProcess來實現。

class BasicBlockMemory(Memory):"""BasicBlockMemory is a basic implemention of the Memory class, which takes in a list of blocks and links them to the memory object. These are editable by the agent via the core memory functions.Attributes:memory (Dict[str, Block]): Mapping from memory block section to memory block.Methods:core_memory_append: Append to the contents of core memory.core_memory_replace: Replace the contents of core memory."""def __init__(self, blocks: List[Block] = []):"""Initialize the BasicBlockMemory object with a list of pre-defined blocks.Args:blocks (List[Block]): List of blocks to be linked to the memory object."""super().__init__(blocks=blocks)

調度
??agent 的核心調度器，其中內存調度（memory hierarchy 的 orchestrator）主要體現在 “重建上下文窗口” 的邏輯，也就是把 Memory Blocks（主內存）、消息歷史和歸檔/摘要拼接起來，送進 LLM。

# letta/agents/letta_agent.pyclass LettaAgent:...async def step(...):# 每次 agent 前進一輪，都會檢查/更新上下文...await self.rebuild_context_window()...async def rebuild_context_window(self):"""這里是核心的內存調度邏輯：1. 從數據庫/存儲里取出 agent 的 memory blocks（主內存塊）2. 加載最近的對話消息（短期記憶）3. 檢查上下文 token 使用量- 如果超過閾值，調用 summarizer 做摘要/驅逐4. 把這些拼接成 prompt，交給模型調用"""blocks = await self.get_attached_blocks()messages = await self.load_recent_messages()# 判斷 token 是否超限if self.exceeds_context_limit(blocks, messages):summarized = await self.summarizer.summarize(messages)messages = summarized# 構造最終的上下文self.context = self.compose_context(blocks, messages)

摘要
??當上下文壓力過大時，會通過LLM對內存進行摘要或驅逐，具體摘要的PE如下：

WORD_LIMIT = 100
SYSTEM = f"""Your job is to summarize a history of previous messages in a conversation between an AI persona and a human.
The conversation you are given is a from a fixed context window and may not be complete.
Messages sent by the AI are marked with the 'assistant' role.
The AI 'assistant' can also make calls to tools, whose outputs can be seen in messages with the 'tool' role.
Things the AI says in the message content are considered inner monologue and are not seen by the user.
The only AI messages seen by the user are from when the AI uses 'send_message'.
Messages the user sends are in the 'user' role.
The 'user' role is also used for important system events, such as login events and heartbeat events (heartbeats run the AI's program without user action, allowing the AI to act without prompting from the user sending them a message).
Summarize what happened in the conversation from the perspective of the AI (use the first person from the perspective of the AI).
Keep your summary less than {WORD_LIMIT} words, do NOT exceed this word limit.
Only output the summary, do NOT include anything else in your output."""

共享內存
??共享內存的實現比較簡單就是將內存塊的id添加到agent的block_ids中即可。

# create a shared memory block
shared_block = client.blocks.create(label="organization",description="Shared information between all agents within the organization.",value="Nothing here yet, we should update this over time."
)# create a supervisor agent
supervisor_agent = client.agents.create(model="anthropic/claude-3-5-sonnet-20241022",embedding="openai/text-embedding-3-small",# blocks created for this agentmemory_blocks=[{"label": "persona", "value": "I am a supervisor"}],# pre-existing shared block that is "attached" to this agentblock_ids=[shared_block.id],
)# create a worker agent
worker_agent = client.agents.create(model="openai/gpt-4.1-mini",embedding="openai/text-embedding-3-small",# blocks created for this agentmemory_blocks=[{"label": "persona", "value": "I am a worker"}],# pre-existing shared block that is "attached" to this agentblock_ids=[shared_block.id],
)

1.2.2 隊列管理器

??MemGPT的隊列管理器（queue manager）對應的就是對話消息隊列 / buffer 的管理邏輯——也就是讓 agent 的上下文只保留一部分最近的消息，把溢出的內容清理、歸檔或摘要。這個機制跟 MemGPT 論文里的 FIFO 隊列 + 內存壓力控制是一一對應的。

消息存儲。所有對話消息存到數據庫里（Postgres/SQLite，表結構在 messages 表），而 agent 每次運行時不會直接加載全部，而是取最近一段窗口。
上下文重建時檢查隊列。從數據庫里取最新的 N 條消息（相當于隊尾元素）。如果 token 超限，調用 summarizer 對舊消息做摘要，然后替換隊首部分（保持隊列容量不爆炸）。

async def _rebuild_context_window(self, summarizer: Summarizer, in_context_messages: List[Message], letta_message_db_queue: List[Message]
) -> None:new_letta_messages = await self.message_manager.create_many_messages_async(letta_message_db_queue, actor=self.actor)# TODO: Make this more general and configurable, less brittlenew_in_context_messages, updated = await summarizer.summarize(in_context_messages=in_context_messages, new_letta_messages=new_letta_messages)await self.agent_manager.update_message_ids_async(agent_id=self.agent_id, message_ids=[m.id for m in new_in_context_messages], actor=self.actor)

驅逐/摘要策略。當消息數量或 token 數超過閾值時，觸發 partial evict buffer summarization，把舊消息合并成一條 “總結消息”，再繼續放回隊首。

async def _partial_evict_buffer_summarization(self,in_context_messages: List[Message],new_letta_messages: List[Message],force: bool = False,clear: bool = False,) -> Tuple[List[Message], bool]:"""Summarization as implemented in the original MemGPT loop, but using message count instead of token count.Evict a partial amount of messages, and replace message[1] with a recursive summary.Note that this can't be made sync, because we're waiting on the summary to inject it into the context window,unlike the version that writes it to a block.Unless force is True, don't summarize.Ignore clear, we don't use it."""all_in_context_messages = in_context_messages + new_letta_messagesif not force:logger.debug("Not forcing summarization, returning in-context messages as is.")return all_in_context_messages, False# First step: determine how many messages to retaintotal_message_count = len(all_in_context_messages)assert self.partial_evict_summarizer_percentage >= 0.0 and self.partial_evict_summarizer_percentage <= 1.0target_message_start = round((1.0 - self.partial_evict_summarizer_percentage) * total_message_count)logger.info(f"Target message count: {total_message_count}->{(total_message_count - target_message_start)}")# The summary message we'll insert is role 'user' (vs 'assistant', 'tool', or 'system')# We are going to put it at index 1 (index 0 is the system message)# That means that index 2 needs to be role 'assistant', so walk up the list starting at# the target_message_count and find the first assistant messagefor i in range(target_message_start, total_message_count):if all_in_context_messages[i].role == MessageRole.assistant:assistant_message_index = ibreakelse:raise ValueError(f"No assistant message found from indices {target_message_start} to {total_message_count}")# The sequence to summarize is index 1 -> assistant_message_indexmessages_to_summarize = all_in_context_messages[1:assistant_message_index]logger.info(f"Eviction indices: {1}->{assistant_message_index}(/{total_message_count})")# Dynamically get the LLMConfig from the summarizer agent# Pretty cringe code here that we need the agent for this but we don't use itagent_state = await self.agent_manager.get_agent_by_id_async(agent_id=self.agent_id, actor=self.actor)# TODO if we do this via the "agent", then we can more easily allow toggling on the memory block versionsummary_message_str = await simple_summary(messages=messages_to_summarize,llm_config=agent_state.llm_config,actor=self.actor,include_ack=True,)

1.2.3 函數執行器

??MemGPT的函數執行器是每個Agent的基礎能力，在處理LLM的響應時進行函數調用。函數的具體執行是拋到了不同的Exector里面。

@trace_methodasync def _handle_ai_response()#省略一些檢查和參數1.  Execute the tool (or synthesize an error result if disallowed)tool_rule_violated = tool_call_name not in valid_tool_names and not is_approvalif tool_rule_violated:tool_execution_result = _build_rule_violation_result(tool_call_name, valid_tool_names, tool_rules_solver)else:# Track tool execution timetool_start_time = get_utc_timestamp_ns()tool_execution_result = await self._execute_tool(tool_name=tool_call_name,tool_args=tool_args,agent_state=agent_state,agent_step_span=agent_step_span,step_id=step_id,)

1.2.4 控制流與函數鏈

??MemGPT的控制流與函數鏈是支撐Agent具備“可編程對話邏輯”的關鍵。核心由 step() 驅動,每次調用 agent.step() 就是一次事件循環。基本流程為:LLM輸出 → 調度器解析 → 執行器執行 → 隊列更新 → 下一輪繼續。偽代碼為：

# letta/agents/letta_agent.pyasync def step(self, user_input=None):# 1. 構建上下文await self.rebuild_context_window()1. 調用模型model_output = await self.model.generate(self.context, user_input)# 3. 根據輸出類型決定控制流if model_output.function_call:response = await self.execute_function_call(model_output.function_call)else:response = model_output.text# 4. 更新隊列（短期記憶）await self.message_queue.enqueue(response)return response

1.3 要點總結

內存層次（Memory Hierarchy）：Main Context Memory，External Memory和Archive / Summarized Memory；
內存調度：高頻訪問內容留在上下文，低頻內容丟到外部存儲；
隊列管理：管理輸入消息流與函數調用結果，確保 LLM 每次看到的上下文是“最有用”的子集。