
Most agent memory is retrieval-augmented guessing. You store facts in a vector database, shove the top-k results into the context window on every turn, and hope the model notices. The retrieval score becomes a lottery. The relevance threshold becomes a knob you tweak. The LLM, the most expensive component in the entire stack, is treated as a passive reader of whatever you decided was important.
Letta, the open-source descendant of the MemGPT paper, takes a different bet: give the model explicit memory-management tool calls and let it page its own context the way a kernel pages RAM. That sounds like a small architectural choice. It is not. It is the only design I've seen in open-source agent infrastructure that takes the LLM seriously as a system, not a function call.
The original MemGPT paper, from Charles Packer and the Berkeley group in 2023, framed the problem precisely: a fixed context window is a hard memory limit, and bolting external retrieval onto a stateless model is a hack. The fix was to give the model a tiered memory hierarchy it could manage itself:
The model decides when to read from recall, when to write to core, when to evict to archival. It is doing its own memory management. The framework just gives it the syscalls.
The API is deceptively simple. Create an agent with named memory blocks, then send messages:
from letta_client import Letta
client = Letta(token=LETTA_API_KEY)
agent = client.agents.create(
model="openai/gpt-5.2",
memory_blocks=[
{"label": "human", "value": "Name: Aashi. Senior backend engineer. Prefers Postgres."},
{"label": "persona", "value": "I am a precise, no-fluff engineering assistant."},
],
tools=["web_search", "memory_insert", "memory_replace"],
)
response = client.agents.messages.create(agent.id, input="What database should I use?")
Behind that API, every turn is a structured loop. The model is prompted with the current state of every memory block, the message buffer, and a toolset that includes `core_memory_append`, `core_memory_replace`, `archival_memory_insert`, `archival_memory_search`, and `conversation_search`. The framework parses the tool calls, mutates persisted state, and re-injects updated blocks into the next prompt. The model does the work; the framework is the persistence layer.
The 2026 addition that actually changed how I think about this is **sleep-time compute**. Instead of forcing the model to consolidate memory during a live conversation (which adds latency while the user waits for the LLM to tidy its notes), Letta spins up a background agent that processes, summarizes, and rewrites memory blocks while the user is idle. MemGPT's original design bundled everything into one agent; sleep-time compute separates the concerns.
Letta Code, the terminal coding agent shipped in December 2025 and the #1 model-agnostic open-source agent on Terminal-Bench as of mid-2026, is the proof of concept. A coding agent that retains your repo conventions, your naming preferences, and the half-finished refactor from last Tuesday across sessions is qualitatively different from one that starts cold every morning.
The Letta API decouples agent identity, memory, and state from the underlying model. Swap GPT-5.2 for Opus 4.5 for a local Qwen3.5 and the agent keeps its memory and history. No other major framework does this cleanly.
The MemGPT approach has real costs. The model is doing memory work that competes with the actual task. Without sleep-time compute, you pay a latency tax on every turn as the model decides what to evict, recall, archive. Memory quality is bounded by the model's tool-use capability. Letta's own leaderboard ranks GPT-5.2 and Opus 4.5 at the top and notes that smaller open models degrade the experience. And this is not a drop-in replacement for RAG. If you have a static document corpus, you want a vector database with hybrid search. Letta's archival memory is for knowledge the agent has produced or consumed, not arbitrary retrieval.
The deeper limitation: memory management is still a context engineering problem, and the LLM is doing it heuristically. There is no guarantee the model writes the right fact to the right block, or recalls the right fact at the right time. The benchmarks (LoCoMo, LongMemEval, BEAM) measure recall, not judgment. Production systems still need to instrument for memory quality.
Letta is the only open-source agent framework I've seen that treats memory as a first-class system component rather than a feature bolted on top of an LLM call. The OS-inspired hierarchy is the right abstraction. Sleep-time compute is the right production pattern. Model-agnostic persistence is the right bet for a world where the frontier model changes every six months. The framework has rough edges — latency cost, model-dependence, evolving developer experience — but the architectural foundation is sound in a way that most of the agent-framework field is not.
If you are building an agent that needs to remember anything beyond the current conversation, install `pip install letta-client` and pay attention to the memory block design. That is where the actual work is. The LLM is the easy part.
*Repo: github.com/letta-ai/letta — Apache 2.0, 19K+ stars, self-hostable, Python and TypeScript SDKs, Letta Code CLI, sleep-time compute. Original MemGPT paper: Packer et al., 2023. Letta Code launched December 2025; #1 on Terminal-Bench open-source category as of May 2026.*