← Dual-Axis Framework · Pattern reference
Hierarchical Retention
Memory × RouteWhy this pattern exists
The single most common cause of agent failure beyond the first 20 turns is that the agent cannot remember the right thing at the right time. Not cannot remember — it can remember something, usually the wrong thing — but cannot route the right memory tier to the current question. A customer asks “what about the refund I requested last Tuesday?” The chat history with the customer is in working memory. The refund policy is in long-term reference memory. The customer’s preference for callbacks instead of email is in procedural memory. The agent needs all three tiers, and it needs them routed differently.
For leadership: this is the pattern that decides whether your agent can hold a relationship across sessions, or whether it starts every conversation from zero. A relationship-keeping agent is a different product than a one-shot Q&A agent. Hierarchical Retention is the architectural prerequisite for the former. Manus, the team behind one of 2025’s benchmark-leading agents, reports that KV-cache hit rate — the operational metric for whether memory routing is working — is the most important cost lever in their stack.
The agent-design problem it solves
A naive agent stuffs everything into the prompt every turn. Cost explodes linearly with conversation length, and quality degrades as relevant signals drown in irrelevant repetition. A slightly-less-naive agent summarizes the whole history each turn. Cost is bounded but every nuance is lost in compression. Neither extreme is right.
Hierarchical Retention sets up four tiers, each with its own retention policy, each accessed by a different routing rule:
- Working memory — current turn, full fidelity, expires at turn end.
- Short-term — recent N turns, structured summary, expires at session end.
- Long-term — durable facts, retrieved by routing rule, never auto-expires.
- Procedural — playbooks and templates, loaded on task-type match.
The routing rule (the Route topology dimension) is what makes this distinct from RAG. RAG retrieves from one undifferentiated store. Hierarchical Retention chooses which store based on the current query type, then retrieves with the policy appropriate to that tier.
Deep thinking direction
The hardest part of Hierarchical Retention is not the storage. Storage is solved — SQLite for short-term, vector DBs for long-term, file system for procedural, in-context for working. The hard part is promotion and eviction: when does a fact in short-term earn its way into long-term? When does a stale long-term fact get demoted? Get these wrong and the agent slowly accumulates contradictions — remembering both the customer’s old address and the new one, with no way to choose which is current.
Three failure modes recur. Promotion Inflation: every fact promoted to long-term, the store bloats, retrieval relevance collapses. The discipline is explicit promotion criteria — access count, confidence threshold, an audit step before promotion. Phantom Memory: the agent remembers something the user did not say because hallucinated facts got promoted on first sight. The discipline is verification on promotion. Tier Confusion: working memory leaking into long-term causes the agent to treat “something said five minutes ago” with the weight of “something established as fact.” The discipline is hard tier boundaries enforced by the harness, not the model.
The architectural insight is that Hierarchical Retention is the CPU cache hierarchy pattern reborn. L1 cache (working), L2 (short-term), L3 (long-term), main memory (procedural) — with promotion driven by access frequency, eviction by LRU. Engineers who have written cache replacement policies recognize this in under a minute. The medium changed; the algorithm structure did not.
Engineering blog posts — curated
- MemGPT — Towards LLMs as Operating Systems The OS-style tiered-memory metaphor in working code. MemGPT explicitly models LLM context as RAM and external store as disk.
- CLAUDE.md and project memory in Claude Code Three-tier (global / project / session) memory implemented as plain files. The most widely-deployed Hierarchical Retention pattern in production today.
- KV-Cache Hit Rate as the Production Cost Lever — Manus Reports KV-cache hit rate as the single highest-leverage cost metric in production multi-turn agents. Directly motivates the tier discipline.
- Long-term Memory for Agents — LangChain Framework-side decomposition: semantic, episodic, procedural memory types and the access patterns each demands.
- Memory components — Lilian Weng Early decomposition of agent memory into sensory / short-term / long-term, drawing on cognitive science. Pre-dates but anticipates the production-engineering version.
Latest paper progress (arXiv)
- MemGPT: Towards LLMs as Operating Systems The foundational paper for OS-style virtual context. Introduces explicit page-in / page-out for LLM memory.
- Cognitive Architectures for Language Agents (CoALA) Sumers et al. organize agent memory by cognitive function (procedural / semantic / episodic). The taxonomy half-feeds directly into the tier design.
- A Survey on the Memory Mechanism of LLM Agents Catalogues memory mechanisms across recent agents. The promotion/eviction policy section is the most directly relevant for engineering.
- Generative Agents with Hierarchical Memory Explicit working/short/long-term separation with consolidation passes. Demonstrates hierarchical retention beats flat retrieval on long-horizon tasks.
- A-Mem: Agent-Mediated Memory with Hierarchical Routing Routing-based memory access where the agent itself decides which tier to query. The closest model-side ancestor to the production-engineering version.
Related patterns
Where this pattern is developed
- Manning book — Designing AI Agents, Chapter 4 §4.2 (Memory / Hierarchical Retention).
- Paper — Huang & Zhou (2026), §4.2 Pattern 2.