Hierarchical Retention — Pattern Reference

Position in the matrix

Chain

Route

Parallel

Orchestrate

Loop

Hierarchy

Memory

RAG Pipeline

Hierarchical Retention

—

Progress Tracking

Failure Journals

—

Why this pattern exists

The single most common cause of agent failure beyond the first 20 turns is that the agent cannot remember the right thing at the right time. Not cannot remember — it can remember something, usually the wrong thing — but cannot route the right memory tier to the current question. A customer asks “what about the refund I requested last Tuesday?” The chat history with the customer is in working memory. The refund policy is in long-term reference memory. The customer’s preference for callbacks instead of email is in procedural memory. The agent needs all three tiers, and it needs them routed differently.

For leadership: this is the pattern that decides whether your agent can hold a relationship across sessions, or whether it starts every conversation from zero. A relationship-keeping agent is a different product than a one-shot Q&A agent. Hierarchical Retention is the architectural prerequisite for the former. Manus, the team behind one of 2025’s benchmark-leading agents, reports that KV-cache hit rate — the operational metric for whether memory routing is working — is the most important cost lever in their stack.

The agent-design problem it solves

A naive agent stuffs everything into the prompt every turn. Cost explodes linearly with conversation length, and quality degrades as relevant signals drown in irrelevant repetition. A slightly-less-naive agent summarizes the whole history each turn. Cost is bounded but every nuance is lost in compression. Neither extreme is right.

Hierarchical Retention sets up four tiers, each with its own retention policy, each accessed by a different routing rule:

Working memory — current turn, full fidelity, expires at turn end.
Short-term — recent N turns, structured summary, expires at session end.
Long-term — durable facts, retrieved by routing rule, never auto-expires.
Procedural — playbooks and templates, loaded on task-type match.

The routing rule (the Route topology dimension) is what makes this distinct from RAG. RAG retrieves from one undifferentiated store. Hierarchical Retention chooses which store based on the current query type, then retrieves with the policy appropriate to that tier.

Deep thinking direction

The hardest part of Hierarchical Retention is not the storage. Storage is solved — SQLite for short-term, vector DBs for long-term, file system for procedural, in-context for working. The hard part is promotion and eviction: when does a fact in short-term earn its way into long-term? When does a stale long-term fact get demoted? Get these wrong and the agent slowly accumulates contradictions — remembering both the customer’s old address and the new one, with no way to choose which is current.

Three failure modes recur. Promotion Inflation: every fact promoted to long-term, the store bloats, retrieval relevance collapses. The discipline is explicit promotion criteria — access count, confidence threshold, an audit step before promotion. Phantom Memory: the agent remembers something the user did not say because hallucinated facts got promoted on first sight. The discipline is verification on promotion. Tier Confusion: working memory leaking into long-term causes the agent to treat “something said five minutes ago” with the weight of “something established as fact.” The discipline is hard tier boundaries enforced by the harness, not the model.

The architectural insight is that Hierarchical Retention is the CPU cache hierarchy pattern reborn. L1 cache (working), L2 (short-term), L3 (long-term), main memory (procedural) — with promotion driven by access frequency, eviction by LRU. Engineers who have written cache replacement policies recognize this in under a minute. The medium changed; the algorithm structure did not.

Engineering blog posts — curated

MemGPT — Towards LLMs as Operating Systems cpacker · 2023-2024 · ongoing The OS-style tiered-memory metaphor in working code. MemGPT explicitly models LLM context as RAM and external store as disk.
CLAUDE.md and project memory in Claude Code Anthropic · 2024-2025 Three-tier (global / project / session) memory implemented as plain files. The most widely-deployed Hierarchical Retention pattern in production today.
KV-Cache Hit Rate as the Production Cost Lever — Manus Manus Research · 2025 Reports KV-cache hit rate as the single highest-leverage cost metric in production multi-turn agents. Directly motivates the tier discipline.
Long-term Memory for Agents — LangChain LangChain Blog · 2025 Framework-side decomposition: semantic, episodic, procedural memory types and the access patterns each demands.
Memory components — Lilian Weng OpenAI · June 2023 Early decomposition of agent memory into sensory / short-term / long-term, drawing on cognitive science. Pre-dates but anticipates the production-engineering version.

Latest paper progress (arXiv)

MemGPT: Towards LLMs as Operating Systems arXiv:2310.08560 · October 2023 The foundational paper for OS-style virtual context. Introduces explicit page-in / page-out for LLM memory.
Cognitive Architectures for Language Agents (CoALA) arXiv:2309.02427 · September 2023 Sumers et al. organize agent memory by cognitive function (procedural / semantic / episodic). The taxonomy half-feeds directly into the tier design.
A Survey on the Memory Mechanism of LLM Agents arXiv:2404.13501 · April 2024 Catalogues memory mechanisms across recent agents. The promotion/eviction policy section is the most directly relevant for engineering.
Generative Agents with Hierarchical Memory arXiv:2501.05813 · January 2025 Explicit working/short/long-term separation with consolidation passes. Demonstrates hierarchical retention beats flat retrieval on long-horizon tasks.
A-Mem: Agent-Mediated Memory with Hierarchical Routing arXiv:2502.18482 · February 2025 Routing-based memory access where the agent itself decides which tier to query. The closest model-side ancestor to the production-engineering version.

Related patterns

Where this pattern is developed

Manning book — Designing AI Agents, Chapter 4 §4.2 (Memory / Hierarchical Retention).
Paper — Huang & Zhou (2026), §4.2 Pattern 2.