← Dual-Axis Framework  ·  Pattern reference

Context Triage

Perception × Route

A priority queue at the door of the model: who enters, who waits, who is dropped — all before any reasoning runs.

Chain
Route
Parallel
Orchestrate
Loop
Hierarchy
Context Eng.
Semantic Compaction
Context Triage
Multi-Modal Fusion
Progressive Discovery

Why this pattern exists

A production agent does not get to read everything. Its context window is a fixed budget — 200K tokens for Claude 3.7, 1M for Gemini, but always finite, always expensive at the margin. When a single PR runs to 50,000 lines, when a compliance bundle arrives with 43 documents, when 100 log lines stream in per second during an outage, the agent cannot see all of it. Something has to decide what enters and what does not. That something is the triage layer, and it sits at the door of the model, not inside it.

This is the pattern that separates a demo agent from a production agent. Demo agents work because the engineer hand-picked the inputs. Production agents face whatever the user supplies, and without explicit routing the model has only one mechanism available — silent truncation from the back of the window, which throws away the most recent context, which is the worst possible place to throw things away.

Picture a loan-evaluation agent built for a regional bank. A new commercial application arrives with 43 documents — financial statements, valuation reports, covenants, environmental assessments — totalling well over the model's 200K-token window. With no triage layer, the harness has nothing to do but truncate. The most recent valuation falls off the end. The model approves the loan based on stale numbers, and the loan defaults two weeks later. The model did nothing wrong — it answered the prompt it was given. The harness simply had no priority logic. Failures like this do not make the front page; they live in postmortems that never leave the building, and they are precisely why context triage has become a first-class concern in 2026.

The agent-design problem it solves

Context Triage answers four questions that show up in every production agent:

  1. What is most important right now? — explicit priority labels (P0/P1/P2/P3) attached at ingestion.
  2. What fits the budget? — a planner that knows the token budget and prunes accordingly.
  3. What is dropped on the floor? — explicit, logged, auditable, reversible.
  4. What can be lazily fetched if needed? — handles for P3 items the model can request later.

The structural decision is to route inputs before the model sees them, with the routing rule itself owned by engineering (the harness), not the model. That is the distinguishing feature: triage is not summarization, not compression, not embedding-based retrieval. It is routing under a hard budget, executed by code that the engineer can audit, version, and adjust without retraining anything.

Deep thinking direction

What makes Context Triage hard is that the priority labels are not given. The agent must decide them, and the decision sets the ceiling on every downstream stage. If a P0 label is wrong, no amount of reasoning can recover — the relevant information was never in the room.

Three failure modes recur. Priority Inflation: when the rule for assigning P0 is fuzzy, every team adds “just in case” items, and within months everything is P0 and the queue collapses to FIFO. The discipline that defeats it is hard schemas with low cardinality — max 5 P0 slots, refusing more — not soft heuristics. Stale Handles: a P3 item points to a snapshot that has changed since ingestion; the agent fetches it later and operates on stale state. The discipline is explicit TTL on handles plus revalidation on fetch. Triage Thrash: priorities revised every turn, the model sees a moving target, confidence collapses. The discipline is a triage decision is sticky for a session unless explicitly invalidated by a downstream signal.

The deeper architectural insight is that Context Triage is the operating-system scheduler pattern reborn in agent form. OS schedulers solved “which process gets the CPU?” under a finite CPU budget with priority queues, aging, and preemption. The agent version solves “which information gets the model?” under a finite context budget with the same primitives. Engineers who have written kernel schedulers find this pattern in five minutes; engineers who have only worked above the OS layer have to learn it from scratch. The patterns transferred; only the medium changed.

Engineering blog posts — curated

Latest paper progress (arXiv)

2026 frontier — deep reads

The five papers above are the foundation. The five below are what changed in the nine months between October 2025 and June 2026 — each one shifts how a working engineer should design the triage layer this quarter. For each paper, what follows is the new idea the paper introduces, the specific failure mode it removes from a production triage layer, the concrete design move you should consider in response, and the single number that quantifies the gain. Each pipeline diagram is hand-drawn so the architecture is visible at a glance.
DEEP READ 1 · FOUNDATIONS

Agentic Context Engineering (ACE): Evolving Contexts for Self-Improving Language Models

arXiv:2510.04618 · Zhang, Hu, Upasani et al. · ICLR 2026 (32 pages)

What changes. Today most production agents assign a priority to each piece of context at the moment it enters the system — a document is P0, a log line is P2, a stale spec is P3 — and then treat that label as ground truth for the rest of the session. This works for the first thirty turns. It breaks for the next thirty, because the agent has by then learned things: a P0 document turned out to mislead it, a P3 fact turned out to be load-bearing. ACE introduces a separate, persistent object — the paper calls it a playbook — that the agent rewrites between turns based on what just happened. The next turn ingests the rewritten playbook as part of its context, so the agent's beliefs about what is important are no longer frozen at the moment of ingestion.

Why it matters for triage. The core mistake the field has been making is treating “priority assigned at ingestion” as if it were “priority that holds for the session.” Priority is in fact a live hypothesis, and that hypothesis decays as the agent learns more. A triage layer that never revises its labels is, in effect, asking the model to keep trusting a guess it made an hour ago — under inputs that the guess could not have anticipated. By the second hour, this is the single most common reason agents start hallucinating obvious facts: the relevant context is technically still in the window, but at a priority so low that the model has stopped attending to it.

What you do differently. Stop storing priorities in a hashmap. Store them in an append-only log keyed by (token-id, turn, evidence-of-revision). Every change to a priority is a new entry, with a record of why — which observation revealed the previous label was wrong. The reflection step at the end of each turn becomes the natural place to write the log. Two consequences follow. Revisions are auditable in postmortems, so a regression caused by a wrong priority can be traced back to the turn where the decision was made. And the agent itself can re-read its own history and ask “why did I demote this fact earlier?” That is the structural prerequisite for an agent to improve within a session; without it, the agent learns nothing across the conversation.

Headline. +10.6% on agent benchmarks, +8.6% on finance benchmarks. The number that should make a serious buyer pause is on AppWorld: an open-backbone agent running ACE matches the top closed-source production systems on overall performance and surpasses them on the harder evaluation splits. The mechanism is not a bigger model and not more tokens. It is a triage layer that is allowed to change its mind.

Generate draft action Reflect critique outcome Curate incremental update Playbook evolved strategies next turn — playbook re-enters as context
Figure — ACE generate·reflect·curate loop with playbook recycled into next turn.
DEEP READ 2 · PLAN-AWARE PRUNING

PAACE: A Plan-Aware Automated Agent Context Engineering Framework

arXiv:2512.16970 · December 18, 2025 · PAACE-Syn + PAACE-FT

What changes. Until PAACE, the priority a piece of context received depended only on the fact itself: “this is a Stripe transaction ID, score it.” PAACE adds a second dimension — the agent's current plan step. The same Stripe transaction ID gets scored P0 when the agent's next step is “diagnose this specific failed payment,” and P3 when the next step is “summarize last week's infrastructure latency trends.” The fact has not changed; what has changed is what the agent is about to do with it. PAACE makes this dependence explicit by training a small distilled scorer that takes the (fact, current-plan-step) tuple as input, with the agent's full plan supplied as supporting context.

Why it matters for triage. Without plan awareness, every triage layer falls back to the same defensive policy: keep everything that might be relevant to any of the next ten steps. This is the policy that quietly destroys token budgets. A 50-step plan touching ten different subsystems will accumulate context from all ten subsystems within the first five steps, and the triage layer has no way to say “we are now firmly inside subsystem A; subsystems B through J can be moved to lower priority.” The result is that the agent carries dead weight for the rest of the session. PAACE is the first paper to give the triage layer a clean signal for shedding that weight.

What you do differently. Train (or distil) a small router whose input key is (fact, plan-step), not (fact) alone. PAACE-FT is the open recipe. Run that small router on every turn, not just at session boundaries. One critical detail: the router must take the agent's current plan as an explicit input, not as a soft hint embedded somewhere in the system prompt. The dependence between plan and priority is too strong to leave to in-context learning; the router has to be conditioned on the plan directly.

Why this matters economically. Re-triaging on every turn used to be prohibitively expensive — it required a full frontier-model call per turn, and the cost crowded out the actual work. PAACE-FT cuts that cost to roughly the cost of an embedding lookup. That is the threshold below which “every turn re-triages” becomes a default architectural choice rather than an aspirational one. For long-running agents, this is what makes plan-aware triage affordable in the first place.

Headline. The distilled compressor retains 97% of the teacher model's accuracy at one-tenth the inference cost. Tested on AppWorld, OfficeBench, and an 8-objective multi-hop QA benchmark, it improves both accuracy and F1 while using fewer agent steps and lower peak tokens than every baseline in the comparison.

Plan Structure parse next-k tasks Relevance Score per-token utility Rewrite + Sum. function-preserve Prune drop low utility Co-refine tune instruction
Figure — PAACE's five-stage compression conditioned on the plan's next-k tasks.
DEEP READ 3 · CODING-AGENT TRIAGE

SWE-Pruner: Self-Adaptive Context Pruning for Coding Agents

arXiv:2601.16746 · Wang et al. · January 23, 2026

What changes. SWE-Pruner introduces a triage layer specifically for coding agents — a 0.6-billion-parameter neural model (the paper calls it the “skimmer”) that sits in front of the frontier model, reads source code line by line, and decides which lines the frontier model actually gets to see. The skimmer is small enough that the latency cost of running it on every turn is effectively invisible. Crucially, the skimmer takes the agent's current goal as an explicit input — for example, “find the cause of this null-pointer error” — and uses that goal to score each line. Lines with no plausible relevance to the current goal are dropped before the frontier model loads them.

Why it matters for triage. The paper documents an economic fact most teams have intuited but few have measured: a frontier coding model spends the dominant share of its token budget on read operations — looking at code it will not ultimately modify. If reads dominate the monthly bill, a triage layer placed in front of the model is not a marginal optimisation; it is the single biggest cost lever in the entire system. Most teams under-invest here because they assume the model is doing something useful with all that context. The data says it usually is not. The model is reading defensively because it has no choice — every read is “just in case” — and a well-built triage layer is precisely what gives the model permission to stop reading defensively.

What you do differently. Stop treating triage as a heavy preprocessing step that runs once per task. Make it as cheap as tokenisation, and run it on every turn. Every time the agent's goal narrows, the triage layer should re-score what it is keeping. The 0.6B skimmer is the reference design: small enough to fine-tune for your domain, small enough to run on CPU if you want, and structured enough that you can swap it out as your task definition evolves. Pass the goal prompt explicitly into the skimmer — do not bury it inside the agent's system prompt where the skimmer has to fish it out.

Headline. On SWE-Bench Verified, the standard coding-agent benchmark, SWE-Pruner cuts token usage by 23–54% while holding or improving success rate. On LongCodeQA, it achieves up to 14.84× compression. The number worth carrying into your next planning meeting: a coding agent that reads 14.84× less code does not perform worse. At production scale, that ratio is often the difference between losing money on every customer and being profitable.

Task + Goal e.g. fix error 0.6B Skimmer light neural model Score Lines per-line utility Drop / Keep preserve structure Pruned Ctx to coding agent
Figure — SWE-Pruner's skimmer-first triage: a 0.6B model is the “door guard” before the frontier model sees code.
DEEP READ 4 · KV-CACHE-LEVEL TRIAGE

SideQuest: Model-Driven KV Cache Management for Long-Horizon Agentic Reasoning

arXiv:2602.22603 · Kariyappa & Suh · February 2026

What changes. Until 2026, almost every triage decision was made at prompt-assembly time — before the model began reasoning. SideQuest moves the decision into inference itself. A second thread — the paper calls it the “side thread” — runs in parallel with the main reasoning loop, continuously scoring tokens in the KV cache for utility and evicting low-utility tokens after the agent has had a chance to see how the reasoning is unfolding. The main reasoning thread never blocks on the side thread; eviction happens in the background, and the compressed cache flows back into the next step of reasoning.

Why it matters for triage. Prompt-time triage suffers from a structural blindness: at the moment you assemble the prompt, you have no idea which pieces of context will turn out to be load-bearing and which will turn out to be decorative. That information only exists mid-reasoning. The earliest moment you can know “step 7's observation actually mattered” is at step 8 — and by step 8, a one-shot ingestion filter has long since locked in its choices and cannot revise them. SideQuest's insight is that the right place for triage in a long-horizon agent is the same place a CPU's branch predictor lives: inside the inference path, watching what is actually happening, with the freedom to revise the cache as reasoning proceeds.

What you do differently. The architecture splits into two clear cases. If you control the inference stack — meaning you run vLLM, SGLang, TensorRT-LLM, or a similar serving framework you can modify — SideQuest's evict-in-flight design is now reachable, and for any agent that runs more than 50 steps it is probably the right architecture. If you do not control inference (which is most teams using the Anthropic or OpenAI APIs directly), the lesson is different but important: force the binding constraint back into the prompt window, where you can triage. Do not pretend the KV cache is infinite just because the API does not expose it.

The deeper point. Triage is no longer “a thing you do before the model runs.” It is “a thing you do continuously while the model runs.” This is a more demanding architecture, and it does not pay off for short-horizon agents. But for the long-horizon agents that increasingly define the frontier — multi-hour coding sessions, day-long research agents, week-long compliance reviews — it is the only architecture that holds up at depth.

Headline. SideQuest cuts peak token usage by up to 65% on long-horizon agentic tasks with minimal accuracy loss. The number that should pause your roadmap committee: the entire system was trained on 215 samples. The technique is within reach of any team with a modest fine-tuning budget.

Main reasoning loop — step n  ·  step n+1  ·  step n+2  ·  … long-horizon agentic task — KV cache grows monotonically without intervention peek Aux: model-driven judge scores KV utility, in parallel Evict / Retain per-token decision Compressed KV cache fed back to main loop inject
Figure — SideQuest's side-thread keeps triage off the critical path while compressing the KV cache.
DEEP READ 5 · TRIAGE AS MODEL-TIER ROUTING

Triage: Routing Software Engineering Tasks to Cost-Effective LLM Tiers via Code Quality Signals

arXiv:2604.07494 · April 2026 · SWE-bench Lite (300 tasks)

What changes. The April 2026 paper called simply “Triage” generalises the pattern in an important direction. Until now, “context triage” meant deciding which tokens enter a given model. This paper asks the prior question: which model gets to process this task at all? A cheap signal computed over the incoming task — code-health metrics in their case, though the framework generalises to any cheap proxy — routes each task to one of three model tiers: Light (Haiku-shaped), Standard (Sonnet-shaped), or Heavy (Opus-shaped). Every tier's output passes the same verification gate. The verification gate is the safety mechanism: it is what makes safe-but-cheaper routing possible, because a wrong route gets caught at verification rather than at customer impact.

Why it matters for triage. Every team running a serious agent eventually discovers the same invoice pattern: 80% of the monthly cost comes from 20% of the calls — specifically, the heavy-model calls. Every dollar of savings in the system lives in pushing some fraction of that 20% down a tier. The hard question has always been “how do we push a call down without quietly degrading quality?” The natural temptation is to gate it before — predict the task's difficulty and route based on the prediction. The problem is that the prediction is imperfect, and an under-routed hard task does not become visibly wrong; it becomes subtly wrong, and the regression hides until it lands in front of a customer. The Triage paper's contribution is a discipline: verify after, do not gate before. Let the cheap tier try, and let the verification step catch the cases where the cheap tier was not good enough.

What you do differently. Build your agent as two concentric layers of triage. Outer layer: model-tier routing on cheap signals (this paper). Inner layer: context-triage inside whichever tier got selected (the four papers above). Have the two layers report metrics separately so you can debug them independently — otherwise a routing regression and a context-pruning regression will look identical from outside. When you present this work to leadership — and you will, because the savings are large — the two numbers to put on the slide are the falsifiable conditions the paper spells out: light-tier pass rate must exceed the inter-tier cost ratio (meaning the cheaper tier is right often enough to make routing economical), and the code-health effect size must be at least p̂ ≥ 0.56 (meaning the signal can actually distinguish easy from hard tasks). Those are two conditions a CFO can underwrite.

The deeper move. Triage stops being a context filter and becomes a budget allocator across heterogeneous models. Once you see it this way, the same pattern applies to embedding models, retrieval indexes, image generators — any system with multiple tiers of cost and quality. The pattern transfers; the medium does not.

Headline. Evaluated on SWE-bench Lite (300 tasks). Three routing policies are compared head-to-head: a heuristic threshold, a trained ML classifier, and a perfect-hindsight oracle. The oracle gap is what makes the paper publishable: it tells you exactly how much savings remain on the table once a real router has been built, and gives you a concrete number to aim at as your routing improves.

SE Task issue + repo Code-Health cheap signal Tier Classifier heuristic / ML Light tier Standard tier Heavy tier Verification same gate for all
Figure — Triage routes by cheap code-health signal to one of three tiers; all converge at the same verification gate.

Synthesis — the three convergent moves of 2026

The five papers above are independent contributions, but they line up onto a single trajectory. Read together they describe a single architectural shift in how the field thinks about triage.

(1) The priority signal has become cheap. Nobody in 2026 calls a frontier model to decide what enters another frontier model. The signal is either a small distilled scorer (SWE-Pruner's 0.6B skimmer), a metric computed directly over the input (the Triage paper's code-health proxy), or a side-thread running for free in parallel with main inference (SideQuest). If your triage layer is itself a bottleneck, your architecture is from 2024.

(2) The signal has become plan-aware. The same fact has different utility under different plans, and the field has stopped pretending otherwise. PAACE made this explicit by training a small router on (fact, plan-step) tuples; ACE achieved the same effect differently, by writing the plan-induced revisions into a persistent playbook. Either way, the scorer's key now includes the agent's current objective, not just the content being scored.

(3) Triage decisions now accumulate across turns. They are revised, not reset. Each turn reads the previous turn's priority log. When the agent learns that a P0 was mistaken, the mistake gets inscribed in the log as a revision, with a record of which observation revealed it. The triage layer becomes a learning surface in its own right — not as good as a fine-tuned model, but several times more responsive to what the agent has just discovered.

The combined recipe. Two layers (model-tier routing outside, context routing inside), one cheap shared scorer that fires every turn, an append-only priority log so revisions are auditable, and a verification gate that catches the routes that went wrong. This is the architecture that has emerged in the serious production agents of 2026 — at Anthropic, at Cognition, at Manus — and it is the architecture the next twelve months of papers will refine rather than displace. If you are starting an agent design today, this is the starting point. The open questions in the field are now about scaling each piece, not about which pieces to include.

Where this pattern is developed

A note on authorship. This page was developed as a human–AI collaboration. The paper selection, editorial framing, and engineering judgments are mine; the prose drafting and SVG flow diagrams were produced in dialogue with Claude (Anthropic). Factual claims about each paper are traceable to the linked arXiv source — please consult the originals when the details matter.