Why Every Agent Team I Know Runs Out of Context Before They Run Out of Compute

· by Jia Huang · Originally on Substack

A team I reviewed last quarter had spent three months scaling their agent. They were on H100s. They had budget. They were optimistic.

What was failing was not the model. The model worked fine on shorter inputs. What was failing was the agent’s ability to keep track of what it had already seen. The team’s system prompt had grown to forty thousand tokens. The conversation history was at fifty thousand. Tool definitions ate another thirty. By the time a real user query arrived, the model was reading something that looked less like a task description and more like a poorly indexed wiki.

They had not run out of compute. They had run out of room to think.

Versions of this show up on every agent project that crosses my desk. Different teams, different domains, different stacks. Same wall. They keep hitting it well before they hit any compute or model capability limit.

This essay is about that wall, and why it changes what we should consider the actual bottleneck in agent engineering.

We were trained to think the bottleneck was compute

If you grew up with the cloud era of software, your instinct about scaling is shaped by one assumption: compute is the limit, and you can throw money at it. Need more throughput? More cores. Need more memory? Bigger instance. Need more parallelism? More replicas. The cloud bill is annoying, but the curve is smooth.

Agent systems break that assumption.

You can buy bigger GPUs. You can buy faster GPUs. You can run more inferences in parallel. None of that helps when the problem is what your model can pay attention to in a single forward pass. The context window is the new bottleneck, and unlike compute, you cannot scale it horizontally.

There is no AWS instance type that gives you a context window twice as big.

There is no “Premium Tier” that lets you fit more relevant information into the same prompt slot.

Even when frontier labs raise the limit (200K, 400K, 1M, 2M), real-world useful capacity does not scale linearly with the headline number. Performance degrades as the window fills with stale conversation, irrelevant tool output, and accumulated chain-of-thought. By the time the window is at 80% of the published limit, the model is operating in a different regime than what the benchmarks measured.

So the question is not “how much context can you buy?” The question is: how do you allocate the context you have?

That is a different question from any question the cloud era asked.

The thesis

Stripped to one sentence: agent architecture is context-budget allocation under uncertainty.

The model spends. The harness budgets.

The next few weeks of essays defend this premise more fully (the MEAP for Designing AI Agents launches with the full argument). For today the goal is simpler: show the symptom. Once seen, it shows up everywhere.

Five places the context wall shows up

1. Sub-agent proliferation that nobody planned.

A team starts with one agent. Three months in, they have a planner agent, an executor agent, a critic agent, a coordinator agent, and a “router” agent. Ask any individual engineer why each one exists, and the answer is the same: the parent agent’s context window was getting full. Each sub-agent is a forced spawn. Nobody designed a multi-agent architecture; the architecture grew because the context kept overflowing. This is not multi-agent design. This is involuntary process forking.

2. RAG systems with five retrieval stages.

The team adds a vector store. Then a reranker. Then a summarizer in front of the reranker because the reranker’s input was too big. Then a query rewriter in front of all of it. Each stage is a patch on a context-budget problem the previous stage didn’t fully solve. None of them would exist if the model could just read the whole knowledge base.

3. The summarization tax that nobody measures.

Every time conversation history gets compacted into a summary, a tax is paid. The summary is shorter (good) but lossy (bad). Most teams do not measure how much information their summarizer destroys per compaction. They simply notice the agent “forgets things” or “starts hallucinating after turn fifteen” and assume the model degraded. The model did not degrade. The summarizer did its job. Context shortage was compensated for by trading information for tokens.

4. The token cost of being thorough.

Want a thorough agent that double-checks its work? That is another reasoning trace consuming the same context window. Want it to consult three tools before answering? Each tool call’s output occupies real estate. By the time an agent thinks carefully, validates against three sources, and writes back a coherent justification, the useful context for the next user message is already half-spent.

5. Memory schema arguments that go in circles.

Listen to a senior engineering meeting on agent memory. The team will fight for ninety minutes about what to store: “we should keep the user’s full history” vs “no, just the last five turns” vs “but what about the system facts we extracted in turn three?” Underneath every memory architecture argument is the same constraint: there is not room for everything; choose what to keep. This is not a knowledge management problem. It is a budget allocation problem dressed up in knowledge management vocabulary.

If you have lived through any of these, you have already lived the thesis. Context, not compute, is the limit. Allocation, not accumulation, is the engineering problem.

What changes when this is taken seriously

Once context is the scarce resource, several things start looking different.

The patterns the AI engineering community keeps reinventing (context triage, hierarchical memory, complexity routing, sub-agent isolation, declarative skills loaded on demand) stop looking like unrelated tricks. They start looking like what they actually are: strategies for allocating a fixed context budget across competing demands. Different patterns, same underlying constraint. Same questions: what gets in, what gets out, what gets summarized, what gets sharded across processes.

The harness, the layer wrapping the model that the community has started naming explicitly, stops looking like infrastructure plumbing. It starts looking like the place where the budgeting happens. The model is not making context allocation decisions. The harness is. The harness decides what to load, what to prune, what to delegate, what to escalate. The model spends; the harness budgets.

And the role of the engineer designing one of these systems stops being “build the smartest agent.” It becomes something stranger and more interesting: decide the budget, design the allocator, make the consequences observable. Most agent failure modes that recur in production reduce to a budget that was wrong, an allocator that ran ad hoc, or consequences that were invisible.

What this is building toward

Several engineering teams have worked through this allocation problem in the last year, and the book with Manning lays out a framework for it. The framework has two axes (one for what the agent is doing, one for how the work is wired) and twenty-seven recurring patterns, all of which turn out to be consequences of the same allocation premise once they are seen through the right lens.

That book, Designing AI Agents, launches in MEAP on May 26. The Chinese edition has shipped 9,000+ copies in two months and is in its 4th print run; the English edition gives the framework its first international readership.

In the meantime, one essay a week appears on this Substack until the MEAP drops, and one engineering idea a week for ten weeks after. Each builds on the same thesis: agent architecture is context-budget allocation under uncertainty.

If you are designing or operating agents in production and have hit some version of the context wall, subscribe. The next two essays go deeper into specific allocation strategies and into a related failure mode that is the single most underrated risk in agent systems today.

— Jia Huang (黄佳) AI Researcher, A*STAR Singapore Author of Designing AI Agents (Manning, MEAP launches May 26) and RAG from First Principles (Packt)