What 19 Papers Taught Me About Where Agent Engineering Is Really Going

March 30, 2026 · by Jia Huang · Originally on Substack

This is the inaugural post of Agent Design Patterns. If you build agent systems (or soon will), this newsletter brings structure to the chaos.

The reference library behind a book on agent design patterns kept growing for two years. Anthropic, OpenAI, Berkeley, Simon Willison, Swyx, Lilian Weng, Eugene Yan, Chip Huyen, Martin Fowler’s site, a dozen more. At some point it became clear the organizing had to stop and the writing had to start.

Nineteen sources later, a pattern had emerged that was not in any single source. The field is converging, and in directions most practitioners have not yet noticed.

Here is what the literature is saying, read in one sitting.

1. The competition has shifted, permanently, from models to systems

This is the single most repeated finding, supported by three independent data points that collectively defeat the “just wait for a better model” strategy:

Andrew Ng (March 2024): GPT-3.5 with an agentic workflow scored 95.1% on HumanEval. GPT-4 zero-shot scored 67%. A weaker model with a good system beat a stronger model without one, by 28 percentage points.

AlphaCodium (2024): GPT-4 accuracy went from 19% to 44% through workflow decomposition alone. No model upgrade. No fine-tuning. Just better system design.

LangChain Deep Agents (March 2026): Same model, same benchmark, different harness. Terminal Bench score jumped 13.7 points, enough to move from Top 30 to Top 5. The only variable was the harness.

Three teams. Different benchmarks. Different years. The conclusion is consistent: system design returns now exceed model scaling returns. In advisory conversations, a recurring pattern is months of waiting for a better model when a two-day harness redesign would have closed the gap. Even after GPT-5 arrives, a well-designed harness will stack additional gains on top of it.

Swyx put this most plainly: “Harnesses can survive even reasoning paradigm changes.”

2. Harness Engineering became a discipline in 90 days

In December 2025, “harness engineering” was not a phrase anyone used. By March 2026 it appeared in:

Anthropic (January): “Effective Harnesses for Long-Running Agents.” Defined the two-phase architecture (Initializer + Coding Agent).
OpenAI (February): “Harness Engineering.” Built a million-line internal product with zero manually-written code, in 1/10th the development time.
Martin Fowler’s site (February): Birgitta Böckeler’s independent critique. Distilled three core harness components (Context Engineering, Architectural Constraints, Entropy Management).
Simon Willison (February): full guide on “Agentic Engineering Patterns.” The best practical tutorial in the field.
arXiv (March): first formal academic definition. Scaffolding (before the first prompt) versus harness (everything after).

When OpenAI, Anthropic, academia, and independent practitioners converge on the same concept within 90 days, it is not hype. It is discovery.

All four describe the same elephant from different angles. Anthropic emphasizes architecture: two-phase design, feature lists. OpenAI emphasizes operations: three components, automated PR workflows. Simon Willison emphasizes implementation: “runs tools in a loop to achieve a goal.” Martin Fowler’s site emphasizes enterprise reality: legacy challenges and verification gaps.

Same concept, four angles. Nobody has unified them yet. That is exactly what a framework is for.

3. Authority is the most underrated dimension in agent design

Swyx’s IMPACT framework (Intent, Memory, Planning, Authority, Control Flow, Tools) makes a striking claim: Authority is “the most overlooked element” in agent systems.

Read against the other 18 sources, the claim holds.

The literature is saturated with discussions of tool calling, reasoning chains, and memory architectures. Almost nobody discusses: who authorized the agent to do this? To what extent? What happens when it exceeds its authority?

Chip Huyen’s taxonomy clarifies why this matters. She distinguishes three types of tools: Knowledge Augmentation (reading), Capability Extension (computing), and Write Actions (modifying the environment by executing SQL, sending emails, transferring money). Most agent frameworks treat these with identical permission models. That is like giving a junior employee the same database access as the DBA because “they both use SQL.”

This blind spot is the single biggest barrier to enterprise adoption. It is also why Governance occupies a full chapter in the book under preparation, as the seventh cognitive dimension, not a footnote.

4. The verification gap is the field’s open wound

Birgitta Böckeler, writing on Martin Fowler’s site, delivered the sharpest critique of any source in the set. Her assessment of OpenAI’s Harness Engineering post: it lacks “verification of functionality and behaviour.”

OpenAI showed how to make agents write good code (linters pass, architecture is consistent). They did not show how to verify that the code does the right thing. Quality is not correctness.

The gap echoes across the literature:

Eugene Yan et al. (2024): hallucination baseline rates sit at 5–10%, hard to push below 2%.
Kamoi et al. (TACL 2025): LLM self-evaluation has systematic biases.
A recurring agent failure mode: phantom issues, where the agent reflects on its work, identifies problems that do not exist, and “fixes” them, making the code worse.

Automated verification of agent output, not just quality but correctness, will be the core battlefield for the next 18 months. Whoever solves “did the agent do the right thing?” at scale will define the next generation of harness architecture.

5. Two metaphors are becoming industry vocabulary

Good metaphors spread faster than good definitions. Two are winning right now.

“Shift engineers” (Anthropic): each new context window is a new engineer arriving for a shift with no memory of what the previous engineer did. One sentence, and everyone understands why a harness is needed: LLMs have no persistent memory.

“Three eras” (Epsilla): Prompt Engineering is writing the perfect email. Context Engineering is attaching all the right files. Harness Engineering is designing the entire email system. Three levels of abstraction, one analogy.

These metaphors matter because they become the vocabulary that non-technical stakeholders use to make budget decisions. When a CTO can say “we need to move from era 2 to era 3,” the architecture argument is already half-won.

6. Agent engineering is eating software engineering

This may be the deepest signal in the entire set, and it comes from Simon Willison.

His Agentic Engineering Patterns guide is structured exactly like a software engineering textbook: Principles → Working with Agents → Testing and QA → Understanding Code. But the subject is not “how to write software.” It is “how to develop software with agents.”

Martin Fowler’s site reinforces this: harnesses may become the next generation of service templates, standardized project structures that define how every AI project is organized, just as microservice templates define how every service is organized today.

The implication: within two to three years, “can you work with coding agents?” will be as basic as “can you use Git?” Not a specialty. A prerequisite.

7. The gap nobody has filled

Every source in this set answers how: how to build agents, design harnesses, manage context. Almost nobody answers when and which: when to use which pattern, when to add complexity, when to stop.

Anthropic gives five workflow patterns. Swyx gives six IMPACT elements. Chip Huyen gives four control flow types. Andrew Ng gives four agentic design patterns. All useful. None composable. None with selection criteria.

That is the gap the framework in Designing AI Agents (Manning, forthcoming) is built to fill: a methodology for choosing between patterns, not another catalog of them. From “I have a problem” to “I need this pattern, configured this way, with these governance constraints.” This newsletter explores the ideas behind it, one thread per week, free, no paywall.

If you build agent systems, or plan to, subscribe. The chaos is about to get some structure.

Next: The Five Chaos Problems in Agent Design. Why “agent” has become the most overloaded word in software since “cloud.”

Designing AI Agents (Manning, forthcoming).

Appendix: The 19 Sources

For those who want to go deeper, here are the sources in recommended reading order:

Start here (2 hours): 1. Andrew Ng — Agentic Design Patterns (2024) 2. Lilian Weng — LLM Powered Autonomous Agents (2023) 3. Anthropic — Building Effective Agents (2024)

Harness deep dive (2 hours): 4. Anthropic — Effective Harnesses for Long-Running Agents (2025) 5. OpenAI — Harness Engineering (2026) 6. Böckeler / Martin Fowler — Harness Engineering (2026) 7. Inngest — Your Agent Needs a Harness, Not a Framework 8. Simon Willison — Agentic Engineering Patterns (2026)

Engineering practice (3 hours): 9. Eugene Yan et al. — What We Learned from a Year of Building with LLMs (2024) 10. Eugene Yan — Patterns for Building LLM-Based Systems (2023) 11. Chip Huyen — AI Agents (2025) 12. BAIR — The Shift from Models to Compound AI Systems (2024)

Reasoning foundations (1.5 hours): 13. Wei et al. — Chain-of-Thought Prompting (NeurIPS 2022) 14. Yao et al. — ReAct (ICLR 2023) 15. Shinn et al. — Reflexion (NeurIPS 2023)

Architecture theory (1 hour): 16. Sumers et al. — CoALA (2023) 17. arXiv 2603.05344 — Building Effective AI Coding Agents (2026)

Industry perspective (30 min): 18. Swyx / Latent Space — Agent Engineering 19. Epsilla — The Three Eras (2026)