← Dual-Axis Framework · Pattern reference
Observability Harness
Governance × OrchestrateWhy this pattern exists
A traditional service that returns wrong results 0.5% of the time is something you can debug. You log the request, log the response, find the divergence, fix the code. An agent that returns wrong results 0.5% of the time is a different problem entirely. The failure is non-deterministic, multi-step, and often invisible until the cascading downstream consequence surfaces — sometimes weeks later. The standard four-golden-signals of SRE (latency, traffic, errors, saturation) capture none of it.
For leadership: this is the pattern that makes “the agent is working” into a statement you can defend with evidence. Without Observability Harness, the answer to “how is the agent doing today?” is anecdotes. With it, the answer is dashboards that show cost trends, quality slopes, and drift signals before they become incidents. Galileo’s 2025 industry report identified 88% of agent incidents had observable precursors in the trace stream that nobody was watching. Observability Harness is the pattern that turns those precursors into alerts.
The agent-design problem it solves
The pattern orchestrates four signal tiers that production agents need simultaneously:
- Tier 1 · Latency — per-step and end-to-end. P50, P95, P99. The familiar SRE tier, but now broken out by LLM call vs tool call vs context-build — because the bottleneck is rarely uniform.
- Tier 2 · Cost — input tokens, output tokens, cached vs uncached, model tier, dollar cost per request. Agent cost is non-deterministic; cost observability is operationally as important as latency in agent SRE.
- Tier 3 · Quality — citation coverage, hallucination rate, user satisfaction signal, approval-rate, false-positive rate. These metrics did not exist in the traditional SRE world; they require explicit harness instrumentation.
- Tier 4 · Behaviour drift — trajectory slope. Is the agent’s accuracy on the same task distribution slowly degrading? Are decision patterns shifting? This tier is unique to the agent era and is where most cascade failures originate.
The pattern is Orchestrate, not Chain, because the four tiers feed a single observation layer that must aggregate, correlate, and alert across them. Latency spike + cost spike + quality dip happening together means something different than any one alone. The orchestrator is the SLO-evaluation logic that watches the four tiers as a system.
Deep thinking direction
Tier 4 — trajectory slope — is the conceptual contribution that distinguishes agent SRE from traditional SRE. Traditional SRE asks “is the service healthy right now?” Agent SRE asks “is the agent’s behaviour drifting away from where it was last week?” The slope is more important than the snapshot because agent failures are slow: an agent that was 94% accurate two months ago and is now 91% is a problem, even though both numbers look acceptable on a snapshot dashboard.
Three failure modes recur. Snapshot Trap: monitoring only the current quality number, missing the slope. By the time the number is bad enough to alert, the cascade has been building for weeks. The discipline is alerting on slope, not on snapshot threshold. Citation Theatre: tracking citation coverage as 100% without verifying that the citations point to real source positions. The model can synthesize plausible-looking citations indefinitely. The discipline is hard-match citation verification at write time, treated as a quality signal not just a presentation feature. Cost Blindness: monitoring cost in aggregate not per-request. A 10x outlier request is invisible in daily totals until billing arrives. The discipline is per-request cost distribution monitoring with explicit outlier alerts.
The architectural insight is that Observability Harness is the OpenTelemetry / Datadog APM pattern reborn with two extra tiers. Tiers 1 and 2 (latency, cost) map cleanly onto existing SRE infrastructure. Tiers 3 and 4 (quality, drift) require new instrumentation that only the harness can produce — the model itself does not know whether its citation was real or hallucinated. Engineers who built APM systems for microservices in 2014-2020 recognize the orchestration pattern in minutes; the harness layer is where the new work lives.
Engineering blog posts — curated
- OpenTelemetry Generative AI semantic conventions Standardized span attributes for LLM and agent traces. The de facto schema for Tier 1 + Tier 2 instrumentation; vendor-neutral, increasingly supported by Datadog, Honeycomb, Jaeger.
- Observability for LLM applications — Honeycomb The clearest practitioner argument for high-cardinality observability over pre-aggregated dashboards in the agent era. Reasoning traces are inherently high-cardinality.
- Cascade Failures Have Observable Precursors — Galileo Empirical study: 88% of agent incidents had detectable precursor patterns in trace data; only 12% of teams were monitoring for them.
- Agent Observability Patterns — Langfuse Reference implementation for agent traces with explicit session / trace / span / generation hierarchy. Open-source so the schema is inspectable.
- LLM Observability — Datadog Enterprise-vendor framing of the four-tier observability story. Useful as a counterweight reference when proposing Observability Harness in regulated industries.
Latest paper progress (arXiv)
- Trajectory-Level Evaluation of LLM Agents Foundational paper for evaluating agents along their full reasoning trajectory, not just final answer. Trajectory-level scoring is the Tier 4 conceptual foundation.
- A Survey on Evaluation of Large Language Model Agents Catalogues evaluation metrics across agent benchmarks. Tier 3 quality-signal candidates are surveyed in section 4.
- Detecting Distribution Shift in LLM Agent Trajectories Statistical tests for detecting behaviour drift in agent trajectories. Direct ancestor for the Tier 4 slope-monitoring metric.
- Cost-Quality Frontier Analysis for Production LLM Agents Methodology for placing cost-quality tradeoffs on a Pareto curve. Useful for setting SLO targets that span Tier 2 and Tier 3 simultaneously.
- Hallucination Detection at Scale: Citation Verification in Agent Outputs Hard-match verification techniques for citation-bearing agent outputs. The reference for implementing the citation-theatre defense.
Related patterns
Where this pattern is developed
- Manning book — Designing AI Agents, Chapter 9 §9.4 (Governance / Observability Harness).
- Paper — Huang & Zhou (2026), §4.8 Pattern 8.