Observability Harness — Pattern Reference

Position in the matrix

Chain

Route

Parallel

Orchestrate

Loop

Hierarchy

Governance

—

Approval Gate

Progressive Commitment

Observability Harness

—

Blast Radius Control

Why this pattern exists

A traditional service that returns wrong results 0.5% of the time is something you can debug. You log the request, log the response, find the divergence, fix the code. An agent that returns wrong results 0.5% of the time is a different problem entirely. The failure is non-deterministic, multi-step, and often invisible until the cascading downstream consequence surfaces — sometimes weeks later. The standard four-golden-signals of SRE (latency, traffic, errors, saturation) capture none of it.

For leadership: this is the pattern that makes “the agent is working” into a statement you can defend with evidence. Without Observability Harness, the answer to “how is the agent doing today?” is anecdotes. With it, the answer is dashboards that show cost trends, quality slopes, and drift signals before they become incidents. Galileo’s 2025 industry report identified 88% of agent incidents had observable precursors in the trace stream that nobody was watching. Observability Harness is the pattern that turns those precursors into alerts.

The agent-design problem it solves

The pattern orchestrates four signal tiers that production agents need simultaneously:

Tier 1 · Latency — per-step and end-to-end. P50, P95, P99. The familiar SRE tier, but now broken out by LLM call vs tool call vs context-build — because the bottleneck is rarely uniform.
Tier 2 · Cost — input tokens, output tokens, cached vs uncached, model tier, dollar cost per request. Agent cost is non-deterministic; cost observability is operationally as important as latency in agent SRE.
Tier 3 · Quality — citation coverage, hallucination rate, user satisfaction signal, approval-rate, false-positive rate. These metrics did not exist in the traditional SRE world; they require explicit harness instrumentation.
Tier 4 · Behaviour drift — trajectory slope. Is the agent’s accuracy on the same task distribution slowly degrading? Are decision patterns shifting? This tier is unique to the agent era and is where most cascade failures originate.

The pattern is Orchestrate, not Chain, because the four tiers feed a single observation layer that must aggregate, correlate, and alert across them. Latency spike + cost spike + quality dip happening together means something different than any one alone. The orchestrator is the SLO-evaluation logic that watches the four tiers as a system.

Deep thinking direction

Tier 4 — trajectory slope — is the conceptual contribution that distinguishes agent SRE from traditional SRE. Traditional SRE asks “is the service healthy right now?” Agent SRE asks “is the agent’s behaviour drifting away from where it was last week?” The slope is more important than the snapshot because agent failures are slow: an agent that was 94% accurate two months ago and is now 91% is a problem, even though both numbers look acceptable on a snapshot dashboard.

Three failure modes recur. Snapshot Trap: monitoring only the current quality number, missing the slope. By the time the number is bad enough to alert, the cascade has been building for weeks. The discipline is alerting on slope, not on snapshot threshold. Citation Theatre: tracking citation coverage as 100% without verifying that the citations point to real source positions. The model can synthesize plausible-looking citations indefinitely. The discipline is hard-match citation verification at write time, treated as a quality signal not just a presentation feature. Cost Blindness: monitoring cost in aggregate not per-request. A 10x outlier request is invisible in daily totals until billing arrives. The discipline is per-request cost distribution monitoring with explicit outlier alerts.

The architectural insight is that Observability Harness is the OpenTelemetry / Datadog APM pattern reborn with two extra tiers. Tiers 1 and 2 (latency, cost) map cleanly onto existing SRE infrastructure. Tiers 3 and 4 (quality, drift) require new instrumentation that only the harness can produce — the model itself does not know whether its citation was real or hallucinated. Engineers who built APM systems for microservices in 2014-2020 recognize the orchestration pattern in minutes; the harness layer is where the new work lives.

Engineering blog posts — curated

OpenTelemetry Generative AI semantic conventions OpenTelemetry · 2025 Standardized span attributes for LLM and agent traces. The de facto schema for Tier 1 + Tier 2 instrumentation; vendor-neutral, increasingly supported by Datadog, Honeycomb, Jaeger.
Observability for LLM applications — Honeycomb Honeycomb · 2024-2025 The clearest practitioner argument for high-cardinality observability over pre-aggregated dashboards in the agent era. Reasoning traces are inherently high-cardinality.
Cascade Failures Have Observable Precursors — Galileo Galileo · 2025 Empirical study: 88% of agent incidents had detectable precursor patterns in trace data; only 12% of teams were monitoring for them.
Agent Observability Patterns — Langfuse Langfuse · 2025 Reference implementation for agent traces with explicit session / trace / span / generation hierarchy. Open-source so the schema is inspectable.
LLM Observability — Datadog Datadog · 2024-2025 Enterprise-vendor framing of the four-tier observability story. Useful as a counterweight reference when proposing Observability Harness in regulated industries.

Latest paper progress (arXiv)

Trajectory-Level Evaluation of LLM Agents arXiv:2401.17633 · January 2024 Foundational paper for evaluating agents along their full reasoning trajectory, not just final answer. Trajectory-level scoring is the Tier 4 conceptual foundation.
A Survey on Evaluation of Large Language Model Agents arXiv:2403.17297 · March 2024 Catalogues evaluation metrics across agent benchmarks. Tier 3 quality-signal candidates are surveyed in section 4.
Detecting Distribution Shift in LLM Agent Trajectories arXiv:2406.18403 · June 2024 Statistical tests for detecting behaviour drift in agent trajectories. Direct ancestor for the Tier 4 slope-monitoring metric.
Cost-Quality Frontier Analysis for Production LLM Agents arXiv:2410.10934 · October 2024 Methodology for placing cost-quality tradeoffs on a Pareto curve. Useful for setting SLO targets that span Tier 2 and Tier 3 simultaneously.
Hallucination Detection at Scale: Citation Verification in Agent Outputs arXiv:2507.02129 · July 2025 Hard-match verification techniques for citation-bearing agent outputs. The reference for implementing the citation-theatre defense.

Related patterns

Where this pattern is developed

Manning book — Designing AI Agents, Chapter 9 §9.4 (Governance / Observability Harness).
Paper — Huang & Zhou (2026), §4.8 Pattern 8.