Observability

Agents are non-deterministic. Without observability, failures are silent and costs invisible. The discipline is small: trace everything, aggregate into metrics, alert on anomalies.

What to Monitor

Signal	How	Reference
Task completion	End-to-end success rate	SWE-Bench Verified / GAIA / WebArena SOTA
Tool correctness	Well-formed calls, valid args	ToolBench, τ2-bench, BFCL v3
Cost per task	Token + API spend per run	Per-task ceiling
Context utilization	% of window consumed at task end	Status-line monitoring
Safety violations	Guardrail trigger rate	Audit trails
Performance degradation	Success rate vs. step count	Step / time budgets

The Observability Stack

graph LR
    subgraph traces ["Traces (per-step)"]
        T1["Prompt"] --> T2["Tool Call"] --> T3["Result"] --> T4["Reasoning"]
    end
    subgraph metrics ["Metrics (aggregate)"]
        M1["Cost"]
        M2["Latency"]
        M3["Success Rate"]
    end
    subgraph alerts ["Alerts (real-time)"]
        A1["Safety Violation"]
        A2["Cost Spike"]
        A3["Stuck Agent"]
    end
    traces --> metrics --> alerts

Traces (per-step)

Each entry captures the prompt, the tool call, its arguments, the raw result, and the agent’s reasoning toward the next step. Traces are the atomic unit of agent observability — debugging, evaluation, and post-hoc analysis without re-running the task.

Metrics (aggregate)

Roll traces into cost, latency, success rate, and context utilization. Dashboards expose trends — a creeping cost-per-task or sliding success rate — that no individual trace surfaces. Slice by agent version and model version so regressions have a place to land.

Alerts (real-time)

Fire on a tripped guardrail, a task that blows its cost ceiling, or a stuck agent (no new tool calls, or repeated identical actions for some window). Keep alerts conservative — false positives train operators to ignore the signal.

Trace-First

Log everything as structured records — JSON or protocol buffers, not log lines — tagged with task ID, agent version, and model version. Include timestamps, step indices, token counts, and per-step cost. The same trace becomes a debugger, training corpus, audit artifact, and offline evaluation substrate.

OpenTelemetry GenAI

OpenTelemetry’s GenAI semantic conventions (SemConv 1.40.0, mid-April 2026) are the emerging vendor-neutral substrate. Status is mixed:

Client spans and metrics (gen_ai.client.*) — Stable since early 2026. One span per LLM round-trip, interoperable across backends.
Agent and framework spans (gen_ai.agent.*, invoke_agent) — still Development, but widely adopted by Datadog, Honeycomb, Jaeger v2, LangSmith, OpenLLMetry, Langfuse, and Arize.

Canonical trace tree: invoke_agent at the top, with chat, execute_tool, and retrieval spans as children. Three reasons to adopt OTel — unified instrumentation (same pipeline as HTTP services), vendor neutrality (export to any compatible backend), and shared semantic conventions for cross-team comparison. Client-level traces are portable today; agent-level trees are converging but not yet 1:1 portable across backends.

Vendor Landscape (May 2026)

The original five have become a broader tier-1:

LangSmith — LangChain/LangGraph default; per-trace pricing.
Langfuse — leading open-source / self-host; broadest framework support.
Arize (AX + Phoenix) — eval-rigor leader, strongest for RAG.
Maxim AI — end-to-end simulation + eval + observability.
Braintrust — production-trace-to-eval loop with PR gating.
Helicone — proxy-based, zero-instrumentation observability.
Galileo — sub-200ms Luna-2 eval models for online scoring.
W&B Weave — research/experimentation slant.
Datadog LLM Observability — APM incumbent, first-class for GenAI.

Comet Opik still ships but has slipped in mindshare and no longer belongs at the top. Select on: OTel ingestion, replay/diff, evaluation integration, cost tracking, team workflows.

Benchmark SOTA (May 2026)

External reference points, with the current honest numbers:

Benchmark	SOTA	Notes
SWE-Bench Verified	Claude Mythos Preview 93.9% / Opus 4.7 87.6%	Opus 4.5’s ~80% is well below current frontier
GAIA	OPS-Agentic-Search 92.36% (agent system) / 52.3% (base model)	The “90%” claim only holds for agent-system scoring
WebArena	OpAgent 71.6% (single-agent SOTA) / frontier models 65–69%	Human baseline ~78%
Tool use	ToolBench still cited; largely displaced by τ2-bench, BFCL v3, MCP-Bench, FinTrace	The standard suite has fractured

Caveat. A 2026 UC Berkeley RDI study showed eight major agent benchmarks — SWE-Bench Verified, Terminal-Bench, WebArena, OSWorld, GAIA among them — can be gamed to near-perfect scores without solving the underlying tasks. Treat raw SOTA as orientation, not truth.

Evaluation Dimensions

End-to-end. Did the agent reach the goal, stay within boundaries, and do so under budget? Success, safety, and cost together define operational quality.

Tool use. Argument validity, correct tool selection, and recovery from failures. Instrument the execution layer to log every call — including malformed ones rejected before execution.

Efficiency. Two agents that both succeed can differ by an order of magnitude in cost and step count. Track latency, total cost per task (including retries), and step count.

Robustness. Layout changes, tool outages, ambiguous inputs. Requires adversarial evaluation — deliberate perturbations, measured impact on success rate.