Observability
Agents are non-deterministic. Without observability, failures are silent and costs invisible. The discipline is small: trace everything, aggregate into metrics, alert on anomalies.
What to Monitor
| Signal | How | Reference |
|---|---|---|
| Task completion | End-to-end success rate | SWE-Bench Verified / GAIA / WebArena SOTA |
| Tool correctness | Well-formed calls, valid args | ToolBench, τ2-bench, BFCL v3 |
| Cost per task | Token + API spend per run | Per-task ceiling |
| Context utilization | % of window consumed at task end | Status-line monitoring |
| Safety violations | Guardrail trigger rate | Audit trails |
| Performance degradation | Success rate vs. step count | Step / time budgets |
The Observability Stack
graph LR
subgraph traces ["Traces (per-step)"]
T1["Prompt"] --> T2["Tool Call"] --> T3["Result"] --> T4["Reasoning"]
end
subgraph metrics ["Metrics (aggregate)"]
M1["Cost"]
M2["Latency"]
M3["Success Rate"]
end
subgraph alerts ["Alerts (real-time)"]
A1["Safety Violation"]
A2["Cost Spike"]
A3["Stuck Agent"]
end
traces --> metrics --> alerts
Traces (per-step)
Each entry captures the prompt, the tool call, its arguments, the raw result, and the agent’s reasoning toward the next step. Traces are the atomic unit of agent observability — debugging, evaluation, and post-hoc analysis without re-running the task.
Metrics (aggregate)
Roll traces into cost, latency, success rate, and context utilization. Dashboards expose trends — a creeping cost-per-task or sliding success rate — that no individual trace surfaces. Slice by agent version and model version so regressions have a place to land.
Alerts (real-time)
Fire on a tripped guardrail, a task that blows its cost ceiling, or a stuck agent (no new tool calls, or repeated identical actions for some window). Keep alerts conservative — false positives train operators to ignore the signal.
Trace-First
Log everything as structured records — JSON or protocol buffers, not log lines — tagged with task ID, agent version, and model version. Include timestamps, step indices, token counts, and per-step cost. The same trace becomes a debugger, training corpus, audit artifact, and offline evaluation substrate.
OpenTelemetry GenAI
OpenTelemetry’s GenAI semantic conventions (SemConv 1.40.0, mid-April 2026) are the emerging vendor-neutral substrate. Status is mixed:
- Client spans and metrics (
gen_ai.client.*) — Stable since early 2026. One span per LLM round-trip, interoperable across backends. - Agent and framework spans (
gen_ai.agent.*,invoke_agent) — still Development, but widely adopted by Datadog, Honeycomb, Jaeger v2, LangSmith, OpenLLMetry, Langfuse, and Arize.
Canonical trace tree: invoke_agent at the top, with chat, execute_tool, and retrieval spans as children. Three reasons to adopt OTel — unified instrumentation (same pipeline as HTTP services), vendor neutrality (export to any compatible backend), and shared semantic conventions for cross-team comparison. Client-level traces are portable today; agent-level trees are converging but not yet 1:1 portable across backends.
Vendor Landscape (May 2026)
The original five have become a broader tier-1:
- LangSmith — LangChain/LangGraph default; per-trace pricing.
- Langfuse — leading open-source / self-host; broadest framework support.
- Arize (AX + Phoenix) — eval-rigor leader, strongest for RAG.
- Maxim AI — end-to-end simulation + eval + observability.
- Braintrust — production-trace-to-eval loop with PR gating.
- Helicone — proxy-based, zero-instrumentation observability.
- Galileo — sub-200ms Luna-2 eval models for online scoring.
- W&B Weave — research/experimentation slant.
- Datadog LLM Observability — APM incumbent, first-class for GenAI.
Comet Opik still ships but has slipped in mindshare and no longer belongs at the top. Select on: OTel ingestion, replay/diff, evaluation integration, cost tracking, team workflows.
Benchmark SOTA (May 2026)
External reference points, with the current honest numbers:
| Benchmark | SOTA | Notes |
|---|---|---|
| SWE-Bench Verified | Claude Mythos Preview 93.9% / Opus 4.7 87.6% | Opus 4.5’s ~80% is well below current frontier |
| GAIA | OPS-Agentic-Search 92.36% (agent system) / 52.3% (base model) | The “90%” claim only holds for agent-system scoring |
| WebArena | OpAgent 71.6% (single-agent SOTA) / frontier models 65–69% | Human baseline ~78% |
| Tool use | ToolBench still cited; largely displaced by τ2-bench, BFCL v3, MCP-Bench, FinTrace | The standard suite has fractured |
Caveat. A 2026 UC Berkeley RDI study showed eight major agent benchmarks — SWE-Bench Verified, Terminal-Bench, WebArena, OSWorld, GAIA among them — can be gamed to near-perfect scores without solving the underlying tasks. Treat raw SOTA as orientation, not truth.
Evaluation Dimensions
End-to-end. Did the agent reach the goal, stay within boundaries, and do so under budget? Success, safety, and cost together define operational quality.
Tool use. Argument validity, correct tool selection, and recovery from failures. Instrument the execution layer to log every call — including malformed ones rejected before execution.
Efficiency. Two agents that both succeed can differ by an order of magnitude in cost and step count. Track latency, total cost per task (including retries), and step count.
Robustness. Layout changes, tool outages, ambiguous inputs. Requires adversarial evaluation — deliberate perturbations, measured impact on success rate.