Observability

Agents are non-deterministic. Without observability, failures are silent and costs invisible. The discipline is small: trace everything, aggregate into metrics, alert on anomalies.


What to Monitor

SignalHowReference
Task completionEnd-to-end success rateSWE-Bench Verified / GAIA / WebArena SOTA
Tool correctnessWell-formed calls, valid argsToolBench, τ2-bench, BFCL v3
Cost per taskToken + API spend per runPer-task ceiling
Context utilization% of window consumed at task endStatus-line monitoring
Safety violationsGuardrail trigger rateAudit trails
Performance degradationSuccess rate vs. step countStep / time budgets

The Observability Stack

graph LR
    subgraph traces ["Traces (per-step)"]
        T1["Prompt"] --> T2["Tool Call"] --> T3["Result"] --> T4["Reasoning"]
    end
    subgraph metrics ["Metrics (aggregate)"]
        M1["Cost"]
        M2["Latency"]
        M3["Success Rate"]
    end
    subgraph alerts ["Alerts (real-time)"]
        A1["Safety Violation"]
        A2["Cost Spike"]
        A3["Stuck Agent"]
    end
    traces --> metrics --> alerts

Traces (per-step)

Each entry captures the prompt, the tool call, its arguments, the raw result, and the agent’s reasoning toward the next step. Traces are the atomic unit of agent observability — debugging, evaluation, and post-hoc analysis without re-running the task.

Metrics (aggregate)

Roll traces into cost, latency, success rate, and context utilization. Dashboards expose trends — a creeping cost-per-task or sliding success rate — that no individual trace surfaces. Slice by agent version and model version so regressions have a place to land.

Alerts (real-time)

Fire on a tripped guardrail, a task that blows its cost ceiling, or a stuck agent (no new tool calls, or repeated identical actions for some window). Keep alerts conservative — false positives train operators to ignore the signal.


Trace-First

Log everything as structured records — JSON or protocol buffers, not log lines — tagged with task ID, agent version, and model version. Include timestamps, step indices, token counts, and per-step cost. The same trace becomes a debugger, training corpus, audit artifact, and offline evaluation substrate.


OpenTelemetry GenAI

OpenTelemetry’s GenAI semantic conventions (SemConv 1.40.0, mid-April 2026) are the emerging vendor-neutral substrate. Status is mixed:

Canonical trace tree: invoke_agent at the top, with chat, execute_tool, and retrieval spans as children. Three reasons to adopt OTel — unified instrumentation (same pipeline as HTTP services), vendor neutrality (export to any compatible backend), and shared semantic conventions for cross-team comparison. Client-level traces are portable today; agent-level trees are converging but not yet 1:1 portable across backends.


Vendor Landscape (May 2026)

The original five have become a broader tier-1:

Comet Opik still ships but has slipped in mindshare and no longer belongs at the top. Select on: OTel ingestion, replay/diff, evaluation integration, cost tracking, team workflows.


Benchmark SOTA (May 2026)

External reference points, with the current honest numbers:

BenchmarkSOTANotes
SWE-Bench VerifiedClaude Mythos Preview 93.9% / Opus 4.7 87.6%Opus 4.5’s ~80% is well below current frontier
GAIAOPS-Agentic-Search 92.36% (agent system) / 52.3% (base model)The “90%” claim only holds for agent-system scoring
WebArenaOpAgent 71.6% (single-agent SOTA) / frontier models 65–69%Human baseline ~78%
Tool useToolBench still cited; largely displaced by τ2-bench, BFCL v3, MCP-Bench, FinTraceThe standard suite has fractured

Caveat. A 2026 UC Berkeley RDI study showed eight major agent benchmarks — SWE-Bench Verified, Terminal-Bench, WebArena, OSWorld, GAIA among them — can be gamed to near-perfect scores without solving the underlying tasks. Treat raw SOTA as orientation, not truth.


Evaluation Dimensions

End-to-end. Did the agent reach the goal, stay within boundaries, and do so under budget? Success, safety, and cost together define operational quality.

Tool use. Argument validity, correct tool selection, and recovery from failures. Instrument the execution layer to log every call — including malformed ones rejected before execution.

Efficiency. Two agents that both succeed can differ by an order of magnitude in cost and step count. Track latency, total cost per task (including retries), and step count.

Robustness. Layout changes, tool outages, ambiguous inputs. Requires adversarial evaluation — deliberate perturbations, measured impact on success rate.