Observability

Agents are non-deterministic systems that make autonomous decisions. Without observability, failures are silent, costs are invisible, and debugging is guesswork. A robust observability practice turns opaque agent behavior into auditable, measurable, and improvable execution.


What to Monitor

MetricHowBenchmark
Task completionEnd-to-end success rateSWE-Bench (80.9% SOTA), GAIA (90% SOTA)
Tool correctnessWell-formed tool callsToolBench
Cost per taskToken usage tracking$3.2—13K/month operational
Context healthContext % utilizationStatus line monitoring
Safety violationsGuardrail trigger rateAudit trails
Performance degradationSuccess vs step countDrops after ~35 min of sustained operation

Agents that run too long exhibit a performance cliff: success rates drop after roughly 35 minutes of sustained operation. Monitor step count and elapsed time to detect this before it wastes tokens.


The Observability Stack

graph LR
    subgraph traces ["Traces (per-step)"]
        T1["Prompt"] --> T2["Tool Call"] --> T3["Result"] --> T4["Reasoning"]
    end
    subgraph metrics ["Metrics (aggregate)"]
        M1["Cost"]
        M2["Latency"]
        M3["Success Rate"]
    end
    subgraph alerts ["Alerts (real-time)"]
        A1["Safety Violation"]
        A2["Cost Spike"]
        A3["Stuck Agent"]
    end
    traces --> metrics --> alerts

Agent observability operates at three layers, each serving a distinct purpose.

Traces (Per-Step)

Traces capture the full sequence of decisions an agent makes during a task. Each trace entry records:

Traces are the atomic unit of agent observability. They power debugging, evaluation, and post-hoc analysis.

Metrics (Aggregate)

Metrics aggregate trace data into operational signals:

These metrics feed dashboards and trend analysis. A sudden spike in average cost or a gradual decline in success rate both indicate problems that individual traces alone would not surface.

Alerts (Real-Time)

Alerts trigger on conditions that demand immediate attention:

Alerts should be conservative. False positives erode trust in the alerting system and train operators to ignore signals.


Trace-First Practices

The foundational rule of agent observability is: log everything.

Every prompt, every tool call, every argument, every outcome should be captured in a structured trace. This practice pays off in multiple ways:

Structure traces as machine-readable records (JSON, protocol buffers) rather than unstructured log lines. Include timestamps, step indices, token counts, and cost estimates at each step. Tag traces with task identifiers, agent versions, and model versions to enable slicing and comparison.


OpenTelemetry

OpenTelemetry (OTel) is emerging as the standard instrumentation framework for agent observability. The GenAI semantic conventions define a shared vocabulary for recording LLM interactions, tool calls, and agent reasoning steps.

Key advantages of adopting OTel for agent systems:

When instrumenting an agent with OTel, create a span for each reasoning step. Attach tool call details, token usage, and latency as span attributes. Use span events to record guardrail checks and their outcomes.


Debugging Platforms

Several platforms provide specialized tooling for inspecting, replaying, and analyzing agent traces.

PlatformKey Capabilities
LangSmithTrace visualization, dataset management, evaluation runs, Polly AI assistant for trace analysis
Maxim AIAgent evaluation, simulation testing, quality scoring
ArizeProduction monitoring, drift detection, embedding analysis, LLM observability
LangfuseOpen-source tracing, prompt management, scoring, session replay
Comet OpikExperiment tracking, prompt versioning, evaluation pipelines

When selecting a platform, evaluate along these axes:


Major Benchmarks

Benchmarks provide external reference points for agent capability. They answer the question: how does your agent compare to the state of the art on standardized tasks?

SWE-Bench Verified

A curated set of 2,200+ real GitHub issues with verified solutions. The leading score exceeds 80% (Claude Opus 4.5). SWE-Bench is the primary benchmark for software engineering agents.

GAIA

General agentic capability benchmark testing tool use, document reasoning, and multi-step workflows. SOTA: 90%. Tests breadth across task types.

WebArena

Web automation benchmark. Agents navigate realistic web interfaces to accomplish goals. Tests dynamic layouts, form interactions, and multi-page workflows.

ToolBench

Tool-use correctness benchmark. Agents must select the right tool, construct valid arguments, and chain multiple calls. Evaluates operational reliability.


Evaluation Metrics

Beyond benchmarks, production agents require ongoing evaluation across four dimensions.

End-to-End Performance

The most important metric is whether the agent accomplishes the task. Measure:

These three metrics together — success, safety, cost — define the operational quality of an agent deployment.

Tool-Use Correctness

Tool-use errors are a major failure mode. Track:

Instrument your tool execution layer to log every call attempt, including malformed calls that were rejected before execution.

Efficiency

Two agents can both succeed at a task while differing dramatically in cost and speed. Track:

Efficiency metrics reveal optimization opportunities. An agent that succeeds in 5 steps is preferable to one that succeeds in 50.

Robustness

Agents operate in environments that change. Evaluate resilience to:

Robustness testing typically requires adversarial evaluation — deliberately introducing perturbations and measuring the impact on success rate.


Observability is not optional. It is the mechanism by which agents move from prototype to production: instrument with traces, aggregate into metrics, alert on anomalies, benchmark periodically, debug via replay, and improve from failure patterns.