Observability
Agents are non-deterministic systems that make autonomous decisions. Without observability, failures are silent, costs are invisible, and debugging is guesswork. A robust observability practice turns opaque agent behavior into auditable, measurable, and improvable execution.
What to Monitor
| Metric | How | Benchmark |
|---|---|---|
| Task completion | End-to-end success rate | SWE-Bench (80.9% SOTA), GAIA (90% SOTA) |
| Tool correctness | Well-formed tool calls | ToolBench |
| Cost per task | Token usage tracking | $3.2—13K/month operational |
| Context health | Context % utilization | Status line monitoring |
| Safety violations | Guardrail trigger rate | Audit trails |
| Performance degradation | Success vs step count | Drops after ~35 min of sustained operation |
Agents that run too long exhibit a performance cliff: success rates drop after roughly 35 minutes of sustained operation. Monitor step count and elapsed time to detect this before it wastes tokens.
The Observability Stack
graph LR
subgraph traces ["Traces (per-step)"]
T1["Prompt"] --> T2["Tool Call"] --> T3["Result"] --> T4["Reasoning"]
end
subgraph metrics ["Metrics (aggregate)"]
M1["Cost"]
M2["Latency"]
M3["Success Rate"]
end
subgraph alerts ["Alerts (real-time)"]
A1["Safety Violation"]
A2["Cost Spike"]
A3["Stuck Agent"]
end
traces --> metrics --> alerts
Agent observability operates at three layers, each serving a distinct purpose.
Traces (Per-Step)
Traces capture the full sequence of decisions an agent makes during a task. Each trace entry records:
- The prompt or instruction the agent received at that step
- The tool selected and the arguments passed
- The raw result returned by the tool
- The agent’s interpretation of the result and its next-step reasoning
Traces are the atomic unit of agent observability. They power debugging, evaluation, and post-hoc analysis.
Metrics (Aggregate)
Metrics aggregate trace data into operational signals:
- Cost: Token consumption per task, per agent, per time window
- Latency: Wall-clock time from task start to completion
- Success rate: Fraction of tasks that reach their goal state
- Context usage: Percentage of the context window consumed at task end
These metrics feed dashboards and trend analysis. A sudden spike in average cost or a gradual decline in success rate both indicate problems that individual traces alone would not surface.
Alerts (Real-Time)
Alerts trigger on conditions that demand immediate attention:
- Safety violations: A guardrail fires, indicating the agent attempted a prohibited action
- Cost spikes: A single task exceeds a cost threshold, suggesting a loop or degenerate behavior
- Stuck agents: An agent has not made meaningful progress (no new tool calls, repeated identical actions) for a configurable duration
Alerts should be conservative. False positives erode trust in the alerting system and train operators to ignore signals.
Trace-First Practices
The foundational rule of agent observability is: log everything.
Every prompt, every tool call, every argument, every outcome should be captured in a structured trace. This practice pays off in multiple ways:
- Debugging: When a task fails, the trace provides a complete replay of what the agent did and why. There is no need to reproduce the failure — the trace is the reproduction.
- Training data: Traces from successful runs become demonstrations for fine-tuning or few-shot prompting. Traces from failures become negative examples.
- Audit trails: For agents operating in regulated or high-stakes domains, traces serve as auditable decision artifacts that document every action the system took.
- Evaluation: Traces enable offline evaluation of agent behavior against new scoring criteria without re-running tasks.
Structure traces as machine-readable records (JSON, protocol buffers) rather than unstructured log lines. Include timestamps, step indices, token counts, and cost estimates at each step. Tag traces with task identifiers, agent versions, and model versions to enable slicing and comparison.
OpenTelemetry
OpenTelemetry (OTel) is emerging as the standard instrumentation framework for agent observability. The GenAI semantic conventions define a shared vocabulary for recording LLM interactions, tool calls, and agent reasoning steps.
Key advantages of adopting OTel for agent systems:
- Unified instrumentation: The same tracing and metrics pipeline that monitors your HTTP services can monitor your agent loops. No separate observability stack required.
- Vendor neutrality: OTel exports to any compatible backend (Jaeger, Prometheus, Grafana, Datadog, and the debugging platforms listed below). Switching backends does not require re-instrumenting your agent.
- Semantic conventions: The GenAI semantic conventions standardize attribute names for model, token count, tool name, and tool arguments. This enables cross-team and cross-organization comparison of agent telemetry.
- Distributed tracing: Agents that call other agents or external services can propagate trace context, producing a single trace that spans the full execution graph.
When instrumenting an agent with OTel, create a span for each reasoning step. Attach tool call details, token usage, and latency as span attributes. Use span events to record guardrail checks and their outcomes.
Debugging Platforms
Several platforms provide specialized tooling for inspecting, replaying, and analyzing agent traces.
| Platform | Key Capabilities |
|---|---|
| LangSmith | Trace visualization, dataset management, evaluation runs, Polly AI assistant for trace analysis |
| Maxim AI | Agent evaluation, simulation testing, quality scoring |
| Arize | Production monitoring, drift detection, embedding analysis, LLM observability |
| Langfuse | Open-source tracing, prompt management, scoring, session replay |
| Comet Opik | Experiment tracking, prompt versioning, evaluation pipelines |
When selecting a platform, evaluate along these axes:
- Trace ingestion: Does it accept OTel or require a proprietary SDK?
- Replay and diff: Can you compare two traces side-by-side to understand behavioral differences?
- Evaluation integration: Can you run scoring functions against stored traces without re-executing the agent?
- Cost tracking: Does it compute and display per-task and per-step cost breakdowns?
- Team workflows: Does it support annotations, comments, and shared investigations?
Major Benchmarks
Benchmarks provide external reference points for agent capability. They answer the question: how does your agent compare to the state of the art on standardized tasks?
SWE-Bench Verified
A curated set of 2,200+ real GitHub issues with verified solutions. The leading score exceeds 80% (Claude Opus 4.5). SWE-Bench is the primary benchmark for software engineering agents.
GAIA
General agentic capability benchmark testing tool use, document reasoning, and multi-step workflows. SOTA: 90%. Tests breadth across task types.
WebArena
Web automation benchmark. Agents navigate realistic web interfaces to accomplish goals. Tests dynamic layouts, form interactions, and multi-page workflows.
ToolBench
Tool-use correctness benchmark. Agents must select the right tool, construct valid arguments, and chain multiple calls. Evaluates operational reliability.
Evaluation Metrics
Beyond benchmarks, production agents require ongoing evaluation across four dimensions.
End-to-End Performance
The most important metric is whether the agent accomplishes the task. Measure:
- Success rate: Did the agent reach the goal state?
- Safety: Did the agent stay within defined boundaries? Were any guardrails triggered?
- Cost: What was the total token and API cost for the task?
These three metrics together — success, safety, cost — define the operational quality of an agent deployment.
Tool-Use Correctness
Tool-use errors are a major failure mode. Track:
- Argument validity: Are all required arguments present and well-typed?
- Invocation correctness: Did the agent call the right tool for the situation?
- Error handling: When a tool call fails, does the agent recover or spiral?
Instrument your tool execution layer to log every call attempt, including malformed calls that were rejected before execution.
Efficiency
Two agents can both succeed at a task while differing dramatically in cost and speed. Track:
- Latency: Wall-clock time from task start to completion
- Cost per task: Total token cost, including retries and failed branches
- Step count: Number of reasoning steps taken (a proxy for context consumption)
Efficiency metrics reveal optimization opportunities. An agent that succeeds in 5 steps is preferable to one that succeeds in 50.
Robustness
Agents operate in environments that change. Evaluate resilience to:
- Layout changes: For web agents, do minor UI changes cause failures?
- Tool failures: When an API returns an error or times out, does the agent retry, fall back, or crash?
- Ambiguous inputs: When task descriptions are underspecified, does the agent ask for clarification or guess badly?
Robustness testing typically requires adversarial evaluation — deliberately introducing perturbations and measuring the impact on success rate.
Observability is not optional. It is the mechanism by which agents move from prototype to production: instrument with traces, aggregate into metrics, alert on anomalies, benchmark periodically, debug via replay, and improve from failure patterns.