Cost
Agent costs scale with usage in ways traditional compute doesn’t. A single poorly-shaped prompt in a high-traffic loop can burn thousands of dollars before anyone notices. Without cost discipline, agents stop being economically viable regardless of capability — and most teams discover this on the invoice.
The Financial Reality
Production deployments operate in the $3.2K-$13K/month range depending on workload volume and model tier; small pilots run $200-$2K, enterprise multi-agent systems clear $10K easily. Build costs span $15K for a simple MVP to $400K+ for enterprise.
The numbers that matter:
- 84% of companies report >6% gross margin erosion directly attributable to AI inference costs; 26% report 16%+ erosion (Mavvrik / Benchmarkit, 2025 State of AI Cost Management).
- 80-85% miss AI cost forecasts by more than 10%; only 15% forecast within ±10% (Mavvrik, 2025).
- Only 25% of enterprises have full AI cost visibility (Optro, 2026). They cannot attribute spend to specific agents, tasks, or users.
- Gartner forecasts 40%+ of agentic AI projects will be canceled by end of 2027 due to escalating costs, unclear value, or risk gaps.
These are not growing pains. They are the default outcome when cost is not actively managed.
Optimization Techniques
| Technique | Typical Savings | Notes |
|---|---|---|
| Prompt Caching | 90% off cached reads | Anthropic 0.1x; OpenAI 75-90% off on GPT-5.4/5.5+; Gemini 90% off via implicit caching, on by default, no storage cost. |
| Plan Caching (APC) | 50.31% cost, 27.28% latency | Stanford, NeurIPS 2025 (arXiv 2506.14852). 76.42% cost reduction on GAIA at 0.61% accuracy drop. |
| Context Compression (ACON) | 26-54% peak-token reduction | arXiv 2510.00615. Preserves >95% accuracy when distilled. |
| Model Routing | Variable | Now built-in across major coding agent CLIs (see below). |
| Off-peak scheduling | 40-60% infra | Batch and non-urgent Kubernetes workloads during off-peak compute windows. |
Layer multiple techniques. No single lever is sufficient, and some interact poorly — most notably caching and compaction.
Caching vs. Compaction
graph LR
subgraph stable ["Stable Zone (cacheable)"]
S1["System prompt"]
S2["Tool definitions"]
S3["Few-shot examples"]
end
subgraph dynamic ["Dynamic Zone (compactable)"]
D1["Conversation history"]
D2["Tool results"]
D3["Retrieved docs"]
end
stable -->|"cache boundary"| dynamic
dynamic -->|"compact when >70%"| dynamic
style stable fill:#f0fdf4,stroke:#bbf7d0
style dynamic fill:#fefce8,stroke:#fde68a
Caching and compaction conflict. Every time the context window is compacted, previously cached prefixes go stale and must be re-paid on the next request. Cache the stable prefix — system prompt, tool definitions, static few-shot examples — and compact only below the cache boundary. Never compact through it.
Anthropic constraint, Feb 5, 2026: prompt caches moved from organization-level to workspace-level isolation. Multi-team accounts can no longer share cache hits across workspaces — relevant if you architected around the old behavior.
Model Routing Is Infrastructure Now
Routing used to be a DIY problem. As of mid-2026, it ships built-in across the major coding agent CLIs:
- Claude Code:
opusplanmode uses Opus for planning, Sonnet for execution; simple file reads and quick edits route to Haiku. Field reports cite 60-70% of tokens hitting cheaper models without quality loss. - Codex CLI: ships built-in routing — decides internally whether a task needs reasoning or fast path.
- Gemini CLI: Auto routing is the default since Feb 2026. Simple → Flash, complex → Pro, with automatic fallback for resilience.
- GitHub Copilot CLI: auto model selection went GA April 17, 2026.
For DIY routing (or platforms that don’t ship it), the heuristics are simple: simple tasks → small models (Haiku, Flash); complex reasoning → frontier (Opus, Pro); validation and judging → small models; code generation → code-specialized models when available.
Model Pricing Snapshot
Snapshot date: 2026-05-20. OpenAI doubled GPT-5 pricing on April 23, 2026 with the GPT-5.5 release ($2.50/$15 → $5/$30 per MTok); pricing below reflects the post-doubling tier.
| Model | Input $/MTok | Output $/MTok | Cached $/MTok |
|---|---|---|---|
| Claude Opus 4.7 | $5.00 | $25.00 | $0.50 (read) |
| Claude Sonnet 4.6 | $3.00 | $15.00 | $0.30 (read) |
| Claude Haiku 4.5 | $1.00 | $5.00 | $0.10 (read) |
| GPT-5.5 | $5.00 | $30.00 | $1.25 |
| GPT-5.4 | $2.50 | $15.00 | $0.25 |
| GPT-5.2 | $0.875 | $7.00 | n/a |
| Gemini 3.1 Pro | $1.25-$2.00 (tiered) | varies | $0.20 (read) |
| Gemini 2.5 Pro | $1.25 (≤200K) | $10.00 | 10% of input |
| Gemini 2.5 Flash Lite | $0.10 | $0.40 | 10% of input |
Opus 4.7, Opus 4.6, and Sonnet 4.6 now include 1M context at standard pricing.
Model Abstraction
If you support more than one model — and any non-trivial harness does — pin the abstraction explicitly rather than scattering model IDs through prompt code. The ModelSpec pattern:
ModelSpec {
id, // "claude-opus-4-7-20260501"
provider, // anthropic | openai | google | ...
max_tokens,
supports_tools,
supports_vision,
cost_per_input_token,
cost_per_output_token,
}
Resolve human-friendly aliases (opus, opusplan, flash) once at session start and pin the resolved ID for the session. Don’t re-resolve mid-conversation — alias drift inside a single conversation produces costing and capability bugs that are nearly impossible to reproduce. Abstract API clients behind a Provider interface so adding a new provider means implementing one trait, not modifying the conversation loop. Capability detection (tools, vision, streaming) and fallback chains for rate limits and outages belong in the same layer.
See Implementation Delta §9 for the broader pattern.
Cost Attribution
Track cost per task, per user, per agent, and per failure. In multi-agent systems, propagate a cost context through delegation chains and aggregate on completion — when Agent A delegates to B which delegates to C, C’s tokens should roll up through B to A.
Cost per failure is the most informative single metric. It reveals tokens burned on tasks that ultimately didn’t complete — retry loops, hallucinated tool calls, sessions that hit context limits and restarted. High cost-per-failure means better routing, error handling, or guardrails will pay for themselves quickly.
Budget Management
- Per-session token limits: hard caps that terminate or escalate to a human rather than burning indefinitely.
- Real-time monitoring:
/costfor session spend; status line for continuous visibility. - Automatic compaction: configure
CLAUDE_AUTOCOMPACT_PCT_OVERRIDEto compact at a percentage of the window rather than waiting for a hard limit (which wastes the tokens already spent building context). - Alerting thresholds: 50/75/90% of budget. By the time you notice a cost problem organically, the damage is already done.
Treat budget management as infrastructure, not an afterthought. The teams succeeding with agents apply the same discipline to tokens that mature engineering organizations apply to AWS or GCP spend: metered, attributed, budgeted, and optimized continuously.