Cost

Agent costs scale with usage in ways traditional compute doesn’t. A single poorly-shaped prompt in a high-traffic loop can burn thousands of dollars before anyone notices. Without cost discipline, agents stop being economically viable regardless of capability — and most teams discover this on the invoice.

The Financial Reality

Production deployments operate in the $3.2K-$13K/month range depending on workload volume and model tier; small pilots run $200-$2K, enterprise multi-agent systems clear $10K easily. Build costs span $15K for a simple MVP to $400K+ for enterprise.

The numbers that matter:

84% of companies report >6% gross margin erosion directly attributable to AI inference costs; 26% report 16%+ erosion (Mavvrik / Benchmarkit, 2025 State of AI Cost Management).
80-85% miss AI cost forecasts by more than 10%; only 15% forecast within ±10% (Mavvrik, 2025).
Only 25% of enterprises have full AI cost visibility (Optro, 2026). They cannot attribute spend to specific agents, tasks, or users.
Gartner forecasts 40%+ of agentic AI projects will be canceled by end of 2027 due to escalating costs, unclear value, or risk gaps.

These are not growing pains. They are the default outcome when cost is not actively managed.

Optimization Techniques

Technique	Typical Savings	Notes
Prompt Caching	90% off cached reads	Anthropic 0.1x; OpenAI 75-90% off on GPT-5.4/5.5+; Gemini 90% off via implicit caching, on by default, no storage cost.
Plan Caching (APC)	50.31% cost, 27.28% latency	Stanford, NeurIPS 2025 (arXiv 2506.14852). 76.42% cost reduction on GAIA at 0.61% accuracy drop.
Context Compression (ACON)	26-54% peak-token reduction	arXiv 2510.00615. Preserves >95% accuracy when distilled.
Model Routing	Variable	Now built-in across major coding agent CLIs (see below).
Off-peak scheduling	40-60% infra	Batch and non-urgent Kubernetes workloads during off-peak compute windows.

Layer multiple techniques. No single lever is sufficient, and some interact poorly — most notably caching and compaction.

Caching vs. Compaction

graph LR
    subgraph stable ["Stable Zone (cacheable)"]
        S1["System prompt"]
        S2["Tool definitions"]
        S3["Few-shot examples"]
    end
    subgraph dynamic ["Dynamic Zone (compactable)"]
        D1["Conversation history"]
        D2["Tool results"]
        D3["Retrieved docs"]
    end
    stable -->|"cache boundary"| dynamic
    dynamic -->|"compact when >70%"| dynamic

    style stable fill:#f0fdf4,stroke:#bbf7d0
    style dynamic fill:#fefce8,stroke:#fde68a

Caching and compaction conflict. Every time the context window is compacted, previously cached prefixes go stale and must be re-paid on the next request. Cache the stable prefix — system prompt, tool definitions, static few-shot examples — and compact only below the cache boundary. Never compact through it.

Anthropic constraint, Feb 5, 2026: prompt caches moved from organization-level to workspace-level isolation. Multi-team accounts can no longer share cache hits across workspaces — relevant if you architected around the old behavior.

Model Routing Is Infrastructure Now

Routing used to be a DIY problem. As of mid-2026, it ships built-in across the major coding agent CLIs:

Claude Code: opusplan mode uses Opus for planning, Sonnet for execution; simple file reads and quick edits route to Haiku. Field reports cite 60-70% of tokens hitting cheaper models without quality loss.
Codex CLI: ships built-in routing — decides internally whether a task needs reasoning or fast path.
Gemini CLI: Auto routing is the default since Feb 2026. Simple → Flash, complex → Pro, with automatic fallback for resilience.
GitHub Copilot CLI: auto model selection went GA April 17, 2026.

For DIY routing (or platforms that don’t ship it), the heuristics are simple: simple tasks → small models (Haiku, Flash); complex reasoning → frontier (Opus, Pro); validation and judging → small models; code generation → code-specialized models when available.

Model Pricing Snapshot

Snapshot date: 2026-05-20. OpenAI doubled GPT-5 pricing on April 23, 2026 with the GPT-5.5 release ($2.50/$15 → $5/$30 per MTok); pricing below reflects the post-doubling tier.

Model	Input $/MTok	Output $/MTok	Cached $/MTok
Claude Opus 4.7	$5.00	$25.00	$0.50 (read)
Claude Sonnet 4.6	$3.00	$15.00	$0.30 (read)
Claude Haiku 4.5	$1.00	$5.00	$0.10 (read)
GPT-5.5	$5.00	$30.00	$1.25
GPT-5.4	$2.50	$15.00	$0.25
GPT-5.2	$0.875	$7.00	n/a
Gemini 3.1 Pro	$1.25-$2.00 (tiered)	varies	$0.20 (read)
Gemini 2.5 Pro	$1.25 (≤200K)	$10.00	10% of input
Gemini 2.5 Flash Lite	$0.10	$0.40	10% of input

Opus 4.7, Opus 4.6, and Sonnet 4.6 now include 1M context at standard pricing.

Model Abstraction

If you support more than one model — and any non-trivial harness does — pin the abstraction explicitly rather than scattering model IDs through prompt code. The ModelSpec pattern:

ModelSpec {
    id,                       // "claude-opus-4-7-20260501"
    provider,                 // anthropic | openai | google | ...
    max_tokens,
    supports_tools,
    supports_vision,
    cost_per_input_token,
    cost_per_output_token,
}

Resolve human-friendly aliases (opus, opusplan, flash) once at session start and pin the resolved ID for the session. Don’t re-resolve mid-conversation — alias drift inside a single conversation produces costing and capability bugs that are nearly impossible to reproduce. Abstract API clients behind a Provider interface so adding a new provider means implementing one trait, not modifying the conversation loop. Capability detection (tools, vision, streaming) and fallback chains for rate limits and outages belong in the same layer.

See Implementation Delta §9 for the broader pattern.

Cost Attribution

Track cost per task, per user, per agent, and per failure. In multi-agent systems, propagate a cost context through delegation chains and aggregate on completion — when Agent A delegates to B which delegates to C, C’s tokens should roll up through B to A.

Cost per failure is the most informative single metric. It reveals tokens burned on tasks that ultimately didn’t complete — retry loops, hallucinated tool calls, sessions that hit context limits and restarted. High cost-per-failure means better routing, error handling, or guardrails will pay for themselves quickly.

Budget Management

Per-session token limits: hard caps that terminate or escalate to a human rather than burning indefinitely.
Real-time monitoring: /cost for session spend; status line for continuous visibility.
Automatic compaction: configure CLAUDE_AUTOCOMPACT_PCT_OVERRIDE to compact at a percentage of the window rather than waiting for a hard limit (which wastes the tokens already spent building context).
Alerting thresholds: 50/75/90% of budget. By the time you notice a cost problem organically, the damage is already done.

Treat budget management as infrastructure, not an afterthought. The teams succeeding with agents apply the same discipline to tokens that mature engineering organizations apply to AWS or GCP spend: metered, attributed, budgeted, and optimized continuously.