Cost

Agent costs scale with usage in ways traditional compute doesn’t. A single poorly-shaped prompt in a high-traffic loop can burn thousands of dollars before anyone notices. Without cost discipline, agents stop being economically viable regardless of capability — and most teams discover this on the invoice.

The Financial Reality

Production deployments operate in the $3.2K-$13K/month range depending on workload volume and model tier; small pilots run $200-$2K, enterprise multi-agent systems clear $10K easily. Build costs span $15K for a simple MVP to $400K+ for enterprise.

The numbers that matter:

These are not growing pains. They are the default outcome when cost is not actively managed.

Optimization Techniques

TechniqueTypical SavingsNotes
Prompt Caching90% off cached readsAnthropic 0.1x; OpenAI 75-90% off on GPT-5.4/5.5+; Gemini 90% off via implicit caching, on by default, no storage cost.
Plan Caching (APC)50.31% cost, 27.28% latencyStanford, NeurIPS 2025 (arXiv 2506.14852). 76.42% cost reduction on GAIA at 0.61% accuracy drop.
Context Compression (ACON)26-54% peak-token reductionarXiv 2510.00615. Preserves >95% accuracy when distilled.
Model RoutingVariableNow built-in across major coding agent CLIs (see below).
Off-peak scheduling40-60% infraBatch and non-urgent Kubernetes workloads during off-peak compute windows.

Layer multiple techniques. No single lever is sufficient, and some interact poorly — most notably caching and compaction.

Caching vs. Compaction

graph LR
    subgraph stable ["Stable Zone (cacheable)"]
        S1["System prompt"]
        S2["Tool definitions"]
        S3["Few-shot examples"]
    end
    subgraph dynamic ["Dynamic Zone (compactable)"]
        D1["Conversation history"]
        D2["Tool results"]
        D3["Retrieved docs"]
    end
    stable -->|"cache boundary"| dynamic
    dynamic -->|"compact when >70%"| dynamic

    style stable fill:#f0fdf4,stroke:#bbf7d0
    style dynamic fill:#fefce8,stroke:#fde68a

Caching and compaction conflict. Every time the context window is compacted, previously cached prefixes go stale and must be re-paid on the next request. Cache the stable prefix — system prompt, tool definitions, static few-shot examples — and compact only below the cache boundary. Never compact through it.

Anthropic constraint, Feb 5, 2026: prompt caches moved from organization-level to workspace-level isolation. Multi-team accounts can no longer share cache hits across workspaces — relevant if you architected around the old behavior.

Model Routing Is Infrastructure Now

Routing used to be a DIY problem. As of mid-2026, it ships built-in across the major coding agent CLIs:

For DIY routing (or platforms that don’t ship it), the heuristics are simple: simple tasks → small models (Haiku, Flash); complex reasoning → frontier (Opus, Pro); validation and judging → small models; code generation → code-specialized models when available.

Model Pricing Snapshot

Snapshot date: 2026-05-20. OpenAI doubled GPT-5 pricing on April 23, 2026 with the GPT-5.5 release ($2.50/$15 → $5/$30 per MTok); pricing below reflects the post-doubling tier.

ModelInput $/MTokOutput $/MTokCached $/MTok
Claude Opus 4.7$5.00$25.00$0.50 (read)
Claude Sonnet 4.6$3.00$15.00$0.30 (read)
Claude Haiku 4.5$1.00$5.00$0.10 (read)
GPT-5.5$5.00$30.00$1.25
GPT-5.4$2.50$15.00$0.25
GPT-5.2$0.875$7.00n/a
Gemini 3.1 Pro$1.25-$2.00 (tiered)varies$0.20 (read)
Gemini 2.5 Pro$1.25 (≤200K)$10.0010% of input
Gemini 2.5 Flash Lite$0.10$0.4010% of input

Opus 4.7, Opus 4.6, and Sonnet 4.6 now include 1M context at standard pricing.

Model Abstraction

If you support more than one model — and any non-trivial harness does — pin the abstraction explicitly rather than scattering model IDs through prompt code. The ModelSpec pattern:

ModelSpec {
    id,                       // "claude-opus-4-7-20260501"
    provider,                 // anthropic | openai | google | ...
    max_tokens,
    supports_tools,
    supports_vision,
    cost_per_input_token,
    cost_per_output_token,
}

Resolve human-friendly aliases (opus, opusplan, flash) once at session start and pin the resolved ID for the session. Don’t re-resolve mid-conversation — alias drift inside a single conversation produces costing and capability bugs that are nearly impossible to reproduce. Abstract API clients behind a Provider interface so adding a new provider means implementing one trait, not modifying the conversation loop. Capability detection (tools, vision, streaming) and fallback chains for rate limits and outages belong in the same layer.

See Implementation Delta §9 for the broader pattern.

Cost Attribution

Track cost per task, per user, per agent, and per failure. In multi-agent systems, propagate a cost context through delegation chains and aggregate on completion — when Agent A delegates to B which delegates to C, C’s tokens should roll up through B to A.

Cost per failure is the most informative single metric. It reveals tokens burned on tasks that ultimately didn’t complete — retry loops, hallucinated tool calls, sessions that hit context limits and restarted. High cost-per-failure means better routing, error handling, or guardrails will pay for themselves quickly.

Budget Management

Treat budget management as infrastructure, not an afterthought. The teams succeeding with agents apply the same discipline to tokens that mature engineering organizations apply to AWS or GCP spend: metered, attributed, budgeted, and optimized continuously.