Memory
Memory is the discipline of deciding what enters the context window each turn. The window is finite, attention degrades non-linearly as it fills, and the degradation is silent. There is no error message when the model starts ignoring instructions buried 80K tokens back in the conversation. Memory is not “the stuff outside the window” — it is the policy that decides what gets in.
The Performance Curve
Agent performance does not collapse at a single clean cliff. The popular “35-minute / 100-tool-call” threshold is uncited. The defensible numbers come from METR and Anthropic’s own telemetry.
- METR Time Horizons (TH1.1, Jan 2026). The gap between 50%-success and 80%-success time horizons is large and model-dependent. Claude 3.7 Sonnet reaches 50% success at 59 minutes of task length but only 15 minutes at 80% success. Degradation is gradual and probabilistic, not a wall-clock cliff.
- Doubling is accelerating. Historical 2019-2026 doubling is 196.5 days. Post-2023 it falls to 130.8 days. Post-2024 it is 88.6 days (~3 months) — not the older “every 7 months” number that still floats around. Latest 50%-horizon for Claude Opus 4.5 is 320 min; GPT-5 is 214 min.
- Canonical-path deviation compounds. Each off-canonical tool call raises the probability the next call is also off-canonical by 22.7 percentage points. There is no fixed step threshold; errors compound.
- Anthropic’s own data, Jan 2026. The 99.9th-percentile turn duration nearly doubled from Oct 2025 to Jan 2026 (under 25 min to over 45 min), and human interventions per Claude Code session dropped from 5.4 to 3.3. The trend is the opposite of a fixed-time cliff: the ceiling is moving.
The shape: budget your turns assuming non-linear, model-generation-specific degradation. Decompose long-horizon work into sub-tasks that each fit well inside the safe zone for the model you actually run. Re-measure every release. Do not anchor on a number you saw on a blog in 2024.
The practical consequence is architectural, not numerical. The performance curve is the strongest argument for sub-task decomposition: no single agent invocation should run long enough to reach the 80%-horizon shoulder. The orchestrator passes forward only checkpoint summaries from previous sub-tasks — not full execution traces — keeping each invocation in the band where the model is still reliable. The alternative — letting a single agent run until it either succeeds or exhausts its context — remains the most common failure mode in production agent systems. Context management is not an optimization. It is a correctness requirement.
The Three Tiers, Sharpened
The L1/L2/L3 framing is correct, but the primitives most teams reach for at each tier are wrong. Use these instead.
| Tier | Role | Right primitive | Wrong primitive |
|---|---|---|---|
| L1 Working | The current context window. Everything the model is attending to this turn. Note: Opus 4.7’s tokenizer can use up to 35% more tokens for the same text vs prior generations — your budget math is model-specific. | Token budget allocation per component (see below). | Treating the window as infinite and letting tool output flood it. |
| L2 Compacted | Recent history, condensed but still query-shaped. | A hierarchy of summary nodes — leaf chunks sealed at a token threshold, fanout-N parents above them, multiple roots scoped by source / topic / session. | A single linear summary string that gets re-summarized each compaction. Information melts. |
| L3 Persistent | What survives across sessions. | Human-auditable markdown files where the workload allows it; otherwise a SQLite-backed tree of summary nodes with embedding tables as an index over the tree. | A vector database treated as the source of truth, with no hierarchy and no audit path. |
The mistake most projects make is collapsing L2 and L3 into “the vector store.” A vector store is an index. It is not memory.
Context Budget Bands (community heuristic, not authoritative)
Multiple practitioner sources echo the same bands. Anthropic does not publish them. Treat as a starting point that you re-calibrate to your model and workload.
| Utilization | Action |
|---|---|
| 0-60% | Work freely. Full instructions, history, and tool results fit comfortably. |
| 50-70% | Monitor. Log size per turn; identify the fastest-growing component. |
| 70-80% | Compact. Summarize history. Preserve decisions, file paths, current task state. Drop verbose tool output. |
| 80%+ | Reset. Serialize critical state externally. Start fresh and reload only what the next sub-task needs. |
A reasonable per-component split inside L1 for 128K-200K windows: system instructions 10-15%, active conversation 30-40%, tool results 20-30%, retrieved memory 10-20%, safety margin 10-15%.
Compaction-Resident State
The single most important pattern from forensic analysis of real long-horizon agents. Name it explicitly: compaction-resident state, or equivalently, out-of-history state.
Compaction operates on conversation history. Anything in history is a candidate for summarization, dilution, or deletion. Long-running goals, agent identity, the current active-task pointer, and the user’s standing constraints must not live in history. They live in session metadata and are re-injected into the system prompt every turn.
The shape:
- A
session.metadatamap holds keys likegoal_state,active_task,user_constraints. - A
runtime_lines()function reads those keys and emits a Runtime Context block appended to the system prompt at every turn. - Compaction never touches metadata. The goal survives every pass — not because the summarizer was clever, but because the summarizer never saw it.
The pattern is implemented in production by HKUDS/nanobot (v0.2.0, May 2026), whose /goal feature holds sustained objectives across turns and mirrors the compaction summary into session metadata so it survives process restarts. Auto-compact also skips sessions with active tasks.
What this fixes: in eager agents that store the goal as a user message (“migrate the auth module to OAuth2…”), the goal gets compacted away around turn 40, the agent drifts, and nobody can tell when it happened. Out-of-history state makes the goal a property of the session, not a memory.
If you take one pattern from this page, take this one. It composes cleanly with every other tier.
Compaction vs. Caching
Modern providers cache repeated prompt prefixes. The system prompt is identical across every turn, which makes it the ideal cache prefix — but compaction breaks the cache: the summarized content differs from the original, and every token after the divergence point becomes a cache miss. The fix is structural: place the system prompt first as a stable, cache-controlled prefix, and treat everything after that boundary — memory, conversation, tool results — as the dynamic region that can be compacted freely.
Key insight: Separate stable (cached) content from dynamic (compactable) content. Only compact the dynamic portion.
Pricing reality, May 2026:
- Anthropic: cache read is still 0.1x base input (the “10% of uncached” rule holds). 5-min cache write 1.25x, 1-hour cache write 2x.
- OpenAI: discounts vary by model — GPT-5.5 cached input runs 90% off ($0.50/M vs $5/M); older defaults sit around 50%. Automatic prefix caching kicks in at 1,024 tokens, extends in 128-token increments, 5-10 min idle TTL, 1 hr cap.
Across hundreds of turns the boundary discipline compounds into substantial savings.
Hierarchical Summarization
Vector retrieval alone fails on multi-hop and temporal queries. Recent Mem0 benchmarks show hybrid systems (vector + graph + summarization) gaining up to 26% accuracy over plain vector approaches on multi-hop and temporal queries. The defensible claim isn’t “hierarchical strictly beats vector” — it’s that vector by itself doesn’t integrate; hierarchy does, and the hybrid wins where it matters.
Build the hierarchy first. Add vectors over it as an index.
graph TB
subgraph L0 ["L0 — Leaf Chunks (seal at ~3k tokens)"]
C1["Chunk 1"]
C2["Chunk 2"]
C3["Chunk 3"]
C4["Chunk 4"]
Cn["…"]
end
subgraph L1 ["L1 — Summary Nodes (fanout ~8)"]
S1["Summary A"]
S2["Summary B"]
end
subgraph L2 ["L2 — Summary Trees (scope: source / topic / global)"]
R1["Source Root"]
R2["Topic Root"]
R3["Global Root"]
end
subgraph IDX ["Vector Index (per-model embeddings)"]
V["Embeddings table<br/>keyed by node id"]
end
C1 --> S1
C2 --> S1
C3 --> S2
C4 --> S2
Cn --> S2
S1 --> R1
S2 --> R2
S2 --> R3
L0 -.embed.-> V
L1 -.embed.-> V
V -.lookup.-> L1
Properties verified against working implementations (tinyhumansai/openhuman README + gitbook architecture):
- Seal threshold at the leaf, ≤3k tokens. Once a chunk crosses the threshold it is sealed and a summary is produced; the seal is the unit of admission to the tree. (The 50k-token figure that circulated in earlier docs — including a prior version of this page — was wrong.)
- Three scoped summary trees. Source / topic / global. Retrieval picks the scope, not the chunk.
- SQLite storage. Hierarchical summary trees on disk, on the user’s machine.
- Per-model embedding tables and an async job queue are common in adjacent implementations (not in openhuman’s published docs); treat as engineering choices, not requirements.
The contrary pattern — many flat memory tables with overlapping columns, dispatched by a hardcoded router — does not scale and does not integrate. MIRIX’s six stores (Core, Episodic, Semantic, Procedural, Resource Memory, Knowledge Vault) are differentiated by per-store fields and managed by six dedicated Memory Managers plus a Meta Memory Manager. The “~70% shared columns” critique that circulated in earlier writeups is overstated — actual overlap is closer to 20-30%, since each store carries specialized fields. The deeper architectural critique stands: a six-table polymorphic layout dispatched by a hardcoded router is taxonomy masquerading as architecture.
Advanced Strategies (Verified)
| Strategy | Mechanism | Reported impact |
|---|---|---|
| AgentFold (arXiv 2510.24699) | Multi-scale folding: Granular Condensation preserves fine-grained detail per turn; Deep Consolidation collapses entire sub-tasks. | AgentFold-30B-A3B: 36.2% on BrowseComp, 47.3% on BrowseComp-ZH, surpassing DeepSeek-V3.1-671B at release. |
| Agentic Plan Caching (APC) (arXiv 2506.14852, NeurIPS 2025) | Extract plan templates from completed runs, keyword-match new requests, adapt via a lightweight model. | 50.31% cost reduction, 27.28% latency reduction, retaining 96.61% of accuracy-optimal performance. 1.04% overhead. |
| ACON (arXiv 2510.00615) | Compresses environment observations + interaction history in natural-language space via guideline optimization. | 26-54% peak-token reduction with >95% accuracy retention; up to 46% gain when distilled into smaller compressor models. |
| Checkpoint Summarization | Completed sub-tasks replaced by structured summaries (inputs, outputs, decisions, artifacts). Full trace discarded. | Use for multi-phase workflows with clear sub-task boundaries. |
Auditable Persistence
For personal-assistant and single-user workloads, store L3 as a small set of human-editable markdown files. One open-source implementation uses three: an agent-voice file, a user-facts file, and a project-knowledge file, edited surgically by a cron-scheduled “dream” job that runs an LLM over the recent session and produces small diffs.
You can cat your memory. You can git diff your memory. You can hand-edit your memory when the agent gets something wrong. This is the contrarian bet: for workloads where the LLM integrates curated prose into its reasoning each turn, a few thousand tokens of well-edited markdown outperforms thousands of fragment retrievals from a vector store. Integration beats recall.
The pattern fails for high-cardinality enterprise observation streams — millions of events, many users, queries that need recency and recall metrics. There, fall back to the hierarchical summary tree with vector index. Do not reach for the enterprise shape on a personal-assistant workload. The cost is real and the integration is worse.
The sharper way to see this is as a question about what the LLM is good at. The LLM is excellent at integrating a small amount of well-curated prose and bad at integrating a large number of approximate-nearest-neighbor fragments. A memory system whose primary mechanism is retrieval returns a thousand fragments. A memory system whose primary mechanism is hierarchical summarization returns one paragraph that captures their meaning. The first looks impressive in a benchmark of recall-at-k. The second is what makes the agent feel like it remembers you.
See also: Instruction Files for the related pattern of versioned, human-edited agent context.
Across Sessions
Users stop working and come back. The harness must serialize full conversation state and rehydrate it later — into a world that changed in between. This is across-session memory, and it is the load-bearing pattern that compaction theory doesn’t cover.
What goes wrong on resume:
- The codebase changed. The agent’s compacted history references files that no longer exist or whose content has shifted.
- Token archaeology. The resumed session carries summary-of-summary history. The agent “remembers” decisions it cannot fully reconstruct.
- OAuth tokens expired during the gap. The first API call after resume fails with a 401 the agent doesn’t expect.
Mitigations:
- Serialize a checkpoint: messages, tool results, compaction summaries, and a manifest of referenced files with content hashes.
- On resume, diff the manifest against current state. Surface changes to the user: “3 files referenced in this session have changed since you last worked.”
- Refresh credentials before replaying the first message, not after the first 401.
- Consider a soft resume: load the checkpoint summary as fresh context for a new session rather than replaying full history. You lose fine-grained turn detail; you gain freedom from stale-context drift. For long gaps, soft resume is almost always the right call.
Across-session state is where memory architecture meets the rest of the harness — credentials, file system, conversation log. The error surface compounds; the discipline pays back every time the user comes back tomorrow.
Cost-Bounded Ingest
Do not pay an LLM call per observation. Cost must be a function of wall time or queue depth, not user activity.
- Eager pipelines make 2+ LLM calls plus 2+ embedding calls per event. Cost scales linearly with usage. A chatty user becomes a billing incident.
- Interval-driven ingest (a cron job every N hours that reads recent traces and emits small memory diffs) bounds cost by wall time. Idle users are free.
- Queue-driven ingest (an async SQLite job queue with retry, dedupe keys, idempotent admission) survives crashes, batches work, decouples ingest latency from user-facing latency.
The durable shape is queue + interval. Eager ingest is the wrong default.
What Not to Do
- Linear summarize-when-full with no hierarchy. Each compaction degrades the previous summary. After three passes, the original information is unrecoverable. Use a tree.
- Vector-default with no rerank and no hierarchy. Cosine similarity over fragments returns plausible-sounding noise. If you must use vectors, use them as an index over summary nodes, not as the memory itself.
- Ingest-time LLM calls without batching or queueing. Synchronous extract-and-embed on the hot path turns memory into a tax on every turn.
- Long-running goals in user messages. The goal will be compacted away. Put it in session metadata and re-inject it. See Compaction-Resident State above.
- One memory table per “type” with overlapping columns and a hardcoded router. Taxonomy is not architecture. Collapse to one table with a
scopecolumn and a real index. - Treating multi-agent shared state as a substitute for memory. Shared state is for coordination across agents within a session (see Multi-Agent Patterns). It is not L3.
- Trusting the “35-minute cliff” number you read somewhere. It’s uncited. Measure your own model + workload. Re-measure every release. The METR data is the closest thing to a public number, and even that doubles every ~3 months.
Memory is the part of the system the user notices last and feels most. Get the tiers right, keep the goal out of history, build the tree before the index, and let the markdown file be the source of truth when you can.