Memory

Memory is the discipline of deciding what enters the context window each turn. The window is finite, attention degrades non-linearly as it fills, and the degradation is silent. There is no error message when the model starts ignoring instructions buried 80K tokens back in the conversation. Memory is not “the stuff outside the window” — it is the policy that decides what gets in.


The Performance Curve

Agent performance does not collapse at a single clean cliff. The popular “35-minute / 100-tool-call” threshold is uncited. The defensible numbers come from METR and Anthropic’s own telemetry.

The shape: budget your turns assuming non-linear, model-generation-specific degradation. Decompose long-horizon work into sub-tasks that each fit well inside the safe zone for the model you actually run. Re-measure every release. Do not anchor on a number you saw on a blog in 2024.

The practical consequence is architectural, not numerical. The performance curve is the strongest argument for sub-task decomposition: no single agent invocation should run long enough to reach the 80%-horizon shoulder. The orchestrator passes forward only checkpoint summaries from previous sub-tasks — not full execution traces — keeping each invocation in the band where the model is still reliable. The alternative — letting a single agent run until it either succeeds or exhausts its context — remains the most common failure mode in production agent systems. Context management is not an optimization. It is a correctness requirement.


The Three Tiers, Sharpened

The L1/L2/L3 framing is correct, but the primitives most teams reach for at each tier are wrong. Use these instead.

TierRoleRight primitiveWrong primitive
L1 WorkingThe current context window. Everything the model is attending to this turn. Note: Opus 4.7’s tokenizer can use up to 35% more tokens for the same text vs prior generations — your budget math is model-specific.Token budget allocation per component (see below).Treating the window as infinite and letting tool output flood it.
L2 CompactedRecent history, condensed but still query-shaped.A hierarchy of summary nodes — leaf chunks sealed at a token threshold, fanout-N parents above them, multiple roots scoped by source / topic / session.A single linear summary string that gets re-summarized each compaction. Information melts.
L3 PersistentWhat survives across sessions.Human-auditable markdown files where the workload allows it; otherwise a SQLite-backed tree of summary nodes with embedding tables as an index over the tree.A vector database treated as the source of truth, with no hierarchy and no audit path.

The mistake most projects make is collapsing L2 and L3 into “the vector store.” A vector store is an index. It is not memory.

Context Budget Bands (community heuristic, not authoritative)

Multiple practitioner sources echo the same bands. Anthropic does not publish them. Treat as a starting point that you re-calibrate to your model and workload.

UtilizationAction
0-60%Work freely. Full instructions, history, and tool results fit comfortably.
50-70%Monitor. Log size per turn; identify the fastest-growing component.
70-80%Compact. Summarize history. Preserve decisions, file paths, current task state. Drop verbose tool output.
80%+Reset. Serialize critical state externally. Start fresh and reload only what the next sub-task needs.

A reasonable per-component split inside L1 for 128K-200K windows: system instructions 10-15%, active conversation 30-40%, tool results 20-30%, retrieved memory 10-20%, safety margin 10-15%.


Compaction-Resident State

The single most important pattern from forensic analysis of real long-horizon agents. Name it explicitly: compaction-resident state, or equivalently, out-of-history state.

Compaction operates on conversation history. Anything in history is a candidate for summarization, dilution, or deletion. Long-running goals, agent identity, the current active-task pointer, and the user’s standing constraints must not live in history. They live in session metadata and are re-injected into the system prompt every turn.

The shape:

The pattern is implemented in production by HKUDS/nanobot (v0.2.0, May 2026), whose /goal feature holds sustained objectives across turns and mirrors the compaction summary into session metadata so it survives process restarts. Auto-compact also skips sessions with active tasks.

What this fixes: in eager agents that store the goal as a user message (“migrate the auth module to OAuth2…”), the goal gets compacted away around turn 40, the agent drifts, and nobody can tell when it happened. Out-of-history state makes the goal a property of the session, not a memory.

If you take one pattern from this page, take this one. It composes cleanly with every other tier.


Compaction vs. Caching

Modern providers cache repeated prompt prefixes. The system prompt is identical across every turn, which makes it the ideal cache prefix — but compaction breaks the cache: the summarized content differs from the original, and every token after the divergence point becomes a cache miss. The fix is structural: place the system prompt first as a stable, cache-controlled prefix, and treat everything after that boundary — memory, conversation, tool results — as the dynamic region that can be compacted freely.

Key insight: Separate stable (cached) content from dynamic (compactable) content. Only compact the dynamic portion.

Pricing reality, May 2026:

Across hundreds of turns the boundary discipline compounds into substantial savings.


Hierarchical Summarization

Vector retrieval alone fails on multi-hop and temporal queries. Recent Mem0 benchmarks show hybrid systems (vector + graph + summarization) gaining up to 26% accuracy over plain vector approaches on multi-hop and temporal queries. The defensible claim isn’t “hierarchical strictly beats vector” — it’s that vector by itself doesn’t integrate; hierarchy does, and the hybrid wins where it matters.

Build the hierarchy first. Add vectors over it as an index.

graph TB
    subgraph L0 ["L0 — Leaf Chunks (seal at ~3k tokens)"]
        C1["Chunk 1"]
        C2["Chunk 2"]
        C3["Chunk 3"]
        C4["Chunk 4"]
        Cn["…"]
    end

    subgraph L1 ["L1 — Summary Nodes (fanout ~8)"]
        S1["Summary A"]
        S2["Summary B"]
    end

    subgraph L2 ["L2 — Summary Trees (scope: source / topic / global)"]
        R1["Source Root"]
        R2["Topic Root"]
        R3["Global Root"]
    end

    subgraph IDX ["Vector Index (per-model embeddings)"]
        V["Embeddings table<br/>keyed by node id"]
    end

    C1 --> S1
    C2 --> S1
    C3 --> S2
    C4 --> S2
    Cn --> S2
    S1 --> R1
    S2 --> R2
    S2 --> R3
    L0 -.embed.-> V
    L1 -.embed.-> V
    V -.lookup.-> L1

Properties verified against working implementations (tinyhumansai/openhuman README + gitbook architecture):

The contrary pattern — many flat memory tables with overlapping columns, dispatched by a hardcoded router — does not scale and does not integrate. MIRIX’s six stores (Core, Episodic, Semantic, Procedural, Resource Memory, Knowledge Vault) are differentiated by per-store fields and managed by six dedicated Memory Managers plus a Meta Memory Manager. The “~70% shared columns” critique that circulated in earlier writeups is overstated — actual overlap is closer to 20-30%, since each store carries specialized fields. The deeper architectural critique stands: a six-table polymorphic layout dispatched by a hardcoded router is taxonomy masquerading as architecture.


Advanced Strategies (Verified)

StrategyMechanismReported impact
AgentFold (arXiv 2510.24699)Multi-scale folding: Granular Condensation preserves fine-grained detail per turn; Deep Consolidation collapses entire sub-tasks.AgentFold-30B-A3B: 36.2% on BrowseComp, 47.3% on BrowseComp-ZH, surpassing DeepSeek-V3.1-671B at release.
Agentic Plan Caching (APC) (arXiv 2506.14852, NeurIPS 2025)Extract plan templates from completed runs, keyword-match new requests, adapt via a lightweight model.50.31% cost reduction, 27.28% latency reduction, retaining 96.61% of accuracy-optimal performance. 1.04% overhead.
ACON (arXiv 2510.00615)Compresses environment observations + interaction history in natural-language space via guideline optimization.26-54% peak-token reduction with >95% accuracy retention; up to 46% gain when distilled into smaller compressor models.
Checkpoint SummarizationCompleted sub-tasks replaced by structured summaries (inputs, outputs, decisions, artifacts). Full trace discarded.Use for multi-phase workflows with clear sub-task boundaries.

Auditable Persistence

For personal-assistant and single-user workloads, store L3 as a small set of human-editable markdown files. One open-source implementation uses three: an agent-voice file, a user-facts file, and a project-knowledge file, edited surgically by a cron-scheduled “dream” job that runs an LLM over the recent session and produces small diffs.

You can cat your memory. You can git diff your memory. You can hand-edit your memory when the agent gets something wrong. This is the contrarian bet: for workloads where the LLM integrates curated prose into its reasoning each turn, a few thousand tokens of well-edited markdown outperforms thousands of fragment retrievals from a vector store. Integration beats recall.

The pattern fails for high-cardinality enterprise observation streams — millions of events, many users, queries that need recency and recall metrics. There, fall back to the hierarchical summary tree with vector index. Do not reach for the enterprise shape on a personal-assistant workload. The cost is real and the integration is worse.

The sharper way to see this is as a question about what the LLM is good at. The LLM is excellent at integrating a small amount of well-curated prose and bad at integrating a large number of approximate-nearest-neighbor fragments. A memory system whose primary mechanism is retrieval returns a thousand fragments. A memory system whose primary mechanism is hierarchical summarization returns one paragraph that captures their meaning. The first looks impressive in a benchmark of recall-at-k. The second is what makes the agent feel like it remembers you.

See also: Instruction Files for the related pattern of versioned, human-edited agent context.


Across Sessions

Users stop working and come back. The harness must serialize full conversation state and rehydrate it later — into a world that changed in between. This is across-session memory, and it is the load-bearing pattern that compaction theory doesn’t cover.

What goes wrong on resume:

Mitigations:

Across-session state is where memory architecture meets the rest of the harness — credentials, file system, conversation log. The error surface compounds; the discipline pays back every time the user comes back tomorrow.


Cost-Bounded Ingest

Do not pay an LLM call per observation. Cost must be a function of wall time or queue depth, not user activity.

The durable shape is queue + interval. Eager ingest is the wrong default.


What Not to Do

Memory is the part of the system the user notices last and feels most. Get the tiers right, keep the goal out of history, build the tree before the index, and let the markdown file be the source of truth when you can.