Sandboxing

Instruction files define what an agent should do; sandboxing defines what it can do. An agent that can execute code, write files, and call networks but has no isolation layer is indistinguishable from handing root access to a probabilistic system --- one misread prompt can exfiltrate secrets, delete production state, or provision cloud resources. Sandboxing is the part of the harness that survives the model getting it wrong.

Cross-Platform Comparison

System	Sandboxing Approach	Permission Model
Claude Code	Seatbelt (macOS) / bubblewrap (Linux/WSL2) + local domain-allowlist network proxy. Open-sourced as `@anthropic-ai/sandbox-runtime`. WSL1 unsupported; native Windows planned.	Allow / deny / ask rules in `settings.json` + lifecycle hooks
OpenAI Codex / Responses API	OS-enforced containers + cloud sandboxes with two-phase runtime	Composable `--sandbox <profile> --ask-for-approval <mode>`; legacy Suggest / Auto-Edit / Full-Auto deprecated (`--full-auto` warns on use)
Google ADK	Code execution sandbox (Agent Engine Executor managed; GKE Executor self-managed) + VPC-SC + hermetic mode	Agent-auth vs user-auth (delegated OAuth)
LangGraph	Middleware policies + tool restrictions	Per-tool gates + human-in-the-loop interrupts

All four converge on the same principle: the agent’s execution environment is a strict subset of the host’s capabilities. No major platform ships “run anything, anywhere” as the default any longer.

The Two-Phase Runtime

sequenceDiagram
    participant R as Runtime
    participant S as Setup Phase
    participant E as Execution Phase
    participant P as Policy Proxy

    R->>S: Start container
    Note over S: Network: ON<br/>Secrets: AVAILABLE
    S->>S: Install deps, clone repos, auth APIs
    S->>R: Setup complete

    R->>E: Flip to execution
    Note over E: Network: OFF (default)<br/>Secrets: PLACEHOLDERS ONLY
    E->>P: Outbound request (if enabled)
    Note over P: Allow-list check<br/>Substitute real secret
    P-->>E: Response (raw secret never seen by model)

OpenAI’s Responses API (and the 2026 cohort that copied it) splits agent execution into two phases. Setup has network access and can read configured secrets to install dependencies and authenticate APIs. When setup completes, the runtime flips to execution: network is off by default, and the model only sees secret placeholders --- raw values stay in an external substitution layer. When network is re-enabled, outbound traffic is forced through a centralized policy proxy that enforces domain allow-lists and swaps placeholders for real credentials only at the egress boundary. The model can’t exfiltrate what it can’t see.

Container & Isolation Tier: the 2026 Shift

Shared-kernel Docker/runc is insufficient for untrusted agent code --- a kernel exploit in the container compromises the host. The production stack now:

Firecracker microVMs: ~125 ms boot, hardware-enforced boundary (AWS, Northflank, Docker Sandboxes).
gVisor: user-space kernel, 10-30% I/O overhead, minimal compute cost.
Kata Containers: VM-grade isolation with container UX.
Wasm isolates: Wassette, Extism --- sub-10 ms startup, memory-safe.
New entrants (2026): Docker Sandboxes (microVM-per-sandbox with private daemon), Microsoft LiteBox library-OS.

The rule of thumb: if untrusted code might run in it, the boundary should be hardware-enforced or formally isolated, not just namespaced.

Permission Models

Model	How It Works	Default Posture
Allow / Deny (Claude Code)	Rules in `settings.json` using `Tool(pattern:)` syntax (note the colon, e.g. `Bash(git:)`). Evaluation order: deny -> ask -> allow, first match wins. Hooks override.	Deny unless allowed
Trust-Based (Codex)	Project must be explicitly trusted; per-invocation `--sandbox` and `--ask-for-approval` flags pick autonomy level	Untrusted until approved
Identity-Based (ADK)	Agent-auth (service account, agent’s own actions) vs user-auth (delegated OAuth on behalf of the user). VPC-SC at the network boundary	Scoped per tool to declared auth context
Role-Based (general)	RBAC; roles bundle tool access, FS scope, network reach	Zero permissions; explicit grants only

Different surfaces, same principle: least privilege, explicit grants.

Plugin Trust Boundary

Sandboxing the agent core is only half the job. Plugins, skills, and MCP servers extend the harness with new code --- and the isolation story collapses if you don’t think about them explicitly.

More isolated <----------------------------------------> Less isolated

Cloud sandbox     MCP server          Plugin (Wasm)        Plugin (in-process)
  (Codex)      (subprocess/stdio)   (Wassette/Extism)     (dynamic linking)

The load-bearing fact: subprocess MCP via stdio inherits the parent process’s environment, filesystem permissions, and network access. That is not isolation. It is process separation with full privilege passthrough. A malicious MCP server can read your ~/.aws/credentials because the harness can.

Emerging Wasm-isolated alternatives --- Wassette (Microsoft’s Rust+Wasm+MCP bridge), Extism, Hyper MCP --- give per-plugin memory safety, capability-scoped imports, and Sigstore/Cosign signature verification. These are the right substrate for untrusted third-party tools.

The ecosystem is not safe by default. The Claude plugin marketplace has version pinning but no binary signing. Snyk’s February 2026 ToxicSkills audit of 3,984 ClawHub skills found 13.4% with critical vulnerabilities and 76 with confirmed malicious payloads. CVE-2025-54136 covered Cursor’s permanent-trust bug --- once approved, never revalidated. Treat every installed plugin as fully trusted code running with your credentials.

Human-in-the-Loop Patterns

Permission gates: specific (per action, not category), contextual (show the exact command and reasoning), blocking (no silent auto-approve on timeout).
Approval workflows for destructive ops: single approval with undo for file deletion, dual approval for production config, multi-party thresholds for financial transactions.
Time-bounded grants: 5-minute windows for customer-facing actions and ~60-minute windows for internal approvals are now the standard. Auto-escalate to backup approvers on timeout; never silently extend.
Audit trails: timestamp, action, permission basis, result --- logged for every gated decision. Required for incident response and increasingly for compliance.

Regulatory Context

The EU AI Act reaches full enforcement on 2 August 2026; prohibitions on unacceptable-risk systems have been live since February 2025, and general-purpose AI obligations since August 2025. California SB-833 (effective 2026-07-01) is the first US state law to codify human-in-the-loop requirements for agentic systems performing high-risk actions --- time-boxed approval, audit trails, identifiable accountable humans. The OWASP Top 10 for Agentic Applications (9 December 2025) maps excessive agency, privilege escalation, and secret exfiltration directly to the controls above. PKCE is now mandatory under OAuth 2.1 and RFC 9700 --- even for confidential server-side clients --- so any CLI agent doing browser-based login should treat non-PKCE flows as a defect.

Example: Locked-Down Claude Code Project

// .claude/settings.json
{
  "permissions": {
    "allow": ["Read", "Glob", "Grep", "Bash(git:*)", "Bash(npm:test)", "Bash(npm:run build)"],
    "deny":  ["Bash(curl:*)", "Bash(wget:*)", "Bash(rm:-rf *)", "Bash(ssh:*)", "Bash(git:push *)"]
  },
  "sandbox": { "allowedDomains": ["registry.npmjs.org", "github.com"] },
  "allowUnsandboxedCommands": false,
  "hooks": {
    "PreToolUse": [{ "matcher": "Bash",
      "hooks": [{ "type": "command", "command": "python3 .claude/hooks/check_no_secrets.py \"$TOOL_INPUT\"" }] }]
  }
}

Read and search are free, a curated set of commands is allowed, network and destructive ops are denied, the sandbox runtime is hard-enabled, and a pre-tool hook scans for secrets the static lists can’t catch. Static rules express policy; hooks enforce runtime invariants; the OS-level sandbox is the floor underneath both.