Sandboxing

Instruction files define what an agent should do; sandboxing defines what it can do. An agent that can execute code, write files, and call networks but has no isolation layer is indistinguishable from handing root access to a probabilistic system --- one misread prompt can exfiltrate secrets, delete production state, or provision cloud resources. Sandboxing is the part of the harness that survives the model getting it wrong.


Cross-Platform Comparison

SystemSandboxing ApproachPermission Model
Claude CodeSeatbelt (macOS) / bubblewrap (Linux/WSL2) + local domain-allowlist network proxy. Open-sourced as @anthropic-ai/sandbox-runtime. WSL1 unsupported; native Windows planned.Allow / deny / ask rules in settings.json + lifecycle hooks
OpenAI Codex / Responses APIOS-enforced containers + cloud sandboxes with two-phase runtimeComposable --sandbox <profile> --ask-for-approval <mode>; legacy Suggest / Auto-Edit / Full-Auto deprecated (--full-auto warns on use)
Google ADKCode execution sandbox (Agent Engine Executor managed; GKE Executor self-managed) + VPC-SC + hermetic modeAgent-auth vs user-auth (delegated OAuth)
LangGraphMiddleware policies + tool restrictionsPer-tool gates + human-in-the-loop interrupts

All four converge on the same principle: the agent’s execution environment is a strict subset of the host’s capabilities. No major platform ships “run anything, anywhere” as the default any longer.


The Two-Phase Runtime

sequenceDiagram
    participant R as Runtime
    participant S as Setup Phase
    participant E as Execution Phase
    participant P as Policy Proxy

    R->>S: Start container
    Note over S: Network: ON<br/>Secrets: AVAILABLE
    S->>S: Install deps, clone repos, auth APIs
    S->>R: Setup complete

    R->>E: Flip to execution
    Note over E: Network: OFF (default)<br/>Secrets: PLACEHOLDERS ONLY
    E->>P: Outbound request (if enabled)
    Note over P: Allow-list check<br/>Substitute real secret
    P-->>E: Response (raw secret never seen by model)

OpenAI’s Responses API (and the 2026 cohort that copied it) splits agent execution into two phases. Setup has network access and can read configured secrets to install dependencies and authenticate APIs. When setup completes, the runtime flips to execution: network is off by default, and the model only sees secret placeholders --- raw values stay in an external substitution layer. When network is re-enabled, outbound traffic is forced through a centralized policy proxy that enforces domain allow-lists and swaps placeholders for real credentials only at the egress boundary. The model can’t exfiltrate what it can’t see.


Container & Isolation Tier: the 2026 Shift

Shared-kernel Docker/runc is insufficient for untrusted agent code --- a kernel exploit in the container compromises the host. The production stack now:

The rule of thumb: if untrusted code might run in it, the boundary should be hardware-enforced or formally isolated, not just namespaced.


Permission Models

ModelHow It WorksDefault Posture
Allow / Deny (Claude Code)Rules in settings.json using Tool(pattern:*) syntax (note the colon, e.g. Bash(git:*)). Evaluation order: deny -> ask -> allow, first match wins. Hooks override.Deny unless allowed
Trust-Based (Codex)Project must be explicitly trusted; per-invocation --sandbox and --ask-for-approval flags pick autonomy levelUntrusted until approved
Identity-Based (ADK)Agent-auth (service account, agent’s own actions) vs user-auth (delegated OAuth on behalf of the user). VPC-SC at the network boundaryScoped per tool to declared auth context
Role-Based (general)RBAC; roles bundle tool access, FS scope, network reachZero permissions; explicit grants only

Different surfaces, same principle: least privilege, explicit grants.


Plugin Trust Boundary

Sandboxing the agent core is only half the job. Plugins, skills, and MCP servers extend the harness with new code --- and the isolation story collapses if you don’t think about them explicitly.

More isolated <----------------------------------------> Less isolated

Cloud sandbox     MCP server          Plugin (Wasm)        Plugin (in-process)
  (Codex)      (subprocess/stdio)   (Wassette/Extism)     (dynamic linking)

The load-bearing fact: subprocess MCP via stdio inherits the parent process’s environment, filesystem permissions, and network access. That is not isolation. It is process separation with full privilege passthrough. A malicious MCP server can read your ~/.aws/credentials because the harness can.

Emerging Wasm-isolated alternatives --- Wassette (Microsoft’s Rust+Wasm+MCP bridge), Extism, Hyper MCP --- give per-plugin memory safety, capability-scoped imports, and Sigstore/Cosign signature verification. These are the right substrate for untrusted third-party tools.

The ecosystem is not safe by default. The Claude plugin marketplace has version pinning but no binary signing. Snyk’s February 2026 ToxicSkills audit of 3,984 ClawHub skills found 13.4% with critical vulnerabilities and 76 with confirmed malicious payloads. CVE-2025-54136 covered Cursor’s permanent-trust bug --- once approved, never revalidated. Treat every installed plugin as fully trusted code running with your credentials.


Human-in-the-Loop Patterns


Regulatory Context

The EU AI Act reaches full enforcement on 2 August 2026; prohibitions on unacceptable-risk systems have been live since February 2025, and general-purpose AI obligations since August 2025. California SB-833 (effective 2026-07-01) is the first US state law to codify human-in-the-loop requirements for agentic systems performing high-risk actions --- time-boxed approval, audit trails, identifiable accountable humans. The OWASP Top 10 for Agentic Applications (9 December 2025) maps excessive agency, privilege escalation, and secret exfiltration directly to the controls above. PKCE is now mandatory under OAuth 2.1 and RFC 9700 --- even for confidential server-side clients --- so any CLI agent doing browser-based login should treat non-PKCE flows as a defect.


Example: Locked-Down Claude Code Project

// .claude/settings.json
{
  "permissions": {
    "allow": ["Read", "Glob", "Grep", "Bash(git:*)", "Bash(npm:test)", "Bash(npm:run build)"],
    "deny":  ["Bash(curl:*)", "Bash(wget:*)", "Bash(rm:-rf *)", "Bash(ssh:*)", "Bash(git:push *)"]
  },
  "sandbox": { "allowedDomains": ["registry.npmjs.org", "github.com"] },
  "allowUnsandboxedCommands": false,
  "hooks": {
    "PreToolUse": [{ "matcher": "Bash",
      "hooks": [{ "type": "command", "command": "python3 .claude/hooks/check_no_secrets.py \"$TOOL_INPUT\"" }] }]
  }
}

Read and search are free, a curated set of commands is allowed, network and destructive ops are denied, the sandbox runtime is hard-enabled, and a pre-tool hook scans for secrets the static lists can’t catch. Static rules express policy; hooks enforce runtime invariants; the OS-level sandbox is the floor underneath both.