Sandboxing
Instruction files define what an agent should do; sandboxing defines what it can do. An agent that can execute code, write files, and call networks but has no isolation layer is indistinguishable from handing root access to a probabilistic system --- one misread prompt can exfiltrate secrets, delete production state, or provision cloud resources. Sandboxing is the part of the harness that survives the model getting it wrong.
Cross-Platform Comparison
| System | Sandboxing Approach | Permission Model |
|---|---|---|
| Claude Code | Seatbelt (macOS) / bubblewrap (Linux/WSL2) + local domain-allowlist network proxy. Open-sourced as @anthropic-ai/sandbox-runtime. WSL1 unsupported; native Windows planned. | Allow / deny / ask rules in settings.json + lifecycle hooks |
| OpenAI Codex / Responses API | OS-enforced containers + cloud sandboxes with two-phase runtime | Composable --sandbox <profile> --ask-for-approval <mode>; legacy Suggest / Auto-Edit / Full-Auto deprecated (--full-auto warns on use) |
| Google ADK | Code execution sandbox (Agent Engine Executor managed; GKE Executor self-managed) + VPC-SC + hermetic mode | Agent-auth vs user-auth (delegated OAuth) |
| LangGraph | Middleware policies + tool restrictions | Per-tool gates + human-in-the-loop interrupts |
All four converge on the same principle: the agent’s execution environment is a strict subset of the host’s capabilities. No major platform ships “run anything, anywhere” as the default any longer.
The Two-Phase Runtime
sequenceDiagram
participant R as Runtime
participant S as Setup Phase
participant E as Execution Phase
participant P as Policy Proxy
R->>S: Start container
Note over S: Network: ON<br/>Secrets: AVAILABLE
S->>S: Install deps, clone repos, auth APIs
S->>R: Setup complete
R->>E: Flip to execution
Note over E: Network: OFF (default)<br/>Secrets: PLACEHOLDERS ONLY
E->>P: Outbound request (if enabled)
Note over P: Allow-list check<br/>Substitute real secret
P-->>E: Response (raw secret never seen by model)
OpenAI’s Responses API (and the 2026 cohort that copied it) splits agent execution into two phases. Setup has network access and can read configured secrets to install dependencies and authenticate APIs. When setup completes, the runtime flips to execution: network is off by default, and the model only sees secret placeholders --- raw values stay in an external substitution layer. When network is re-enabled, outbound traffic is forced through a centralized policy proxy that enforces domain allow-lists and swaps placeholders for real credentials only at the egress boundary. The model can’t exfiltrate what it can’t see.
Container & Isolation Tier: the 2026 Shift
Shared-kernel Docker/runc is insufficient for untrusted agent code --- a kernel exploit in the container compromises the host. The production stack now:
- Firecracker microVMs: ~125 ms boot, hardware-enforced boundary (AWS, Northflank, Docker Sandboxes).
- gVisor: user-space kernel, 10-30% I/O overhead, minimal compute cost.
- Kata Containers: VM-grade isolation with container UX.
- Wasm isolates: Wassette, Extism --- sub-10 ms startup, memory-safe.
- New entrants (2026): Docker Sandboxes (microVM-per-sandbox with private daemon), Microsoft LiteBox library-OS.
The rule of thumb: if untrusted code might run in it, the boundary should be hardware-enforced or formally isolated, not just namespaced.
Permission Models
| Model | How It Works | Default Posture |
|---|---|---|
| Allow / Deny (Claude Code) | Rules in settings.json using Tool(pattern:*) syntax (note the colon, e.g. Bash(git:*)). Evaluation order: deny -> ask -> allow, first match wins. Hooks override. | Deny unless allowed |
| Trust-Based (Codex) | Project must be explicitly trusted; per-invocation --sandbox and --ask-for-approval flags pick autonomy level | Untrusted until approved |
| Identity-Based (ADK) | Agent-auth (service account, agent’s own actions) vs user-auth (delegated OAuth on behalf of the user). VPC-SC at the network boundary | Scoped per tool to declared auth context |
| Role-Based (general) | RBAC; roles bundle tool access, FS scope, network reach | Zero permissions; explicit grants only |
Different surfaces, same principle: least privilege, explicit grants.
Plugin Trust Boundary
Sandboxing the agent core is only half the job. Plugins, skills, and MCP servers extend the harness with new code --- and the isolation story collapses if you don’t think about them explicitly.
More isolated <----------------------------------------> Less isolated
Cloud sandbox MCP server Plugin (Wasm) Plugin (in-process)
(Codex) (subprocess/stdio) (Wassette/Extism) (dynamic linking)
The load-bearing fact: subprocess MCP via stdio inherits the parent process’s environment, filesystem permissions, and network access. That is not isolation. It is process separation with full privilege passthrough. A malicious MCP server can read your ~/.aws/credentials because the harness can.
Emerging Wasm-isolated alternatives --- Wassette (Microsoft’s Rust+Wasm+MCP bridge), Extism, Hyper MCP --- give per-plugin memory safety, capability-scoped imports, and Sigstore/Cosign signature verification. These are the right substrate for untrusted third-party tools.
The ecosystem is not safe by default. The Claude plugin marketplace has version pinning but no binary signing. Snyk’s February 2026 ToxicSkills audit of 3,984 ClawHub skills found 13.4% with critical vulnerabilities and 76 with confirmed malicious payloads. CVE-2025-54136 covered Cursor’s permanent-trust bug --- once approved, never revalidated. Treat every installed plugin as fully trusted code running with your credentials.
Human-in-the-Loop Patterns
- Permission gates: specific (per action, not category), contextual (show the exact command and reasoning), blocking (no silent auto-approve on timeout).
- Approval workflows for destructive ops: single approval with undo for file deletion, dual approval for production config, multi-party thresholds for financial transactions.
- Time-bounded grants: 5-minute windows for customer-facing actions and ~60-minute windows for internal approvals are now the standard. Auto-escalate to backup approvers on timeout; never silently extend.
- Audit trails: timestamp, action, permission basis, result --- logged for every gated decision. Required for incident response and increasingly for compliance.
Regulatory Context
The EU AI Act reaches full enforcement on 2 August 2026; prohibitions on unacceptable-risk systems have been live since February 2025, and general-purpose AI obligations since August 2025. California SB-833 (effective 2026-07-01) is the first US state law to codify human-in-the-loop requirements for agentic systems performing high-risk actions --- time-boxed approval, audit trails, identifiable accountable humans. The OWASP Top 10 for Agentic Applications (9 December 2025) maps excessive agency, privilege escalation, and secret exfiltration directly to the controls above. PKCE is now mandatory under OAuth 2.1 and RFC 9700 --- even for confidential server-side clients --- so any CLI agent doing browser-based login should treat non-PKCE flows as a defect.
Example: Locked-Down Claude Code Project
// .claude/settings.json
{
"permissions": {
"allow": ["Read", "Glob", "Grep", "Bash(git:*)", "Bash(npm:test)", "Bash(npm:run build)"],
"deny": ["Bash(curl:*)", "Bash(wget:*)", "Bash(rm:-rf *)", "Bash(ssh:*)", "Bash(git:push *)"]
},
"sandbox": { "allowedDomains": ["registry.npmjs.org", "github.com"] },
"allowUnsandboxedCommands": false,
"hooks": {
"PreToolUse": [{ "matcher": "Bash",
"hooks": [{ "type": "command", "command": "python3 .claude/hooks/check_no_secrets.py \"$TOOL_INPUT\"" }] }]
}
}
Read and search are free, a curated set of commands is allowed, network and destructive ops are denied, the sandbox runtime is hard-enabled, and a pre-tool hook scans for secrets the static lists can’t catch. Static rules express policy; hooks enforce runtime invariants; the OS-level sandbox is the floor underneath both.