AI Engineering Curriculum
Phase 5: AI Security & Safety·8 min read

Module 5.5

Sandboxing Autonomous Agents

When your agent executes code, the output of that code runs on a real machine. The model doesn't know that. It generates subprocess.run(["rm", "-rf", "/"]) the same way it generates print("hello") — they're both just tokens. The difference between them isn't in the model. It's in whether you let either of them actually run.

Sandboxing is the architectural answer to that problem. It means: the agent's code runs in an isolated environment where even the worst possible output can't escape to hurt the host system or other users.

Real-World Use Cases

  • Code interpreter agents (think: AI that writes and runs Python on your behalf) need full isolation — the user's code is untrusted by definition
  • Research agents that scrape the web, parse PDFs, and process arbitrary file formats need containment in case a malicious file exploits a parser vulnerability
  • Customer-facing agents where multiple users share infrastructure need isolation between sessions — one user's injected code can't affect another's environment
  • DevOps agents that run shell commands need strict limits on what commands can even be attempted, let alone succeed

Key Terms

Namespace — a Linux kernel feature that isolates process views of resources (filesystem, network, process IDs). Docker uses namespaces extensively.

cgroup (control group) — limits how much CPU, memory, and I/O a process can consume. The enforcement mechanism for resource quotas.

Syscall (system call) — how a process asks the OS kernel to do something: read a file, open a socket, allocate memory. The kernel is the gatekeeper. If you control which syscalls are allowed, you control what the process can do.

microVM — a minimal virtual machine. Runs its own kernel, isolated at the hardware virtualization layer. Much lighter than a full VM, heavier than a container.

Cold start — the time from "start the sandbox" to "code can run." Critical for interactive agent workflows.


The Isolation Stack

There's a clear hierarchy of isolation technologies, from fastest-but-weakest to slowest-but-strongest:

Docker  <  gVisor  <  Firecracker / Kata microVMs
(weakest)   (mid)          (strongest)

Each step up the stack adds stronger isolation and slightly higher overhead. The right choice depends on how much you trust what's running inside.


Docker — Fast, Good Enough for Trusted Code

Docker uses Linux namespaces and cgroups to isolate containers. Processes inside can't see each other's filesystems or process trees. Resource consumption is capped.

The critical limitation: all containers on the same host share the host kernel. There is no separate kernel per container. If a piece of code inside the container exploits a kernel vulnerability, it can potentially escape to the host.

This is acceptable when you trust the code. It's not acceptable when the code was generated by an LLM based on instructions from an untrusted user.

Python
import docker client = docker.from_env() def run_code_in_docker(code: str, timeout: int = 10) -> str: container = client.containers.run( image="python:3.12-slim", command=["python", "-c", code], mem_limit="128m", # hard memory cap cpu_quota=50000, # 50% of one CPU core network_disabled=True, # no network access read_only=True, # read-only filesystem remove=True, # auto-delete after run timeout=timeout, ) return container.decode("utf-8")

Use Docker when: you control the code being run, cold start speed matters, or you're running in a trusted internal environment.


gVisor — Better Isolation, Moderate Overhead

gVisor intercepts syscalls in user space before they reach the host kernel. It implements a synthetic kernel layer: the sandboxed process thinks it's talking to a real kernel, but it's actually talking to gVisor's user-space implementation, which then decides what to pass through.

The result: the host kernel never directly receives syscalls from sandboxed code. A kernel exploit inside the sandbox fails because the real kernel never sees it.

gVisor is used in production at Modal and Northflank. It's a reasonable middle ground — meaningfully better than Docker's isolation, without the full overhead of booting a separate kernel per execution.

The trade-off: not all syscalls are implemented. Some programs that rely on obscure kernel interfaces break inside gVisor. You need to test your specific workload.

Use gVisor when: code is "probably fine" but not fully trusted, or you need better-than-Docker isolation without the complexity of microVMs.


Firecracker microVMs — Maximum Isolation

Firecracker is the strongest isolation available for production use. It's the technology behind AWS Lambda — each invocation gets its own dedicated Linux kernel running in a hardware-virtualized microVM.

The key property: kernel exploits inside one microVM cannot reach other microVMs or the host, because the isolation boundary is hardware virtualization, not OS-level namespacing. There's no shared kernel to exploit.

E2B and Sprites.dev are the managed platforms built on Firecracker for agent workloads specifically. They handle the infrastructure complexity and expose a clean SDK.

Python
from e2b_code_interpreter import Sandbox async def run_code_isolated(code: str) -> dict: async with Sandbox() as sandbox: # Each Sandbox() call spins up a fresh Firecracker microVM # ~150ms cold start, full Linux kernel, complete isolation execution = await sandbox.run_code(code) return { "stdout": execution.logs.stdout, "stderr": execution.logs.stderr, "error": execution.error, "results": execution.results, } # The microVM is automatically destroyed when the context manager exits # Nothing that ran inside can persist to the host

E2B sessions also support pause and resume — you can snapshot a microVM's state mid-session, pause it, and resume later. Useful for long-running agent workflows that span multiple interactions.

Use Firecracker when: executing untrusted LLM-generated code, operating in compliance environments (SOC2, HIPAA), building multi-tenant systems where user isolation is a hard requirement, or when a security breach would be catastrophic.


Platform Comparison

PlatformIsolationCold StartSession LimitBest For
E2BFirecracker microVM~150ms24hr (Pro)Untrusted code, compliance
DaytonaDocker (Kata optional)<90msLong-runningPersistent workspaces, speed
Sprites.devFirecracker microVMFastUnlimited + checkpointStateful long sessions
ModalgVisorFastPer-invocationServerless Python, scale
Self-hosted DockerDocker<50msUnlimitedInternal trusted workloads

The right answer isn't always the strongest isolation. A research agent running in a controlled internal environment doesn't need Firecracker. A public-facing code execution product absolutely does.


Rate Limiting and Circuit Breakers

Sandboxing isolates the blast radius of bad code. Rate limiting and circuit breakers control the volume of agent actions — preventing runaway loops, cost explosions, and denial-of-wallet attacks.

The circuit breaker pattern is borrowed from electrical engineering: if the system starts failing repeatedly, trip a breaker and stop execution rather than letting it spiral.

Python
import time from collections import deque from threading import Lock class AgentRateLimiter: def __init__( self, max_actions_per_minute: int = 20, max_spend_per_hour_usd: float = 1.00, circuit_breaker_threshold: int = 5, # consecutive failures before open ): self.max_actions = max_actions_per_minute self.max_spend = max_spend_per_hour_usd self.cb_threshold = circuit_breaker_threshold self._action_times: deque = deque() self._spend_log: list = [] self._consecutive_failures = 0 self._lock = Lock() def check_action(self) -> tuple[bool, str]: with self._lock: # Circuit breaker: too many consecutive failures → stop everything if self._consecutive_failures >= self.cb_threshold: return False, f"Circuit breaker open: {self._consecutive_failures} failures" # Rate check: prune old entries, check current window cutoff = time.time() - 60 while self._action_times and self._action_times[0] < cutoff: self._action_times.popleft() if len(self._action_times) >= self.max_actions: return False, f"Rate limit: {self.max_actions} actions/minute exceeded" self._action_times.append(time.time()) return True, "ok" def check_spend(self, estimated_cost_usd: float) -> tuple[bool, str]: with self._lock: cutoff = time.time() - 3600 self._spend_log = [(t, c) for t, c in self._spend_log if t > cutoff] total = sum(c for _, c in self._spend_log) if total + estimated_cost_usd > self.max_spend: return False, f"Spend limit: ${self.max_spend}/hr exceeded (current: ${total:.4f})" self._spend_log.append((time.time(), estimated_cost_usd)) return True, "ok" def record_failure(self): with self._lock: self._consecutive_failures += 1 def record_success(self): with self._lock: self._consecutive_failures = 0

Wrap every significant agent action with check_action() before it runs. If it returns False, stop and surface the reason. The circuit breaker catches runaway loops that a simple rate limit won't — an agent that fails 5 times in a row is probably broken, not just slow.


The Three-Layer Security Architecture

Sandboxing, rate limiting, and guardrails don't exist in isolation — they're part of a coordinated security architecture. In production, think of it as three layers:

Layer 1: Policy — what is allowed in principle

  • Data classification tiers (what data can this agent touch?)
  • Decision boundary rules (what categories of action are permitted?)
  • Compliance checkpoints (GDPR, HIPAA, SOC2 if relevant)

Layer 2: Configuration — enforced at startup, before any code runs

  • IAM / RBAC role binding (what credentials does the agent get?)
  • Prompt filtering rules (what topics are in-scope?)
  • Sandboxed execution environment (which isolation technology?)
  • Model registry (only approved models can be invoked)

Layer 3: Runtime — enforced continuously during execution

  • Rate limiters and circuit breakers
  • Continuous anomaly detection (unusual action patterns)
  • Immutable audit trail (every action logged, tamper-evident)
  • Kill switch / automated incident isolation

The failure mode to avoid: putting all your security at Layer 3. Runtime monitoring catches problems as they happen, but by then the agent may have already taken irreversible actions. Layers 1 and 2 are where you prevent the conditions for those actions to exist at all.


Choosing Your Sandbox

A simple decision tree:

  • The code was written by you or your team → Docker is fine
  • The code was written by an LLM but the user is internal/trusted → Docker with tight resource limits, or gVisor
  • The code was generated by an LLM for an untrusted external user → Firecracker (E2B, Sprites.dev)
  • You're building a multi-tenant product where users share infrastructure → Firecracker, mandatory
  • You have compliance requirements (SOC2, HIPAA) → Firecracker, E2B specifically has these certifications

No sandbox is a substitute for least-privilege tool design. An agent that has no file delete tool can't delete files — regardless of what isolation technology surrounds it. Sandboxing is the last line of defense, not the first.

Sources