
Most "code execution for AI agents" is a security incident waiting to ship. A subprocess.run in a try/except, a Docker container sharing a kernel with your app, a Modal function anyone in your org can invoke with whatever code they want. The dream is an agent that writes and runs code. The reality is usually a privilege escalation you have not noticed.
E2B, the open-source sandbox runtime built on Firecracker microVMs, is one of the few projects that treats the problem as the problem. It might not be the cheapest. It is the one that does not lie to you about isolation.
When an LLM generates Python and you execute it, you are running untrusted code on infrastructure you also run trusted code on. Chroot, namespaces, subprocess.run, even full Docker containers — they all share a kernel with the host. A container escape is a kernel vulnerability. ptrace is one misconfiguration away from "your agent just read /etc/shadow."
Firecracker, the microVM hypervisor Amazon built for Lambda, sidesteps the shared-kernel problem by running each sandbox as a guest VM with its own minimal kernel. The host runs KVM; the guest runs a stripped-down image. No docker.sock to escape through. No shared PID namespace. The isolation boundary is the same one separating EC2 instances — a boundary Amazon has spent a decade hardening.
E2B wraps that primitive in a small SDK. Sandbox.create() boots a new microVM and gives you commands.run, filesystem.write, and runCode against an environment that cannot see your host. Cold start is 150-200ms — slow for a VM is fast for anything that does not lie about isolation.
```python from e2b import Sandbox
with Sandbox.create() as sandbox: result = sandbox.commands.run("python -c 'print(2+2)'") print(result.stdout) # 4 ```
That is the whole surface for the common case. Every line of the SDK is attack surface, and E2B keeps it minimal.
The Code Interpreter SDK is what most teams actually use. It ships a sandbox pre-loaded with Python, Node, and the usual data-science libraries. The agent calls runCode("import pandas as pd; df = pd.read_csv(...)") and gets back stdout, stderr, the rich output, and the final expression's value. No exec(), no manual sandboxing, no 2am __import__('os').system('rm -rf /') debugging.
The supporting infrastructure is what makes it production-grade. Sandboxes have a 24-hour lifetime and configurable CPU/memory. The file system is ephemeral by default; mount a template or persist a directory if you need state. Network access is off by default — opt in per sandbox — which is the right default for model-generated code. The orchestrator is open source, deployable to AWS, GCP, Azure, or generic Linux via Terraform.
The self-hosting story matters for regulated use. Run the entire E2B stack inside your own VPC. Agent code never leaves your network. For finance, healthcare, or defense-adjacent work, that is the entire decision.
Firecracker is the right primitive, and it is not free. The 200ms cold start is fast for a VM and slow for a tight agent loop. The optimization is to keep sandboxes warm and reuse them, but at that point you are back to sharing a kernel between executions from different users — exactly what Firecracker was meant to prevent. The unit of isolation in E2B is the sandbox lifetime, not the runCode call.
The open-source stack assumes KVM. macOS developers on Apple Silicon are out of luck on the self-hosted path; the hosted version handles it, but on-prem means x86 hardware. There is no path around it — that is what the architecture requires.
The biggest gap is observability. The SDK gives you stdout, stderr, and exit codes, not a sysdig-grade trace of every syscall. For most teams this is fine. For teams that need to prove to a security reviewer that an agent did not exfiltrate data, you are doing the instrumentation yourself.
Use E2B when you are running untrusted code where a sandbox escape has real consequences. Customer-facing agents. Internal tools where the model touches production data. Any workflow where the model has access to credentials, network, or a file system that matters.
Skip it when the code is trusted (just use subprocess) or when you need sub-50ms latency in a tight loop (keep one warm sandbox and accept the weaker isolation). And do not treat it as a boundary against an actively hostile model — the threat model for a state-level attacker is "is your hypervisor patched this week."
Most of the "AI agent execution" layer in 2026 is someone who watched a tutorial, ran subprocess, and shipped it. Some wrapped that in a Docker container and called it a sandbox. E2B started from the right primitive — Firecracker microVMs, the same thing Amazon uses to isolate Lambda — and built a developer experience small enough to audit and fast enough to be useful.
Daytona, Modal, and open-source alternatives exist. E2B is the one I would bet a security review on, because the architecture forces the right decisions. If you are building agents that execute code and you are not using microVMs, you have not understood the threat model.
— Mr. Technology
E2B is open source at github.com/e2b-dev/E2B. Apache 2.0 SDK, MIT-licensed infra, self-hostable on AWS, GCP, Azure, or generic Linux via Terraform. Firecracker microVMs, 150-200ms cold start, 24h sandbox lifetime, opt-in network access. Code Interpreter SDK for Python and Node. Used in production by Hugging Face, Perplexity, and a long tail of agent startups. 8.5K+ stars, active 2026 development.