Code execution with MCP: How sandboxed Python replaces tool schema bloat in AI agents

by | Apr 23, 2026 | AI

As the number of tools connected to an AI agent grows, JSON Schema definitions become a massive scaling bottleneck. Every tool carries a full schema that gets loaded into the LLM’s context window on every turn. Our tests show that replacing these schemas with a sandboxed Python approach can cut token overhead by 53% on identical tasks, and the savings only increase with scale.

Here at Red Hat’s Emerging Technologies group, we built codemode-lite, an open source Model Context Protocol (MCP) server, to test a different approach. Instead of exposing individual tools to the agent, we expose one tool: run_python. The agent writes Python code that executes inside an isolated sandbox, and the code calls all the tools through a remote procedure call (RPC). In our tests with 38 tools across 4 MCP servers, this reduced token overhead by 53% on identical tasks, and the savings grew as tool count increased.

Note: Red Hat’s Emerging Technologies blog includes posts that discuss technologies that are under active development in upstream open source communities and at Red Hat. We believe in sharing early and often the things we’re working on, but we want to note that unless otherwise stated the technologies and how-tos shared here aren’t part of supported products, nor promised to be in the future.

The approach draws from research published by both Anthropic and Cloudflare. Both teams independently reached the same conclusion: LLMs are better at writing code to call APIs than they are at making structured tool calls, because they have been trained on millions of lines of real API-calling code but very few examples of tool-calling formats. We built codemode-lite to validate these claims with real measurements.

Tool schemas don’t scale

MCP’s tool-calling pattern is straightforward. The LLM receives a list of available tools with their schemas, picks one, calls it, reads the result, and decides what to do next. Three things break down as you add more tools.

Schema overhead compounds every turn. Each tool’s JSON Schema includes parameter names, types, descriptions, and enum values. With dozens of tools across multiple MCP servers, this can exceed 15,000 tokens of schema definitions in the context window. These tokens ship with every API call to the LLM, whether the agent uses those tools or not.

Results accumulate in context. Each tool call appends its response to the conversation history. By the seventh call in a multistep task, the LLM is rereading six previous results just to decide what to call next. Context grows quadratically with the number of sequential tool calls.

No parallel execution. Traditional MCP tool calling is one call per LLM turn. If two operations are independent, they still require two sequential round-trips. The agent cannot express “do these three things at the same time.”

One tool, sandboxed execution

codemode-lite implements the code execution pattern as a standalone MCP server with three layers.

Figure 1: The agent sees one tool (run_python). The codemode server connects to MCP servers lazily and routes tool calls from the sandbox via RPC.

The agent LLM sees a single run_python tool consuming roughly 220 tokens of schema. That number stays constant whether you have 3 MCP servers behind it or 30. The LLM writes Python, the code runs in isolation, and tool calls proxy back to real MCP servers through JSON-RPC over stdin/stdout.

This works with any LLM and any MCP client. We tested with Claude (via Claude Code) and Gemini (via a LangChain agent), and the architecture imposes no model-specific constraints. If your client speaks MCP over stdio, it works.

Why lazy server loading

Early versions connected to all configured MCP servers at startup. This meant every session paid the connection cost for every server, even if the agent only needed one for a given query. We changed this to lazy loading: servers register their configs at startup but connect only when the agent explicitly requests them via a servers parameter.

The practical effect is that a query touching only one service connects to that server and ignores everything else. Connection state is cached, so the second query touching the same server pays zero overhead.

Figure 2: Servers register at startup but connect only when the agent passes servers=[…]. Unused servers never connect.

Runtime discovery instead of listing tool names

An earlier iteration listed all tool names in the run_python description. With dozens of tools, that added significant tokens to the description itself, partially defeating the purpose. We removed tool names entirely and instead gave the agent three discovery functions available inside the sandbox:

  • _discover() lists all available tools with names and descriptions
  • _schema('tool-name') returns the full input schema including required and optional parameters
  • _search('keyword') finds tools by name or description

The agent sees server names with short descriptions in the tool definition, so it knows which servers to load. Then it discovers actual tools and their schemas at runtime. This is progressive disclosure: context tokens are spent only on tools the agent actually uses.

In the Podman backend, _schema() reads from JSON metadata baked into the container entrypoint at startup, avoiding an RPC round-trip. For tools loaded on demand after the container started, it falls back to RPC.

Two sandbox backends

The sandbox executes the agent’s code. codemode-lite supports two backends, selectable via an environment variable.

Podman runs a rootless container that stays alive between calls. Variables persist across the session, so the agent can fetch data in one call and process it in the next without re-querying. Top-level await is supported, eliminating the need for an async def main() wrapper. Server-scoped proxies (mcp_servername.tool_name()) group tools by the service they belong to. The container launches with –network none, –read-only, –cap-drop ALL, –security-opt no-new-privileges, and runs as an unprivileged user. Credentials never enter the container. API keys stay on the host side and are injected at the RPC layer when the host calls the actual MCP server.

The container boots in roughly 2 seconds on the first call and is instant on every subsequent call. If it crashes, the next run_python call auto-restarts it.

Pyodide WASM runs Python inside WebAssembly via Node.js. A fresh process spawns per call, providing true memory isolation with zero state leakage between executions. Code must be wrapped in async def main() and return a dict. No container runtime is needed, just Node.js with the pyodide npm package. Boot time is approximately 0.7 seconds per call.

Both backends use the same JSON-framed RPC protocol over stdin/stdout. When sandbox code calls tools['create-event'](...), a proxy function sends a JSON request to the host process, which routes it through a ToolProxy to the real MCP server, unwraps the response, and returns a clean Python object. The sandbox code never sees MCP protocol details.

A security fix found along the way

Pyodide was designed for browsers where JavaScript is already sandboxed. Running it in Node.js changes the threat model. We discovered that Pyodide’s js module lets Python code call from js import require, which in Node.js gives access to the filesystem, network, and process APIs. A prompt-injected code block could read environment variables, exfiltrate API keys, or execute arbitrary commands.

We fixed this by setting js, pyodide_js, and pyodide to None in sys.modules before user code runs. We also added these to the forbidden imports list in the code validator, which performs AST-based checking before any code reaches the sandbox.

Podman does not have this issue. Its isolation is at the kernel level through cgroups and namespaces, and there is no bridge from the container to the host runtime.

Why we kept both backends

Podman is the better sandbox for almost every use case. Persistent state, top-level await, server-scoped proxies, and faster execution after the first call. But it requires Podman or Docker installed on the machine.

Pyodide exists for environments where you cannot run containers: lightweight deployments, CI pipelines, or machines without container runtimes. It trades persistence and convenience for portability. The code validation and tool descriptions adapt dynamically based on which backend is selected.

How tool calls cross the container boundary

The RPC protocol handles four message types.

call_tool is the primary message. The sandbox sends a tool name and arguments. The host calls the real MCP server and returns the result:

Container -> {"type":"rpc_request","id":1,"payload":
              {"type":"call_tool","tool":"create-event",
               "arguments":{"calendarId":"primary","summary":"Team Sync"}}}

Host      -> {"type":"rpc_response","id":1,"payload":
              {"success":true,"result":{"id":"evt_abc","summary":"Team Sync"}}}

discover, schema, and search handle runtime tool discovery. In Podman, schema checks baked metadata first (zero RPC). In Pyodide, all three go through RPC.

Figure 3: Tool calls cross the container boundary via JSON-RPC over stdin/stdout. The host-side ToolProxy normalizes responses before returning them to the sandbox.

The host-side ToolProxy normalizes responses before they reach the sandbox. It unwraps MCP content blocks, parses JSON strings, strips None values from arguments that LLMs commonly pass for optional parameters, and handles three different calling conventions: dict arguments, keyword arguments, and positional arguments mapped to schema parameter names.

What the measurements show

We ran the same multistep task using both direct MCP tool calling and codemode-lite. The task involved search, calendar scheduling, and repository operations across multiple MCP servers. Everything else was the same: Same LLM (Claude Opus), same servers, same task description.

MetricDirect MCP (38 tools)codemode-lite
First-turn context41,205 tokens26,635 tokens
Peak context49,697 tokens36,518 tokens
Schema tokens per turn~15,000~220
Turns2125
Tool calls1015
Cache create tokens321,053119,804
Token overhead reduction53%

The first-turn context dropped from 41,205 to 26,635 tokens. That gap of roughly 15,000 tokens is the tool schemas eliminated from every turn. With codemode-lite, the agent carries about 220 tokens of schema regardless of how many tools sit behind the server.

Direct MCP was more precise in execution: 10 tool calls, 0 failures, because the LLM had full schema information upfront. codemode-lite used 15 calls, including discovery and some retries. But the per-turn schema savings compound across every round-trip, and the advantage grows with tool count.

Context accumulation

In a test with three parallel API calls, the sandbox processed 5,984 characters of raw API responses internally and returned only 147 characters of summary to the agent. That is a 97.5% reduction in what enters the agent’s context window.

With direct MCP, all three responses would have entered the conversation history as individual tool results, growing the context for every subsequent turn. With codemode-lite, the raw data stays in the sandbox. Only the print() summary crosses the boundary.

Why cache-created tokens matter

This is worth explaining because it is not obvious. When the LLM provider receives a request, unchanged content from prior turns gets cached. Cache reads are cheap. But new content in each turn, including new tool results and schema additions, requires cache creation, which is billed at a premium.

Direct MCP generates more cache creation overhead because each individual tool result is new content. With codemode-lite, tool results stay inside the sandbox. Only the compact print() summary enters the conversation history, producing less new content to cache per turn.

What ships in codemode-lite

codemode-lite is 9 files and roughly 2,000 lines of Python:

codemode-lite/
+-- server.py                          MCP server, tool registration, lazy loading
+-- codemode/
|   +-- engine.py                      CodeMode class, run_code(), discovery helpers
|   +-- proxy.py                       Tool call routing, logging, response unwrapping
|   +-- mcp_adapter.py                 MCPToolLoader (SSE + stdio connections)
|   +-- backends/
|       +-- podman_backend.py          Rootless Podman container, RPC, entrypoint generation
|       +-- pyodide_wasm_backend.py    Pyodide WASM via Node.js
|       +-- pyodide_runner.js          Node.js Pyodide bootstrap
+-- requirements.txt                   mcp + aiohttp

There are zero LLM dependencies. No OpenAI SDK, no LangChain, no Pydantic. The server speaks standard MCP protocol over stdio. Any MCP-compatible client works: Claude Code, Cursor, LangChain agents, OpenAI Agents SDK, or a custom orchestrator.

Adding a new MCP server requires dropping a single JSON file into a configurable directory:

{ 
  "mcpServers": {
    "MyService": {
      "type": "sse",
      "url": "http://localhost:4000/sse",
      "description": "Internal API for customer data"
    }
  }
}

The agent can load it on demand in the next run_python call. No restart needed.

Honest trade-offs

This approach is not universally better than direct tool calling.

Discovery adds overhead. The agent needs _discover() and _schema() calls to learn tool names and parameters. Direct MCP does not need these. For simple single-tool tasks, direct calling may be more efficient.

Code quality depends on the model. The agent must write syntactically correct Python with exact tool names and parameter formats. Stronger models handle this well. Smaller models may struggle, producing tool name mismatches (we saw Gemini Flash use list_events instead of list-events) or incorrect response key access.

Conclusion

Our tests, along with research from other leaders in the AI space, show that tool schema overhead is the dominant scaling bottleneck in multitool agent pipelines. LLMs are effective at writing code to call APIs, and wrapping tool access behind a single code execution tool reduces context overhead significantly.

codemode-lite validates this with a 53% token reduction and extends the pattern with lazy server loading (connect only what you need), persistent sandbox state (variables survive between calls), and runtime tool discovery (no schema tokens in the agent’s context). The project is open source at github.com/redhat-et/codemode-lite.