The Debate AI Engineers Are Having
Open any AI engineering forum or Discord server in 2025-2026 and you'll find the same argument playing out repeatedly. One camp says CLI tools are the right way to give language models access to external systems — they're predictable, cacheable, and the agent always knows exactly what operation it's performing. The other camp says MCP (Model Context Protocol) is the future — standardized, composable, and interoperable across any LLM client that speaks the protocol.
Both camps are partially right. And both are answering the wrong question.
The benchmark we're analyzing today — run by Patrick Kelly at Port of Context against 12 real Stripe API tasks — reveals something that neither side has fully reckoned with: the dominant variable isn't the transport format (CLI vs MCP). It's the client-side execution architecture. Specifically, whether the LLM client allows the model to write code that executes tool calls, or forces it to make tool calls one turn at a time.
This distinction matters enormously in production. We're talking about the difference between a 19-step operation costing $1.52 and a 4-step operation costing $0.13 — for the exact same task, with the exact same model, connecting to the exact same API.
What MCP Actually Is (And What It Isn't)
Before we can understand the benchmark, we need to be clear about what MCP is architecturally — not in the marketing sense, but in the engineering sense.
The Model Context Protocol, introduced by Anthropic in late 2024 and since adopted by OpenAI, Google, and most major AI tooling vendors, is a standardized wire protocol for connecting LLM applications to external data sources and tools. Anthropic's own description is deliberately modest: it's "an open standard that enables developers to build secure, two-way connections between their data sources and AI-powered tools."
The USB-C analogy that appears in nearly every MCP explainer is apt: just as USB-C standardized the physical and electrical interface between devices so that any device can talk to any peripheral, MCP standardizes the interface between LLM clients and tool servers. A Stripe MCP server can be used by Claude, GPT-4, or any other MCP-capable client without modification.
What MCP is not is an execution strategy. The protocol defines how tools are advertised (schema), how they're invoked (request format), and how results are returned (response format). It says nothing about how many LLM turns it takes to accomplish a goal, or whether the LLM should write code that orchestrates multiple tool calls or invoke them individually one by one.
This is the crucial distinction that the CLI-vs-MCP debate misses. MCP is infrastructure — like HTTP is infrastructure for the web. HTTP doesn't make web applications fast; architecture does. Similarly, MCP doesn't determine agent efficiency; the client execution model does.
How a Typical MCP Interaction Works
When an MCP client connects to a server (say, the Stripe MCP server), the server advertises its available tools with full JSON schema definitions. This schema includes the tool name, description, parameter names, types, and documentation — everything the LLM needs to understand how to use the tool.
In a standard "raw MCP" setup, this schema is injected into the LLM's context on every request. If the Stripe MCP server has 40 tools, the LLM receives schema definitions for all 40 tools on every single turn. The LLM then decides which tool to call, generates a structured tool call, the client executes it, and the result is fed back into the next LLM turn. Then the cycle repeats.
This is the architecture that the benchmark calls "Raw MCP" — and understanding this loop is essential to understanding why it's not the most efficient approach for multi-step tasks.
CLI Tools: The Alternative
The CLI (command-line interface) tool approach predates MCP and is what most early AI agent frameworks used. In this model, tools are implemented as functions or executables that the agent can call, typically with a simpler and more compact schema than an MCP server would advertise. The LLM generates a tool call, the runtime executes it, the result comes back, and the conversation continues.
The key difference from Raw MCP is that CLI tool schemas are usually hand-written and optimized — an engineer decided exactly which parameters to expose and how to describe them. MCP server schemas are typically auto-generated from API definitions and tend to be more verbose and comprehensive, which has implications for token usage that we'll quantify shortly.
The Benchmark: Methodology and Setup
Patrick Kelly at Port of Context ran a rigorous controlled experiment: 12 real-world Stripe API tasks, three different agent configurations, one model (Claude Sonnet 4.6), and full accounting of tokens consumed and dollars spent. Every configuration solved every task — this is not a benchmark about task success rates, but about the cost and efficiency of accomplishing the same work through different architectures.
The Three Configurations
Configuration 1: CLI Tool
Standard CLI-style tool calling where the agent has access to Stripe operations as discrete functions. Each operation is one tool call, and each tool call requires one LLM turn to generate and one turn to process the result. Schemas are relatively compact and hand-crafted.
Configuration 2: Raw MCP
The agent connects directly to the Stripe MCP server. Full tool schemas are injected into context on every request. The agent calls one MCP tool per LLM turn, receives the result, and decides the next step. This is the "naive" way of using MCP — correct, but not architecturally optimized.
Configuration 3: Code Mode MCP
The Port of Context pctx framework presents MCP tools as a code API rather than direct invocations. The LLM writes a TypeScript program that calls MCP tools in a loop, handles state management in code, and executes the entire workflow in a small number of LLM turns. The pctx runtime then executes that program in a secure sandbox.
The 12 Stripe Tasks
The benchmark tasks span the Stripe API's breadth and complexity, from trivially simple read operations to multi-step write workflows:
- Simple reads: Balance lookup, customer list retrieval, payment intent status checks
- Simple writes: Customer creation, single charge creation
- Multi-step workflows: Invoice creation (with line items and finalization), subscription management, refund processing with conditional logic
- Compound queries: Filtering and aggregating payment data across multiple API calls
This mix is important: it includes tasks where CLI should theoretically excel (simple single-operation reads) and tasks where Code Mode should have a structural advantage (multi-step workflows requiring loops or conditionals). The benchmark tests both ends of the complexity spectrum.
Results Overview: The Numbers
All three configurations passed all 12 tasks. There is no accuracy difference to report. The entire story is in cost and efficiency:
| Configuration | Tasks Passed | Total Tokens | Avg Tokens/Task | Total Cost | Avg Cost/Task |
|---|---|---|---|---|---|
| CLI Tool | 12/12 ✅ | 711,555 | 59,296 | $2.22 | $0.185 |
| Raw MCP | 12/12 ✅ | 506,970 | 42,248 | $1.60 | $0.133 |
| Code Mode MCP | 12/12 ✅ | 294,924 | 24,577 | $0.98 | $0.082 |
Code Mode MCP uses 294,924 tokens total versus CLI's 711,555 — that's 58.5% fewer tokens for identical outcomes. At current Claude Sonnet 4.6 pricing, this translates to $0.98 versus $2.22 — a 56% cost reduction. For teams running agents at scale, this delta compounds fast: 1,000 such tasks per day means $1,240/day with CLI versus $820/day with Code Mode. That's $152,000/year in savings on a single workflow.
There's also a notable gap between Raw MCP and CLI that deserves attention. Raw MCP uses 29% fewer tokens than CLI despite the overhead of injecting full tool schemas into context. The reason, as we'll explore, is that MCP's structured responses are more compact than CLI's free-form text outputs — but this advantage is overwhelmed for CLI by how much leaner its schema definitions are per call. The net effect favors Raw MCP, but neither approaches Code Mode's structural advantage.
Why CLI Looks Good on Simple Tasks
The CLI approach isn't uniformly worse — and understanding when it performs well is important for making sensible architecture decisions.
Consider the simplest task in the benchmark: a Stripe balance lookup. This is a single API call with no parameters — you just call stripe.balance.retrieve() and get back a JSON object. Here's how the configurations compare on this task:
| Configuration | Tokens Used | LLM Turns |
|---|---|---|
| CLI Tool | 3,001 | 2 |
| Raw MCP | 19,172 | 2 |
| Code Mode MCP | ~8,000 | 2 |
On a simple balance lookup, CLI is 6x more token-efficient than Raw MCP. Both take 2 LLM turns (one to decide to call the tool, one to process the result), but the token counts are wildly different. Why?
The Schema Size Problem
The answer is in the system prompt. MCP servers advertise their full tool catalog — every available operation with its complete JSON schema definition — on every request. The Stripe MCP server has dozens of tools covering customers, charges, invoices, subscriptions, payment intents, refunds, and more. Each tool definition includes its name, a description, every parameter with its type, whether it's required, and documentation strings.
For a single balance lookup, Raw MCP injects all of those tool definitions into context even though only one tool (retrieve_balance) will be used. This overhead — roughly 16,000 tokens of schema documentation that's irrelevant to this specific call — is paid on every single request, regardless of task complexity.
CLI tool schemas, by contrast, are hand-written and minimal. Engineers craft exactly the parameters the agent needs, with concise descriptions, and don't include documentation for operations that are out of scope for the current task. For a balance lookup, the CLI tool might be defined in 200 tokens rather than 16,000.
This is why CLI looks deceptively efficient on simple tasks: it's not that the architecture is better, it's that compact schemas produce small contexts. But this advantage evaporates when tasks become complex and multi-step, as we're about to see.
Where Code Mode Dominates: The create_invoice Deep Dive
The benchmark includes a create_invoice task that is representative of real-world Stripe workflows: create a customer, create an invoice, add line items, set billing parameters, and finalize the invoice. This requires multiple Stripe API calls with dependencies between them (you can't add line items before the invoice exists, can't finalize before the line items are added).
Here's the comparison on this single task:
| Configuration | LLM Turns | Total Tokens | Cost |
|---|---|---|---|
| CLI Tool | 19 turns | 497,556 | $1.52 |
| Raw MCP | 12 turns | 168,480 | $0.53 |
| Code Mode MCP | 4 turns | 38,847 | $0.13 |
This one task alone accounts for the majority of CLI's total cost in the benchmark ($1.52 of $2.22 total). Code Mode completes the same task for $0.13. That's a 12x cost reduction on a single task.
Why 19 Turns Becomes 4
The LLM turn count is the key variable here. Every LLM turn means: the model receives the full conversation history (all prior turns, all tool results), generates a response (either a tool call or a final answer), and that response is processed. As the conversation grows, each turn becomes more expensive because the context window carrying all previous steps must be re-read on every invocation.
In the CLI configuration, each step in the invoice workflow is one LLM turn. Step 1: create customer. Step 2: create invoice for that customer. Step 3-8: add each line item. Step 9-12: set billing parameters. Steps 13-19: verify, adjust, finalize. By turn 10, the model is reading 9 prior turns of tool calls and results as input to decide step 10. By turn 19, it's reading 18 turns of history. The token cost of the context window compounds with every step.
In Code Mode, the LLM writes a TypeScript function that does the entire workflow in a single program. That program calls the MCP tools internally, in a tight loop, with state managed in variables rather than in LLM context. The LLM's 4 turns are: (1) understand the task and write the program, (2-3) minor corrections or refinements if needed, (4) receive and report the final result. The intermediate API calls happen inside the program's execution — not in LLM context.
What "Code Mode" Actually Is Architecturally
Code Mode is a client-side execution architecture, not a model feature. It emerged from the Port of Context team's observation that the most token-efficient way to use a capable LLM with tool access is to ask the LLM to write a program that uses the tools, rather than to use the tools directly.
The pctx (Port of Context) framework implements this by presenting MCP tools as a TypeScript API rather than as callable LLM tools. When the agent receives a task, instead of planning "first I'll call tool A, then tool B, then tool C," it plans "I'll write a TypeScript program that calls A, then B, then C, handling the outputs in between." The pctx runtime executes this program in a secure sandbox and returns the final output.
The Mechanics
In a Code Mode interaction, the flow looks like this:
- Task receipt: The LLM receives the task description and a TypeScript type definition for the available MCP tools (not their full schemas — just their function signatures).
- Program generation: The LLM writes a TypeScript async function that orchestrates all necessary tool calls, including error handling, conditional logic, and loops.
- Sandbox execution: The pctx runtime executes the program in an isolated Node.js sandbox. Tool calls within the program invoke actual MCP server endpoints.
- Result return: The program's return value is the task output, returned to the LLM as a single result.
- Reporting: The LLM formats and reports the result to the user.
The key insight is that steps 2-4 can involve an arbitrarily large number of API calls — 5, 50, or 500 — without any additional LLM turns. The LLM wrote the program once; the program runs as many API calls as needed.
Why Type Signatures Beat Full Schemas
Code Mode also solves the schema bloat problem differently. Instead of injecting full JSON schemas for every tool into every LLM turn, pctx generates compact TypeScript function signatures. A Stripe tool that might require 3,000 tokens to describe as a JSON schema with full parameter documentation might be represented in 150 tokens as a TypeScript function signature with JSDoc comments. The LLM still understands what the tool does and how to call it — TypeScript types are an extremely dense information format for LLMs trained extensively on code — but the context overhead is a fraction of the raw MCP approach.
Code Mode vs Open Interpreter and Similar Systems
Code Mode shares philosophical DNA with Open Interpreter and similar "LLM writes and runs code" systems. The key difference is that Code Mode is specifically designed around MCP tool access rather than general-purpose code execution. Where Open Interpreter might spin up a Python REPL for arbitrary computation, Code Mode targets the specific pattern of "use these MCP tools to accomplish a task" — making it more constrained and more auditable in production contexts.
The pctx Python SDK (released December 2025) extended Code Mode to Python as well, allowing Python ML pipelines and business logic to be combined with MCP tool access in the same Code Mode environment — addressing the gap between data-science workflows (typically Python) and API orchestration (which pctx originally targeted with TypeScript).
The Architecture Insight: MCP Is Transport, Not Strategy
The benchmark's most important finding isn't the specific numbers — it's the framework they point to. The CLI vs. MCP debate is a debate about transport format. Code Mode vs. direct-call architectures is a debate about execution strategy. The second debate is the one that matters.
Think about it this way: HTTP is the transport protocol for the web. But no engineer would argue that choosing HTTP makes a web application fast or slow. What determines web application performance is how you architect queries, how you cache, how you pipeline requests, whether you use streaming. The transport is a prerequisite, not a determinant.
MCP is in the same position. Whether your tools are exposed via MCP or as CLI functions, the dominant cost variable is how many LLM turns the task requires and how large the context window is at each turn. These are determined by your client execution architecture — not your transport protocol.
⚡ The Reframe
The question "should I use CLI tools or MCP?" is like asking "should I use HTTP or FTP?" — it addresses real concerns (interoperability, schema design, tooling ecosystem), but it doesn't address the question that actually determines your agent's economics: does your client execute one tool per LLM turn, or does it write code that executes many tools per LLM turn?
Raw MCP and CLI tools both have the "one tool per LLM turn" property. Code Mode breaks this constraint. That's the variable that explains the 56% cost reduction.
Why Raw MCP Still Beats CLI Overall
It's worth noting that Raw MCP outperforms CLI overall (506,970 tokens vs. 711,555, a 29% reduction) despite the schema overhead. The reason is that MCP's structured response format is more compact than CLI's typical free-form text output. When CLI tools return results as human-readable text (which many CLI wrappers do), each result carries more tokens than an MCP server returning structured JSON that the model can parse efficiently.
On individual simple tasks, CLI wins due to schema compactness. Across the full benchmark including multi-step tasks, Raw MCP wins on total tokens because of more efficient result encoding. But Code Mode beats both by reducing the number of LLM turns required, which is the dominant cost factor.
Practical Guidance: When to Use Each Approach
Based on the benchmark data, here's a framework for choosing your agent architecture:
Use CLI Tools When:
- Tasks are genuinely single-operation. If 90% of your use cases involve one or two API calls with no inter-call dependencies, CLI's compact schemas make it the most token-efficient choice.
- You need tight schema control. Hand-crafted CLI schemas let you expose exactly the parameters and documentation relevant to your use case, nothing more.
- Tooling ecosystem matters. Existing CLI tool libraries for your stack are mature and well-tested.
- You're prototyping. CLI tools are faster to write and easier to debug for individual operations.
Use Raw MCP When:
- Interoperability is the priority. If you need the same tool server to work with multiple different LLM clients without modification, MCP's standardization delivers real value.
- The MCP ecosystem has the server you need. Hundreds of MCP servers exist for popular services (GitHub, Notion, Stripe, etc.). Using them is faster than building CLI wrappers.
- Tasks are moderately complex but not highly looping. Raw MCP handles sequential multi-step tasks reasonably, and beats CLI on overall efficiency once you get beyond trivial operations.
Use Code Mode When:
- Tasks involve loops or conditionals. Any workflow where "do this for each item in a list" appears is a Code Mode candidate. The model writes the loop; the loop executes N API calls without N LLM turns.
- Task complexity is variable. If sometimes you need 2 API calls and sometimes you need 20, Code Mode handles both efficiently. The LLM writes exactly the code the task requires.
- You're operating at scale. At 1,000+ tasks per day, the 56% cost difference becomes tens of thousands of dollars annually. The investment in Code Mode architecture pays back quickly.
- Multi-step write operations are common. The create_invoice example is representative: any time you're creating entities, relating them to other entities, and performing sequential operations on them, Code Mode's 12x efficiency advantage on complex tasks is decisive.
The Mixed-Mode Strategy
The benchmark suggests a practical heuristic: route tasks by complexity. For simple reads and single-operation writes, CLI is fine and avoids Code Mode's small additional overhead for program generation. For anything multi-step — especially anything with loops, conditionals, or more than 3 sequential API calls — route to Code Mode. A simple classifier on task type (or even a heuristic based on estimated step count) can capture most of Code Mode's cost savings while preserving CLI's efficiency on simple tasks.
Implications for AI Engineering
The broader implication of this benchmark is that the AI engineering community has been optimizing the wrong variable. The explosion of MCP server development — hundreds of servers for popular APIs, tooling to auto-generate MCP servers from OpenAPI specs, MCP server marketplaces — addresses a real problem (interoperability and tool discoverability) but doesn't address the cost and efficiency problem that teams building production agents face.
The parallel in web engineering: in the late 1990s, there was enormous debate about which application server framework to use, which web server, which database connector. The question that actually mattered — and that the industry took years to converge on — was application architecture: stateless vs. stateful, query optimization, caching strategies, asynchronous processing. Transport and framework debates were proxies for the architectural question that was harder to articulate.
The AI agent community appears to be in a similar phase. MCP vs. CLI is the surface debate. Client execution architecture — specifically whether you reduce LLM turns through code execution — is the underlying question.
For Teams Evaluating AI Agent Frameworks
When evaluating any agent framework, add these questions to your checklist:
- Does the framework support code-mode execution? Can the LLM write a program that executes multiple tool calls without returning to the LLM for each one?
- What is the schema injection strategy? Does the framework inject all tool schemas on every request, or does it use selective schema injection or compact type definitions?
- How does context window size grow with task complexity? Does the framework's design contain or compound the context growth problem?
- Can you route tasks by architecture? Does the framework support different execution strategies for different task types?
For Teams Already Using MCP
If you've already built on MCP (which is a reasonable choice given its rapid ecosystem growth), the transition to Code Mode doesn't require abandoning your MCP servers. The pctx framework wraps existing MCP servers — it's a client-side change, not a server-side change. Your Stripe MCP server continues to work; you're simply changing how the client interacts with it.
This is an important point: MCP server investment is not wasted if you switch to Code Mode clients. The ecosystem value of standardized tool servers remains. You're only changing the execution strategy layer above the transport.
The Token Economics of Scale
To make the economic argument concrete: suppose you're running an AI agent that handles customer support workflows for a SaaS business, and each complex support ticket requires a Stripe invoice check, a subscription status check, and potentially a refund creation — a 3-5 step workflow. At 500 such tickets per day:
- CLI approach: ~500 × $0.25/task = ~$125/day = ~$45,600/year
- Code Mode approach: ~500 × $0.11/task = ~$55/day = ~$20,000/year
That's $25,000/year in savings on a modest-scale deployment, from a single architectural choice. At the scale where real enterprise AI agents operate — millions of tasks per month — these numbers become material budget considerations.
References
- Kelly, Patrick. "MCP is a Primitive, Not a Strategy — CLI vs MCP Benchmark." Port of Context, March 16, 2026.
- Port of Context. "pctx: The execution layer for agentic tool calls." GitHub, 2025-2026.
- Anthropic. "Introducing the Model Context Protocol." Anthropic, November 2024.
- Model Context Protocol. "What is the Model Context Protocol?" Official Documentation, 2025.
- IBM. "What is Model Context Protocol (MCP)?" IBM Think, February 2026.
- Port of Context. "Python SDK for Code Mode." Port of Context Blog, December 2025.