← Back to Research

🏆 Local Coding Agent: Building a Claude Code Killer That Runs Entirely on Your Hardware

A fully local, provider-agnostic coding agent with 719 tests, 29+ components, multi-agent orchestration, RAG, MCP support, and browser automation — built on proven agentic architectural patterns to rival the paid tools without sending your code to the cloud.

May 22, 2026 · 12 min read · SWE-bench target: 70%+

🎧 Listen to this article:

Introduction

There's a fundamental tension at the heart of the AI coding agent boom. Tools like Claude Code (80.8% on SWE-bench), Cursor, and OpenAI Codex are genuinely impressive — they can write, refactor, debug, and deploy code with an ability that was unthinkable even a year ago. But they all share one dealbreaker for a significant segment of developers: everything your code touches flows to a cloud provider.

Your proprietary algorithms. Your internal APIs. Your client data. Every prompt you send, every file the agent reads, every terminal command it executes — it all passes through Anthropic's, OpenAI's, or Cursor's servers. For startups, researchers, and privacy-conscious engineers, that's not an acceptable trade-off.

Enter local-coding-agent — a full-featured, local-first coding agent that runs entirely on your hardware. Built by Michel Laclay, this project isn't a toy wrapper around a local model. It's a 31,000+ line software architecture with 719 passing tests, 29+ components, and a phased build plan grounded in the agentic patterns from Dr. Ali Arsanjani and Juan Pablo Bustos's 2026 Packt publication: "Agentic Architectural Patterns for Building Multi-Agent Systems."

This article is a deep technical review of the project as it stands today — what's built, what's architecturally sound, what's missing, and whether it can realistically close the gap with the cloud-hosted leaders it's designed to compete against.

Why a Local Agent?

The market case is clear. The current landscape breaks down into three camps:

CLI-first tools include Claude Code (Anthropic), OpenCode (open-source, BYO provider), OpenAI Codex, and Aider. These are terminal-native and favored by operators who want automation without leaving the command line. Claude Code leads on SWE-bench at 80.8%, and has a one-million-token context window. But it locks you into Anthropic's API.

IDE-native tools like Cursor, Windsurf, and Zed are embedded directly in your editor. They offer better UX — code navigation, IDE-integrated chat, generative UI — but they're closed ecosystems with no CLI mode and no local model support.

Cloud platforms like Devin (Cognition) and OpenHands offer fully autonomous agents that plan, execute, and deploy end-to-end. But they're cloud-only, expensive, and require active internet. None offer a meaningful local-first option.

The gap? Truly local-first agents with deep git integration, multi-agent orchestration, and a provider-agnostic design. OpenCode is closest but less mature. local-coding-agent aims to fill that gap — and to do it in a way that exposes its own agentic patterns, something no competitor does.

Competitive Landscape

SWE-bench (May 2026 benchmark scores for real-world software engineering tasks):

Rank	Agent	SWE-bench	Interface
1	Claude Code (Anthropic)	80.8%	CLI
2	OpenCode	~76%	CLI
3	Cursor	~74%	IDE
4	Windsurf	~73%	IDE

The project's SWE-bench target is 70%+ — within striking distance of the leaders. But SWE-bench measures the model's underlying capability more than the agent's architecture. What's more interesting about this project is the engineering: how it structures itself to make whatever model you plug in as effective as possible.

Architecture & Agentic Patterns

The project follows a clean three-layer architecture, directly mapped to the patterns from Arsanjani & Bustos's taxonomy:

1. Presentation Layer — The terminal UI, CLI commands, and conversational interface. Built with Rich for streaming token rendering, syntax highlighting, and structured output tables. This is your front door — token-by-token streaming so you see results as they arrive, not after the whole response finishes.

2. Orchestration Layer — The brain. Contains the agent core (multi-turn tool chaining with max-turns safety), the task planner (decomposes goals into steps with complexity estimation), multi-agent coordination (delegates to isolated subagent sessions with batch parallel execution up to 3 concurrent), persistent memory (user profile + agent notes), skill management, human-in-the-loop interruption, and in-context learning.

3. Infrastructure Layer — The toolset. Model router, tool registry, vector store (for RAG), file system operations, terminal execution, git client, MCP server support, browser engine, and embedding API. Every tool has a typed schema, validation, and a dedicated test suite.

This is not an afterthought. The PRD explicitly maps 19 functional requirements to documented agentic patterns — meaning every feature has an architectural precedent, not just a feature checklist. That's a significant differentiator.

Core Components

Here's what's shipped as of May 22, 2026 — every module with its test count:

Foundation (Phase 1)

ModelRouter — Routes tasks to the right model based on complexity; supports Ollama, OpenAI-compatible SDKs; 120s timeout. (16 tests)
ToolRegistry — Typed tool schemas with validation and execution pipeline. (14 tests)
FileTools — Read, write, patch (fuzzy find-and-replace), diff display. (24 tests)
TerminalTools — Foreground and background execution, process management. (19 tests)
GitTools — Init, add, commit, status, diff, branches, push, log, merge with conflict detection. (18 tests)
SearchTools — Ripgrep integration for content and file search. (15 tests)
AgentCore — The main multi-turn loop; chains tool calls with max-turns safety. (17 tests)
TerminalUI — Streaming token rendering with Rich console. (14 tests)
RAG Pipeline — Document indexer, vector store, Ollama/TF-IDF embeddings. (35 tests)

Core Capabilities (Phase 2)

Multi-Agent Orchestration — DelegateAgent with isolated sessions, batch parallel mode (up to 3 concurrent), structured summaries. (12 tests)
Persistent Memory — Two stores (user profile + agent notes), CRUD, size limits, auto-inject. (10 tests)
Skill System — SKILL.md format, create/update/delete/list, categorized library, in-repo SKILL.md. (11 tests)
Human-in-the-Loop — Multi-choice and open-ended prompts, confirmation on destructive ops. (10 tests)
Adaptive Retry — Exponential backoff, context-aware retry, max retry escalation. (12 tests)
MCP Client — stdio and HTTP MCP server connections, auto-discover tools. (12 tests)

Safety & Polish (Phase 3)

Explainability — Decision logging, chain of thought, JSON Lines audit trail, self-assessment. (37 tests)
Safety Module — 12 prompt injection detection patterns, command allowlists, rate limiting. (45 tests)
Browser Engine — Playwright-based: navigate, click, type, screenshot, accessibility tree, JS evaluation. (10 tests)
Cron Scheduler — Job CRUD, cron expressions, persistence, chained jobs. (40 tests)
Config System — YAML, env interpolation, multi-source merge, hot-reload, schema validation. (28 tests)
Config Manager — Multi-source config merging with interpolation. (28 tests)

Advanced Features (Phase 4 & 5)

Task Planner — Goal decomposition via LLM, per-step complexity estimation, plan persistence. (28 tests)
Complexity Routing — Routing rules map complexity tiers to model configs, fallback chains, per-model stats. (18 tests)
Session Persistence — Auto-save conversation history, restore on startup. (15 tests)
Token Tracking — Prompt/completion counting, latency, error counts. (18 tests)
LSP Client — JSON-RPC stdio client supporting pyright, typescript-language-server, rust-analyzer. (Phase 5)
Project Context Auto-loading — Scans for AGENTS.md, CLAUDE.md, SKILL.md, pyproject.toml, etc. (Phase 5)
Diff Review Workflow — Propose changes, preview unified diffs, approve/reject per-file. (Phase 5)
In-Context Learning — Tracks user corrections, categorizes them, injects few-shot examples into system prompt. (Phase 5)

Total: 719 passing tests across 29+ components. That's serious test coverage for a project of this scope.

The Build Timeline

What's remarkable about this project is not just what it builds, but how systematically it was planned. The PRD laid out an 8-week build plan mapped to the agentic patterns from Arsanjani & Bustos:

Phase 1 (Weeks 1–3): Foundation — Model router, streaming terminal UI, file operations, terminal execution, function calling framework, git integration, and RAG. The goal: make an agent that can read a codebase, run commands, and call an LLM. Completed.

Phase 2 (Weeks 4–6): Core Capabilities — Multi-agent delegation, persistent memory, skill system, human-in-the-loop, adaptive retry, and MCP server support. The goal: make an agent that delegats to subagents, remembers across sessions, and learns workflows as skills. Completed.

Phase 3 (Weeks 7–8): Safety & Polish — Explainability, adversarial protection, browser automation, cron scheduling, and the config system. Completed.

Phase 4: Advanced Features — Task planner, complexity routing, session persistence, token tracking. Completed.

Phase 5: Advanced IDE-like Features

The most recent additions are Phase 5 work — features that push this agent into true IDE-parity territory:

LSP Client — This is a big one. A Language Server Protocol client means the agent can understand your code at a structural level, not just as text. It connects to pyright, typescript-language-server, rust-analyzer, and others to get real-time diagnostics, go-to-definition, references, symbols, and hover information. This is what separates a "grep agent" from a "code-aware agent."

Project Context Auto-loading — Instead of requiring you to copy-paste project conventions, the agent scans for AGENTS.md, CLAUDE.md, .cursorrules, pyproject.toml, package.json, and README.md files, then formats them into system prompt blocks with priority ordering and size limits. This is exactly what Claude Code does for its context window.

Diff Review Workflow — Mirrors Claude Code's review UX: the agent proposes changes, you preview unified diffs, and approve or reject individual files before they're applied. This is a critical UX pattern that prevents the agent from doing damage before you see it.

In-Context Learning — The agent tracks corrections you make during a session, categorizes them (style, naming, logic, security), and injects few-shot examples into the system prompt. This means the agent gets better mid-session without needing fine-tuning. It's a clever form of lightweight personalization.

Technical Deep Dive

The Model Router — Your Agent's Nervous System

One of the most architecturally sophisticated components is the ModelRouter. It doesn't just plug in one model — it routes tasks based on complexity tiers. Simple tasks go to small, fast models (local llama or Qwen quantized variants). Complex coding tasks route to larger models. If the primary model fails, it chains through fallback providers.

This is significant because local LLMs vary wildly in capability. An 8B parameter model can refactor simple functions but chokes on architectural design decisions. The router lets the agent be pragmatic — use the minimum model that gets the job done, reserving capacity for hard tasks. It also tracks per-model token usage, latency, and error rates, giving you observability into what the agent is actually doing.

RAG with Local Embeddings

The RAG pipeline indexes local documents — PDFs, markdown, code docs — and runs vector similarity search over them. Embeddings come from local models (nomic-embed-text via Ollama) with a TF-IDF fallback. This means the agent can reference project documentation, APIs, or any indexed knowledge base without sending it to the cloud. The pipeline has its own 35 tests, which is notable for an ML component.

MCP — The Agent Protocol

Support for the Model Context Protocol (MCP) is a forward-thinking addition. MCP is becoming the de facto standard for how agents discover and call tools. By supporting both stdio and HTTP transports, local-coding-agent can connect to external MCP servers — databases, file systems, web services — and call their tools alongside its native tools. This means the agent's capabilities are extensible without code changes.

Explainability & Audit Trail

With 37 tests, the explainability module is one of the most thoroughly tested components. It logs every decision, the reasoning chain, and confidence scores as JSON Lines. This is arguably the project's strongest differentiator. No major coding agent exposes this level of transparency — Claude Code operates as a black box. If something goes wrong, you get a full audit trail showing exactly why the agent chose a specific tool, what alternatives it considered, and its self-assessment of the outcome.

Success Criteria: Can It Compete?

Let's break down the targets outlined in the PRD:

SWE-bench ≥ 70% — This is ambitious but reachable. The current SWE-bench leaders (Claude Code at 80.8%, OpenCode at ~76%) get their scores from the underlying model. What local-coding-agent has going for it is the context management: RAG, in-context learning, project context auto-loading, and skill injection. These all effectively widen the agent's attention window, which correlates strongly with SWE-bench performance. The question is whether a locally-run 70B quantized model can close the gap with Opus 4.6 running in full precision on Anthropic's infrastructure.
First token latency < 500ms (local) — This depends entirely on your GPU. On the Mac Studio you're running (M3 Ultra with up to 256 GB unified memory), a quantized Qwen 3.6 27B would comfortably achieve this. On an RTX 4070 with 8 GB, you'd need heavier quantization and might see higher latency. This is a hardware-variable metric.
Tool call success rate ≥ 95% — The combination of typed schemas, validation, adaptive retry, and complexity routing makes this achievable. 719 tests across the tool ecosystem gives confidence that edge cases are covered.
Session memory recall accuracy ≥ 90% — The dual memory system (user profile + agent notes) with forced injection into the system prompt should handle this well, as long as the pruning policies keep the injected memory concise.
Subagent task completion rate ≥ 85% — Multi-agent coordination is inherently tricky. The project starts with depth-1 (children can't spawn grandchildren), which is the right call for v1. The structured summary format should help keep orchestration coherent.

Risks & Mitigations

The PRD identifies several risks with sensible mitigations:

Local models underperforming on complex tasks — The model router's fallback chains are the safety net here. You can configure cloud models as a fallback for when the local model can't handle the complexity. The question is whether this is acceptable for users who wanted to avoid cloud providers entirely.

Context window overflow on large codebases — Addressed via RAG, smart indexing, and depth-limited file reading. The project context auto-loading with size limits is also a good defense. But very large monorepos (think 100K+ LOC) will still pressure the context window, especially with quantized models that have shorter effective windows than their FP16 counterparts.

Prompt injection via retrieved content — The 12-pattern detection engine and command allowlists are solid for v1. This is an area where ongoing research is rapidly evolving — new injection techniques emerge monthly.

Performance regression as features add up — The PRD mentions a benchmark suite and performance budgets. With the existing modular architecture, this is manageable, but the 40+ tests on the cron scheduler and config system suggest the codebase is getting substantial. Regression testing will matter.

The Verdict

local-coding-agent is arguably the most structurally complete local coding agent in the open-source landscape right now.

What's impressive: The three-layer architecture mapped to documented agentic patterns. The test coverage (719 tests is no joke). The feature set is genuinely comprehensive — multi-agent orchestration, RAG, MCP, browser automation, cron jobs, skill management, explainability, adaptive retry, in-context learning. The Phase 5 additions (LSP, diff review, project context auto-loading, in-context learning) close the biggest UX gaps relative to Claude Code.

Where it's still behind: The underlying model is the bottleneck. No amount of architecture can make a locally-run 32B quantized model match Opus 4.6 on complex software engineering tasks. The SWE-bench target of 70% is achievable, but it's still ~10 points behind Claude Code. If you're a privacy-conscious developer who values control over raw capability, this gap is worth accepting. If you need maximum code correctness on hard problems, the cloud agents still win.

Who this is for: Developers who value privacy above all, teams working on proprietary code, researchers who can't send prompts to external APIs, or anyone tired of $20–$200/month subscriptions. It's also interesting for builders who want to learn — the architecture is a masterclass in agentic software design, and the pattern-mapping approach is something every agent builder should study.

Who it's not for yet: Teams that need Claude Code's level of reliability on production code. Until local models close the capability gap, this is best positioned as a co-pilot rather than a full autopilot for your most critical work.

The project is actively developed — the latest commits (as of this writing) are from 20 minutes ago, adding Phase 5 features. At this pace, and with the clear architectural vision, this is one to watch closely. The question isn't whether a local coding agent can be built. The question is now whether it can be built well — and by the tests, architecture, and feature set, local-coding-agent is proving that it can.

Links & References

GitHub: Local Coding Agent — main repository
Product Requirements Document (PRD) — full requirements, architecture, phased build plan
Implementation Plan — Phase 1–5 progress tracker with test counts
README — project overview and artifacts
INSTALL.md — setup instructions, tool call parsing fixes, default model config
Agentic Architectural Patterns for Building Multi-Agent Systems — Arsanjani & Bustos (Packt, 2026) — source material