Local AI Hardware Inference Agents

🏗️ Build Your Local AI Stack from Scratch

The complete recipe — hardware, inference engines, agent frameworks, models, TTS/STT, and coding tools for a fully self-hosted AI workstation.

AI Agent

March 21, 2026 · 20 min read · Local AI, Hardware, Inference, Agents

📺 Watch the video version: ThinkSmart.Life/youtube

🎧

Listen to this article

1. Why Go Local?

There's a quiet revolution happening in garages, home offices, and server rooms around the world. Developers, researchers, and tinkerers are unplugging from the cloud and standing up their own AI workstations. The reason isn't contrarianism — it's a cold, rational calculation. For an increasing number of use cases, running AI locally is simply better in every dimension that matters: privacy, cost, latency, and control.

Privacy first. When you run a local model, your data never leaves your machine. Not one prompt, not one token, not one document you feed it for context. This isn't a minor feature — it's a fundamental shift in the trust model. Companies processing sensitive customer data, lawyers reviewing privileged documents, researchers handling pre-publication findings: all of these use cases are incompatible with cloud-hosted inference by default. Local models eliminate that exposure entirely.

Cost at scale. Cloud inference pricing is reasonable for occasional use. It becomes brutal at scale. Running a 70B-class model in production — answering thousands of queries per day — will run you tens of thousands of dollars per month on hosted APIs. A one-time hardware investment in a capable local rig amortizes to near-zero marginal cost per query. At high-volume usage, local wins in 3–6 months. At very high volume, it wins faster.

Latency with no ceiling. Cloud inference adds network round-trips, rate limits, queue delays, and the unpredictability of shared infrastructure. A well-configured local setup with a 4× RTX 3090 rig delivers responses in milliseconds — no network hop, no cold start, no API throttle. For real-time applications (voice assistants, coding agents, live document analysis), this matters enormously.

Total control. Want to use a specific quantization? Fine-tune on your data? Run a custom LoRA? Serve 50 concurrent users? Mix two models in a pipeline? With local hardware, you make every call. No model deprecations, no policy changes, no surprise capacity limits.

The hardware landscape has also reached a critical inflection point. Two archetypes dominate the conversation: the GPU rig (raw throughput via multiple discrete GPUs) and the Mac Studio (elegant, high-capacity unified memory). Each has a clear use case, and this guide covers both.

96GB

VRAM on 4× RTX 3090

192GB

Unified Memory on M4 Ultra

Per-token cost after hardware

<10ms

Local inference latency

2. Hardware Tier 1 — The GPU Rig: 4× RTX 3090

If you want raw, multi-GPU inference horsepower that can run 70B+ models in FP16 and serve production workloads, nothing in the consumer or prosumer space competes with a well-built 4× RTX 3090 rig. These cards were designed for the datacenter before NVIDIA capped consumer PCIe SLI, but for AI inference via tensor parallelism over PCIe, they're still king at the price point.

The VRAM Math

Each RTX 3090 carries 24GB of GDDR6X VRAM. Across four cards, that's 96GB of combined addressable VRAM. With vLLM's tensor parallelism, this is treated as a unified pool — not separate islands. What this means in practice: a Qwen3.5-72B model in BF16 (~144GB raw) doesn't fit, but the same model in Q4_K_M quantization (~40GB) fits with headroom on just two cards. With all four, you're running 110B-class models comfortably and touching 120B territory via REAP-compressed weights.

PCIe Bandwidth and Topology

NVLink is not required. Unlike training, which demands constant gradient synchronization, inference tensor parallelism requires far less inter-GPU bandwidth. PCIe 4.0 x16 per slot (128 GB/s bidirectional) is ideal; x8 lanes are acceptable for most inference workloads. The key is that your motherboard physically provides four full PCIe slots — many ATX boards only offer two.

Recommended motherboards:

ASUS Pro WS X570-ACE — 4× PCIe 4.0 x16 slots, workstation-grade VRM, excellent stability
MSI MEG Z790 GODLIKE — Intel platform, 4 PCIe 5.0/4.0 slots, PCIe bifurcation support

CPU, RAM, and Storage

The CPU doesn't need to be top-tier for inference. Your GPU is doing the heavy lifting. An AMD Ryzen 9 5950X or Intel Core i9-13900K is more than sufficient. The CPU's job is managing the memory bus, feeding data to the GPUs, and running your orchestration layer.

RAM: 128GB DDR4/5 is the target. When a model's VRAM footprint exceeds what your GPUs can hold, modern inference engines spill layers to system RAM. This is slower, but it keeps large models runnable. 128GB gives you meaningful overflow capacity. Do not go below 64GB on a 4× 3090 rig.

Storage: 2TB+ NVMe SSD is essential, not optional. Model weights are large — a 70B model in Q4 quantization is ~40GB. A 120B model is ~60-70GB. SSD streaming (used in REAP and MLX) reads weights directly from NVMe during inference. Fast sequential read speeds (7+ GB/s on PCIe 4.0 NVMe) determine how quickly large models load and stream.

Power Requirements

Each RTX 3090 has a TDP of approximately 350W under load. Four cards running simultaneously: ~1,400W — and that's before your CPU, RAM, storage, and fans. You need a 1600W+ PSU, ideally 1800W for headroom and power supply longevity. The EVGA SuperNOVA 1600 T2 and Seasonic PRIME TX-1600 are proven choices. Proper cable management and a well-ventilated case (Fractal Design Meshify is popular) are critical — these cards run hot.

💰 Budget Reality Check: RTX 3090 cards trade used for $700–$900 each in 2026. Total for 4 cards: ~$3,000–$3,500. Add motherboard ($400), CPU ($300), 128GB RAM ($250), PSU ($300), NVMe ($200), case ($200), misc ($150): total build cost ~$5,100–$5,800. You're running 72B models 24/7 with zero marginal API cost from day one.

3. Hardware Tier 2 — Mac Studio M3/M4 Ultra

The Mac Studio with an M3 or M4 Ultra chip represents a fundamentally different design philosophy than the GPU rig — and for many workloads, it's actually the superior choice. Where the GPU rig wins on raw multi-GPU tensor parallelism for batch throughput, the Mac Studio wins on sheer memory capacity, power efficiency, form factor, and silent operation.

Unified Memory: The Key Insight

Apple Silicon's defining architectural advantage for LLM inference is unified memory. The CPU and GPU share one physical memory pool — there are no PCIe transfers between host and device RAM. On an M4 Ultra with 192GB of unified memory, a 120B-parameter model in BF16 (~240GB raw) doesn't fit, but the same model at Q4 (~60GB) or Q8 (~120GB) sits entirely in memory with fast GPU access.

The M3 Ultra offers up to 192GB unified memory with approximately 800 GB/s memory bandwidth. The M4 Ultra pushes similar specs with architectural improvements in efficiency and throughput. This bandwidth matters enormously: inference speed is often memory-bandwidth-bound, not compute-bound. Apple's unified architecture means no bottleneck between the memory pool and the neural engine.

What Models Run Well

The sweet spot on a 128GB Mac Studio M3/M4 Ultra:

Up to 72B in Q8 — fully resident, fast generation (~20 tokens/sec)
Up to 120B in Q4 — fully resident, good for heavy reasoning tasks
397B via SSD streaming — Qwen3.5-397B runs at approximately 3.4 tok/s using MLX's metal-accelerated SSD streaming. Slow but unprecedented at this parameter count on consumer hardware.

Inference on Mac: MLX and Ollama

Two inference paths dominate on Apple Silicon. MLX is Apple's own machine learning framework, optimized for Metal and the Neural Engine. It offers the highest throughput for Apple-native weight formats and is the right choice for serious Mac inference workloads. The mlx-community on HuggingFace maintains pre-converted MLX weights for most popular models.

Ollama on Apple Silicon uses Metal acceleration automatically and provides a clean, beginner-friendly interface. For most users, Ollama is the right starting point — ollama run qwen3.5:27b just works, and the OpenAI-compatible API means your tooling integrates immediately.

📊 Cost vs GPU Rig: A Mac Studio M4 Ultra (192GB) lists at approximately $8,000–$9,000 new. It's more expensive upfront but consumes roughly 60–80W at load (vs 1400W for the GPU rig), runs silently, and requires zero Linux configuration. The right choice depends on your workload: batch throughput → GPU rig; large models + simplicity → Mac Studio.

4. Inference Engines

The inference engine is the layer between your model weights and your application. It handles model loading, quantization, batching, KV cache management, and serving. Choosing the right engine for your hardware and workload dramatically impacts both throughput and ease of setup. Here are the five you need to know.

⚡ Ollama

Best for: Beginners, Mac, single-GPU, quick setup

Ollama is the fastest path from zero to running a local model. It wraps llama.cpp in a clean daemon with a REST API that's OpenAI-compatible out of the box. One command — ollama run qwen3.5:7b — pulls the model, loads it, and starts serving. The model library at ollama.com/library covers every major open-source model in pre-quantized GGUF format.

On Apple Silicon, Ollama automatically uses Metal acceleration. On CUDA systems, it offloads layers to GPU automatically. The web dashboard at localhost:11434 shows running models, memory usage, and active requests.

Limitations: No real multi-GPU tensor parallelism (it treats multiple GPUs as overflow, not as a unified pool). No continuous batching or PagedAttention. Not suitable for high-throughput production serving. For a 4× 3090 rig at scale, use vLLM.

🚀 vLLM

Best for: 4× RTX 3090, high-throughput, multi-GPU tensor parallel, production

vLLM is the gold standard for high-throughput local inference on CUDA hardware. Its two key innovations are PagedAttention (which manages KV cache memory as non-contiguous pages, dramatically improving utilization) and continuous batching (which interleaves requests dynamically rather than waiting for a full batch).

For the 4× RTX 3090 rig, vLLM's tensor parallelism is the key feature: --tensor-parallel-size 4 splits the model's layers across all four GPUs simultaneously, treating them as a unified accelerator. Run a 72B model in Q4 across two GPUs, or a 110B model across four:

docker run --gpus all vllm/vllm-openai \
  --model Qwen/Qwen3.5-72B-Instruct \
  --tensor-parallel-size 4 \
  --dtype bfloat16

The resulting server is OpenAI API-compatible at localhost:8000. Drop it in as a replacement for any OpenAI SDK call. vLLM also supports speculative decoding, LoRA adapters at serving time, and prefix caching for long-context workloads.

🔧 llama.cpp

Best for: CPU+GPU hybrid, edge, GGUF models, fine-grained control

llama.cpp is the bedrock of local inference. Written in pure C++ with no Python dependencies, it runs on everything — from a Raspberry Pi to a Linux server. Its GGUF quantization format (Q2 through Q8, with mixed-precision K-quants like Q4_K_M) makes it the most portable model format in the ecosystem.

The partial GPU offload feature is uniquely valuable for systems with limited VRAM: -ngl 40 offloads the first 40 transformer layers to GPU while keeping the rest in system RAM. A 70B model in Q4_K_M (~40GB) can partially fit on a single 24GB card, running much faster than pure CPU inference.

Ollama is built on llama.cpp. ExLlamaV2 is its main competitor for GPU-heavy use. For edge devices, network-constrained environments, or Windows machines, llama.cpp is the right call.

⚙️ ExLlamaV2

Best for: Maximum single-machine throughput, EXL2 quantization

ExLlamaV2 competes with vLLM on multi-GPU setups and often wins on single-machine throughput benchmarks. Its EXL2 quantization format achieves higher accuracy than GGUF at the same bit-width by using non-uniform per-layer quantization. A 70B model in EXL2 at 4 bits per weight is measurably better than GGUF Q4_K_M on most benchmarks.

The Tabby API wrapper provides an OpenAI-compatible server on top of ExLlamaV2. ExLlamaV2 also supports tensor parallelism across multiple GPUs, making it a genuine vLLM alternative for setups where you want maximum single-machine throughput over multi-node scalability.

🍎 MLX (Apple Silicon Only)

Best for: Apple Silicon — Metal-native inference, large model streaming

Apple's MLX framework is the fastest inference engine for Apple Silicon by a meaningful margin. It uses Metal compute shaders optimized for the M-series Neural Engine and memory architecture. For models in MLX format (available on huggingface.co/mlx-community), it outperforms llama.cpp on Mac by 20–40% depending on the model and quantization.

MLX's SSD-streaming capability is what enables Qwen3.5-397B to run on a Mac Studio: weights stream from NVMe into unified memory in chunks as inference proceeds. It's slow (~3.4 tok/s) but it works — a frontier-class model running locally on consumer hardware.

Engine	Multi-GPU	Platform	API	Best Use Case
Ollama	No	All	OpenAI-compat	Quick setup, Mac daily use
vLLM	✅ Tensor parallel	CUDA	OpenAI-compat	Production, 4× 3090 rig
llama.cpp	Partial	All	OpenAI-compat	Edge, CPU+GPU hybrid
ExLlamaV2	✅ Tensor parallel	CUDA	Via Tabby	Max throughput, EXL2 quant
MLX	No (M-series)	Apple Silicon	Via Ollama/API	Mac Studio, SSD streaming

5. Models for Local Agentic Use

Choosing a model for a local agentic deployment involves balancing capability, VRAM footprint, and task fit. The model landscape in early 2026 is rich — here are the families worth knowing.

Qwen3.5 Family (Recommended for Agents)

The Qwen3.5 series from Alibaba Cloud is the benchmark-leader for open-weight models and the top recommendation for local agentic deployments. The family spans an extraordinary range of sizes, making it ideal for resource-constrained and resource-rich environments alike.

Qwen3.5-7B — 4–5GB VRAM in Q4. Fast, capable, great for lightweight classification, summarization, routing tasks.
Qwen3.5-14B — ~8GB VRAM in Q4. Strong reasoning, fits on a single RTX 3090 with headroom for context.
Qwen3.5-27B — ~16GB VRAM in Q4. Excellent all-around model, single RTX 3090 at Q4 or single card at Q8 with overflow.
Qwen3.5-72B — ~40GB VRAM in Q4. Fits on two RTX 3090s with tensor parallel. The best local agent model for most serious workloads.
Qwen3.5-110B — Tensor parallel across 4× RTX 3090, or fully resident on Mac Studio 128GB at Q4.
Qwen3.5-397B — SSD streaming via MLX on Mac Studio M4 Ultra 192GB. ~3.4 tok/s — slow but unprecedented frontier-class local inference.

All Qwen3.5 models are trained for strong tool calling, JSON mode, and structured output — essential capabilities for agentic pipelines.

Nemotron Family (NVIDIA)

NVIDIA's Nemotron family covers a wide spectrum of deployment targets, from browser-capable models to dual-GPU heavyweights:

Nemotron-Nano 4B — 3GB VRAM. Mamba hybrid architecture. Designed to run in-browser or on low-power hardware.
Nemotron-Cascade 2 (30B MoE) — Only 3B parameters active per forward pass. Single-GPU efficient reasoning without the VRAM cost of a dense 30B model.
Nemotron-Super 120B — Runs via REAP compression on 2× RTX 3090 (~60GB VRAM). NVIDIA's most capable open model.

Hermes Models (Nous Research)

Nous Research's Hermes series is specifically optimized for agentic and tool-calling tasks. Hermes 3 and the newer Hermes Pro are fine-tuned on large datasets of tool-call traces, function specifications, and structured output examples. They reliably produce valid JSON function calls, handle complex multi-turn agentic dialogues, and work well in constrained-output pipelines. If your primary use case is an agent that calls tools and returns structured data, Hermes is worth evaluating alongside Qwen3.5.

Qwen3-Coder (Coding Tasks)

For coding agents and developer tooling, the Qwen3-Coder family (7B through 32B) is optimized specifically for code generation, completion, and editing. It's not a general-purpose model — it's trained on a massive corpus of code and technical documentation. For a local coding assistant powering an IDE plugin or terminal agent, Qwen3-Coder-14B via Ollama is a strong default.

⚠️ Don't run LLMs for TTS/STT: Speech synthesis and transcription are specialized tasks. Using a general LLM for audio is wasteful and produces poor results. Use dedicated models (Kokoro for TTS, faster-whisper for STT) covered in the next section.

6. TTS/STT: Local Voice Stack

A complete local AI stack includes voice — both text-to-speech and speech-to-text. As of 2026, the open-source voice stack has reached production quality. The three components worth deploying are Kokoro TTS, faster-whisper STT, and Cacique as the glue layer.

Kokoro TTS

Kokoro is an open-source TTS model released under the Apache 2.0 license. It offers over 50 voice presets across multiple accents and speaking styles — from professional narration voices to conversational tones. GPU-accelerated inference on an RTX 3090 delivers sub-100ms latency for short utterances, making it suitable for real-time voice assistant responses.

Kokoro exposes an OpenAI-compatible /v1/audio/speech API. Any application built against the OpenAI TTS API can switch to Kokoro with a single endpoint change. The af_sky voice is widely regarded as the cleanest and most natural female voice in the library. The am_adam and bm_george voices are the most natural male options.

faster-whisper STT

faster-whisper is an optimized implementation of OpenAI's Whisper speech recognition model using CTranslate2 as the backend. It achieves 4–8× the throughput of the original Whisper implementation with identical accuracy. The large-v3 model provides near-perfect transcription in dozens of languages and handles real-world audio conditions (background noise, accents, varied microphone quality) with remarkable robustness.

On a CUDA-enabled GPU, faster-whisper large-v3 transcribes a 60-second audio clip in under 3 seconds. It natively handles OGG, MP3, WAV, M4A, and most other common audio formats. This makes it ideal for voice message processing in agent pipelines — Telegram voice notes, phone recordings, meeting audio.

Cacique Server

Cacique is a combined TTS + STT server that runs Kokoro and faster-whisper together on a single GPU. It exposes a fully OpenAI-compatible API:

POST /v1/audio/speech — TTS (same spec as OpenAI TTS)
POST /v1/audio/transcriptions — STT (same spec as OpenAI Whisper API)

A simple TTS request from the command line:

curl http://10.0.0.79:8880/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"model":"kokoro","voice":"af_sky","input":"Hello from your local AI stack.","response_format":"aac"}' \
  -o speech.aac

Deployment Strategy

For a 4× RTX 3090 rig, dedicating one card to Cacique (Kokoro + faster-whisper) and three to your LLM inference engine is a reasonable allocation. Both TTS and STT are lightweight compared to LLM inference — one RTX 3090 handles both with capacity to spare.

Alternatively, a dedicated Mac Mini M4 (running mlx-whisper and a local Kokoro port) makes an excellent always-on audio server that's silent and power-efficient. For a Mac Studio setup where all GPU resources go to LLM inference, the Mac Mini M4 audio server is the natural complement.

7. Agent Frameworks

An inference engine gives you a model. An agent framework gives you a model that can do things — call tools, search the web, manage memory, send messages, and orchestrate multi-step tasks. Here are the four frameworks worth knowing for local deployments.

OpenClaw

OpenClaw is a personal AI agent runtime designed for self-hosted deployments. Its key architectural concept is the Skills system: modular, shareable packages that define what the agent can do (research a topic, generate a video, fetch your bookmarks, analyze a PDF). Skills are plain directories with a SKILL.md specification and supporting scripts — easy to write, easy to share.

OpenClaw connects to messaging platforms (Telegram, Discord, Signal) and becomes your always-on personal assistant. It's heartbeat-driven — checking calendars, emails, and monitoring tasks proactively without requiring you to initiate every interaction. The OpenClaw gateway connects local tools to remote channels with zero infrastructure overhead.

For local inference, OpenClaw connects to any OpenAI-compatible endpoint — point it at your Ollama instance or vLLM server and it works immediately. This makes it the natural "application layer" on top of the inference stack described in this guide.

Hermes Agent (Nous Research)

Built by the same team that trained the Hermes model family, Hermes Agent is architected around programmatic tool calling. Its primary interface is execute_code — an RPC mechanism that allows the agent to run code in isolated environments with 5 different backend options (Python, bash, JavaScript, etc.).

Hermes Agent shines in multi-agent pipelines: you can spawn sub-agents, assign them tasks, and aggregate their outputs in a structured orchestration loop. It supports scheduled automations and long-running research tasks. The natural pairing is Hermes Pro (the model) with Hermes Agent (the framework) — both optimized for each other.

NemoClaw (NVIDIA OpenShell)

NemoClaw is NVIDIA's enterprise-grade agent runtime, built on OpenShell. It focuses on safe, sandboxed shell operations — executing system commands in controlled environments with audit logging and permission management. Designed for data center deployment alongside Nemotron models, NemoClaw is the right choice when you need fine-grained control over what the agent is allowed to do at the system level. Less "personal assistant," more "ops automation."

Swama

Swama is a lightweight agent orchestrator that serves as the routing layer in the SmarterClaw stack. It's simple by design: route a request to the right model, aggregate results, and return structured output. No complex framework, no opaque abstractions — just clean routing between local Ollama-hosted models. For straightforward multi-model pipelines, Swama's simplicity is its advantage.

8. Coding Agents & Developer Tools

The local AI stack isn't complete without developer tooling that connects your models to your codebase. The open-source ecosystem has produced genuine Copilot and Cursor alternatives that work entirely with local inference. Here's the current landscape.

OpenCode

OpenCode is a terminal-native, open-source coding agent built for provider-agnostic local inference. It connects to any OpenAI-compatible endpoint — point it at your vLLM server or Ollama instance and it works immediately. The TUI (terminal user interface) is polished and intuitive: multi-file editing, inline diffs, conversation history, context management.

OpenCode's key design principle is that the AI model is just a backend — you bring your own, whether that's a cloud API or a local Qwen3-Coder-14B via Ollama. Available at github.com/opencode-ai/opencode.

Aider

Aider is a terminal coding assistant with unusually deep Git integration. It's commit-aware — it understands your repository's history, can reference past commits in its context, and automatically creates commits for the changes it makes. Aider works with any OpenAI-compatible endpoint: configure it to point at your local Ollama server and it runs entirely offline.

aider --openai-api-base http://localhost:11434/v1 \
  --openai-api-key ollama \
  --model ollama/qwen3-coder:14b

For codebases with meaningful Git history that you want the AI to understand, Aider is the strongest choice in the local-first ecosystem.

Continue.dev

Continue is an open-source IDE extension for VS Code and JetBrains that functions as a privacy-respecting Copilot alternative. Connect it to a local Ollama instance in the settings, and you get inline autocomplete, a chat sidebar, and context-aware suggestions — all running on your machine. The config.json setup takes about two minutes:

{
  "models": [
    {
      "title": "Qwen3-Coder 14B",
      "provider": "ollama",
      "model": "qwen3-coder:14b"
    }
  ]
}

Continue.dev is the best option for developers who want to stay in VS Code or JetBrains and get Copilot-class inline suggestions powered by local models.

Void

Void is an open-source fork of VS Code with AI deeply integrated at the IDE level — the closest open-source equivalent to Cursor. It provides inline completions, agent-mode code generation, multi-file context, and supports any OpenAI-compatible endpoint. For developers who want a full AI-native IDE experience without sending their code to Anthropic or OpenAI, Void is the answer. Point it at your local vLLM or Ollama endpoint in settings.

✅ Recommended Config: Use Qwen3-Coder-14B via Ollama for lightweight tasks (autocomplete, quick fixes, small edits). Use Qwen3.5-72B via vLLM for complex refactors, architecture-level reasoning, and long-context code analysis. Continue.dev and Void can be configured with multiple models and route requests accordingly.

9. Putting It Together: Two Reference Stacks

All the components above come together into two coherent, deployable stacks. These are battle-tested reference configurations — not theoretical. Each reflects what actually runs well in practice on the specified hardware as of March 2026.

🖥️ Stack A — The GPU Rig (4× RTX 3090)

High-throughput, multi-GPU inference workstation · ~$5,500 total

Hardware:   4× RTX 3090 (96GB VRAM), 128GB DDR4, Ryzen 9 5950X, 2TB NVMe
            ASUS Pro WS X570-ACE, 1800W PSU, Fractal Meshify case

Inference:  vLLM (--tensor-parallel-size 4) → Qwen3.5-72B, Nemotron-Super 120B
            llama.cpp → GGUF models, edge/hybrid tasks
            1× RTX 3090 dedicated to Cacique (TTS + STT)

Audio:      Cacique server: Kokoro TTS (af_sky) + faster-whisper large-v3
            GPU: RTX 3090 card 4 (dedicated)
            Endpoint: http://localhost:8880/v1/audio/speech

Agent:      OpenClaw → vLLM → Qwen3.5-72B (daily reasoning)
            Hermes Agent → vLLM → Hermes Pro (tool calling pipelines)

Models:     Qwen3.5-72B-Instruct (Q4) — daily general agent
            Nemotron-Super-120B (REAP) — heavy reasoning, 2× GPU
            Qwen3-Coder-14B (Q4) — code tasks
            Hermes Pro (Q4) — structured outputs, JSON mode

Coding:     OpenCode → Ollama → Qwen3-Coder-14B (lightweight)
            Aider → Ollama → Qwen3-Coder-14B (Git-aware)
            vLLM → Qwen3.5-72B (complex refactors)

IDE:        VS Code + Continue.dev, or Void
            Model: Qwen3-Coder-14B via Ollama for autocomplete

🍎 Stack B — Mac Studio M3/M4 Ultra (192GB)

Silent, large-model-capable, macOS-native workstation · ~$8,000–$9,000

Hardware:   Mac Studio M4 Ultra, 192GB unified memory, 2TB SSD
            Silent, 60W load power, macOS ecosystem

Inference:  Ollama (Metal) → Qwen3.5-27B, Qwen3.5-110B (daily use)
            MLX → Qwen3.5-397B via SSD streaming (~3.4 tok/s, frontier)
            MLX → Qwen3-Coder-27B (coding tasks)

Audio:      Option A: Cacique on remote GPU rig (http://rig:8880)
            Option B: local mlx-whisper + Kokoro port for Mac
            Option C: Dedicated Mac Mini M4 for always-on TTS/STT

Agent:      OpenClaw → Ollama → Qwen3.5-110B (resident, 60GB Q4)
            SmarterClaw (Swama + Ollama) for multi-model routing

Models:     Qwen3.5-110B-Instruct (Q4, 64GB) — primary agent
            Qwen3.5-397B (MLX streaming) — frontier tasks
            Qwen3-Coder-27B (Q4) — code assistance

Coding:     OpenCode → Ollama → Qwen3-Coder-27B
            Aider → Ollama → Qwen3-Coder-27B

IDE:        Void (full AI-native IDE, points at local Ollama)
            Continue.dev (VS Code/JetBrains, local model)
            Model: Qwen3-Coder-27B for autocomplete + chat

🧭 Which Stack Should You Build?

GPU Rig (4× RTX 3090): Choose this if you need high batch throughput, plan to serve multiple users or pipelines simultaneously, want maximum flexibility with quantization and model formats, are comfortable with Linux, and want the lowest cost per inference at scale.

Mac Studio (M3/M4 Ultra): Choose this if you need to run very large models (110B+) with simple setup, value silent operation and low power consumption, work primarily in macOS, want the simplest path to running frontier-class models locally, and don't need maximum throughput.

Many serious local AI practitioners run both: the Mac Studio as a always-on personal inference node for the agent layer, and a GPU rig for batch processing, training, and high-throughput tasks. They're complementary, not competing.

🏗️ Build Your Local AI Stack from Scratch

1. Why Go Local?

2. Hardware Tier 1 — The GPU Rig: 4× RTX 3090

The VRAM Math

PCIe Bandwidth and Topology

CPU, RAM, and Storage

Power Requirements

3. Hardware Tier 2 — Mac Studio M3/M4 Ultra

Unified Memory: The Key Insight

What Models Run Well

Inference on Mac: MLX and Ollama

4. Inference Engines

⚡ Ollama

🚀 vLLM

🔧 llama.cpp

⚙️ ExLlamaV2

🍎 MLX (Apple Silicon Only)

5. Models for Local Agentic Use

Qwen3.5 Family (Recommended for Agents)

Nemotron Family (NVIDIA)

Hermes Models (Nous Research)

Qwen3-Coder (Coding Tasks)

6. TTS/STT: Local Voice Stack

Kokoro TTS

faster-whisper STT

Cacique Server

Deployment Strategy

7. Agent Frameworks

OpenClaw

Hermes Agent (Nous Research)

NemoClaw (NVIDIA OpenShell)

Swama

8. Coding Agents & Developer Tools

OpenCode

Aider

Continue.dev

Void

9. Putting It Together: Two Reference Stacks

🖥️ Stack A — The GPU Rig (4× RTX 3090)

🍎 Stack B — Mac Studio M3/M4 Ultra (192GB)

🧭 Which Stack Should You Build?

References