1. Why Go Local?
There's a quiet revolution happening in garages, home offices, and server rooms around the world. Developers, researchers, and tinkerers are unplugging from the cloud and standing up their own AI workstations. The reason isn't contrarianism โ it's a cold, rational calculation. For an increasing number of use cases, running AI locally is simply better in every dimension that matters: privacy, cost, latency, and control.
Privacy first. When you run a local model, your data never leaves your machine. Not one prompt, not one token, not one document you feed it for context. This isn't a minor feature โ it's a fundamental shift in the trust model. Companies processing sensitive customer data, lawyers reviewing privileged documents, researchers handling pre-publication findings: all of these use cases are incompatible with cloud-hosted inference by default. Local models eliminate that exposure entirely.
Cost at scale. Cloud inference pricing is reasonable for occasional use. It becomes brutal at scale. Running a 70B-class model in production โ answering thousands of queries per day โ will run you tens of thousands of dollars per month on hosted APIs. A one-time hardware investment in a capable local rig amortizes to near-zero marginal cost per query. At high-volume usage, local wins in 3โ6 months. At very high volume, it wins faster.
Latency with no ceiling. Cloud inference adds network round-trips, rate limits, queue delays, and the unpredictability of shared infrastructure. A well-configured local setup with a 4ร RTX 3090 rig delivers responses in milliseconds โ no network hop, no cold start, no API throttle. For real-time applications (voice assistants, coding agents, live document analysis), this matters enormously.
Total control. Want to use a specific quantization? Fine-tune on your data? Run a custom LoRA? Serve 50 concurrent users? Mix two models in a pipeline? With local hardware, you make every call. No model deprecations, no policy changes, no surprise capacity limits.
The hardware landscape has also reached a critical inflection point. Two archetypes dominate the conversation: the GPU rig (raw throughput via multiple discrete GPUs) and the Mac Studio (elegant, high-capacity unified memory). Each has a clear use case, and this guide covers both.
2. Hardware Tier 1 โ The GPU Rig: 4ร RTX 3090
If you want raw, multi-GPU inference horsepower that can run 70B+ models in FP16 and serve production workloads, nothing in the consumer or prosumer space competes with a well-built 4ร RTX 3090 rig. These cards were designed for the datacenter before NVIDIA capped consumer PCIe SLI, but for AI inference via tensor parallelism over PCIe, they're still king at the price point.
The VRAM Math
Each RTX 3090 carries 24GB of GDDR6X VRAM. Across four cards, that's 96GB of combined addressable VRAM. With vLLM's tensor parallelism, this is treated as a unified pool โ not separate islands. What this means in practice: a Qwen3.5-72B model in BF16 (~144GB raw) doesn't fit, but the same model in Q4_K_M quantization (~40GB) fits with headroom on just two cards. With all four, you're running 110B-class models comfortably and touching 120B territory via REAP-compressed weights.
PCIe Bandwidth and Topology
NVLink is not required. Unlike training, which demands constant gradient synchronization, inference tensor parallelism requires far less inter-GPU bandwidth. PCIe 4.0 x16 per slot (128 GB/s bidirectional) is ideal; x8 lanes are acceptable for most inference workloads. The key is that your motherboard physically provides four full PCIe slots โ many ATX boards only offer two.
Recommended motherboards:
- ASUS Pro WS X570-ACE โ 4ร PCIe 4.0 x16 slots, workstation-grade VRM, excellent stability
- MSI MEG Z790 GODLIKE โ Intel platform, 4 PCIe 5.0/4.0 slots, PCIe bifurcation support
CPU, RAM, and Storage
The CPU doesn't need to be top-tier for inference. Your GPU is doing the heavy lifting. An AMD Ryzen 9 5950X or Intel Core i9-13900K is more than sufficient. The CPU's job is managing the memory bus, feeding data to the GPUs, and running your orchestration layer.
RAM: 128GB DDR4/5 is the target. When a model's VRAM footprint exceeds what your GPUs can hold, modern inference engines spill layers to system RAM. This is slower, but it keeps large models runnable. 128GB gives you meaningful overflow capacity. Do not go below 64GB on a 4ร 3090 rig.
Storage: 2TB+ NVMe SSD is essential, not optional. Model weights are large โ a 70B model in Q4 quantization is ~40GB. A 120B model is ~60-70GB. SSD streaming (used in REAP and MLX) reads weights directly from NVMe during inference. Fast sequential read speeds (7+ GB/s on PCIe 4.0 NVMe) determine how quickly large models load and stream.
Power Requirements
Each RTX 3090 has a TDP of approximately 350W under load. Four cards running simultaneously: ~1,400W โ and that's before your CPU, RAM, storage, and fans. You need a 1600W+ PSU, ideally 1800W for headroom and power supply longevity. The EVGA SuperNOVA 1600 T2 and Seasonic PRIME TX-1600 are proven choices. Proper cable management and a well-ventilated case (Fractal Design Meshify is popular) are critical โ these cards run hot.
3. Hardware Tier 2 โ Mac Studio M3/M4 Ultra
The Mac Studio with an M3 or M4 Ultra chip represents a fundamentally different design philosophy than the GPU rig โ and for many workloads, it's actually the superior choice. Where the GPU rig wins on raw multi-GPU tensor parallelism for batch throughput, the Mac Studio wins on sheer memory capacity, power efficiency, form factor, and silent operation.
Unified Memory: The Key Insight
Apple Silicon's defining architectural advantage for LLM inference is unified memory. The CPU and GPU share one physical memory pool โ there are no PCIe transfers between host and device RAM. On an M4 Ultra with 192GB of unified memory, a 120B-parameter model in BF16 (~240GB raw) doesn't fit, but the same model at Q4 (~60GB) or Q8 (~120GB) sits entirely in memory with fast GPU access.
The M3 Ultra offers up to 192GB unified memory with approximately 800 GB/s memory bandwidth. The M4 Ultra pushes similar specs with architectural improvements in efficiency and throughput. This bandwidth matters enormously: inference speed is often memory-bandwidth-bound, not compute-bound. Apple's unified architecture means no bottleneck between the memory pool and the neural engine.
What Models Run Well
The sweet spot on a 128GB Mac Studio M3/M4 Ultra:
- Up to 72B in Q8 โ fully resident, fast generation (~20 tokens/sec)
- Up to 120B in Q4 โ fully resident, good for heavy reasoning tasks
- 397B via SSD streaming โ Qwen3.5-397B runs at approximately 3.4 tok/s using MLX's metal-accelerated SSD streaming. Slow but unprecedented at this parameter count on consumer hardware.
Inference on Mac: MLX and Ollama
Two inference paths dominate on Apple Silicon. MLX is Apple's own machine learning framework, optimized for Metal and the Neural Engine. It offers the highest throughput for Apple-native weight formats and is the right choice for serious Mac inference workloads. The mlx-community on HuggingFace maintains pre-converted MLX weights for most popular models.
Ollama on Apple Silicon uses Metal acceleration automatically and provides a clean, beginner-friendly interface. For most users, Ollama is the right starting point โ ollama run qwen3.5:27b just works, and the OpenAI-compatible API means your tooling integrates immediately.
4. Inference Engines
The inference engine is the layer between your model weights and your application. It handles model loading, quantization, batching, KV cache management, and serving. Choosing the right engine for your hardware and workload dramatically impacts both throughput and ease of setup. Here are the five you need to know.
โก Ollama
Best for: Beginners, Mac, single-GPU, quick setupOllama is the fastest path from zero to running a local model. It wraps llama.cpp in a clean daemon with a REST API that's OpenAI-compatible out of the box. One command โ ollama run qwen3.5:7b โ pulls the model, loads it, and starts serving. The model library at ollama.com/library covers every major open-source model in pre-quantized GGUF format.
On Apple Silicon, Ollama automatically uses Metal acceleration. On CUDA systems, it offloads layers to GPU automatically. The web dashboard at localhost:11434 shows running models, memory usage, and active requests.
Limitations: No real multi-GPU tensor parallelism (it treats multiple GPUs as overflow, not as a unified pool). No continuous batching or PagedAttention. Not suitable for high-throughput production serving. For a 4ร 3090 rig at scale, use vLLM.
๐ vLLM
Best for: 4ร RTX 3090, high-throughput, multi-GPU tensor parallel, productionvLLM is the gold standard for high-throughput local inference on CUDA hardware. Its two key innovations are PagedAttention (which manages KV cache memory as non-contiguous pages, dramatically improving utilization) and continuous batching (which interleaves requests dynamically rather than waiting for a full batch).
For the 4ร RTX 3090 rig, vLLM's tensor parallelism is the key feature: --tensor-parallel-size 4 splits the model's layers across all four GPUs simultaneously, treating them as a unified accelerator. Run a 72B model in Q4 across two GPUs, or a 110B model across four:
docker run --gpus all vllm/vllm-openai \
--model Qwen/Qwen3.5-72B-Instruct \
--tensor-parallel-size 4 \
--dtype bfloat16
The resulting server is OpenAI API-compatible at localhost:8000. Drop it in as a replacement for any OpenAI SDK call. vLLM also supports speculative decoding, LoRA adapters at serving time, and prefix caching for long-context workloads.
๐ง llama.cpp
Best for: CPU+GPU hybrid, edge, GGUF models, fine-grained controlllama.cpp is the bedrock of local inference. Written in pure C++ with no Python dependencies, it runs on everything โ from a Raspberry Pi to a Linux server. Its GGUF quantization format (Q2 through Q8, with mixed-precision K-quants like Q4_K_M) makes it the most portable model format in the ecosystem.
The partial GPU offload feature is uniquely valuable for systems with limited VRAM: -ngl 40 offloads the first 40 transformer layers to GPU while keeping the rest in system RAM. A 70B model in Q4_K_M (~40GB) can partially fit on a single 24GB card, running much faster than pure CPU inference.
Ollama is built on llama.cpp. ExLlamaV2 is its main competitor for GPU-heavy use. For edge devices, network-constrained environments, or Windows machines, llama.cpp is the right call.
โ๏ธ ExLlamaV2
Best for: Maximum single-machine throughput, EXL2 quantizationExLlamaV2 competes with vLLM on multi-GPU setups and often wins on single-machine throughput benchmarks. Its EXL2 quantization format achieves higher accuracy than GGUF at the same bit-width by using non-uniform per-layer quantization. A 70B model in EXL2 at 4 bits per weight is measurably better than GGUF Q4_K_M on most benchmarks.
The Tabby API wrapper provides an OpenAI-compatible server on top of ExLlamaV2. ExLlamaV2 also supports tensor parallelism across multiple GPUs, making it a genuine vLLM alternative for setups where you want maximum single-machine throughput over multi-node scalability.
๐ MLX (Apple Silicon Only)
Best for: Apple Silicon โ Metal-native inference, large model streamingApple's MLX framework is the fastest inference engine for Apple Silicon by a meaningful margin. It uses Metal compute shaders optimized for the M-series Neural Engine and memory architecture. For models in MLX format (available on huggingface.co/mlx-community), it outperforms llama.cpp on Mac by 20โ40% depending on the model and quantization.
MLX's SSD-streaming capability is what enables Qwen3.5-397B to run on a Mac Studio: weights stream from NVMe into unified memory in chunks as inference proceeds. It's slow (~3.4 tok/s) but it works โ a frontier-class model running locally on consumer hardware.
| Engine | Multi-GPU | Platform | API | Best Use Case |
|---|---|---|---|---|
| Ollama | No | All | OpenAI-compat | Quick setup, Mac daily use |
| vLLM | โ Tensor parallel | CUDA | OpenAI-compat | Production, 4ร 3090 rig |
| llama.cpp | Partial | All | OpenAI-compat | Edge, CPU+GPU hybrid |
| ExLlamaV2 | โ Tensor parallel | CUDA | Via Tabby | Max throughput, EXL2 quant |
| MLX | No (M-series) | Apple Silicon | Via Ollama/API | Mac Studio, SSD streaming |
5. Models for Local Agentic Use
Choosing a model for a local agentic deployment involves balancing capability, VRAM footprint, and task fit. The model landscape in early 2026 is rich โ here are the families worth knowing.
Qwen3.5 Family (Recommended for Agents)
The Qwen3.5 series from Alibaba Cloud is the benchmark-leader for open-weight models and the top recommendation for local agentic deployments. The family spans an extraordinary range of sizes, making it ideal for resource-constrained and resource-rich environments alike.
- Qwen3.5-7B โ 4โ5GB VRAM in Q4. Fast, capable, great for lightweight classification, summarization, routing tasks.
- Qwen3.5-14B โ ~8GB VRAM in Q4. Strong reasoning, fits on a single RTX 3090 with headroom for context.
- Qwen3.5-27B โ ~16GB VRAM in Q4. Excellent all-around model, single RTX 3090 at Q4 or single card at Q8 with overflow.
- Qwen3.5-72B โ ~40GB VRAM in Q4. Fits on two RTX 3090s with tensor parallel. The best local agent model for most serious workloads.
- Qwen3.5-110B โ Tensor parallel across 4ร RTX 3090, or fully resident on Mac Studio 128GB at Q4.
- Qwen3.5-397B โ SSD streaming via MLX on Mac Studio M4 Ultra 192GB. ~3.4 tok/s โ slow but unprecedented frontier-class local inference.
All Qwen3.5 models are trained for strong tool calling, JSON mode, and structured output โ essential capabilities for agentic pipelines.
Nemotron Family (NVIDIA)
NVIDIA's Nemotron family covers a wide spectrum of deployment targets, from browser-capable models to dual-GPU heavyweights:
- Nemotron-Nano 4B โ 3GB VRAM. Mamba hybrid architecture. Designed to run in-browser or on low-power hardware.
- Nemotron-Cascade 2 (30B MoE) โ Only 3B parameters active per forward pass. Single-GPU efficient reasoning without the VRAM cost of a dense 30B model.
- Nemotron-Super 120B โ Runs via REAP compression on 2ร RTX 3090 (~60GB VRAM). NVIDIA's most capable open model.
Hermes Models (Nous Research)
Nous Research's Hermes series is specifically optimized for agentic and tool-calling tasks. Hermes 3 and the newer Hermes Pro are fine-tuned on large datasets of tool-call traces, function specifications, and structured output examples. They reliably produce valid JSON function calls, handle complex multi-turn agentic dialogues, and work well in constrained-output pipelines. If your primary use case is an agent that calls tools and returns structured data, Hermes is worth evaluating alongside Qwen3.5.
Qwen3-Coder (Coding Tasks)
For coding agents and developer tooling, the Qwen3-Coder family (7B through 32B) is optimized specifically for code generation, completion, and editing. It's not a general-purpose model โ it's trained on a massive corpus of code and technical documentation. For a local coding assistant powering an IDE plugin or terminal agent, Qwen3-Coder-14B via Ollama is a strong default.
6. TTS/STT: Local Voice Stack
A complete local AI stack includes voice โ both text-to-speech and speech-to-text. As of 2026, the open-source voice stack has reached production quality. The three components worth deploying are Kokoro TTS, faster-whisper STT, and Cacique as the glue layer.
Kokoro TTS
Kokoro is an open-source TTS model released under the Apache 2.0 license. It offers over 50 voice presets across multiple accents and speaking styles โ from professional narration voices to conversational tones. GPU-accelerated inference on an RTX 3090 delivers sub-100ms latency for short utterances, making it suitable for real-time voice assistant responses.
Kokoro exposes an OpenAI-compatible /v1/audio/speech API. Any application built against the OpenAI TTS API can switch to Kokoro with a single endpoint change. The af_sky voice is widely regarded as the cleanest and most natural female voice in the library. The am_adam and bm_george voices are the most natural male options.
faster-whisper STT
faster-whisper is an optimized implementation of OpenAI's Whisper speech recognition model using CTranslate2 as the backend. It achieves 4โ8ร the throughput of the original Whisper implementation with identical accuracy. The large-v3 model provides near-perfect transcription in dozens of languages and handles real-world audio conditions (background noise, accents, varied microphone quality) with remarkable robustness.
On a CUDA-enabled GPU, faster-whisper large-v3 transcribes a 60-second audio clip in under 3 seconds. It natively handles OGG, MP3, WAV, M4A, and most other common audio formats. This makes it ideal for voice message processing in agent pipelines โ Telegram voice notes, phone recordings, meeting audio.
Cacique Server
Cacique is a combined TTS + STT server that runs Kokoro and faster-whisper together on a single GPU. It exposes a fully OpenAI-compatible API:
POST /v1/audio/speechโ TTS (same spec as OpenAI TTS)POST /v1/audio/transcriptionsโ STT (same spec as OpenAI Whisper API)
A simple TTS request from the command line:
curl http://10.0.0.79:8880/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{"model":"kokoro","voice":"af_sky","input":"Hello from your local AI stack.","response_format":"aac"}' \
-o speech.aac
Deployment Strategy
For a 4ร RTX 3090 rig, dedicating one card to Cacique (Kokoro + faster-whisper) and three to your LLM inference engine is a reasonable allocation. Both TTS and STT are lightweight compared to LLM inference โ one RTX 3090 handles both with capacity to spare.
Alternatively, a dedicated Mac Mini M4 (running mlx-whisper and a local Kokoro port) makes an excellent always-on audio server that's silent and power-efficient. For a Mac Studio setup where all GPU resources go to LLM inference, the Mac Mini M4 audio server is the natural complement.
7. Agent Frameworks
An inference engine gives you a model. An agent framework gives you a model that can do things โ call tools, search the web, manage memory, send messages, and orchestrate multi-step tasks. Here are the four frameworks worth knowing for local deployments.
OpenClaw
OpenClaw is a personal AI agent runtime designed for self-hosted deployments. Its key architectural concept is the Skills system: modular, shareable packages that define what the agent can do (research a topic, generate a video, fetch your bookmarks, analyze a PDF). Skills are plain directories with a SKILL.md specification and supporting scripts โ easy to write, easy to share.
OpenClaw connects to messaging platforms (Telegram, Discord, Signal) and becomes your always-on personal assistant. It's heartbeat-driven โ checking calendars, emails, and monitoring tasks proactively without requiring you to initiate every interaction. The OpenClaw gateway connects local tools to remote channels with zero infrastructure overhead.
For local inference, OpenClaw connects to any OpenAI-compatible endpoint โ point it at your Ollama instance or vLLM server and it works immediately. This makes it the natural "application layer" on top of the inference stack described in this guide.
Hermes Agent (Nous Research)
Built by the same team that trained the Hermes model family, Hermes Agent is architected around programmatic tool calling. Its primary interface is execute_code โ an RPC mechanism that allows the agent to run code in isolated environments with 5 different backend options (Python, bash, JavaScript, etc.).
Hermes Agent shines in multi-agent pipelines: you can spawn sub-agents, assign them tasks, and aggregate their outputs in a structured orchestration loop. It supports scheduled automations and long-running research tasks. The natural pairing is Hermes Pro (the model) with Hermes Agent (the framework) โ both optimized for each other.
NemoClaw (NVIDIA OpenShell)
NemoClaw is NVIDIA's enterprise-grade agent runtime, built on OpenShell. It focuses on safe, sandboxed shell operations โ executing system commands in controlled environments with audit logging and permission management. Designed for data center deployment alongside Nemotron models, NemoClaw is the right choice when you need fine-grained control over what the agent is allowed to do at the system level. Less "personal assistant," more "ops automation."
Swama
Swama is a lightweight agent orchestrator that serves as the routing layer in the SmarterClaw stack. It's simple by design: route a request to the right model, aggregate results, and return structured output. No complex framework, no opaque abstractions โ just clean routing between local Ollama-hosted models. For straightforward multi-model pipelines, Swama's simplicity is its advantage.
8. Coding Agents & Developer Tools
The local AI stack isn't complete without developer tooling that connects your models to your codebase. The open-source ecosystem has produced genuine Copilot and Cursor alternatives that work entirely with local inference. Here's the current landscape.
OpenCode
OpenCode is a terminal-native, open-source coding agent built for provider-agnostic local inference. It connects to any OpenAI-compatible endpoint โ point it at your vLLM server or Ollama instance and it works immediately. The TUI (terminal user interface) is polished and intuitive: multi-file editing, inline diffs, conversation history, context management.
OpenCode's key design principle is that the AI model is just a backend โ you bring your own, whether that's a cloud API or a local Qwen3-Coder-14B via Ollama. Available at github.com/opencode-ai/opencode.
Aider
Aider is a terminal coding assistant with unusually deep Git integration. It's commit-aware โ it understands your repository's history, can reference past commits in its context, and automatically creates commits for the changes it makes. Aider works with any OpenAI-compatible endpoint: configure it to point at your local Ollama server and it runs entirely offline.
aider --openai-api-base http://localhost:11434/v1 \
--openai-api-key ollama \
--model ollama/qwen3-coder:14b
For codebases with meaningful Git history that you want the AI to understand, Aider is the strongest choice in the local-first ecosystem.
Continue.dev
Continue is an open-source IDE extension for VS Code and JetBrains that functions as a privacy-respecting Copilot alternative. Connect it to a local Ollama instance in the settings, and you get inline autocomplete, a chat sidebar, and context-aware suggestions โ all running on your machine. The config.json setup takes about two minutes:
{
"models": [
{
"title": "Qwen3-Coder 14B",
"provider": "ollama",
"model": "qwen3-coder:14b"
}
]
}
Continue.dev is the best option for developers who want to stay in VS Code or JetBrains and get Copilot-class inline suggestions powered by local models.
Void
Void is an open-source fork of VS Code with AI deeply integrated at the IDE level โ the closest open-source equivalent to Cursor. It provides inline completions, agent-mode code generation, multi-file context, and supports any OpenAI-compatible endpoint. For developers who want a full AI-native IDE experience without sending their code to Anthropic or OpenAI, Void is the answer. Point it at your local vLLM or Ollama endpoint in settings.
9. Putting It Together: Two Reference Stacks
All the components above come together into two coherent, deployable stacks. These are battle-tested reference configurations โ not theoretical. Each reflects what actually runs well in practice on the specified hardware as of March 2026.
๐ฅ๏ธ Stack A โ The GPU Rig (4ร RTX 3090)
Hardware: 4ร RTX 3090 (96GB VRAM), 128GB DDR4, Ryzen 9 5950X, 2TB NVMe
ASUS Pro WS X570-ACE, 1800W PSU, Fractal Meshify case
Inference: vLLM (--tensor-parallel-size 4) โ Qwen3.5-72B, Nemotron-Super 120B
llama.cpp โ GGUF models, edge/hybrid tasks
1ร RTX 3090 dedicated to Cacique (TTS + STT)
Audio: Cacique server: Kokoro TTS (af_sky) + faster-whisper large-v3
GPU: RTX 3090 card 4 (dedicated)
Endpoint: http://localhost:8880/v1/audio/speech
Agent: OpenClaw โ vLLM โ Qwen3.5-72B (daily reasoning)
Hermes Agent โ vLLM โ Hermes Pro (tool calling pipelines)
Models: Qwen3.5-72B-Instruct (Q4) โ daily general agent
Nemotron-Super-120B (REAP) โ heavy reasoning, 2ร GPU
Qwen3-Coder-14B (Q4) โ code tasks
Hermes Pro (Q4) โ structured outputs, JSON mode
Coding: OpenCode โ Ollama โ Qwen3-Coder-14B (lightweight)
Aider โ Ollama โ Qwen3-Coder-14B (Git-aware)
vLLM โ Qwen3.5-72B (complex refactors)
IDE: VS Code + Continue.dev, or Void
Model: Qwen3-Coder-14B via Ollama for autocomplete
๐ Stack B โ Mac Studio M3/M4 Ultra (192GB)
Hardware: Mac Studio M4 Ultra, 192GB unified memory, 2TB SSD
Silent, 60W load power, macOS ecosystem
Inference: Ollama (Metal) โ Qwen3.5-27B, Qwen3.5-110B (daily use)
MLX โ Qwen3.5-397B via SSD streaming (~3.4 tok/s, frontier)
MLX โ Qwen3-Coder-27B (coding tasks)
Audio: Option A: Cacique on remote GPU rig (http://rig:8880)
Option B: local mlx-whisper + Kokoro port for Mac
Option C: Dedicated Mac Mini M4 for always-on TTS/STT
Agent: OpenClaw โ Ollama โ Qwen3.5-110B (resident, 60GB Q4)
SmarterClaw (Swama + Ollama) for multi-model routing
Models: Qwen3.5-110B-Instruct (Q4, 64GB) โ primary agent
Qwen3.5-397B (MLX streaming) โ frontier tasks
Qwen3-Coder-27B (Q4) โ code assistance
Coding: OpenCode โ Ollama โ Qwen3-Coder-27B
Aider โ Ollama โ Qwen3-Coder-27B
IDE: Void (full AI-native IDE, points at local Ollama)
Continue.dev (VS Code/JetBrains, local model)
Model: Qwen3-Coder-27B for autocomplete + chat
๐งญ Which Stack Should You Build?
GPU Rig (4ร RTX 3090): Choose this if you need high batch throughput, plan to serve multiple users or pipelines simultaneously, want maximum flexibility with quantization and model formats, are comfortable with Linux, and want the lowest cost per inference at scale.
Mac Studio (M3/M4 Ultra): Choose this if you need to run very large models (110B+) with simple setup, value silent operation and low power consumption, work primarily in macOS, want the simplest path to running frontier-class models locally, and don't need maximum throughput.
Many serious local AI practitioners run both: the Mac Studio as a always-on personal inference node for the agent layer, and a GPU rig for batch processing, training, and high-throughput tasks. They're complementary, not competing.
References
- vLLM Documentation โ docs.vllm.ai
- Ollama GitHub โ github.com/ollama/ollama
- llama.cpp GitHub โ github.com/ggerganov/llama.cpp
- NousResearch Hermes Agent โ hermes-agent.nousresearch.com
- OpenClaw Documentation โ docs.openclaw.ai
- OpenCode GitHub โ github.com/opencode-ai/opencode
- Continue.dev โ continue.dev
- Kokoro TTS FastAPI โ github.com/remsky/Kokoro-FastAPI
- faster-whisper โ github.com/SYSTRAN/faster-whisper
- Qwen3.5 Models โ huggingface.co/Qwen