1. The GPU Landscape in 2025–2026
We're in a golden age for local AI. Consumer GPUs now rival enterprise hardware from just two years ago. NVIDIA's Blackwell architecture has arrived in both consumer (RTX 50-series) and professional (RTX PRO 6000) form, and the competition from Apple Silicon and AMD has never been stronger.
Here's the current hierarchy of GPUs that matter for LLM work:
| GPU | VRAM | Memory BW | CUDA/Stream | TDP | Street Price |
|---|---|---|---|---|---|
| RTX PRO 6000 Blackwell | 96 GB GDDR7 ECC | ~2,000 GB/s | 24,064 CUDA | 350W | ~$6,800–$7,500 |
| NVIDIA RTX 5090 | 32 GB GDDR7 | 1,792 GB/s | 21,760 CUDA | 575W | $1,999–$3,800 |
| NVIDIA RTX 4090 | 24 GB GDDR6X | 1,008 GB/s | 16,384 CUDA | 450W | $1,600–$2,000 |
| RTX 6000 Ada (prev gen) | 48 GB GDDR6 ECC | 960 GB/s | 18,176 CUDA | 300W | ~$6,000 |
| NVIDIA H100 SXM | 80 GB HBM3 | 3,350 GB/s | 16,896 CUDA | 700W | $25,000–$35,000 |
| NVIDIA H200 | 141 GB HBM3e | 4,800 GB/s | 16,896 CUDA | 700W | $40,000–$55,000 |
| NVIDIA B200 | 192 GB HBM3e | 8,000 GB/s | — | 1000W | $30,000–$35,000 |
| AMD MI300X | 192 GB HBM3 | 5,300 GB/s | 19,456 Stream | 750W | $10,000–$15,000 |
RTX PRO 6000 Blackwell: The New Professional King
The RTX PRO 6000 Blackwell is NVIDIA's new flagship professional GPU and it's a monster: 96 GB of GDDR7 ECC memory, 24,064 CUDA cores, 752 fifth-generation Tensor Cores, and fourth-generation RT cores. Built on the full GB202 die, it delivers up to 3× the AI performance of the previous-generation RTX 6000 Ada [1].
At roughly $6,800–$7,500, it's not cheap — but it's the only single-slot professional GPU that can fit a full 70B parameter model at FP16 precision. For serious AI labs and studios, one RTX PRO 6000 replaces what previously required two RTX 4090s.
RTX 5090: The Consumer Champion
The RTX 5090 is the new consumer king at $1,999 MSRP (though street prices often hit $2,500–$3,800 due to demand). Its 32 GB of GDDR7 with 1,792 GB/s bandwidth enables running quantized 70B models on a single GPU — something the 24GB RTX 4090 can't do [2].
Benchmarks tell the story clearly:
| Benchmark | RTX 4090 | RTX 5090 | RTX PRO 6000 |
|---|---|---|---|
| 8B model (tok/s) | 128 | 213 | — |
| Qwen3-Coder-30B (1x, tok/s) | 2,259 | 4,570 | 8,425 |
| Llama 3.3 70B AWQ (2x, tok/s) | 467 | 1,230 | 1,031 (1x) |
| Cost per 1M tokens (30B model) | $0.048 | $0.040 | $0.043 |
Source: CloudRift benchmark using vLLM, October 2025 [3]
Key takeaway: the RTX 5090 outperforms the A100 80GB on LLM inference and matches the H100 on 32B models at a fraction of the cost [4]. Two RTX 5090s achieve 27 tok/s on 70B evaluation, matching single H100 performance [5].
RTX 4090: Still the Workhorse
Don't sleep on the RTX 4090. At $1,600–$2,000, its 24GB VRAM handles every 7–13B model comfortably and can run quantized 30B models. It delivers 128 tok/s on 8B models — more than fast enough for interactive use. It's the most popular GPU in the local LLM community for good reason: mature ecosystem, rock-solid llama.cpp support, and wide availability [2].
2. Why VRAM Is King for LLMs
If there's one thing to understand about running LLMs locally, it's this: VRAM is the single most important spec. Not CUDA cores, not clock speeds, not TFLOPs. VRAM.
Here's why: a large language model must load its entire weight matrix into memory for inference. If the model doesn't fit in VRAM, it "spills" to system RAM over the PCIe bus — and performance drops by 10–50×.
VRAM usage during inference has two components:
- Fixed cost: Model weights + CUDA overhead + scratchpad activations
- Variable cost: KV cache that grows linearly with context length
The formula [2]:
VRAM (GB) = (P × b_w) + (0.55 + 0.08 × P) + KV_cache
Where P = parameters in billions, b_w = bytes per weight (0.57 for Q4_K_M, 2.0 for FP16), and KV cache depends on context length, batch size, and model architecture.
The second killer is memory bandwidth. LLM inference is memory-bound — the GPU spends most of its time waiting for weights to be read from VRAM, not computing. This is why the RTX 5090's 1,792 GB/s bandwidth gives it a 67% speed advantage over the RTX 4090's 1,008 GB/s [2].
3. The Competitor Landscape
AMD: Instinct for Data Centers, Radeon for Consumers
AMD Instinct MI300X is a legitimate enterprise contender with 192 GB HBM3 and 5,300 GB/s bandwidth. It's excellent for large-scale LLM inference, and AMD's ROCm software stack has matured significantly with vLLM support [6]. At $10,000–$15,000, it's cheaper than an H100 with more VRAM.
The newer MI350 (CDNA 4) pushes to 288 GB HBM3e with FP4/FP6/FP8 support — targeting frontier model training [7].
On the consumer side, the AMD RX 9070 XT ($549, 16GB GDDR6) is rated "excellent for LLM tasks" but limited by 16GB VRAM — enough for 7–13B quantized models but you'll hit the wall fast on anything larger [8]. The bigger issue is software: ROCm support for consumer cards is improving but still behind NVIDIA's CUDA ecosystem for llama.cpp and Ollama.
Intel Arc: Budget Entry Point
The Intel Arc B580 at just $249 is the sleeper pick for experimentation. It delivers 62 tokens/second on 8B models — surprisingly competitive for the price. SYCL/oneAPI support is improving. It's not for serious production work, but for learning and running small models, it's hard to beat the price-to-performance ratio [2].
Apple Silicon: The Unified Memory Advantage
Apple Silicon takes a fundamentally different approach. Instead of discrete VRAM, the M-series chips use unified memory — a shared pool accessible to both CPU and GPU cores without PCIe transfer overhead. This means a Mac with 128GB of unified memory can load a 128GB model without spilling [9].
The tradeoff: memory bandwidth is lower than discrete GPUs. An M4 Max delivers ~550 GB/s vs. the RTX 5090's 1,792 GB/s. This means slower token generation, but you can run much larger models that simply won't fit on consumer GPUs.
| Apple Config | Unified Memory | Bandwidth | Starting Price | Best For |
|---|---|---|---|---|
| Mac Mini M4 Pro | Up to 64 GB | 273 GB/s | $1,399 | Small-medium models, dev |
| Mac Studio M4 Max | Up to 128 GB | 546 GB/s | $1,999 | 70B quantized models |
| Mac Studio M3/M4 Ultra | Up to 512 GB | 819 GB/s | $3,999 | 405B+ quantized models |
| Mac Pro M2/M3 Ultra | Up to 512 GB | 819 GB/s | $6,999 | Maximum memory, expansion |
Real-world results: an M4 Pro (64GB) runs Qwen 2.5 32B at 11–12 tok/s. An M3 Ultra with 512GB can load DeepSeek-R1 671B in quantized form — something that would require 8× H100s in FP16 [5].
Cloud Alternatives: When They Make More Sense
Sometimes the best local rig is no local rig. Cloud GPU rental makes sense when:
- You need it temporarily — fine-tuning a model for a week doesn't justify $5,000 in hardware
- You need massive VRAM — H100/B200 for 405B models without buying a server
- You're experimenting — try different GPU configs before committing
Current cloud GPU pricing (approximate):
- RunPod: RTX 4090 from $0.59/hr, RTX 5090 from $0.65/hr, H100 from $2.49/hr
- Vast.ai: RTX 4090 from $0.25/hr (community, variable quality)
- Lambda: H100 from $2.49/hr, B200 available
Break-even math: An RTX 5090 at $2,500 vs. cloud rental at $0.65/hr breaks even at ~3,846 hours (~160 days of 24/7 use, or ~16 months at 8 hours/day). If you're using GPUs daily for AI work, buying hardware pays for itself within a year.
4. Hardware Requirements by Model Size
Small Models (1–8B Parameters)
Models: Llama 3.2 (1B, 3B, 8B), Phi-3 Mini (3.8B), Gemma 2 (2B, 7B), Qwen 2.5 (7B), Mistral 7B
| Model | Q4_K_M | Q8_0 | FP16 | Minimum GPU |
|---|---|---|---|---|
| Llama 3.2 3B | ~2.0 GB | ~3.5 GB | ~6.5 GB | RTX 3060 12GB, Arc B580 |
| Llama 3.2 8B | ~4.9 GB | ~8.5 GB | ~16 GB | RTX 4060 Ti 16GB |
| Phi-3 Mini 3.8B | ~2.3 GB | ~4.0 GB | ~7.6 GB | RTX 3060 12GB, Arc B580 |
| Gemma 2 7B | ~4.5 GB | ~7.8 GB | ~14 GB | RTX 4060 Ti 16GB |
Speed: With an RTX 4090, expect 100–130 tok/s on 8B models. RTX 5090 pushes 200+ tok/s. Even a $249 Intel Arc B580 does 62 tok/s. These models are fast and accessible on virtually any modern GPU with 8+ GB VRAM.
Medium Models (13–32B Parameters)
Models: Llama 3.1 13B, Qwen3-Coder-30B, Mixtral 8×7B (47B total, ~13B active), Qwen 2.5 32B, DeepSeek-Coder-33B
| Model | Q4_K_M | Q8_0 | FP16 | Minimum GPU |
|---|---|---|---|---|
| Llama 3.1 13B | ~8.0 GB | ~14 GB | ~26 GB | RTX 4090 (Q4), RTX 5090 (Q8) |
| Mixtral 8×7B | ~26 GB | ~47 GB | ~94 GB | RTX 5090 (Q4), 2× 4090 (Q8) |
| Qwen 2.5 32B | ~19 GB | ~34 GB | ~64 GB | RTX 4090 (Q4), RTX 5090 (Q8) |
| Qwen3-Coder-30B-A3B (MoE) | ~18 GB | ~32 GB | — | RTX 4090 (Q4), RTX 5090 (Q8) |
Speed: RTX 5090 delivers ~61 tok/s on 32B models. RTX 4090 runs Q4 quantized 32B at ~35–40 tok/s. Mixtral's MoE architecture only activates ~13B parameters at a time, so it's faster than its total size suggests — but you still need the full model in VRAM.
Large Models (65B–405B+ Parameters)
Models: Llama 3.1 70B, Llama 3.1 405B, DeepSeek V3/R1 (671B), Qwen 2.5 72B
| Model | Q4_K_M | FP16 | Hardware Required |
|---|---|---|---|
| Llama 3.1 70B | ~40 GB | ~140 GB | 2× RTX 5090 (Q4), 2× H100 (FP16) |
| Llama 3.1 405B | ~230 GB | ~810 GB | Mac Studio 512GB (Q4), 8× H100 (FP8) |
| DeepSeek V3 671B | ~380 GB | ~1,340 GB | Multi-node or 8× H100 (FP8) |
| Qwen 2.5 72B | ~42 GB | ~144 GB | 2× RTX 5090 (Q4), 1× RTX PRO 6000 (Q8) |
5. Quantization: Trading Precision for Accessibility
Quantization reduces model precision from 16-bit floats to lower bit widths, dramatically cutting VRAM requirements with varying quality tradeoffs:
| Format | Bits/Weight | VRAM Savings | Quality Impact | Best For |
|---|---|---|---|---|
| FP16 | 16 (2 bytes) | Baseline | None (full precision) | Research, maximum quality |
| FP8 | 8 (1 byte) | 2× reduction | Minimal | H100/B200 native, training |
| Q8_0 (GGUF) | 8 | 2× reduction | Near-lossless | When VRAM allows, quality-critical |
| Q5_K_M (GGUF) | ~5.5 | ~3× reduction | Very low degradation | Quality-conscious with limited VRAM |
| Q4_K_M (GGUF) | ~4.5 | ~4× reduction | Slight degradation | Best balance of quality/size |
| AWQ INT4 | 4 | 4× reduction | Better than GPTQ at same bits | vLLM serving, GPU inference |
| IQ2_M (GGUF) | ~2.7 | ~6× reduction | Noticeable degradation | Fitting huge models on limited HW |
Practical advice: For most users, Q4_K_M is the sweet spot — it delivers ~4× VRAM savings with quality that's difficult to distinguish from FP16 in most tasks. Q5_K_M is worth the extra VRAM if you can afford it. Avoid going below Q3 unless you're desperate to fit a model that's otherwise too large.
GGUF vs. AWQ: GGUF (from llama.cpp) is the format of choice for Ollama and local inference. AWQ is optimized for GPU-accelerated serving with vLLM. If you're running a personal model on Ollama, use GGUF. If you're serving models to an API, use AWQ with vLLM [5].
6. Build Guide: Starter (~$1,500–$2,500)
🟢 Tier 1: The AI Starter Rig
Runs: All 7–8B models at full speed, 13B models quantized, 32B models at Q4 with some compromise
| Component | Pick | Price |
|---|---|---|
| GPU | NVIDIA RTX 4090 24GB (or used RTX 3090 24GB ~$800–$900) | $1,600–$2,000 |
| CPU | AMD Ryzen 7 7700X (8C/16T) or Intel i5-14600K | $220–$290 |
| RAM | 64GB DDR5-5600 (2×32GB) — CPU offloading needs headroom | $120–$160 |
| Motherboard | B650 (AMD) or Z790 (Intel) — PCIe 4.0 x16 slot minimum | $150–$200 |
| Storage | 1TB NVMe Gen4 SSD (model loading speed matters) | $80–$100 |
| PSU | 850W 80+ Gold (1000W for 5090 future-proofing) | $100–$130 |
| Case | Mid-tower with good airflow (Fractal Meshify 2, etc.) | $100–$130 |
| Total (with new RTX 4090) | ~$2,370–$3,010 | |
| Total (with used RTX 3090) | ~$1,570–$1,710 | |
Budget option: A used RTX 3090 at $800–$900 gives you the same 24GB VRAM as the 4090, just with lower bandwidth (936 GB/s) and older architecture. For interactive chat with 7–13B models, you'll barely notice the difference.
Performance expectations:
- Llama 3.2 8B Q4: ~100–130 tok/s (RTX 4090), instant responses
- Qwen 2.5 32B Q4: ~35–40 tok/s (RTX 4090), very usable
- Llama 3.1 70B Q4: Won't fit in 24GB — need CPU offload (~5–8 tok/s, painful)
7. Build Guide: Enthusiast (~$4,000–$7,000)
🔵 Tier 2: The Serious AI Workstation
Runs: Everything up to 70B models comfortably, 405B at aggressive quantization with offloading
| Component | Pick | Price |
|---|---|---|
| GPU | 2× NVIDIA RTX 5090 32GB (64GB total) | $4,000–$7,600 |
| CPU | AMD Ryzen 9 9900X (12C/24T) or Ryzen 9 9950X (16C/32T) | $400–$550 |
| RAM | 128GB DDR5-5600 (4×32GB) | $250–$350 |
| Motherboard | X870E with 2× PCIe 5.0 x16 slots (ASUS ProArt, MSI MEG) | $350–$500 |
| Storage | 2TB NVMe Gen4 SSD + 4TB SATA SSD for model library | $200–$300 |
| PSU | 1600W 80+ Platinum (2× 5090 = ~1150W GPU alone) | $250–$350 |
| Case | Full tower (Corsair 7000D, be quiet! Dark Base Pro 901) | $200–$280 |
| Cooling | 360mm AIO for CPU + aggressive case fan config | $100–$180 |
| Total | ~$5,750–$10,110 | |
Performance expectations:
- Llama 3.1 70B Q4 AWQ: ~1,230 tok/s throughput (batched via vLLM), ~25–30 tok/s single-user
- Qwen3-Coder-30B: ~9,000+ tok/s batched (2× instances)
- Llama 3.1 405B Q4: Won't fully fit in 64GB — need aggressive quantization + offload
Why two 5090s instead of one PRO 6000? Two 5090s give you 64GB total at $4,000–$7,600 vs. one PRO 6000 with 96GB at ~$7,000. The PRO 6000 avoids multi-GPU communication overhead and has ECC memory, but two 5090s offer higher combined throughput for models that fit in 32GB each (you run two replicas). It depends on your typical model size.
8. Build Guide: Go For Broke (~$10,000–$25,000+)
🟣 Tier 3: The Local AI Server
Runs: Everything. 405B quantized, 70B at FP16, multiple models simultaneously.
| Component | Pick | Price |
|---|---|---|
| GPU Option A | 4× RTX 5090 32GB (128GB total) | $8,000–$15,200 |
| GPU Option B | 2× RTX PRO 6000 96GB (192GB total) | $13,600–$15,000 |
| CPU | AMD Threadripper 7970X (32C/64T) or EPYC 9354 (32C) | $2,000–$3,500 |
| RAM | 256GB DDR5 ECC (Threadripper) or 512GB (EPYC) | $600–$1,500 |
| Motherboard | TRX50 (Threadripper) or SP5 (EPYC) with 4+ PCIe 5.0 x16 | $700–$1,500 |
| Storage | 2TB Gen5 NVMe + 8TB SATA SSD model library | $400–$600 |
| PSU | 2000W+ 80+ Titanium (or dual PSU with adapter) | $400–$600 |
| Case/Chassis | 4U server chassis or full tower (Corsair 9000D) | $300–$500 |
| Cooling | Custom loop or industrial cooling solution | $300–$500 |
| Total (Option A: 4× 5090) | ~$12,700–$23,900 | |
| Total (Option B: 2× PRO 6000) | ~$18,300–$23,700 | |
Can it run 405B? Yes — with caveats. Llama 3.1 405B at Q4_K_M needs ~230GB. Option A (4× 5090, 128GB) can't fit it in VRAM alone — you'd need aggressive Q2/Q3 quantization or significant CPU offloading. Option B (2× PRO 6000, 192GB) fits it comfortably at Q4_K_M with room for KV cache. This is the primary argument for the PRO 6000 path.
The EPYC advantage: AMD EPYC platforms support 128 PCIe 5.0 lanes — enough for 4× x16 GPU slots without lane splitting. Threadripper offers 64 lanes with more consumer-friendly pricing. Both support massive RAM for CPU offloading fallback.
Performance expectations (Option B):
- Llama 3.1 70B FP16: Fully in VRAM, ~40–60 tok/s
- Llama 3.1 405B Q4: Runs in 192GB, ~8–15 tok/s (usable for interactive chat)
- DeepSeek V3 671B: Needs Q2 quantization to squeeze in, or CPU offloading — experimental
9. The Apple Silicon Path
🍎 Alternative: Mac Studio / Mac Pro
For: Maximum model size on a single quiet machine, macOS ecosystem users, avoiding driver/Linux headaches
| Configuration | Memory | Largest Model (Q4) | Speed (70B Q4) | Price |
|---|---|---|---|---|
| Mac Mini M4 Pro 64GB | 64 GB | ~32B FP16, ~70B Q4 | ~8–10 tok/s | ~$2,200 |
| Mac Studio M4 Max 128GB | 128 GB | ~70B Q8, ~140B Q4 | ~15–20 tok/s | ~$3,500–$4,000 |
| Mac Studio M3 Ultra 192GB | 192 GB | ~200B Q4 | ~12–15 tok/s | ~$5,500 |
| Mac Studio M3 Ultra 512GB | 512 GB | ~405B Q4, ~671B Q3 | ~5–8 tok/s (405B) | ~$9,500–$14,000 |
Software: Use MLX (Apple's ML framework), Ollama (has native Metal support), or llama.cpp with Metal backend. MLX is optimized for Apple Silicon's unified memory architecture and often outperforms llama.cpp on Macs [9].
Pros: Silent operation, unified memory eliminates spill cliff, great build quality, macOS convenience, no driver issues, single-machine 405B capability at 512GB.
Cons: Lower tok/s than equivalent-cost GPU rigs, no CUDA, limited fine-tuning support, not upgradeable, Apple tax on RAM (going from 192GB to 512GB costs $2,400+).
10. The Software Stack
Hardware is only half the equation. Here's the software that makes local LLMs work:
Ollama — The Easiest On-Ramp
Ollama is the Docker of LLMs. One command to download and run any model: ollama run llama3.2. It handles model management, GGUF quantization, GPU offloading, and exposes an OpenAI-compatible API. If you're new to local AI, start here.
- Best for: Personal use, development, quick prototyping
- Supports: NVIDIA CUDA, Apple Metal, AMD ROCm
- Limitations: Single-model serving, no batching, limited multi-GPU
llama.cpp — The Performance Foundation
llama.cpp is the C/C++ inference engine that powers Ollama, LM Studio, and many other tools. It's the reference implementation for GGUF format and supports virtually every quantization method. For maximum performance and control, use llama.cpp directly.
- Best for: Maximum inference speed, custom setups, CPU inference
- Supports: CUDA, Metal, Vulkan, SYCL (Intel), OpenCL
- Key feature: Partial GPU offloading — split model layers between GPU and CPU
vLLM — Production Serving
vLLM is the gold standard for serving LLMs at scale. It supports continuous batching, tensor/pipeline parallelism, PagedAttention for efficient KV cache management, and an OpenAI-compatible API. If you're serving models to multiple users, vLLM is the answer.
- Best for: Multi-user serving, API endpoints, maximum throughput
- Supports: NVIDIA CUDA, AMD ROCm, AWQ/GPTQ quantization
- Key feature: Continuous batching delivers 5–10× throughput vs. Ollama under load
Text Generation WebUI (Oobabooga) — The Swiss Army Knife
Text Generation WebUI provides a Gradio-based interface with support for every backend (llama.cpp, transformers, ExLlamaV2, etc.), character chat, extensions, and model management. Think of it as the "everything app" for local LLMs.
- Best for: Experimentation, roleplay/character chat, comparing models
- Supports: Every format and backend
- Limitations: Resource-heavy UI, not optimized for API serving
Best Practices for Local Deployment
- Use GGUF Q4_K_M as your default — best quality-to-size ratio for most uses
- Monitor VRAM usage — use
nvidia-smiornvtopto ensure you're not spilling - Set context length wisely — KV cache grows linearly. Don't default to 128K if you only need 4K
- Use mmap for large models — llama.cpp can memory-map model files for faster loading
- Consider serving vs. chat — Ollama for personal use, vLLM for API/multi-user
- Keep models on NVMe — model loading from NVMe vs. HDD is 10× faster
11. Pre-Built AI Workstations
Don't want to build? Several companies sell ready-to-go AI workstations:
| Vendor | Starting Price | GPU Options | Notes |
|---|---|---|---|
| Puget Systems | ~$3,100 | RTX 4090, RTX 5090, RTX PRO 6000, up to 4 GPUs | Premium support, custom configs, extensive testing. Up to ~$61K for maxed systems. |
| Lambda | ~$7,000 | RTX 4090, A6000, H100 (server) | Comes with Lambda Stack (CUDA, PyTorch pre-installed). Strong AI focus. |
| Thinkmate | ~$5,000 | Various NVIDIA professional | Enterprise focus, rack-mount options, custom builds. |
| Bizon | ~$5,000 | RTX 4090, 5090, A6000, up to 8 GPUs | Liquid cooling, AI-specific configs, good reviews. |
Build vs. Buy tradeoffs:
- Build: 20–40% cheaper, fully customizable, you learn the hardware, you can upgrade piecemeal
- Buy: Warranty, professional support, guaranteed compatibility, pre-installed software, saves 10–20 hours of assembly and troubleshooting
For most individuals and small teams, building is worth it — the community knowledge for AI rigs is extensive and the process is well-documented. For companies that need reliability guarantees and tax-deductible invoices, pre-builts make sense.
12. Cloud vs. Local: The Decision Framework
| Factor | Local | Cloud |
|---|---|---|
| Upfront cost | $1,500–$25,000+ | $0 |
| Ongoing cost | Electricity (~$30–$100/mo under heavy use) | $0.25–$10+/hr while running |
| Privacy | Complete — data never leaves your machine | Depends on provider, shared hardware risks |
| Availability | 24/7 once built | Subject to spot pricing, availability, outages |
| Flexibility | Upgrade path, your hardware | Switch GPU types instantly, scale to multi-node |
| Best for | Daily use, privacy-critical, long-term cost savings | Occasional use, massive compute needs, experimentation |
13. Final Recommendations
Here's our opinionated take on what to buy based on your use case:
🎯 "I just want to chat with AI privately"
Get: Mac Mini M4 Pro with 64GB ($2,200) or used RTX 3090 build ($1,600). Run Ollama. You'll have access to excellent 7–32B models that rival ChatGPT for most tasks. Done.
🎯 "I'm a developer building AI-powered apps"
Get: RTX 5090 build ($3,500–$5,000) or Mac Studio M4 Max 128GB ($3,500–$4,000). The 5090 gives you speed; the Mac gives you model size. Either way, you get a local API endpoint via Ollama that mimics OpenAI's API for seamless development.
🎯 "I want to run the best open-source models available"
Get: 2× RTX 5090 build ($5,500–$10,000) for 70B models at excellent speed, or Mac Studio M3 Ultra 192GB ($5,500) if you prioritize fitting larger models. The 70B class (Llama 3.1 70B, Qwen 2.5 72B) is the sweet spot where open-source approaches frontier-model quality.
🎯 "I want to run 405B+ models locally"
Get: Mac Studio M3 Ultra 512GB ($9,500–$14,000) for simplicity, or 2× RTX PRO 6000 build ($18,000–$24,000) for performance. There's no cheap way to run 400B+ models. The Mac path is easier; the NVIDIA path is faster. Cloud (8× H100 for a few hours) is worth considering if this is occasional.
🎯 "I have unlimited budget"
Get: DGX B200 ($515,000), or more practically, a Lambda or Puget Systems server with 4–8× H100/B200 GPUs. At this level, you're running an AI lab, not a personal rig. Consider Lambda Cloud on-demand unless you have consistent 24/7 utilization.
References
- NVIDIA, "RTX PRO 6000 Blackwell Workstation Edition," nvidia.com, 2025.
- LocalLLM.in, "The Best GPUs for Local LLM Inference in 2025," localllm.in, August 2025.
- D. Trifonov, "RTX 4090 vs RTX 5090 vs RTX PRO 6000: Comprehensive LLM Inference Benchmark," CloudRift.ai, October 2025.
- DatabaseMart, "RTX 5090 Ollama Benchmark: Extreme Performance Faster Than H100," databasemart.com, 2025.
- Introl, "Local LLM Hardware Guide 2025: GPU Specs & Pricing," introl.com, August 2025.
- AMD, "vLLM x AMD: Highly Efficient LLM Inference on AMD Instinct MI300X GPUs," amd.com, April 2025.
- BestGPUsForAI, "Best AMD GPUs for AI Training & Deep Learning in 2026," bestgpusforai.com, 2026.
- TechReviewer, "Is the Radeon RX 9070 XT Good for Running LLMs?" techreviewer.com, October 2025.
- M. Schall, "Apple MLX vs. NVIDIA: How local AI inference works on the Mac," markus-schall.de, November 2025.
- Central Computer, "Understanding the NVIDIA RTX 6000 PRO Blackwell Lineup," centralcomputer.com, September 2025.
- Puget Systems, "Workstations for Machine Learning / AI," pugetsystems.com, 2025.
- RunPod, "RTX 5090 LLM Benchmarks," runpod.io, 2025.
This article was written collaboratively by Michel (human) and Yaneth (AI agent) as part of ThinkSmart.Life's research initiative. Hardware prices reflect market conditions as of February 2026 and may fluctuate.
💬 Comments