Hardware Beginner's Guide

🖥️ Local AI Hardware: What You Actually Need to Get Started

From zero to running LLMs locally — the honest gear guide for beginners, covering every component from GPU to storage, across every budget tier.

Michel Lacle | ThinkSmart.Life Research

March 29, 2026 · min read

📺 Watch the video version: ThinkSmart.Life/youtube

🎧

Listen to this article

🎯 Who this guide is for You've heard about people running AI models on their own hardware. You're curious, maybe skeptical, and definitely confused by spec sheets. This guide tells you exactly what hardware matters, what doesn't, and how to build a local AI machine for any budget — without wasting money on the wrong things.

1. Why Run AI Locally?

Every serious AI practitioner I know eventually builds a local inference rig. Not because it's cheaper than the cloud (it usually isn't, at first), but because of what you get when the model lives on your machine instead of someone else's server.

Privacy. When you run a model locally, your prompts never leave your machine. No API logs, no training data harvesting, no compliance nightmares. For lawyers, doctors, writers, and anyone who handles sensitive information, this is the entire argument.

Speed. API calls have latency — round-trips to a data center, rate limits, queue delays. A local model answers instantly. Once you've used a local model that generates at 30 tokens per second, going back to a sluggish API feels painful.

Cost. GPT-4o costs money per token. A local model, once you've paid for the hardware, is free forever. If you generate a lot of text — coding assistance, document processing, agentic workflows — local inference pays for itself quickly.

Capability. The open-source model ecosystem in 2026 is astonishing. Qwen3, Llama 4, DeepSeek, Phi-4 — these models compete with frontier APIs on most benchmarks. You're not settling for a lesser model. You're getting the real thing, privately, on your own hardware.

Control. You choose the model. You choose the context length. You run it 24/7 inside your own agent pipelines without worrying about someone changing pricing, breaking an API, or deprecating a model you depend on.

What This Guide Covers

We're going to walk through every component that matters for local AI inference — GPU, system RAM, CPU, storage, PSU, cooling, and OS. We'll cover the Mac path separately because Apple Silicon plays by different rules. Then we'll look at quantization (the technique that doubles your effective GPU memory), the software stack you need to get started, and four concrete builds for every budget.

By the end, you'll know exactly what to buy — or whether your existing machine is already good enough to start.

2. The One Rule: VRAM Is Everything

Before you spend a single dollar, understand this: the single most important number for local AI is VRAM — the amount of memory on your GPU.

Not CPU speed. Not system RAM. Not even clock speed or core count. VRAM.

Here's why. When you run a language model, the model's weights — billions of numbers that encode everything it knows — need to live somewhere while the GPU processes them. That somewhere is VRAM. If the model fits entirely in VRAM, the GPU can process tokens at full speed, accessing weights at hundreds of gigabytes per second. If the model doesn't fit, the system has to shuffle weights back and forth between GPU memory and system RAM — a process called CPU offloading that's 10–100× slower.

The Cliff Effect

This creates what I call the VRAM cliff. Below the threshold, performance is painful — 2 to 5 tokens per second, sometimes slower, as the system strains to move data around. The moment your model fits entirely in VRAM, performance jumps to 15–60 tokens per second. It's not a gradual improvement. It's a cliff.

⚠️ The 8GB trap RTX 4060 (8GB), RX 7600 (8GB), and similar cards are sold as "gaming GPUs" and look attractive on price. For local AI, they're nearly useless. You can barely fit a 7B model at 4-bit quantization — the smallest model worth running. Don't buy 8GB. The minimum that makes sense for local AI is 12GB, and 16GB is where you should start.

VRAM vs Model Size (with Q4 quantization)

8 GB

7B barely

12 GB

13B (Q4)

16 GB

14B (Q4) · 7B at high quality

24 GB

34B (Q4) · sweet spot

32 GB

70B (Q4) fits! · 34B high quality

128 GB+

70B full precision · 120B+ quantized

These numbers assume Q4 quantization (more on that in Section 6). At full precision (FP16), double the memory requirement — a 7B model needs ~14GB, a 70B model needs ~140GB. Quantization is why consumer hardware can run these models at all.

3. GPU: The Core Decision

Pick your GPU first. Everything else follows from this decision. Here's how the landscape breaks down in 2026.

The Budget Tiers

Don't bother zone (under $300 for AI): Intel Arc B580 at $250 offers 12GB which is workable for small models. Cards with 8GB VRAM are nearly useless for AI. If you already own one, fine — you can learn on it. Don't buy one specifically for local AI.

Entry tier ($600–$1,200 build): RTX 4060 Ti 16GB or RTX 5080 16GB. You can run 14B models at Q4, do coding assistance, run agents. Real performance, real models. This is a good starting point if you're not sure how deep you're going.

Sweet spot ($1,500–$2,500 build): This is where local AI starts breathing. A used RTX 3090 (24GB, ~$700) or new RTX 4070 Ti Super (16GB, ~$800) puts you in a completely different league. With 24GB, you can run Qwen3-35B, DeepSeek-V2.5-7B-Instruct at high quality, and use the best open-source models available. The RTX 3090 used is the best value in 2026 for anyone serious about local AI.

Power user ($3,000–$5,000 build): RTX 4090 (24GB, ~$1,600) or RTX 5090 (32GB, ~$3,000). The 4090 is the fastest 24GB card you can buy. The 5090 adds 32GB GDDR7 — finally enough to run 70B models without CPU offloading. If you're running production workloads, agentic pipelines, or embedding servers alongside inference, this is worth it.

Why 24GB Is the Magic Number

24GB is the sweet spot where local AI becomes genuinely capable. At 24GB with Q4 quantization, you can run Qwen3-35B — one of the best open-source models in existence — entirely in VRAM. That's a top-tier coding assistant, reasoning engine, and writing partner running locally, privately, for free (after hardware cost), at 15–25 tokens per second.

The 12GB vs 24GB gap is enormous. With 12GB you're running smaller models that feel limited. With 24GB you're running models that compete with GPT-4. The jump in capability is not proportional to the jump in cost.

GPU Comparison Table

GPU	VRAM	Bandwidth	Price (2026)	Best For
Intel Arc B580	12 GB GDDR6	456 GB/s	~$250	Absolute budget — 13B models only
RTX 4060 Ti 16GB	16 GB GDDR6	288 GB/s	~$450	Entry, learning, small models
RTX 3090 (used)	24 GB GDDR6X	936 GB/s	~$700	⭐ Best budget 24GB — serious AI
RTX 4070 Ti Super	16 GB GDDR6X	672 GB/s	~$800	New build sweet spot (16GB)
RTX 5080	16 GB GDDR7	960 GB/s	~$1,100	Fast 16GB, good mid-tier
RTX 4090	24 GB GDDR6X	1,008 GB/s	~$1,600	⭐ Top consumer 24GB GPU
RTX 5090	32 GB GDDR7	1,792 GB/s	~$3,000	70B in VRAM — new king
Mac Studio M3 Max	128 GB unified	800 GB/s	~$3,500	macOS path, huge model capacity
Mac Studio M4 Ultra	192 GB unified	800 GB/s	~$6,000+	Runs 405B models — Mac endgame

💡 The RTX 3090 used-market argument The RTX 3090 launched in 2020 and still ships 936 GB/s of memory bandwidth — faster than the RTX 4080. It has 24GB GDDR6X. In 2026, used prices are hovering around $700. For anyone building their first serious local AI rig, this is the card to beat. It's not the newest, but in AI inference, memory bandwidth and VRAM size matter more than shader performance, and the 3090 delivers both at a price that's hard to argue with.

4. The Rest of the System

VRAM is the critical variable. Everything else matters less — but it still matters. Here's what you need to know about each component.

System RAM: 64GB Minimum

This surprises people. They think about GPU memory and forget that the rest of the system needs RAM too. Here's the reality: you need RAM for the operating system, for loading models before they're transferred to VRAM, for embeddings, for vector databases, for the apps running alongside your AI stack. On 32GB, you'll constantly hit the wall. The minimum I recommend is 64GB DDR5. For the Mac path (where unified memory is both system RAM and VRAM), 128GB or more is ideal.

CPU: Boring Is Right

The CPU is almost irrelevant for GPU inference. Once the model is loaded into VRAM, the GPU does all the work. The CPU sits largely idle, tokenizing your prompts and managing I/O. You need a modern CPU with enough PCIe lanes to run your GPU at full bandwidth, and that's basically it.

For a new Windows/Linux build in 2026, the Ryzen 7 7700 (AM5, ~$180) is the boring-right answer. It's fast, it's cheap, it'll never be your bottleneck, and AM5 has years of upgrade path ahead. Don't overthink this.

Storage: 2TB NVMe Minimum

Models are large. A 7B model at Q4 is about 4GB. A 14B model is 8GB. A 70B model is 40GB. You'll download more models than you think — trying different versions, different quantization levels, different architectures. A 1TB drive fills up fast. Start with 2TB NVMe SSD; 4TB is better if you're building for serious use.

Speed matters here too. Loading a 40GB model from a slow drive takes minutes. An NVMe SSD loads the same model in 20–30 seconds. Don't use a hard drive or old SATA SSD for model storage if you can avoid it.

Power Supply: Size It Right

This is the component people get wrong most often. AI workloads drive GPUs to near-maximum TDP for long periods — this isn't a gaming load that spikes and drops. You need headroom.

RTX 3090 (350W TDP) → 750W PSU minimum, 850W recommended
RTX 4090 (450W TDP) → 850W PSU minimum, 1000W recommended
RTX 5090 (575W TDP) → 1000W PSU minimum, 1200W if you want margin

Get an 80+ Gold or Platinum rated PSU from a reputable brand (Seasonic, Corsair, EVGA, be quiet!). An underpowered or cheap PSU is a fire risk and will cause system instability under load.

Cooling: GPU Temps Are What Matter

For the CPU, any decent tower cooler works fine — Noctua NH-U12S, Deepcool AK400, or similar. The CPU won't get hot during AI inference.

The GPU is different. During AI workloads, your GPU runs at maximum power continuously. Keep your GPU below 83°C. This means good case airflow (positive pressure, front intake fans), not cramming the card into a tight case, and cleaning dust filters regularly. If you're in a warm environment, consider an aftermarket GPU cooler or a case specifically designed for airflow.

Operating System: Linux Is Best, Windows Works

For the best local AI performance, Ubuntu 22.04 or 24.04 LTS is the clear choice. CUDA drivers, Ollama, and most inference tools are built and tested on Linux first. Performance is typically 10–20% better than Windows for the same hardware.

That said, Windows works fine. Ollama runs on Windows, LM Studio is Windows-first, and most tools have Windows support. If you're already on Windows and don't want to dual-boot, don't let Linux intimidation stop you from getting started.

5. The Mac Path

Apple Silicon Macs deserve their own section because they play by completely different rules.

Unified Memory: The Mac's Secret Weapon

In a traditional PC, VRAM (GPU memory) and system RAM are physically separate. The GPU can only use what's in VRAM; moving data between the two is a bottleneck. On Apple Silicon Macs, this distinction doesn't exist. The M-series chip, system RAM, and "GPU memory" all share the same physical memory pool at 800 GB/s bandwidth.

This means a Mac Studio M3 Max with 128GB of RAM has 128GB of effective "VRAM" — more than any single consumer NVIDIA GPU. You can run a 70B model at full FP16 precision entirely in this memory pool. You can load multiple models simultaneously. You can run a 120B+ model with quantization on a single device.

When to Choose Mac Over GPU

The Mac path wins when:

You need more than 24–32GB effective VRAM without building a multi-GPU rig
You want a complete, quiet, power-efficient workstation (not just an inference box)
You need macOS for development (Xcode, Final Cut, etc.)
You want to run very large models (70B+) without quantization quality loss
You prefer a supported, integrated system over DIY PC building

The Tradeoffs

The Mac path has real disadvantages too. Apple Silicon has far fewer CUDA floating-point operations per second than a high-end NVIDIA GPU. An RTX 4090 delivers ~82 TFLOPS of FP16; an M4 Ultra delivers ~21 TFLOPS. For inference tasks dominated by memory bandwidth (which most LLM workloads are), this matters less — both deliver similar tokens per second for the same model size. But for training or fine-tuning, NVIDIA wins decisively.

The CUDA ecosystem is also massive. Most AI research tools, training frameworks, and optimization libraries are built for NVIDIA first. macOS/Metal support is improving but always lags behind.

Finally, price. A Mac Studio M4 Max (128GB) costs ~$3,500. A DIY Linux PC with a used RTX 3090 + 64GB RAM costs ~$1,500 and has similar effective VRAM. The Mac is more polished, but the PC is better value for pure inference workloads.

Michel's take I run both. My Linux rig has 4× RTX 3090s (96GB total VRAM) and handles the heavy batch inference. My Mac Studio M3 Ultra (256GB unified) runs large models with Ollama and is where I do most of my interactive AI work — it's quiet, fast, and the integration with macOS is seamless. For a beginner, the Mac Studio M3 Max at 128GB is genuinely one of the best single local AI machines you can buy, if you can stomach the price.

6. Quantization: The Free Upgrade

Quantization is the most important concept to understand after VRAM. It's the technique that makes local AI practical, and it's completely free — it's just how you download and run models.

What Quantization Does

A language model's "weights" are billions of decimal numbers. At full precision (FP16), each number takes 2 bytes of memory. A 70B model at FP16 is ~140GB — far more than any consumer GPU.

Quantization compresses these numbers. Instead of storing each weight as a precise 16-bit float, you store it as a rougher 4-bit integer — rounding it to one of 16 possible values instead of 65,536. This reduces memory by roughly 4×. That 140GB 70B model becomes ~40GB at Q4 — small enough to fit on a 48GB GPU or run with partial CPU offloading on a 24GB card.

The Quality Tradeoff

Quantization loses information — you're rounding numbers, after all. But the practical quality loss is smaller than you'd expect:

Q4_K_M — The standard starting point. ~4 bits per weight. About 3–5% quality degradation on most benchmarks. This is what you'll use most often.
Q6_K — 6 bits per weight. Barely perceptible quality difference from FP16 on most tasks. Use this if you have extra VRAM headroom. It's the sweet spot of size vs quality.
Q8_0 — 8 bits per weight. Nearly identical to FP16. Use only if you have plenty of VRAM and want maximum quality.
IQ2 / IQ3 — Extreme compression, 2–3 bits. Significant quality loss. Only useful when you're desperately constrained on memory.

Quantization Doubles Your Effective VRAM

Think of it this way: with Q4 quantization, a 24GB GPU can run models that would otherwise require 48GB. You haven't added any hardware — you've just learned to use what you have. Q4 is the reason the RTX 3090's 24GB is genuinely powerful for local AI, not just adequate.

When you download models from Ollama or Hugging Face, they're typically already quantized. You just choose the quant level based on your available VRAM.

The practical rule Start with Q4_K_M for most models. If the model fits and you have VRAM headroom, try Q6_K. You'll notice a modest quality improvement. Only go below Q4 if you can't fit the model at Q4 — the quality loss at Q2/Q3 is significant.

7. Your First Software Stack

You have the hardware. Now what? Here's the minimal software stack to go from zero to running a local AI model in about 10 minutes.

🦙 Ollama — The Starting Point

The easiest way to download and run open-source models. One command downloads the model, handles quantization, and starts an inference server. Supports NVIDIA, AMD, and Apple Silicon.

Install and run your first model:

# Install Ollama (Linux/Mac)
curl -fsSL https://ollama.com/install.sh | sh

# Run a model (downloads automatically)
ollama run llama3.2

# Try a bigger model
ollama run qwen2.5:32b

# List what you've downloaded
ollama list

🌐 Open WebUI — The ChatGPT Interface

A free, self-hosted web interface for Ollama. Opens in your browser and looks like ChatGPT. Handles conversation history, multiple models, file uploads. Run it with Docker in one command.

docker run -d -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  --name open-webui \
  ghcr.io/open-webui/open-webui:main

Then open http://localhost:3000 in your browser. You now have a local ChatGPT.

🖥️ LM Studio — Windows GUI Alternative

If you're on Windows and prefer a native GUI over Docker, LM Studio is the go-to. Download models from Hugging Face directly inside the app, switch between them easily, and get a chat interface built-in.

⌨️ Continue.dev / Cursor — AI Coding

VS Code extensions that connect to your local Ollama instance and give you AI code completion and chat. Same experience as GitHub Copilot, running on your own hardware.

First Models to Try

Once Ollama is running, try these in order of increasing model size:

ollama run phi4-mini — 3.8B, very fast, great for quick tasks
ollama run llama3.2 — 3B, Meta's latest, excellent baseline
ollama run qwen2.5:14b — 14B, strong for coding and reasoning (needs 16GB VRAM)
ollama run qwen2.5:32b — 32B, excellent all-rounder (needs 24GB VRAM)
ollama run llama3.3:70b — 70B, near-frontier quality (needs 40GB+ VRAM)

8. Recommended Builds by Budget

Here are four concrete builds. Prices are approximate as of early 2026 and will vary by region and availability.

🟢 Build 1: The Starter

~$1,200

Component	Pick	Price
GPU	RTX 4060 Ti 16GB	~$450
CPU	Ryzen 5 7600 (AM5)	~$150
Motherboard	Budget B650 AM5	~$120
RAM	32GB DDR5-5600	~$80
Storage	2TB NVMe SSD	~$100
PSU	650W 80+ Gold	~$80
Case + Cooler	Mid-tower + tower cooler	~$80

✅ Runs 7B–14B models at Q4 · 10–20 tok/s · Great for learning and everyday use. Upgrade the GPU later.

🔵 Build 2: The Sweet Spot

~$1,600

Component	Pick	Price
GPU	RTX 3090 24GB (used)	~$700
CPU	Ryzen 7 7700 (AM5)	~$180
Motherboard	B650 AM5	~$140
RAM	64GB DDR5-5600	~$150
Storage	2TB NVMe SSD	~$100
PSU	850W 80+ Gold	~$110
Case + Cooler	Mid-tower + tower cooler	~$80

⭐ Best value build. Runs 35B models at Q4 · 15–25 tok/s · Competes with frontier AI models locally.

🟣 Build 3: The Power User

~$3,500

Component	Pick	Price
GPU	RTX 4090 24GB	~$1,600
CPU	Ryzen 9 7950X (AM5)	~$400
Motherboard	X670E AM5	~$280
RAM	128GB DDR5-6000	~$300
Storage	4TB NVMe SSD	~$180
PSU	1000W 80+ Platinum	~$170
Case + Cooler	Full tower + 360mm AIO	~$200

🚀 Runs 70B models with CPU offloading · 30–50 tok/s on 35B · Production-grade local inference rig.

🍎 Build 4: The Mac Path

~$3,500–$6,000+

Option	Unified Memory	Price
Mac Studio M3 Max	128GB	~$3,500
Mac Studio M4 Max	128GB	~$3,500
Mac Studio M3 Ultra	192GB	~$5,000
Mac Studio M4 Ultra	192GB	~$6,000+

🍎 Runs 70B at full precision · Quiet, efficient, macOS-native · Less FLOPS than NVIDIA but incredible memory capacity.

Which build should you pick? If you're new to local AI and want the best value: Build 2 (the sweet spot). The RTX 3090 used market is competitive, 24GB changes what you can run, and the total price is manageable. If you're already on Mac and want to stay there: Mac Studio M3 Max 128GB is the starting point. It's pricier but runs more powerful models than any consumer NVIDIA GPU except the RTX 5090.

Conclusion

Local AI in 2026 is genuinely accessible. The open-source model ecosystem is excellent. The software stack (Ollama + Open WebUI) takes 10 minutes to set up. And the hardware, while a real investment, pays dividends in privacy, speed, and capability that compound over time.

The key things to remember:

VRAM is everything. Size it for the models you want to run.
24GB is the sweet spot. The used RTX 3090 makes this accessible at ~$700.
64GB system RAM minimum. Don't cheap out here.
Quantization is free. Q4_K_M lets you run models 4× larger than your raw VRAM.
Start with Ollama. You can be running a local model today.

The local AI revolution isn't coming. It's already here. The question is whether you're running it on someone else's hardware, or your own.

References

NVIDIA RTX 3090 Product Page — NVIDIA Corporation (2020). Memory bandwidth specs: 936 GB/s GDDR6X. nvidia.com
NVIDIA RTX 4090 Product Page — NVIDIA Corporation (2022). 24GB GDDR6X, 1,008 GB/s bandwidth. nvidia.com
NVIDIA RTX 5090 Product Page — NVIDIA Corporation (2025). 32GB GDDR7, 1,792 GB/s bandwidth. nvidia.com
Apple Mac Studio — Apple Inc. (2024). M3/M4 Max and Ultra unified memory specs. apple.com
Ollama — Run large language models locally. ollama.com
Open WebUI — Community-developed web interface for Ollama. github.com/open-webui/open-webui
LM Studio — Discover, download, and run local LLMs. lmstudio.ai
Frantar, E. et al. (2022). GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers. arXiv:2210.17323. Foundational quantization research.
Blakemnan, T. (2023). GGUF format specification — llama.cpp project. The GGUF format used by Ollama and most local inference tools. github.com/ggerganov/llama.cpp
ThinkSmart.Life — AI Model Quantization: The Complete Guide (2026). Companion article covering quantization in depth. thinksmart.life/research/posts/quantization-guide/
ThinkSmart.Life — How GPUs Actually Work: A Deep Dive for AI Engineers (2026). GPU architecture, memory bandwidth, and CUDA cores explained. thinksmart.life/research/posts/gpu-architecture-deep-dive/
ThinkSmart.Life — Mac Studio M3 Ultra vs DIY GPU Rig (2026). Head-to-head comparison of Mac vs NVIDIA paths. thinksmart.life/research/posts/mac-studio-ultra-vs-diy-gpu-rig/