1. You Have $4,500 — Build or Buy?
You've decided to run large language models locally. Maybe you want to serve Llama 3 70B to your team, experiment with Stable Diffusion, or fine-tune models on your own data without paying cloud bills. You have about $4,500 to spend. The question every AI enthusiast faces: do you build a DIY GPU rig from parts, or buy something off the shelf?
This isn't a theoretical exercise. We've actually built the DIY rig — our Pro Tier build delivers 4× RTX 3090 with 96GB of total VRAM on a server-grade platform for ~$4,314. Now we're going to honestly compare it against every off-the-shelf option at the same price point.
The key metric for LLM inference is VRAM. The model has to fit in GPU memory (or unified memory on Apple Silicon). A Llama 3 70B model at Q4 quantization needs ~40GB. At full FP16 precision, it needs ~140GB. Llama 3 405B at Q4 needs ~230GB. VRAM determines which models you can run — everything else is secondary.
2. Our DIY Pro Tier Build (~$4,314)
Here's what ~$4,500 gets you when you build it yourself. Full details in our Pro Tier shopping list:
- GPUs: 4× NVIDIA RTX 3090 (24GB GDDR6X each) — $3,000
- Motherboard: ASRock Rack ROMED8-2T — 7× PCIe 4.0 x16, IPMI, dual 10GbE
- CPU: AMD EPYC 7252 (8-core, 128 PCIe lanes)
- RAM: 32GB DDR4 ECC (expandable to 2TB)
- PSU: Super Flower 1600W 80+ Titanium
- Frame: Veddha V3D open-air frame
- Models it runs: Llama 3 70B Q4 ✅ (16.9 tok/s) | Llama 3 70B FP16 ❌ (needs 140GB) | Llama 3 405B Q4 ❌ (needs 230GB)
- Expandable to: 7 GPUs (168GB VRAM), or 13 GPUs with bifurcation (312GB)
The real benchmarks via llama.cpp: 4× RTX 3090 generates 70B Q4_K_M at 16.89 tokens/second with prompt eval at 350 tok/s. For 8B models, it hits 105 tok/s generation. These numbers come from the comprehensive GPU-Benchmarks-on-LLM-Inference project[1].
3. Apple Mac Studio M3 Ultra
Apple's biggest competitor in this space is the Mac Studio with M3 Ultra chip (released March 2025). It replaces the M2 Ultra and is the first realistic off-the-shelf option for running large models locally.
What $4,500 Gets You
The Mac Studio M3 Ultra starts at $3,999 with 96GB unified memory, 28-core CPU, 60-core GPU, and 1TB SSD[2]. That's within our $4,500 budget. Upgrading to 256GB unified memory pushes the price to ~$5,599 — over budget but worth discussing since memory capacity is the whole game for LLMs.
| Config | Memory | GPU Cores | Bandwidth | Price |
|---|---|---|---|---|
| M3 Ultra (base) | 96GB unified | 60-core | 819 GB/s | $3,999 ✅ In budget |
| M3 Ultra (256GB) | 256GB unified | 60-core | 819 GB/s | ~$5,599 ❌ Over budget |
| M3 Ultra (512GB) | 512GB unified | 80-core | 819 GB/s | ~$8,099 ❌ Way over |
| M4 Max (base) | 36GB unified | 32-core | 410 GB/s | $1,999 |
AI Performance: The Honest Numbers
Using the M2 Ultra 192GB as a proxy (same memory bandwidth class), the llama.cpp benchmarks show[1]:
- 70B Q4_K_M generation: 12.13 tok/s (vs 16.89 tok/s on 4× RTX 3090) — DIY is 39% faster
- 70B Q4_K_M prompt eval: 117.76 tok/s (vs 350 tok/s on 4× RTX 3090) — DIY is 3× faster
- 70B FP16 generation: 4.71 tok/s — Mac wins here because 4× RTX 3090 can't even run it (OOM at 96GB)
- 8B Q4_K_M generation: 76.28 tok/s (vs 104.94 tok/s on 4× 3090)
Mac Studio Pros & Cons
- ✅ Silent operation — literally no fan noise at idle
- ✅ Tiny form factor (7.7" × 7.7" × 3.7")
- ✅ Can run 70B FP16 on 96GB unified (can't on discrete GPUs)
- ✅ Built-in 10GbE, Thunderbolt 5, WiFi 6E
- ✅ macOS + no driver headaches
- ✅ 1-year warranty + AppleCare option
- ❌ Not upgradeable — what you buy is what you get forever
- ❌ 39% slower than 4× 3090 on quantized models
- ❌ 3× slower prompt processing
- ❌ No CUDA — limited training/fine-tuning ecosystem
- ❌ Cannot run Llama 3 405B even at Q4 (needs 230GB, only 96GB in budget)
4. NVIDIA Jetson AGX Orin 64GB
The Jetson AGX Orin 64GB Developer Kit costs approximately $1,999[3]. At first glance, it's compelling — an NVIDIA GPU with 64GB of unified memory, CUDA support, designed for edge AI. But the reality is more nuanced.
- Memory: 64GB LPDDR5 unified (shared between CPU and GPU)
- GPU: 2048 CUDA cores + 64 Tensor cores (Ampere architecture)
- Compute: ~275 TOPS (INT8), ~5.3 TFLOPS FP32
- Memory bandwidth: 204.8 GB/s
- Power: 15-60W
- Form factor: Credit card-sized module
The Jetson was designed for robotics and edge deployment, not desktop LLM inference. Community benchmarks show disappointing results: Deepseek-R1-Distill-Qwen-7B runs at about 10 tokens/second, and even tiny 1.5B models struggle to exceed 20 tok/s[4]. For comparison, a single RTX 3090 hits 111 tok/s on 8B Q4.
At $2,000, you could buy two Jetsons for $4,000 — but they don't pool their memory. You'd have two independent 64GB machines. The better move at this price point is simply buying the GPUs for a DIY build.
5. Pre-Built Workstations: What $4,500 Actually Buys
HP Z4 G5 Workstation
HP's Z4 G5 is the entry-level professional workstation. For ~$4,500, a typical configuration includes[5]:
- Intel Xeon W3-2425 (6-core)
- 32-64GB DDR5 ECC
- 1 × NVIDIA RTX A4000 (16GB VRAM) or 1 × RTX 4000 Ada (20GB)
- 1TB NVMe SSD
- Windows 11 Pro license
Total VRAM: 16-20GB. That's enough for 7B-13B models. You cannot run 70B at any quantization on 20GB. For $4,500, you get a single professional GPU in a well-built, quiet, warrantied tower — but AI inference capability is severely limited compared to the DIY rig's 96GB.
Dell Precision 3680 / 5860
Dell's workstation story is similar. A Precision 3680 tower with an RTX 4060 Ti (16GB) runs ~$2,500-3,500. Step up to a Precision 5860 tower with dual GPU support and you're looking at $5,000+ to get two RTX A5000 cards (48GB total)[6]. At the $4,500 budget, you'll land on a single RTX A5000 (24GB) or RTX 4000 Ada (20GB) — similar to the HP.
Lenovo ThinkStation P3
Lenovo's ThinkStation P3 tower configured with an RTX A4000 (16GB) and Xeon processor comes in around $3,500-4,500[7]. Same story — one professional GPU, 16-20GB VRAM. Excellent build quality and support, but not enough memory for serious LLM work.
Lambda Labs — Discontinued Hardware
Lambda Labs was once the go-to for pre-built ML workstations (Vector, Vector One, Vector Pro). However, Lambda ended its on-premise hardware business in August 2025[8] and now focuses exclusively on cloud GPU compute. Their cloud pricing starts at ~$0.50/hr for an A10G, or $2.49/hr for an H100. If you'd spend $4,500 on cloud GPUs at H100 rates, you'd get about 1,800 hours (~75 days) of compute — after which your money is gone and you own nothing.
Puget Systems
Puget Systems builds custom workstations with legendary support. Their single-GPU AI workstation starts around $3,100 with an RTX 4090 (24GB)[9]. For $4,500, you could get:
- 1× RTX 4090 (24GB) + Intel Core i9 + 64GB DDR5 + 1TB NVMe
- Professional cable management, quiet cooling, 3-year warranty
- Lifetime phone/email support from actual humans
Total VRAM: 24GB. A single RTX 4090 is faster per-GPU than a single RTX 3090, but 24GB vs our 96GB means dramatically fewer models you can run. Multi-GPU Puget configs with 2× RTX 4090 start around $7,000-8,000. With 4× GPUs, you're looking at $12,000+.
Used Enterprise Servers on eBay
What about used data center hardware? The NVIDIA A100 80GB PCIe — the gold standard — currently sells for a median of ~$18,500 used on eBay[10]. That's 4× our entire budget for a single card. Even the older A100 40GB runs $3,000-5,000 used. Enterprise GPU hardware holds its value stubbornly because data centers still need it.
You can find used dual-GPU servers (e.g., Dell R740 with 2× Tesla V100 32GB) for $3,000-5,000, but V100s are significantly slower than RTX 3090s and have less VRAM per card.
6. The Big Comparison Table
| Feature | DIY Pro Tier (4× RTX 3090) |
Mac Studio M3 Ultra 96GB |
Jetson AGX Orin 64GB |
HP Z4 G5 (RTX A4000) |
Puget Systems (1× RTX 4090) |
|---|---|---|---|---|---|
| Price | $4,314 | $3,999 | $1,999 | ~$4,500 | ~$4,500 |
| Total VRAM / Memory | 96GB Best | 96GB unified | 64GB shared | 16GB | 24GB |
| Memory Bandwidth | 3,744 GB/s combined | 819 GB/s | 204.8 GB/s | 448 GB/s | 1,008 GB/s |
| FP16 TFLOPS | ~140 | ~27* | ~5.3 (FP32) | ~19.2 | ~82.6 |
| 70B Q4 (tok/s) | 16.9 Fastest | 12.1 | ~2-3 (est.) | ❌ OOM | ❌ OOM |
| 70B FP16 | ❌ OOM | 4.7 tok/s Only option | ❌ OOM | ❌ OOM | ❌ OOM |
| Llama 405B Q4 | ❌ (needs 230GB) | ❌ (needs 230GB) | ❌ | ❌ | ❌ |
| Power Draw | ~1,600W peak | ~480W max | 15-60W | ~350W | ~600W |
| Noise Level | Loud (open frame, 4 GPUs) | Silent Best | Silent (fanless) | Quiet (enclosed) | Quiet (enclosed) |
| Form Factor | Open frame (large) | Desktop cube (tiny) | Module (tiny) | Tower | Tower |
| Expandability | Up to 7-13 GPUs Best | None — soldered | None | 1-2 GPUs max | 1-2 GPUs max |
| CUDA Support | ✅ Full | ❌ Metal only | ✅ Full | ✅ Full | ✅ Full |
| Training / Fine-tuning | ✅ Excellent | ⚠️ Limited (MLX) | ⚠️ Slow | ⚠️ Small models only | ✅ Good (24GB limit) |
| Warranty | Individual parts only | 1-year + AppleCare | 1-year NVIDIA | 3-year on-site | 3-year + lifetime support |
| Setup Time | 4-8 hours | 5 minutes | 30 minutes | 30 minutes | 30 minutes |
* Apple doesn't publish TFLOPS for M3 Ultra GPU in a directly comparable way. The 60-core GPU is estimated at ~27 TFLOPS FP16 based on per-core Apple GPU architecture numbers. Actual AI inference performance depends heavily on memory bandwidth, not raw TFLOPS.
7. Performance per Dollar Analysis
Let's normalize performance to dollars spent. The metric that matters most for LLM inference: tokens per second per $1,000 spent.
| System | Price | 70B Q4 tok/s | tok/s per $1K | VRAM per $1K |
|---|---|---|---|---|
| DIY Pro Tier (4× 3090) | $4,314 | 16.89 | 3.91 Best | 22.3 GB Best |
| Mac Studio M3 Ultra 96GB | $3,999 | 12.13 | 3.03 | 24.0 GB |
| Jetson AGX Orin 64GB | $1,999 | ~2.5 (est.) | 1.25 | 32.0 GB |
| Puget (1× RTX 4090) | $4,500 | OOM | 0 (can't run) | 5.3 GB |
| HP Z4 G5 (RTX A4000) | $4,500 | OOM | 0 (can't run) | 3.6 GB |
The story is clear: the DIY build dominates on performance per dollar for 70B models — 29% more tok/s per dollar than the Mac Studio. The pre-built workstations from HP, Dell, and Puget literally cannot run the benchmark model at all because they don't have enough VRAM.
However, look at VRAM per $1K: the Mac Studio and Jetson offer competitive memory density thanks to unified architectures. The Mac Studio's 96GB at $3,999 gives you 24GB per $1K — slightly ahead of the DIY build. The difference is that the Mac's unified memory is slower for GPU compute than dedicated VRAM.
8. Hidden Costs: What the Sticker Price Doesn't Tell You
Electricity
This is the biggest ongoing cost difference. Assuming 8 hours of active inference per day at US average $0.16/kWh:
- DIY 4× RTX 3090: ~1,400W under load × 8h × 30 days × $0.16 = $53.76/month
- Mac Studio M3 Ultra: ~300W average × 8h × 30 days × $0.16 = $11.52/month
- Puget/HP (single GPU): ~250W × 8h × 30 days × $0.16 = $9.60/month
The Mac Studio saves you ~$42/month in electricity. Over 3 years, that's $1,512 in electricity savings — significant, but the Mac still can't match the DIY rig's inference speed.
Cooling & Noise
Four RTX 3090s in an open frame generate significant heat (~1,400W = ~4,780 BTU/hr). This is equivalent to a small space heater. In summer, your AC will work harder. The Mac Studio is silent and produces minimal heat. If you're putting this in a living space, the Mac wins decisively on livability.
Warranty & Support
- DIY: Individual component warranties (GPU: 3-year if new, none if used from eBay). If something breaks, you troubleshoot yourself.
- Mac Studio: 1-year limited + optional AppleCare for 3 years. Walk into an Apple Store for help.
- HP/Dell: 3-year on-site service with next-business-day response. An engineer comes to your office.
- Puget: 3-year parts and labor + lifetime phone/email support. The gold standard for workstation support.
Time to Build
The DIY rig takes 4-8 hours to assemble, plus time researching parts, waiting for deliveries, and troubleshooting any BIOS/driver issues. The Mac Studio takes 5 minutes to unbox and plug in. If your time is worth $100/hour, the build time alone costs $400-800 in opportunity cost.
Risk of Used Parts
Our build uses RTX 3090s at $750 each — these are used cards, many from cryptocurrency mining. While mining cards are generally reliable (they run at constant temperatures, which is gentler than gaming thermal cycling), there's inherent risk. A dead GPU means $750 lost and troubleshooting time. New professional GPUs from HP/Dell/Puget don't carry this risk.
9. When to Buy Off-the-Shelf vs DIY
Buy the Mac Studio M3 Ultra ($3,999) if:
- You need a silent, desk-friendly AI machine
- You want to run 70B models at full FP16 precision (unified memory advantage)
- You value zero maintenance and Apple ecosystem
- You won't need to expand beyond 96GB (unless you pay $5,599+ for 256GB)
- You primarily do inference, not training (MLX is maturing but CUDA ecosystem is deeper)
- You need a machine that also works as a general workstation (video editing, development, etc.)
Buy a Pre-Built Workstation (HP/Dell/Puget) if:
- You work in a corporate environment that requires vendor support contracts
- You need on-site warranty service and can't afford downtime
- Your AI workloads are small models (7B-13B) and you need the machine for other professional work too
- IT policy prohibits custom builds
- You can expense the purchase and support contract matters more than VRAM
Build the DIY Pro Tier Rig ($4,314) if:
- You need maximum VRAM per dollar — 96GB for $4,314
- You want to run 70B+ models at the fastest possible speed
- You plan to expand to 7+ GPUs over time
- You need CUDA for training and fine-tuning
- You have a dedicated space (closet, garage, basement) where noise and heat are fine
- You enjoy building and maintaining your own hardware
- You're running 24/7 inference servers and need IPMI remote management
10. Verdict & Recommendation
After researching every option at the ~$4,500 price point, here's our honest assessment:
The bottom line: The used RTX 3090 market has created an extraordinary value proposition for DIY builders. At $750 per card, you get 24GB of fast VRAM with full CUDA support — something that costs $2,000+ in professional GPU form factors. The DIY build exploits this gap. Until used GPU prices rise or professional GPUs drop, the DIY advantage is enormous.
11. Our Build Guides
Ready to build? We have complete, step-by-step guides with buy links for every component:
- 📦 Budget Build ($3,500) — Complete Shopping List — The entry-level 4× RTX 3090 build with consumer parts
- 🏗️ Pro Tier Build ($4,314) — Server-Grade Shopping List — ROMED8-2T, EPYC, PCIe 4.0, IPMI, dual 10GbE
- ⚔️ Budget vs Pro Tier Comparison — Which build is right for you?
- 🔧 Multi-GPU Setup Guide — How to connect and run multiple GPUs together
- 🖥️ Local LLM Hardware Guide — GPU deep dive, build tiers, model requirements
References
- XiongjieDai, "GPU Benchmarks on LLM Inference — Multiple NVIDIA GPUs or Apple Silicon," github.com. LLaMA 3 benchmarks across 30+ GPU configurations.
- Apple, "Mac Studio (2025) — Tech Specs," support.apple.com. M3 Ultra and M4 Max configurations.
- NVIDIA, "Jetson AGX Orin for Next-Gen Robotics," nvidia.com.
- NVIDIA Developer Forums, "The token speed of LLM on Jetson AGX Orin," forums.developer.nvidia.com, 2025.
- HP, "HP Z4 Workstation," hp.com.
- PromiseGulf, "Dell Precision Workstation Price List 2025," blog.promisegulf.com, September 2025.
- Lenovo, "ThinkStation P3 Tower Workstation," lenovo.com.
- Lambda, "Legacy Hardware," lambda.ai. Lambda ended on-premise hardware business August 29, 2025.
- Puget Systems, "Single GPU Tower Workstation for AI Development," pugetsystems.com.
- r/LocalLLaMA, "Used A100 80 GB Prices Don't Make Sense," reddit.com, May 2025. Median A100 80GB PCIe price: $18,502.
- r/LocalLLaMA, "Speed Test: Llama-3.3-70b on 2xRTX-3090 vs M3-Max 64GB," reddit.com, December 2024.
- PCMag, "Apple Mac Studio (2025, M3 Ultra) Review," pcmag.com, March 2025. Base price $3,999.
- TechRadar, "Puget Systems Workstation Review," techradar.com, January 2024. Pricing from $3,132 to $61,000.
- markus-schall.de, "AI Studio 2025: The best hardware for LLMs and image AI," markus-schall.de, November 2025.
- ggml-org, "Performance of llama.cpp on Apple Silicon M-series," github.com. Comprehensive Apple Silicon benchmarks.