🎧 Listen

1. Why Multiple GPUs? The AI Compute Arms Race

We're in the middle of a GPU arms race. As AI models grow larger — 70 billion, 405 billion, even 671 billion parameters — a single GPU simply can't hold the entire model in memory. The solution? Connect multiple GPUs together and split the workload across them.

This isn't just for data centers anymore. People are building multi-GPU rigs at home. @TheAhmadOsman runs 33 GPUs at home — 21× RTX 3090s, 4× RTX 4090s, 4× RTX 5090s, and 4× Tenstorrent Blackhole p150a accelerators. He's not alone: the r/LocalLLaMA and r/homelab communities are filled with people building GPU farms for AI inference, model training, 3D rendering, and scientific computing [1].

The motivations are clear:

This guide covers everything: the hardware interconnects (NVLink, PCIe, InfiniBand), the physical builds (motherboards, frames, power, cooling), the software frameworks (vLLM, DeepSpeed, Exo, llama.cpp), and practical build tiers from $1,500 to $15,000+.

2. Hardware: How GPUs Talk to Each Other

The single most important factor in a multi-GPU setup is how the GPUs communicate. Different interconnects offer wildly different bandwidth, and bandwidth directly determines performance when splitting models across cards.

Interconnect Bandwidth Latency Use Case Cost
NVLink 4.0 900 GB/s Very low Tensor parallelism, HPC $80–$150 (bridge)
NVLink 3.0 600 GB/s Very low A100/RTX 3090 pairs $80–$120 (bridge)
PCIe 5.0 x16 64 GB/s Low Standard multi-GPU Included (motherboard)
PCIe 4.0 x16 32 GB/s Low Mining rigs, inference Included (motherboard)
PCIe Riser (x1) ~1–4 GB/s Medium Mining, independent tasks $5–$15 per riser
InfiniBand HDR 200 Gbps (25 GB/s) 1–5 μs Multi-node clusters $200–$500 (used cards)
100GbE Ethernet 100 Gbps (12.5 GB/s) 10–50 μs Multi-node, budget $50–$200 (used NICs)

NVLink: The Gold Standard

NVLink is NVIDIA's proprietary high-speed GPU-to-GPU interconnect. It bypasses the PCIe bus entirely, giving GPUs a direct memory access path to each other's VRAM. With NVLink 4.0 on H100s, that's 900 GB/s of bidirectional bandwidth — 14× faster than PCIe 5.0 [2].

For consumer GPUs, NVLink support is limited:

⚠️ NVLink Only Bridges Identical GPUs You cannot NVLink an RTX 3090 with an RTX 4090. Both cards must be the same model. The bridge size must also match the card spacing on your motherboard (3-slot or 4-slot) [3].

Do you need NVLink? For tensor parallelism (splitting one model across GPUs), NVLink helps significantly — 30–50% better performance than PCIe on large models. For pipeline parallelism or independent workloads (different models on different GPUs), PCIe is perfectly fine [4].

PCIe: The Universal Standard

Every GPU plugs into a PCIe slot. Most multi-GPU setups rely on PCIe — it's what your motherboard provides natively. The key considerations:

For AI inference, PCIe bandwidth is usually sufficient. LLM inference is memory-bandwidth bound (how fast the GPU reads its own VRAM), not interconnect-bound. The inter-GPU communication during tensor parallelism adds overhead, but on PCIe 4.0 x16 it's manageable for up to 4 GPUs [5].

PCIe Risers: More GPUs, Less Bandwidth

PCIe risers are adapter cables that connect a GPU to a PCIe x1 slot via a USB-style cable. They were the backbone of cryptocurrency mining rigs — allowing 6, 8, or even 13 GPUs on a single motherboard.

⚠️ Never Use SATA-Powered Risers SATA connectors are rated for 54W. PCIe risers can draw up to 75W. The result: melted connectors and potential fires. Always use 6-pin or 8-pin PCIe power for risers.

SLI: A Brief History (Deprecated)

Before NVLink, there was SLI (Scalable Link Interface) — NVIDIA's original multi-GPU technology for gaming. SLI split rendering frames across two (or more) GPUs. It was killed off because:

NVIDIA officially dropped SLI support after the RTX 30-series. The RTX 3090 was the last consumer card to support NVLink (SLI's successor). Today, multi-GPU is about compute, not gaming [2].

InfiniBand & High-Speed Ethernet: Multi-Node Setups

When you run out of PCIe slots on a single machine, you connect multiple machines together. This is where InfiniBand and high-speed Ethernet come in:

💡 When to Go Multi-Node Multi-node only makes sense when you need more GPUs than fit in one machine (typically 4–8 in a server chassis). For most homelab users, a single 4–8 GPU machine is simpler and faster than a cluster. Multi-node shines at 16+ GPUs.

3. Motherboards & Frames for Multi-GPU

Server Motherboards

Consumer motherboards typically have 2–3 PCIe x16 slots. For serious multi-GPU builds, you need server or workstation boards:

Board CPU Socket PCIe x16 Slots Max GPUs Price
ASUS Pro WS WRX90E-SAGE SE sTR5 (Threadripper) 7× PCIe 5.0 7 ~$1,100
Supermicro H13SSL-N SP5 (EPYC) 6× PCIe 5.0 6 ~$800
ASRock Rack ROMED8-2T SP3 (EPYC 7003) 7× PCIe 4.0 7 ~$500 (used)
Gigabyte MC62-G40 SP3 (EPYC 7003) 6× PCIe 4.0 6 ~$400 (used)
ASUS B250 Mining Expert LGA 1151 19× PCIe (x1 risers) 19 ~$200 (used)
💡 PCIe Lane Math An AMD EPYC 9004 gives you 128 PCIe 5.0 lanes. That's 8 GPUs at full x16 speed. A Threadripper PRO 7000 gives 128 lanes as well. Consumer Ryzen/Intel? Only 20–28 lanes — enough for 1–2 GPUs at full speed before lane splitting kicks in [8].

Mining Motherboards (Budget Multi-GPU)

For workloads that don't need full PCIe bandwidth (mining, independent inference tasks, or pipeline parallelism), mining motherboards with tons of x1 slots + risers are an economical choice. Boards like the ASUS B250 Mining Expert support up to 19 GPUs via risers. The BTC-37 and similar Chinese boards are even cheaper (~$80) with 8 GPU support [6].

Open-Air Mining Frames

You can't shove 8 GPUs into a regular PC case. Open-air frames solve this with an aluminum rack design that holds GPUs vertically with plenty of airflow:

Advantages of open-air frames: excellent airflow (GPUs run 10–20°C cooler than in enclosed cases), easy access for maintenance, and no compatibility issues with GPU length. The downside: dust accumulation and noise — these rigs are not quiet [6].

4. Power & Cooling Considerations

Power Requirements

GPUs are power-hungry. Here's the math for common setups:

Setup GPU Power System Total Recommended PSU Electrical Circuit
2× RTX 4090 900W ~1,100W 1,200W+ ATX 1× 20A circuit
4× RTX 3090 1,400W ~1,600W 2× 850W or 1× 1,600W 1× 20A circuit
6× RTX 3090 2,100W ~2,400W 2× 1,200W or server PSU 2× 15A circuits
8× RTX 5090 4,600W ~5,000W Multiple server PSUs 2× 30A circuits
⚠️ Don't Overlook Your Electrical Panel A standard US 15A/120V outlet provides 1,800W max (1,440W sustained at 80% rule). A 6-GPU rig easily exceeds that. You may need dedicated 20A circuits, 240V outlets, or multiple circuits. Consult an electrician before running 2,000W+ of GPU hardware [9].

Power Supply Options

Power Limiting: The Secret Weapon

You don't have to run GPUs at full TDP. Power-limiting by 10–20% reduces performance by less than 5% while cutting heat and power draw significantly. Four RTX 3090s power-limited from 350W to 280W each saves 280W total — the difference between needing one circuit or two [9].

# Set power limit on Linux (requires root)
nvidia-smi -i 0 -pl 280   # GPU 0: limit to 280W
nvidia-smi -i 1 -pl 280   # GPU 1: limit to 280W

# Verify
nvidia-smi --query-gpu=power.limit --format=csv

Cooling Solutions

💡 Temperature Targets GPU core: under 80°C. VRAM (GDDR6X): under 95°C (check with nvidia-smi -q -d TEMPERATURE). Thermal throttling starts at 83°C on most NVIDIA cards, drastically reducing performance.

5. Software: Making Your GPUs Work Together

Hardware is only half the battle. You need software that knows how to split workloads across multiple GPUs. Here are the major frameworks:

vLLM — High-Throughput Multi-GPU Inference

vLLM is the leading open-source framework for serving large language models. It supports tensor parallelism — splitting a single model's layers across GPUs so each GPU processes part of every token simultaneously [10].

# Serve a 70B model across 2 GPUs with tensor parallelism
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 2 \
  --gpu-memory-utilization 0.9

vLLM's PagedAttention algorithm manages GPU memory like an OS manages RAM — with virtual pages that eliminate memory waste. This means you can serve more concurrent requests per GPU than with naive approaches [10].

💡 vLLM vs. llama.cpp for Multi-GPU Ahmad Osman's benchmarks show vLLM with tensor parallelism is significantly faster than llama.cpp's pipeline parallelism on multi-GPU setups. llama.cpp splits layers sequentially — GPU 0 processes layers 1–40, then GPU 1 processes 41–80. Each GPU is idle half the time. Tensor parallelism keeps all GPUs busy simultaneously [1].

llama.cpp / Ollama — Simple Multi-GPU Inference

llama.cpp uses pipeline parallelism — assigning different layers to different GPUs. It's simpler to set up and works with heterogeneous GPUs (mix RTX 3090 + RTX 4090), but each GPU is idle while waiting for the other to finish its layers [11].

# llama.cpp: split 80 layers across 2 GPUs
./llama-server -m model.gguf \
  --n-gpu-layers 80 \
  --tensor-split 0.5,0.5

# Ollama: automatically uses all visible GPUs
CUDA_VISIBLE_DEVICES=0,1 ollama run llama3.1:70b

Recent breakthroughs in llama.cpp (January 2026) have improved multi-GPU performance significantly, with automatic tensor split optimization and better NUMA-aware scheduling [11].

DeepSpeed — Distributed Training at Scale

Microsoft's DeepSpeed is the go-to library for training large models across multiple GPUs. Its ZeRO (Zero Redundancy Optimizer) stages progressively partition model states across GPUs [12]:

# DeepSpeed training with 4 GPUs
deepspeed --num_gpus=4 train.py \
  --deepspeed ds_config.json \
  --model_name_or_path meta-llama/Llama-3.1-8B

PyTorch Distributed — Native Multi-GPU

PyTorch provides built-in multi-GPU support with two main approaches [13]:

# PyTorch DDP launch with 4 GPUs
torchrun --nproc_per_node=4 train.py

Exo — Distributed Inference Across Heterogeneous Devices

Exo is fascinating — it lets you pool compute across completely different devices over a network. Connect a Mac Studio, a PC with 2× RTX 4090s, and a laptop with an RTX 3060 into one inference cluster. Exo handles splitting the model and routing tokens between devices [14].

# Node 1 (Mac Studio)
exo run llama-3.1-70b --node

# Node 2 (PC with GPUs)  
exo run llama-3.1-70b --node

# They auto-discover each other and pool VRAM

Petals — Collaborative Distributed Inference

Petals takes distributed inference to the internet scale. Multiple people contribute their GPUs to serve parts of a large model. Think BitTorrent, but for LLM inference. You contribute GPU time and can run models that no single person could afford [15].

Other Notable Frameworks

GPU Selection & CUDA Basics

# See all GPUs
nvidia-smi

# Select specific GPUs for a process
CUDA_VISIBLE_DEVICES=0,2 python my_script.py  # Uses GPU 0 and GPU 2

# Monitor GPU usage in real-time
watch -n 1 nvidia-smi

6. Use Cases: Why People Build Multi-GPU Rigs

🤖 Local AI Inference

The #1 reason people build multi-GPU rigs today. A 70B parameter model at Q4 quantization needs ~40 GB of VRAM. Two RTX 5090s (64 GB total) can run it at 27+ tokens/second. Four RTX 3090s (96 GB total) can run even larger models. This is how you run GPT-4-class models privately at home.

🧠 AI Model Training & Fine-Tuning

Fine-tuning a 7B model on a single RTX 4090 takes hours. Four GPUs with DeepSpeed DDP cut that to a quarter. Full pre-training of models like LLaMA requires hundreds of GPUs — but even fine-tuning and LoRA training benefit enormously from 2–8 GPUs with DeepSpeed ZeRO.

🎨 3D Rendering

Blender's Cycles renderer scales nearly linearly with GPU count. OctaneRender and Redshift are also multi-GPU native. Adding a second GPU literally halves render times. Studios routinely use 4–8 GPU workstations for production rendering [16].

⛏️ Cryptocurrency Mining

The original multi-GPU use case. While Ethereum moved to Proof of Stake in 2022, other coins (Ravencoin, Ergo, Kaspa) remain GPU-minable. Many homelab GPU farms were originally built for mining and later repurposed for AI. The hardware is identical — open-air frames, server PSUs, risers, cooling [6].

🔬 Scientific Computing

Molecular dynamics (GROMACS, AMBER), climate modeling, computational fluid dynamics — these workloads scale across GPUs using CUDA and MPI. University research labs run 4–8 GPU workstations for simulations that would take weeks on CPUs.

🎥 Video Encoding

NVIDIA's NVENC hardware encoder is per-GPU. Two GPUs = two simultaneous encode streams. Professional video workflows use multi-GPU for real-time 4K/8K encoding. FFmpeg can target specific GPU encoders with -gpu flag.

7. Build Tiers: Starter to Beast Mode

🟢 Tier 1: Starter (2 GPUs) — $1,500–$3,000

Runs: 70B models at Q4, all 7–32B models at full speed, 3D rendering at 2× single-GPU

  • GPUs: 2× RTX 3090 24GB ($700 each used) or 2× RTX 5090 32GB ($2,000 each)
  • Motherboard: Any ATX board with 2× PCIe x16 slots (e.g., ASUS TUF B650, $150)
  • CPU: AMD Ryzen 7 7700X or Intel i5-13600K ($200–$300)
  • RAM: 64GB DDR5 ($120)
  • PSU: 1,200W ATX (Corsair RM1200x, $180)
  • Case: Full tower with good airflow (Fractal Meshify 2 XL, $180) or open-air frame ($50)
  • Total: ~$2,000–$5,000 depending on GPU choice

This is the sweet spot for most people. Two RTX 3090s give you 48 GB VRAM for $1,400 in GPUs — enough to run Llama 3.1 70B at Q4. Software: Ollama or vLLM with tensor-parallel-size 2.

🔵 Tier 2: Enthusiast (4–8 GPUs) — $5,000–$15,000

Runs: 405B models quantized, multi-user AI serving, large-scale rendering, distributed training

  • GPUs: 4–6× RTX 3090 ($700 each) or 4× RTX 4090 ($1,800 each)
  • Motherboard: ASRock Rack ROMED8-2T or Gigabyte MC62-G40 ($400–$500 used)
  • CPU: AMD EPYC 7313 or 7443 ($200–$400 used)
  • RAM: 128–256GB ECC DDR4 ($200–$400 used)
  • PSU: 2× HP 1200W server PSUs with breakout boards ($80 total) or 1× EVGA 1600W
  • Frame: Open-air 8-GPU mining frame ($80–$120)
  • Cooling: 4× 140mm fans at base ($50)
  • Total: ~$4,000–$15,000

This is where things get serious. Six RTX 3090s = 144 GB VRAM for ~$4,200 in GPUs. You can run a fully unquantized 70B model or a quantized 405B model. Use EPYC CPUs for the PCIe lanes (128 lanes = 8 GPUs at x16 each) [8].

🟣 Tier 3: Homelab Beast (8+ GPUs) — $15,000+

Runs: DeepSeek V3 671B, full-precision 70B+ models, production AI serving, multi-node training

  • Node 1: 8× RTX 3090 on ASUS WRX90E-SAGE + Threadripper PRO 7965WX
  • Node 2: 4× RTX 5090 on Supermicro H13SSL-N + EPYC 9354
  • Networking: Mellanox ConnectX-5 100GbE cards ($100 each used) + direct cable
  • Power: Multiple server PSUs, dedicated 30A circuits
  • Software: Exo for distributed inference, DeepSpeed for training
  • Rack: 42U server rack ($200–$500)
  • Total: $15,000–$50,000+

This is @TheAhmadOsman territory. At this scale, you're managing a mini data center — with power distribution, cooling infrastructure, networking, and monitoring. But you have compute power that rivals a small cloud provider.

8. How People Do It on X

The homelab GPU community is thriving on X/Twitter. Here are the key voices and trends:

🔥 Trending: Used RTX 3090s The RTX 3090 at $600–$800 used is widely considered the best value for multi-GPU AI builds in 2025–2026. 24 GB VRAM, NVLink support (unique among consumer cards), and proven reliability. The mining bust flooded the used market, and smart builders are snapping them up.

9. Common Mistakes & Tips

  1. Ignoring PCIe lane counts — Putting 4 GPUs on a consumer CPU with 24 PCIe lanes means each GPU gets x4 instead of x16. Performance drops 10–30% for bandwidth-sensitive workloads.
  2. Using SATA-powered risers — Fire hazard. Always use 6-pin PCIe powered risers.
  3. Undersized PSU — GPUs have massive transient power spikes. A 4× RTX 3090 build with a 1,200W PSU will crash under load. Add 20% headroom minimum.
  4. Poor cooling = thermal throttling — GPUs packed together with 1-slot spacing will throttle. Use 2-slot spacing or water cooling.
  5. Wrong software choice — Using llama.cpp pipeline parallelism when vLLM tensor parallelism is available wastes half your GPU compute time.
  6. Forgetting about RAM — Multi-GPU AI inference still needs lots of system RAM for KV cache overflow and CPU preprocessing. Budget 32–64 GB minimum, 128 GB for large models.
  7. Not power-limiting — Running all GPUs at stock TDP when a 10% power limit saves 15% power for 3% performance loss.
  8. Mixing GPU generations without proper software — vLLM tensor parallelism requires identical GPUs. llama.cpp and Exo handle heterogeneous GPUs.

10. Getting Started Step-by-Step

Step 1: Define Your Goal

What do you want to run? A 70B model needs ~40 GB VRAM. Training needs more than inference. Rendering scales linearly. Start with your target workload and work backwards to GPU count.

Step 2: Choose Your GPUs

For most builders: used RTX 3090s ($700 each, 24 GB VRAM, NVLink support). If budget allows: RTX 5090 ($2,000, 32 GB VRAM, best single-card performance). For maximum VRAM: RTX PRO 6000 ($6,800, 96 GB, one card to rule them all).

Step 3: Pick Your Platform

Step 4: Power Planning

Calculate total wattage: (GPU count × TDP) + 200W for system. Add 20% headroom. Verify your electrical circuits can handle it.

Step 5: Build & Configure

  1. Assemble hardware (mount GPUs, connect power, risers if needed)
  2. Install Linux (Ubuntu 22.04 or 24.04 recommended)
  3. Install NVIDIA drivers: sudo apt install nvidia-driver-560
  4. Verify GPUs: nvidia-smi (all cards should appear)
  5. Install frameworks: pip install vllm or pip install deepspeed
  6. Run your first multi-GPU workload

Step 6: Optimize

11. Pros & Cons of Multi-GPU Builds

✅ Pros ❌ Cons
Pool VRAM to run larger models High upfront cost ($1,500–$50,000+)
Near-linear scaling for many workloads Significant power consumption & electricity bills
Complete data privacy — nothing leaves your building Noise and heat — not apartment-friendly
No ongoing cloud costs after hardware purchase Requires Linux knowledge and debugging skills
Full control over hardware and software stack Hardware maintenance and potential failures
Can be repurposed (AI → rendering → mining) GPUs depreciate; new generations arrive yearly
Breaks even vs. cloud in months of heavy use Electrical infrastructure may need upgrading
🛡️ No Third-Party Tracking