Research Hardware Multi-GPU

How to Connect & Run Multiple GPUs Together

The complete guide to building multi-GPU rigs — from NVLink bridges and PCIe risers to open-air mining frames, server motherboards, and the software that makes them all work together for AI inference, training, rendering, and more.

Michel Lacle & Yaneth | ThinkSmart.Life Research

February 17, 2026 · min read

🎧 Listen

1. Why Multiple GPUs? The AI Compute Arms Race

We're in the middle of a GPU arms race. As AI models grow larger — 70 billion, 405 billion, even 671 billion parameters — a single GPU simply can't hold the entire model in memory. The solution? Connect multiple GPUs together and split the workload across them.

This isn't just for data centers anymore. People are building multi-GPU rigs at home. @TheAhmadOsman runs 33 GPUs at home — 21× RTX 3090s, 4× RTX 4090s, 4× RTX 5090s, and 4× Tenstorrent Blackhole p150a accelerators. He's not alone: the r/LocalLLaMA and r/homelab communities are filled with people building GPU farms for AI inference, model training, 3D rendering, and scientific computing [1].

The motivations are clear:

VRAM pooling — A 70B parameter model needs ~40 GB of VRAM at Q4 quantization. No single consumer GPU has that, but 2× RTX 5090s (64 GB total) do.
Throughput — Tensor parallelism across GPUs delivers near-linear speedup. Two GPUs can serve twice the requests per second.
Cost — Cloud GPU rental at $2.49/hr for an H100 adds up fast. A home rig pays for itself within months of heavy use.
Privacy — Your data never leaves your building.

This guide covers everything: the hardware interconnects (NVLink, PCIe, InfiniBand), the physical builds (motherboards, frames, power, cooling), the software frameworks (vLLM, DeepSpeed, Exo, llama.cpp), and practical build tiers from $1,500 to $15,000+.

2. Hardware: How GPUs Talk to Each Other

The single most important factor in a multi-GPU setup is how the GPUs communicate. Different interconnects offer wildly different bandwidth, and bandwidth directly determines performance when splitting models across cards.

Interconnect	Bandwidth	Latency	Use Case	Cost
NVLink 4.0	900 GB/s	Very low	Tensor parallelism, HPC	$80–$150 (bridge)
NVLink 3.0	600 GB/s	Very low	A100/RTX 3090 pairs	$80–$120 (bridge)
PCIe 5.0 x16	64 GB/s	Low	Standard multi-GPU	Included (motherboard)
PCIe 4.0 x16	32 GB/s	Low	Mining rigs, inference	Included (motherboard)
PCIe Riser (x1)	~1–4 GB/s	Medium	Mining, independent tasks	$5–$15 per riser
InfiniBand HDR	200 Gbps (25 GB/s)	1–5 μs	Multi-node clusters	$200–$500 (used cards)
100GbE Ethernet	100 Gbps (12.5 GB/s)	10–50 μs	Multi-node, budget	$50–$200 (used NICs)

NVLink: The Gold Standard

NVLink is NVIDIA's proprietary high-speed GPU-to-GPU interconnect. It bypasses the PCIe bus entirely, giving GPUs a direct memory access path to each other's VRAM. With NVLink 4.0 on H100s, that's 900 GB/s of bidirectional bandwidth — 14× faster than PCIe 5.0 [2].

For consumer GPUs, NVLink support is limited:

RTX 3090 — supports NVLink 3.0 via a 2-slot bridge. The only RTX 30-series card with NVLink. Up to 112.5 GB/s.
RTX 4090 — no NVLink support. NVIDIA removed it from the consumer 40-series.
RTX 5090 — no NVLink support. Same story as the 4090.
Professional GPUs — RTX 6000 Ada, RTX PRO 6000, A100, H100, H200, B200 all support NVLink.

⚠️ NVLink Only Bridges Identical GPUs You cannot NVLink an RTX 3090 with an RTX 4090. Both cards must be the same model. The bridge size must also match the card spacing on your motherboard (3-slot or 4-slot) [3].

Do you need NVLink? For tensor parallelism (splitting one model across GPUs), NVLink helps significantly — 30–50% better performance than PCIe on large models. For pipeline parallelism or independent workloads (different models on different GPUs), PCIe is perfectly fine [4].

PCIe: The Universal Standard

Every GPU plugs into a PCIe slot. Most multi-GPU setups rely on PCIe — it's what your motherboard provides natively. The key considerations:

Lane count matters — A GPU in a x16 slot gets full bandwidth. A GPU in a x8 slot gets half. A GPU via a x1 riser gets 1/16th.
PCIe generation matters — PCIe 4.0 x16 = 32 GB/s. PCIe 5.0 x16 = 64 GB/s. PCIe 3.0 x1 (typical mining riser) = ~1 GB/s.
CPU lane count — Most consumer CPUs provide 16–24 PCIe lanes. Server CPUs (EPYC, Xeon) provide 64–128 lanes. More lanes = more GPUs at full speed.

For AI inference, PCIe bandwidth is usually sufficient. LLM inference is memory-bandwidth bound (how fast the GPU reads its own VRAM), not interconnect-bound. The inter-GPU communication during tensor parallelism adds overhead, but on PCIe 4.0 x16 it's manageable for up to 4 GPUs [5].

PCIe Risers: More GPUs, Less Bandwidth

PCIe risers are adapter cables that connect a GPU to a PCIe x1 slot via a USB-style cable. They were the backbone of cryptocurrency mining rigs — allowing 6, 8, or even 13 GPUs on a single motherboard.

Pro: Cheap ($5–$15), lets you mount GPUs away from the motherboard in open-air frames
Con: Only x1 bandwidth (~1–4 GB/s depending on PCIe gen). Fine for mining and independent inference, terrible for tensor parallelism.
Power risk: Cheap risers with SATA power have caused fires. Use only risers with 6-pin PCIe power connectors [6].

⚠️ Never Use SATA-Powered Risers SATA connectors are rated for 54W. PCIe risers can draw up to 75W. The result: melted connectors and potential fires. Always use 6-pin or 8-pin PCIe power for risers.

SLI: A Brief History (Deprecated)

Before NVLink, there was SLI (Scalable Link Interface) — NVIDIA's original multi-GPU technology for gaming. SLI split rendering frames across two (or more) GPUs. It was killed off because:

Game developers had to explicitly support it
Micro-stuttering and frame pacing issues plagued it
Single-GPU performance improved fast enough to make it unnecessary

NVIDIA officially dropped SLI support after the RTX 30-series. The RTX 3090 was the last consumer card to support NVLink (SLI's successor). Today, multi-GPU is about compute, not gaming [2].

InfiniBand & High-Speed Ethernet: Multi-Node Setups

When you run out of PCIe slots on a single machine, you connect multiple machines together. This is where InfiniBand and high-speed Ethernet come in:

InfiniBand — The gold standard for GPU clusters. Uses RDMA (Remote Direct Memory Access) for ultra-low latency. ConnectX-6 cards provide 200 Gbps. Used cards (ConnectX-5, 100 Gbps) can be found on eBay for $50–$200 [7].
100GbE Ethernet — Cheaper than InfiniBand, more familiar networking. With RoCE (RDMA over Converged Ethernet), you get near-InfiniBand performance. Mellanox ConnectX-4/5 cards with 100GbE are $50–$150 used.
Standard 10GbE/25GbE — Works for distributed inference with Exo or Petals, but too slow for tensor parallelism across nodes.

💡 When to Go Multi-Node Multi-node only makes sense when you need more GPUs than fit in one machine (typically 4–8 in a server chassis). For most homelab users, a single 4–8 GPU machine is simpler and faster than a cluster. Multi-node shines at 16+ GPUs.

3. Motherboards & Frames for Multi-GPU

Server Motherboards

Consumer motherboards typically have 2–3 PCIe x16 slots. For serious multi-GPU builds, you need server or workstation boards:

Board	CPU Socket	PCIe x16 Slots	Max GPUs	Price
ASUS Pro WS WRX90E-SAGE SE	sTR5 (Threadripper)	7× PCIe 5.0	7	~$1,100
Supermicro H13SSL-N	SP5 (EPYC)	6× PCIe 5.0	6	~$800
ASRock Rack ROMED8-2T	SP3 (EPYC 7003)	7× PCIe 4.0	7	~$500 (used)
Gigabyte MC62-G40	SP3 (EPYC 7003)	6× PCIe 4.0	6	~$400 (used)
ASUS B250 Mining Expert	LGA 1151	19× PCIe (x1 risers)	19	~$200 (used)

💡 PCIe Lane Math An AMD EPYC 9004 gives you 128 PCIe 5.0 lanes. That's 8 GPUs at full x16 speed. A Threadripper PRO 7000 gives 128 lanes as well. Consumer Ryzen/Intel? Only 20–28 lanes — enough for 1–2 GPUs at full speed before lane splitting kicks in [8].

Mining Motherboards (Budget Multi-GPU)

For workloads that don't need full PCIe bandwidth (mining, independent inference tasks, or pipeline parallelism), mining motherboards with tons of x1 slots + risers are an economical choice. Boards like the ASUS B250 Mining Expert support up to 19 GPUs via risers. The BTC-37 and similar Chinese boards are even cheaper (~$80) with 8 GPU support [6].

Open-Air Mining Frames

You can't shove 8 GPUs into a regular PC case. Open-air frames solve this with an aluminum rack design that holds GPUs vertically with plenty of airflow:

6-GPU frames — $50–$100. The most common size. Fits most mining boards.
8-GPU frames — $80–$150. Extended version for larger boards.
12-GPU frames — $100–$200. For massive single-node builds or dual-PSU setups.
Custom rack-mount — 4U server chassis from Supermicro or Rosewill can hold 4–8 GPUs in a rack-mountable form factor. $200–$500.

Advantages of open-air frames: excellent airflow (GPUs run 10–20°C cooler than in enclosed cases), easy access for maintenance, and no compatibility issues with GPU length. The downside: dust accumulation and noise — these rigs are not quiet [6].

4. Power & Cooling Considerations

Power Requirements

GPUs are power-hungry. Here's the math for common setups:

Setup	GPU Power	System Total	Recommended PSU	Electrical Circuit
2× RTX 4090	900W	~1,100W	1,200W+ ATX	1× 20A circuit
4× RTX 3090	1,400W	~1,600W	2× 850W or 1× 1,600W	1× 20A circuit
6× RTX 3090	2,100W	~2,400W	2× 1,200W or server PSU	2× 15A circuits
8× RTX 5090	4,600W	~5,000W	Multiple server PSUs	2× 30A circuits

⚠️ Don't Overlook Your Electrical Panel A standard US 15A/120V outlet provides 1,800W max (1,440W sustained at 80% rule). A 6-GPU rig easily exceeds that. You may need dedicated 20A circuits, 240V outlets, or multiple circuits. Consult an electrician before running 2,000W+ of GPU hardware [9].

Power Supply Options

ATX PSUs (up to 1,600W) — Corsair HX1500i, be quiet! Dark Power Pro 1500W. Good for 2–4 GPU builds. Use quality 80+ Gold or better units.
Server PSUs (750W–2,400W) — HP 1200W server PSUs are the mining community's secret weapon. $20–$40 used on eBay, incredibly efficient (80+ Platinum), with breakout boards for PCIe connectors. Stack 2–4 of them.
Dual PSU setups — Use an Add2PSU adapter or a jumper wire to start a second PSU from the first. Split GPUs between them.

Power Limiting: The Secret Weapon

You don't have to run GPUs at full TDP. Power-limiting by 10–20% reduces performance by less than 5% while cutting heat and power draw significantly. Four RTX 3090s power-limited from 350W to 280W each saves 280W total — the difference between needing one circuit or two [9].

# Set power limit on Linux (requires root)
nvidia-smi -i 0 -pl 280   # GPU 0: limit to 280W
nvidia-smi -i 1 -pl 280   # GPU 1: limit to 280W

# Verify
nvidia-smi --query-gpu=power.limit --format=csv

Cooling Solutions

Open-air (best for most) — GPUs in open frames with ambient airflow. Add 120mm or 140mm fans at the bottom blowing up through the cards. Maintains 65–75°C under load.
Blower-style GPUs — Exhaust heat out the back of the card. Better for enclosed server chassis with front-to-back airflow.
Water cooling — For dense setups where airflow is restricted. Expensive ($100–$200 per GPU block) but keeps temps under 55°C. The EKWB and Alphacool ecosystems support many GPU models.
Spot cooling — For VRAM hotspots, add thermal pads and heatsinks to the back of the PCB. GDDR6X on RTX 3090/4090 runs hot (90°C+) without backplate cooling.

💡 Temperature Targets GPU core: under 80°C. VRAM (GDDR6X): under 95°C (check with nvidia-smi -q -d TEMPERATURE). Thermal throttling starts at 83°C on most NVIDIA cards, drastically reducing performance.

5. Software: Making Your GPUs Work Together

Hardware is only half the battle. You need software that knows how to split workloads across multiple GPUs. Here are the major frameworks:

vLLM — High-Throughput Multi-GPU Inference

vLLM is the leading open-source framework for serving large language models. It supports tensor parallelism — splitting a single model's layers across GPUs so each GPU processes part of every token simultaneously [10].

# Serve a 70B model across 2 GPUs with tensor parallelism
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 2 \
  --gpu-memory-utilization 0.9

vLLM's PagedAttention algorithm manages GPU memory like an OS manages RAM — with virtual pages that eliminate memory waste. This means you can serve more concurrent requests per GPU than with naive approaches [10].

💡 vLLM vs. llama.cpp for Multi-GPU Ahmad Osman's benchmarks show vLLM with tensor parallelism is significantly faster than llama.cpp's pipeline parallelism on multi-GPU setups. llama.cpp splits layers sequentially — GPU 0 processes layers 1–40, then GPU 1 processes 41–80. Each GPU is idle half the time. Tensor parallelism keeps all GPUs busy simultaneously [1].

llama.cpp / Ollama — Simple Multi-GPU Inference

llama.cpp uses pipeline parallelism — assigning different layers to different GPUs. It's simpler to set up and works with heterogeneous GPUs (mix RTX 3090 + RTX 4090), but each GPU is idle while waiting for the other to finish its layers [11].

# llama.cpp: split 80 layers across 2 GPUs
./llama-server -m model.gguf \
  --n-gpu-layers 80 \
  --tensor-split 0.5,0.5

# Ollama: automatically uses all visible GPUs
CUDA_VISIBLE_DEVICES=0,1 ollama run llama3.1:70b

Recent breakthroughs in llama.cpp (January 2026) have improved multi-GPU performance significantly, with automatic tensor split optimization and better NUMA-aware scheduling [11].

DeepSpeed — Distributed Training at Scale

Microsoft's DeepSpeed is the go-to library for training large models across multiple GPUs. Its ZeRO (Zero Redundancy Optimizer) stages progressively partition model states across GPUs [12]:

ZeRO Stage 1 — Partition optimizer states. ~4× memory reduction.
ZeRO Stage 2 — + partition gradients. ~8× reduction.
ZeRO Stage 3 — + partition model parameters. Enables training models larger than any single GPU's memory.
ZeRO-Infinity — Offload to CPU/NVMe. Train trillion-parameter models.

# DeepSpeed training with 4 GPUs
deepspeed --num_gpus=4 train.py \
  --deepspeed ds_config.json \
  --model_name_or_path meta-llama/Llama-3.1-8B

PyTorch Distributed — Native Multi-GPU

PyTorch provides built-in multi-GPU support with two main approaches [13]:

DataParallel (DP) — Easy but slow. Replicates model on all GPUs, splits data batches. One GPU becomes a bottleneck for gradient reduction.
DistributedDataParallel (DDP) — Production-grade. Each GPU runs its own process. Uses NCCL for efficient all-reduce operations. Near-linear scaling to 8+ GPUs.
FSDP (Fully Sharded Data Parallel) — Like DeepSpeed ZeRO Stage 3 but native to PyTorch. Shards model parameters across GPUs.

# PyTorch DDP launch with 4 GPUs
torchrun --nproc_per_node=4 train.py

Exo — Distributed Inference Across Heterogeneous Devices

Exo is fascinating — it lets you pool compute across completely different devices over a network. Connect a Mac Studio, a PC with 2× RTX 4090s, and a laptop with an RTX 3060 into one inference cluster. Exo handles splitting the model and routing tokens between devices [14].

# Node 1 (Mac Studio)
exo run llama-3.1-70b --node

# Node 2 (PC with GPUs)  
exo run llama-3.1-70b --node

# They auto-discover each other and pool VRAM

Petals — Collaborative Distributed Inference

Petals takes distributed inference to the internet scale. Multiple people contribute their GPUs to serve parts of a large model. Think BitTorrent, but for LLM inference. You contribute GPU time and can run models that no single person could afford [15].

Other Notable Frameworks

Hugging Face TGI (Text Generation Inference) — Docker-based, supports tensor parallelism. docker run --gpus all -e NUM_SHARD=2 ghcr.io/huggingface/text-generation-inference
Hugging Face Accelerate — Simple wrapper for multi-GPU training. Auto-detects and distributes across available GPUs.
ExLlamaV2 — Optimized multi-GPU inference with tensor parallelism. Excellent for quantized models.

GPU Selection & CUDA Basics

# See all GPUs
nvidia-smi

# Select specific GPUs for a process
CUDA_VISIBLE_DEVICES=0,2 python my_script.py  # Uses GPU 0 and GPU 2

# Monitor GPU usage in real-time
watch -n 1 nvidia-smi

6. Use Cases: Why People Build Multi-GPU Rigs

🤖 Local AI Inference

The #1 reason people build multi-GPU rigs today. A 70B parameter model at Q4 quantization needs ~40 GB of VRAM. Two RTX 5090s (64 GB total) can run it at 27+ tokens/second. Four RTX 3090s (96 GB total) can run even larger models. This is how you run GPT-4-class models privately at home.

🧠 AI Model Training & Fine-Tuning

Fine-tuning a 7B model on a single RTX 4090 takes hours. Four GPUs with DeepSpeed DDP cut that to a quarter. Full pre-training of models like LLaMA requires hundreds of GPUs — but even fine-tuning and LoRA training benefit enormously from 2–8 GPUs with DeepSpeed ZeRO.

🎨 3D Rendering

Blender's Cycles renderer scales nearly linearly with GPU count. OctaneRender and Redshift are also multi-GPU native. Adding a second GPU literally halves render times. Studios routinely use 4–8 GPU workstations for production rendering [16].

⛏️ Cryptocurrency Mining

The original multi-GPU use case. While Ethereum moved to Proof of Stake in 2022, other coins (Ravencoin, Ergo, Kaspa) remain GPU-minable. Many homelab GPU farms were originally built for mining and later repurposed for AI. The hardware is identical — open-air frames, server PSUs, risers, cooling [6].

🔬 Scientific Computing

Molecular dynamics (GROMACS, AMBER), climate modeling, computational fluid dynamics — these workloads scale across GPUs using CUDA and MPI. University research labs run 4–8 GPU workstations for simulations that would take weeks on CPUs.

🎥 Video Encoding

NVIDIA's NVENC hardware encoder is per-GPU. Two GPUs = two simultaneous encode streams. Professional video workflows use multi-GPU for real-time 4K/8K encoding. FFmpeg can target specific GPU encoders with -gpu flag.

7. Build Tiers: Starter to Beast Mode

🟢 Tier 1: Starter (2 GPUs) — $1,500–$3,000

Runs: 70B models at Q4, all 7–32B models at full speed, 3D rendering at 2× single-GPU

GPUs: 2× RTX 3090 24GB ($700 each used) or 2× RTX 5090 32GB ($2,000 each)
Motherboard: Any ATX board with 2× PCIe x16 slots (e.g., ASUS TUF B650, $150)
CPU: AMD Ryzen 7 7700X or Intel i5-13600K ($200–$300)
RAM: 64GB DDR5 ($120)
PSU: 1,200W ATX (Corsair RM1200x, $180)
Case: Full tower with good airflow (Fractal Meshify 2 XL, $180) or open-air frame ($50)
Total: ~$2,000–$5,000 depending on GPU choice

This is the sweet spot for most people. Two RTX 3090s give you 48 GB VRAM for $1,400 in GPUs — enough to run Llama 3.1 70B at Q4. Software: Ollama or vLLM with tensor-parallel-size 2.

🔵 Tier 2: Enthusiast (4–8 GPUs) — $5,000–$15,000

Runs: 405B models quantized, multi-user AI serving, large-scale rendering, distributed training

GPUs: 4–6× RTX 3090 ($700 each) or 4× RTX 4090 ($1,800 each)
Motherboard: ASRock Rack ROMED8-2T or Gigabyte MC62-G40 ($400–$500 used)
CPU: AMD EPYC 7313 or 7443 ($200–$400 used)
RAM: 128–256GB ECC DDR4 ($200–$400 used)
PSU: 2× HP 1200W server PSUs with breakout boards ($80 total) or 1× EVGA 1600W
Frame: Open-air 8-GPU mining frame ($80–$120)
Cooling: 4× 140mm fans at base ($50)
Total: ~$4,000–$15,000

This is where things get serious. Six RTX 3090s = 144 GB VRAM for ~$4,200 in GPUs. You can run a fully unquantized 70B model or a quantized 405B model. Use EPYC CPUs for the PCIe lanes (128 lanes = 8 GPUs at x16 each) [8].

🟣 Tier 3: Homelab Beast (8+ GPUs) — $15,000+

Runs: DeepSeek V3 671B, full-precision 70B+ models, production AI serving, multi-node training

Node 1: 8× RTX 3090 on ASUS WRX90E-SAGE + Threadripper PRO 7965WX
Node 2: 4× RTX 5090 on Supermicro H13SSL-N + EPYC 9354
Networking: Mellanox ConnectX-5 100GbE cards ($100 each used) + direct cable
Power: Multiple server PSUs, dedicated 30A circuits
Software: Exo for distributed inference, DeepSpeed for training
Rack: 42U server rack ($200–$500)
Total: $15,000–$50,000+

This is @TheAhmadOsman territory. At this scale, you're managing a mini data center — with power distribution, cooling infrastructure, networking, and monitoring. But you have compute power that rivals a small cloud provider.

8. How People Do It on X

The homelab GPU community is thriving on X/Twitter. Here are the key voices and trends:

@TheAhmadOsman — Runs 33 GPUs at home. His blog post on why vLLM beats llama.cpp for multi-GPU is required reading. He benchmarks tensor parallelism vs. pipeline parallelism and shows 25%+ performance gains with vLLM/ExLlamaV2 [1].
r/LocalLLaMA — The subreddit is filled with multi-GPU build logs, benchmarks, and troubleshooting. Common advice: "Get the most VRAM per dollar. Used RTX 3090s are the sweet spot" [11].
r/homelab — Server rack builds with GPU nodes. People share power consumption data, cooling solutions, and rack layouts.
Tenstorrent community — Ahmad's 4× Blackhole p150a cards represent the emerging open-source AI accelerator ecosystem. Tenstorrent's chips are designed specifically for inference, potentially offering better performance-per-watt than NVIDIA for certain workloads.

🔥 Trending: Used RTX 3090s The RTX 3090 at $600–$800 used is widely considered the best value for multi-GPU AI builds in 2025–2026. 24 GB VRAM, NVLink support (unique among consumer cards), and proven reliability. The mining bust flooded the used market, and smart builders are snapping them up.

9. Common Mistakes & Tips

Ignoring PCIe lane counts — Putting 4 GPUs on a consumer CPU with 24 PCIe lanes means each GPU gets x4 instead of x16. Performance drops 10–30% for bandwidth-sensitive workloads.
Using SATA-powered risers — Fire hazard. Always use 6-pin PCIe powered risers.
Undersized PSU — GPUs have massive transient power spikes. A 4× RTX 3090 build with a 1,200W PSU will crash under load. Add 20% headroom minimum.
Poor cooling = thermal throttling — GPUs packed together with 1-slot spacing will throttle. Use 2-slot spacing or water cooling.
Wrong software choice — Using llama.cpp pipeline parallelism when vLLM tensor parallelism is available wastes half your GPU compute time.
Forgetting about RAM — Multi-GPU AI inference still needs lots of system RAM for KV cache overflow and CPU preprocessing. Budget 32–64 GB minimum, 128 GB for large models.
Not power-limiting — Running all GPUs at stock TDP when a 10% power limit saves 15% power for 3% performance loss.
Mixing GPU generations without proper software — vLLM tensor parallelism requires identical GPUs. llama.cpp and Exo handle heterogeneous GPUs.

10. Getting Started Step-by-Step

Step 1: Define Your Goal

What do you want to run? A 70B model needs ~40 GB VRAM. Training needs more than inference. Rendering scales linearly. Start with your target workload and work backwards to GPU count.

Step 2: Choose Your GPUs

For most builders: used RTX 3090s ($700 each, 24 GB VRAM, NVLink support). If budget allows: RTX 5090 ($2,000, 32 GB VRAM, best single-card performance). For maximum VRAM: RTX PRO 6000 ($6,800, 96 GB, one card to rule them all).

Step 3: Pick Your Platform

2 GPUs: Any ATX motherboard with 2× PCIe x16. Consumer CPU is fine.
4 GPUs: Workstation board (WRX90) or server board (EPYC). Need 64+ PCIe lanes.
6+ GPUs: Server board mandatory. Open-air frame. Server PSUs. Dedicated circuits.

Step 4: Power Planning

Calculate total wattage: (GPU count × TDP) + 200W for system. Add 20% headroom. Verify your electrical circuits can handle it.

Step 5: Build & Configure

Assemble hardware (mount GPUs, connect power, risers if needed)
Install Linux (Ubuntu 22.04 or 24.04 recommended)
Install NVIDIA drivers: sudo apt install nvidia-driver-560
Verify GPUs: nvidia-smi (all cards should appear)
Install frameworks: pip install vllm or pip install deepspeed
Run your first multi-GPU workload

Step 6: Optimize

Power-limit GPUs with nvidia-smi -pl
Monitor temperatures and adjust cooling
Benchmark with your actual workloads
Set up monitoring (Grafana + Prometheus + nvidia_gpu_exporter)

11. Pros & Cons of Multi-GPU Builds

✅ Pros	❌ Cons
Pool VRAM to run larger models	High upfront cost ($1,500–$50,000+)
Near-linear scaling for many workloads	Significant power consumption & electricity bills
Complete data privacy — nothing leaves your building	Noise and heat — not apartment-friendly
No ongoing cloud costs after hardware purchase	Requires Linux knowledge and debugging skills
Full control over hardware and software stack	Hardware maintenance and potential failures
Can be repurposed (AI → rendering → mining)	GPUs depreciate; new generations arrive yearly
Breaks even vs. cloud in months of heavy use	Electrical infrastructure may need upgrading

References

Ahmad Osman, "Stop Wasting Your Multi-GPU Setup With llama.cpp: Use vLLM or ExLlamaV2," ahmadosman.com, February 2025.
IntuitionLabs, "NVIDIA NVLink Explained: A Guide to the GPU Interconnect," October 2025.
Chaos Group, "NVLink FAQ" — NVLink only bridges identical cards.
SabrePC, "Do You Really Need NVLink for Multi-GPU Setups?"
dasroot.net, "Multi-GPU Setups for LLM Development: When and How," February 2026.
Einstein@Home Community, "Troubleshooting Multiple GPU Setups Using Riser Cards"
RunPod, "Do I Need InfiniBand for Distributed AI Training?"
Hardware Corner, "Building a Multi-GPU LLM Workstation: Choosing the Right Motherboard," November 2025.
Towards Data Science, "How to Build a Multi-GPU System for Deep Learning," January 2025.
vLLM Blog, "Distributed Inference with vLLM," February 2025.
László Jagusztin, "llama.cpp Performance Breakthrough for Multi-GPU Setups," Medium, January 2026.
Microsoft, "DeepSpeed: Deep Learning Optimization Library," GitHub.
DigitalOcean, "Splitting LLMs Across Multiple GPUs: Techniques, Tools, and Best Practices," April 2025.
Exo, "Distributed AI Inference Across Heterogeneous Devices," GitHub.
Petals, "Collaborative Distributed LLM Inference," GitHub.
RunPod, "The Complete Guide to Multi-GPU Training"