1. Why Multiple GPUs? The AI Compute Arms Race
We're in the middle of a GPU arms race. As AI models grow larger — 70 billion, 405 billion, even 671 billion parameters — a single GPU simply can't hold the entire model in memory. The solution? Connect multiple GPUs together and split the workload across them.
This isn't just for data centers anymore. People are building multi-GPU rigs at home. @TheAhmadOsman runs 33 GPUs at home — 21× RTX 3090s, 4× RTX 4090s, 4× RTX 5090s, and 4× Tenstorrent Blackhole p150a accelerators. He's not alone: the r/LocalLLaMA and r/homelab communities are filled with people building GPU farms for AI inference, model training, 3D rendering, and scientific computing [1].
The motivations are clear:
- VRAM pooling — A 70B parameter model needs ~40 GB of VRAM at Q4 quantization. No single consumer GPU has that, but 2× RTX 5090s (64 GB total) do.
- Throughput — Tensor parallelism across GPUs delivers near-linear speedup. Two GPUs can serve twice the requests per second.
- Cost — Cloud GPU rental at $2.49/hr for an H100 adds up fast. A home rig pays for itself within months of heavy use.
- Privacy — Your data never leaves your building.
This guide covers everything: the hardware interconnects (NVLink, PCIe, InfiniBand), the physical builds (motherboards, frames, power, cooling), the software frameworks (vLLM, DeepSpeed, Exo, llama.cpp), and practical build tiers from $1,500 to $15,000+.
2. Hardware: How GPUs Talk to Each Other
The single most important factor in a multi-GPU setup is how the GPUs communicate. Different interconnects offer wildly different bandwidth, and bandwidth directly determines performance when splitting models across cards.
| Interconnect | Bandwidth | Latency | Use Case | Cost |
|---|---|---|---|---|
| NVLink 4.0 | 900 GB/s | Very low | Tensor parallelism, HPC | $80–$150 (bridge) |
| NVLink 3.0 | 600 GB/s | Very low | A100/RTX 3090 pairs | $80–$120 (bridge) |
| PCIe 5.0 x16 | 64 GB/s | Low | Standard multi-GPU | Included (motherboard) |
| PCIe 4.0 x16 | 32 GB/s | Low | Mining rigs, inference | Included (motherboard) |
| PCIe Riser (x1) | ~1–4 GB/s | Medium | Mining, independent tasks | $5–$15 per riser |
| InfiniBand HDR | 200 Gbps (25 GB/s) | 1–5 μs | Multi-node clusters | $200–$500 (used cards) |
| 100GbE Ethernet | 100 Gbps (12.5 GB/s) | 10–50 μs | Multi-node, budget | $50–$200 (used NICs) |
NVLink: The Gold Standard
NVLink is NVIDIA's proprietary high-speed GPU-to-GPU interconnect. It bypasses the PCIe bus entirely, giving GPUs a direct memory access path to each other's VRAM. With NVLink 4.0 on H100s, that's 900 GB/s of bidirectional bandwidth — 14× faster than PCIe 5.0 [2].
For consumer GPUs, NVLink support is limited:
- RTX 3090 — supports NVLink 3.0 via a 2-slot bridge. The only RTX 30-series card with NVLink. Up to 112.5 GB/s.
- RTX 4090 — no NVLink support. NVIDIA removed it from the consumer 40-series.
- RTX 5090 — no NVLink support. Same story as the 4090.
- Professional GPUs — RTX 6000 Ada, RTX PRO 6000, A100, H100, H200, B200 all support NVLink.
Do you need NVLink? For tensor parallelism (splitting one model across GPUs), NVLink helps significantly — 30–50% better performance than PCIe on large models. For pipeline parallelism or independent workloads (different models on different GPUs), PCIe is perfectly fine [4].
PCIe: The Universal Standard
Every GPU plugs into a PCIe slot. Most multi-GPU setups rely on PCIe — it's what your motherboard provides natively. The key considerations:
- Lane count matters — A GPU in a x16 slot gets full bandwidth. A GPU in a x8 slot gets half. A GPU via a x1 riser gets 1/16th.
- PCIe generation matters — PCIe 4.0 x16 = 32 GB/s. PCIe 5.0 x16 = 64 GB/s. PCIe 3.0 x1 (typical mining riser) = ~1 GB/s.
- CPU lane count — Most consumer CPUs provide 16–24 PCIe lanes. Server CPUs (EPYC, Xeon) provide 64–128 lanes. More lanes = more GPUs at full speed.
For AI inference, PCIe bandwidth is usually sufficient. LLM inference is memory-bandwidth bound (how fast the GPU reads its own VRAM), not interconnect-bound. The inter-GPU communication during tensor parallelism adds overhead, but on PCIe 4.0 x16 it's manageable for up to 4 GPUs [5].
PCIe Risers: More GPUs, Less Bandwidth
PCIe risers are adapter cables that connect a GPU to a PCIe x1 slot via a USB-style cable. They were the backbone of cryptocurrency mining rigs — allowing 6, 8, or even 13 GPUs on a single motherboard.
- Pro: Cheap ($5–$15), lets you mount GPUs away from the motherboard in open-air frames
- Con: Only x1 bandwidth (~1–4 GB/s depending on PCIe gen). Fine for mining and independent inference, terrible for tensor parallelism.
- Power risk: Cheap risers with SATA power have caused fires. Use only risers with 6-pin PCIe power connectors [6].
SLI: A Brief History (Deprecated)
Before NVLink, there was SLI (Scalable Link Interface) — NVIDIA's original multi-GPU technology for gaming. SLI split rendering frames across two (or more) GPUs. It was killed off because:
- Game developers had to explicitly support it
- Micro-stuttering and frame pacing issues plagued it
- Single-GPU performance improved fast enough to make it unnecessary
NVIDIA officially dropped SLI support after the RTX 30-series. The RTX 3090 was the last consumer card to support NVLink (SLI's successor). Today, multi-GPU is about compute, not gaming [2].
InfiniBand & High-Speed Ethernet: Multi-Node Setups
When you run out of PCIe slots on a single machine, you connect multiple machines together. This is where InfiniBand and high-speed Ethernet come in:
- InfiniBand — The gold standard for GPU clusters. Uses RDMA (Remote Direct Memory Access) for ultra-low latency. ConnectX-6 cards provide 200 Gbps. Used cards (ConnectX-5, 100 Gbps) can be found on eBay for $50–$200 [7].
- 100GbE Ethernet — Cheaper than InfiniBand, more familiar networking. With RoCE (RDMA over Converged Ethernet), you get near-InfiniBand performance. Mellanox ConnectX-4/5 cards with 100GbE are $50–$150 used.
- Standard 10GbE/25GbE — Works for distributed inference with Exo or Petals, but too slow for tensor parallelism across nodes.
3. Motherboards & Frames for Multi-GPU
Server Motherboards
Consumer motherboards typically have 2–3 PCIe x16 slots. For serious multi-GPU builds, you need server or workstation boards:
| Board | CPU Socket | PCIe x16 Slots | Max GPUs | Price |
|---|---|---|---|---|
| ASUS Pro WS WRX90E-SAGE SE | sTR5 (Threadripper) | 7× PCIe 5.0 | 7 | ~$1,100 |
| Supermicro H13SSL-N | SP5 (EPYC) | 6× PCIe 5.0 | 6 | ~$800 |
| ASRock Rack ROMED8-2T | SP3 (EPYC 7003) | 7× PCIe 4.0 | 7 | ~$500 (used) |
| Gigabyte MC62-G40 | SP3 (EPYC 7003) | 6× PCIe 4.0 | 6 | ~$400 (used) |
| ASUS B250 Mining Expert | LGA 1151 | 19× PCIe (x1 risers) | 19 | ~$200 (used) |
Mining Motherboards (Budget Multi-GPU)
For workloads that don't need full PCIe bandwidth (mining, independent inference tasks, or pipeline parallelism), mining motherboards with tons of x1 slots + risers are an economical choice. Boards like the ASUS B250 Mining Expert support up to 19 GPUs via risers. The BTC-37 and similar Chinese boards are even cheaper (~$80) with 8 GPU support [6].
Open-Air Mining Frames
You can't shove 8 GPUs into a regular PC case. Open-air frames solve this with an aluminum rack design that holds GPUs vertically with plenty of airflow:
- 6-GPU frames — $50–$100. The most common size. Fits most mining boards.
- 8-GPU frames — $80–$150. Extended version for larger boards.
- 12-GPU frames — $100–$200. For massive single-node builds or dual-PSU setups.
- Custom rack-mount — 4U server chassis from Supermicro or Rosewill can hold 4–8 GPUs in a rack-mountable form factor. $200–$500.
Advantages of open-air frames: excellent airflow (GPUs run 10–20°C cooler than in enclosed cases), easy access for maintenance, and no compatibility issues with GPU length. The downside: dust accumulation and noise — these rigs are not quiet [6].
4. Power & Cooling Considerations
Power Requirements
GPUs are power-hungry. Here's the math for common setups:
| Setup | GPU Power | System Total | Recommended PSU | Electrical Circuit |
|---|---|---|---|---|
| 2× RTX 4090 | 900W | ~1,100W | 1,200W+ ATX | 1× 20A circuit |
| 4× RTX 3090 | 1,400W | ~1,600W | 2× 850W or 1× 1,600W | 1× 20A circuit |
| 6× RTX 3090 | 2,100W | ~2,400W | 2× 1,200W or server PSU | 2× 15A circuits |
| 8× RTX 5090 | 4,600W | ~5,000W | Multiple server PSUs | 2× 30A circuits |
Power Supply Options
- ATX PSUs (up to 1,600W) — Corsair HX1500i, be quiet! Dark Power Pro 1500W. Good for 2–4 GPU builds. Use quality 80+ Gold or better units.
- Server PSUs (750W–2,400W) — HP 1200W server PSUs are the mining community's secret weapon. $20–$40 used on eBay, incredibly efficient (80+ Platinum), with breakout boards for PCIe connectors. Stack 2–4 of them.
- Dual PSU setups — Use an Add2PSU adapter or a jumper wire to start a second PSU from the first. Split GPUs between them.
Power Limiting: The Secret Weapon
You don't have to run GPUs at full TDP. Power-limiting by 10–20% reduces performance by less than 5% while cutting heat and power draw significantly. Four RTX 3090s power-limited from 350W to 280W each saves 280W total — the difference between needing one circuit or two [9].
# Set power limit on Linux (requires root)
nvidia-smi -i 0 -pl 280 # GPU 0: limit to 280W
nvidia-smi -i 1 -pl 280 # GPU 1: limit to 280W
# Verify
nvidia-smi --query-gpu=power.limit --format=csv
Cooling Solutions
- Open-air (best for most) — GPUs in open frames with ambient airflow. Add 120mm or 140mm fans at the bottom blowing up through the cards. Maintains 65–75°C under load.
- Blower-style GPUs — Exhaust heat out the back of the card. Better for enclosed server chassis with front-to-back airflow.
- Water cooling — For dense setups where airflow is restricted. Expensive ($100–$200 per GPU block) but keeps temps under 55°C. The EKWB and Alphacool ecosystems support many GPU models.
- Spot cooling — For VRAM hotspots, add thermal pads and heatsinks to the back of the PCB. GDDR6X on RTX 3090/4090 runs hot (90°C+) without backplate cooling.
nvidia-smi -q -d TEMPERATURE). Thermal throttling starts at 83°C on most NVIDIA cards, drastically reducing performance.
5. Software: Making Your GPUs Work Together
Hardware is only half the battle. You need software that knows how to split workloads across multiple GPUs. Here are the major frameworks:
vLLM — High-Throughput Multi-GPU Inference
vLLM is the leading open-source framework for serving large language models. It supports tensor parallelism — splitting a single model's layers across GPUs so each GPU processes part of every token simultaneously [10].
# Serve a 70B model across 2 GPUs with tensor parallelism
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-70B-Instruct \
--tensor-parallel-size 2 \
--gpu-memory-utilization 0.9
vLLM's PagedAttention algorithm manages GPU memory like an OS manages RAM — with virtual pages that eliminate memory waste. This means you can serve more concurrent requests per GPU than with naive approaches [10].
llama.cpp / Ollama — Simple Multi-GPU Inference
llama.cpp uses pipeline parallelism — assigning different layers to different GPUs. It's simpler to set up and works with heterogeneous GPUs (mix RTX 3090 + RTX 4090), but each GPU is idle while waiting for the other to finish its layers [11].
# llama.cpp: split 80 layers across 2 GPUs
./llama-server -m model.gguf \
--n-gpu-layers 80 \
--tensor-split 0.5,0.5
# Ollama: automatically uses all visible GPUs
CUDA_VISIBLE_DEVICES=0,1 ollama run llama3.1:70b
Recent breakthroughs in llama.cpp (January 2026) have improved multi-GPU performance significantly, with automatic tensor split optimization and better NUMA-aware scheduling [11].
DeepSpeed — Distributed Training at Scale
Microsoft's DeepSpeed is the go-to library for training large models across multiple GPUs. Its ZeRO (Zero Redundancy Optimizer) stages progressively partition model states across GPUs [12]:
- ZeRO Stage 1 — Partition optimizer states. ~4× memory reduction.
- ZeRO Stage 2 — + partition gradients. ~8× reduction.
- ZeRO Stage 3 — + partition model parameters. Enables training models larger than any single GPU's memory.
- ZeRO-Infinity — Offload to CPU/NVMe. Train trillion-parameter models.
# DeepSpeed training with 4 GPUs
deepspeed --num_gpus=4 train.py \
--deepspeed ds_config.json \
--model_name_or_path meta-llama/Llama-3.1-8B
PyTorch Distributed — Native Multi-GPU
PyTorch provides built-in multi-GPU support with two main approaches [13]:
- DataParallel (DP) — Easy but slow. Replicates model on all GPUs, splits data batches. One GPU becomes a bottleneck for gradient reduction.
- DistributedDataParallel (DDP) — Production-grade. Each GPU runs its own process. Uses NCCL for efficient all-reduce operations. Near-linear scaling to 8+ GPUs.
- FSDP (Fully Sharded Data Parallel) — Like DeepSpeed ZeRO Stage 3 but native to PyTorch. Shards model parameters across GPUs.
# PyTorch DDP launch with 4 GPUs
torchrun --nproc_per_node=4 train.py
Exo — Distributed Inference Across Heterogeneous Devices
Exo is fascinating — it lets you pool compute across completely different devices over a network. Connect a Mac Studio, a PC with 2× RTX 4090s, and a laptop with an RTX 3060 into one inference cluster. Exo handles splitting the model and routing tokens between devices [14].
# Node 1 (Mac Studio)
exo run llama-3.1-70b --node
# Node 2 (PC with GPUs)
exo run llama-3.1-70b --node
# They auto-discover each other and pool VRAM
Petals — Collaborative Distributed Inference
Petals takes distributed inference to the internet scale. Multiple people contribute their GPUs to serve parts of a large model. Think BitTorrent, but for LLM inference. You contribute GPU time and can run models that no single person could afford [15].
Other Notable Frameworks
- Hugging Face TGI (Text Generation Inference) — Docker-based, supports tensor parallelism.
docker run --gpus all -e NUM_SHARD=2 ghcr.io/huggingface/text-generation-inference - Hugging Face Accelerate — Simple wrapper for multi-GPU training. Auto-detects and distributes across available GPUs.
- ExLlamaV2 — Optimized multi-GPU inference with tensor parallelism. Excellent for quantized models.
GPU Selection & CUDA Basics
# See all GPUs
nvidia-smi
# Select specific GPUs for a process
CUDA_VISIBLE_DEVICES=0,2 python my_script.py # Uses GPU 0 and GPU 2
# Monitor GPU usage in real-time
watch -n 1 nvidia-smi
6. Use Cases: Why People Build Multi-GPU Rigs
🤖 Local AI Inference
The #1 reason people build multi-GPU rigs today. A 70B parameter model at Q4 quantization needs ~40 GB of VRAM. Two RTX 5090s (64 GB total) can run it at 27+ tokens/second. Four RTX 3090s (96 GB total) can run even larger models. This is how you run GPT-4-class models privately at home.
🧠 AI Model Training & Fine-Tuning
Fine-tuning a 7B model on a single RTX 4090 takes hours. Four GPUs with DeepSpeed DDP cut that to a quarter. Full pre-training of models like LLaMA requires hundreds of GPUs — but even fine-tuning and LoRA training benefit enormously from 2–8 GPUs with DeepSpeed ZeRO.
🎨 3D Rendering
Blender's Cycles renderer scales nearly linearly with GPU count. OctaneRender and Redshift are also multi-GPU native. Adding a second GPU literally halves render times. Studios routinely use 4–8 GPU workstations for production rendering [16].
⛏️ Cryptocurrency Mining
The original multi-GPU use case. While Ethereum moved to Proof of Stake in 2022, other coins (Ravencoin, Ergo, Kaspa) remain GPU-minable. Many homelab GPU farms were originally built for mining and later repurposed for AI. The hardware is identical — open-air frames, server PSUs, risers, cooling [6].
🔬 Scientific Computing
Molecular dynamics (GROMACS, AMBER), climate modeling, computational fluid dynamics — these workloads scale across GPUs using CUDA and MPI. University research labs run 4–8 GPU workstations for simulations that would take weeks on CPUs.
🎥 Video Encoding
NVIDIA's NVENC hardware encoder is per-GPU. Two GPUs = two simultaneous encode streams. Professional video workflows use multi-GPU for real-time 4K/8K encoding. FFmpeg can target specific GPU encoders with -gpu flag.
7. Build Tiers: Starter to Beast Mode
🟢 Tier 1: Starter (2 GPUs) — $1,500–$3,000
Runs: 70B models at Q4, all 7–32B models at full speed, 3D rendering at 2× single-GPU
- GPUs: 2× RTX 3090 24GB ($700 each used) or 2× RTX 5090 32GB ($2,000 each)
- Motherboard: Any ATX board with 2× PCIe x16 slots (e.g., ASUS TUF B650, $150)
- CPU: AMD Ryzen 7 7700X or Intel i5-13600K ($200–$300)
- RAM: 64GB DDR5 ($120)
- PSU: 1,200W ATX (Corsair RM1200x, $180)
- Case: Full tower with good airflow (Fractal Meshify 2 XL, $180) or open-air frame ($50)
- Total: ~$2,000–$5,000 depending on GPU choice
This is the sweet spot for most people. Two RTX 3090s give you 48 GB VRAM for $1,400 in GPUs — enough to run Llama 3.1 70B at Q4. Software: Ollama or vLLM with tensor-parallel-size 2.
🔵 Tier 2: Enthusiast (4–8 GPUs) — $5,000–$15,000
Runs: 405B models quantized, multi-user AI serving, large-scale rendering, distributed training
- GPUs: 4–6× RTX 3090 ($700 each) or 4× RTX 4090 ($1,800 each)
- Motherboard: ASRock Rack ROMED8-2T or Gigabyte MC62-G40 ($400–$500 used)
- CPU: AMD EPYC 7313 or 7443 ($200–$400 used)
- RAM: 128–256GB ECC DDR4 ($200–$400 used)
- PSU: 2× HP 1200W server PSUs with breakout boards ($80 total) or 1× EVGA 1600W
- Frame: Open-air 8-GPU mining frame ($80–$120)
- Cooling: 4× 140mm fans at base ($50)
- Total: ~$4,000–$15,000
This is where things get serious. Six RTX 3090s = 144 GB VRAM for ~$4,200 in GPUs. You can run a fully unquantized 70B model or a quantized 405B model. Use EPYC CPUs for the PCIe lanes (128 lanes = 8 GPUs at x16 each) [8].
🟣 Tier 3: Homelab Beast (8+ GPUs) — $15,000+
Runs: DeepSeek V3 671B, full-precision 70B+ models, production AI serving, multi-node training
- Node 1: 8× RTX 3090 on ASUS WRX90E-SAGE + Threadripper PRO 7965WX
- Node 2: 4× RTX 5090 on Supermicro H13SSL-N + EPYC 9354
- Networking: Mellanox ConnectX-5 100GbE cards ($100 each used) + direct cable
- Power: Multiple server PSUs, dedicated 30A circuits
- Software: Exo for distributed inference, DeepSpeed for training
- Rack: 42U server rack ($200–$500)
- Total: $15,000–$50,000+
This is @TheAhmadOsman territory. At this scale, you're managing a mini data center — with power distribution, cooling infrastructure, networking, and monitoring. But you have compute power that rivals a small cloud provider.
8. How People Do It on X
The homelab GPU community is thriving on X/Twitter. Here are the key voices and trends:
- @TheAhmadOsman — Runs 33 GPUs at home. His blog post on why vLLM beats llama.cpp for multi-GPU is required reading. He benchmarks tensor parallelism vs. pipeline parallelism and shows 25%+ performance gains with vLLM/ExLlamaV2 [1].
- r/LocalLLaMA — The subreddit is filled with multi-GPU build logs, benchmarks, and troubleshooting. Common advice: "Get the most VRAM per dollar. Used RTX 3090s are the sweet spot" [11].
- r/homelab — Server rack builds with GPU nodes. People share power consumption data, cooling solutions, and rack layouts.
- Tenstorrent community — Ahmad's 4× Blackhole p150a cards represent the emerging open-source AI accelerator ecosystem. Tenstorrent's chips are designed specifically for inference, potentially offering better performance-per-watt than NVIDIA for certain workloads.
9. Common Mistakes & Tips
- Ignoring PCIe lane counts — Putting 4 GPUs on a consumer CPU with 24 PCIe lanes means each GPU gets x4 instead of x16. Performance drops 10–30% for bandwidth-sensitive workloads.
- Using SATA-powered risers — Fire hazard. Always use 6-pin PCIe powered risers.
- Undersized PSU — GPUs have massive transient power spikes. A 4× RTX 3090 build with a 1,200W PSU will crash under load. Add 20% headroom minimum.
- Poor cooling = thermal throttling — GPUs packed together with 1-slot spacing will throttle. Use 2-slot spacing or water cooling.
- Wrong software choice — Using llama.cpp pipeline parallelism when vLLM tensor parallelism is available wastes half your GPU compute time.
- Forgetting about RAM — Multi-GPU AI inference still needs lots of system RAM for KV cache overflow and CPU preprocessing. Budget 32–64 GB minimum, 128 GB for large models.
- Not power-limiting — Running all GPUs at stock TDP when a 10% power limit saves 15% power for 3% performance loss.
- Mixing GPU generations without proper software — vLLM tensor parallelism requires identical GPUs. llama.cpp and Exo handle heterogeneous GPUs.
10. Getting Started Step-by-Step
Step 1: Define Your Goal
What do you want to run? A 70B model needs ~40 GB VRAM. Training needs more than inference. Rendering scales linearly. Start with your target workload and work backwards to GPU count.
Step 2: Choose Your GPUs
For most builders: used RTX 3090s ($700 each, 24 GB VRAM, NVLink support). If budget allows: RTX 5090 ($2,000, 32 GB VRAM, best single-card performance). For maximum VRAM: RTX PRO 6000 ($6,800, 96 GB, one card to rule them all).
Step 3: Pick Your Platform
- 2 GPUs: Any ATX motherboard with 2× PCIe x16. Consumer CPU is fine.
- 4 GPUs: Workstation board (WRX90) or server board (EPYC). Need 64+ PCIe lanes.
- 6+ GPUs: Server board mandatory. Open-air frame. Server PSUs. Dedicated circuits.
Step 4: Power Planning
Calculate total wattage: (GPU count × TDP) + 200W for system. Add 20% headroom. Verify your electrical circuits can handle it.
Step 5: Build & Configure
- Assemble hardware (mount GPUs, connect power, risers if needed)
- Install Linux (Ubuntu 22.04 or 24.04 recommended)
- Install NVIDIA drivers:
sudo apt install nvidia-driver-560 - Verify GPUs:
nvidia-smi(all cards should appear) - Install frameworks:
pip install vllmorpip install deepspeed - Run your first multi-GPU workload
Step 6: Optimize
- Power-limit GPUs with
nvidia-smi -pl - Monitor temperatures and adjust cooling
- Benchmark with your actual workloads
- Set up monitoring (Grafana + Prometheus + nvidia_gpu_exporter)
11. Pros & Cons of Multi-GPU Builds
| ✅ Pros | ❌ Cons |
|---|---|
| Pool VRAM to run larger models | High upfront cost ($1,500–$50,000+) |
| Near-linear scaling for many workloads | Significant power consumption & electricity bills |
| Complete data privacy — nothing leaves your building | Noise and heat — not apartment-friendly |
| No ongoing cloud costs after hardware purchase | Requires Linux knowledge and debugging skills |
| Full control over hardware and software stack | Hardware maintenance and potential failures |
| Can be repurposed (AI → rendering → mining) | GPUs depreciate; new generations arrive yearly |
| Breaks even vs. cloud in months of heavy use | Electrical infrastructure may need upgrading |
References
- Ahmad Osman, "Stop Wasting Your Multi-GPU Setup With llama.cpp: Use vLLM or ExLlamaV2," ahmadosman.com, February 2025.
- IntuitionLabs, "NVIDIA NVLink Explained: A Guide to the GPU Interconnect," October 2025.
- Chaos Group, "NVLink FAQ" — NVLink only bridges identical cards.
- SabrePC, "Do You Really Need NVLink for Multi-GPU Setups?"
- dasroot.net, "Multi-GPU Setups for LLM Development: When and How," February 2026.
- Einstein@Home Community, "Troubleshooting Multiple GPU Setups Using Riser Cards"
- RunPod, "Do I Need InfiniBand for Distributed AI Training?"
- Hardware Corner, "Building a Multi-GPU LLM Workstation: Choosing the Right Motherboard," November 2025.
- Towards Data Science, "How to Build a Multi-GPU System for Deep Learning," January 2025.
- vLLM Blog, "Distributed Inference with vLLM," February 2025.
- László Jagusztin, "llama.cpp Performance Breakthrough for Multi-GPU Setups," Medium, January 2026.
- Microsoft, "DeepSpeed: Deep Learning Optimization Library," GitHub.
- DigitalOcean, "Splitting LLMs Across Multiple GPUs: Techniques, Tools, and Best Practices," April 2025.
- Exo, "Distributed AI Inference Across Heterogeneous Devices," GitHub.
- Petals, "Collaborative Distributed LLM Inference," GitHub.
- RunPod, "The Complete Guide to Multi-GPU Training"