๐Ÿ“บ Watch the video version: ThinkSmart.Life/youtube

Introduction

A 4ร— RTX 3090 rig puts 96GB of aggregate VRAM in your hands โ€” enough to run 70B parameter models in Q4 quantization, serve multiple concurrent LLM requests, and host models that would have required cloud data center access two years ago. But somewhere along the way, inference feels slower than expected. Token throughput plateaus. You wonder: is the hardware the limit, or is there something I can change?

A bottleneck in LLM inference is whichever resource hits saturation first and throttles everything downstream. For most people's intuition, that's GPU compute โ€” CUDA cores, TFLOPS, tensor core utilization. That intuition is wrong. For LLM inference (as opposed to training), the primary bottleneck is almost always GPU memory bandwidth: the rate at which the GPU can read model weights from VRAM into its compute registers. Every token generated requires loading the model's weight matrices into the GPU's arithmetic units. The speed of that transfer โ€” not the arithmetic itself โ€” sets your tokens-per-second ceiling.

Secondary bottlenecks stack up behind that: inter-GPU communication latency and bandwidth, VRAM capacity (which determines what models you can run at all), CPU orchestration overhead, and PCIe transfer speed. The relative importance of each depends heavily on which parallelism strategy your inference stack uses.

This article maps all of it: where a 4ร— RTX 3090 rig specifically bottlenecks, what each hardware upgrade actually changes, and a ranked upgrade priority list grounded in benchmark data โ€” not spec sheets.

How Multi-GPU Inference Actually Works

The moment you split a model across multiple GPUs, you introduce inter-GPU communication into every forward pass. But not all multi-GPU strategies are equal โ€” the communication volume varies by orders of magnitude depending on the approach.

Layer Parallelism (Pipeline Parallelism)

In layer parallelism, each GPU owns a subset of the model's transformer layers. GPU 0 processes layers 0โ€“19, GPU 1 processes layers 20โ€“39, and so on. A token flows through the GPUs sequentially: each GPU completes its layers, then passes the activation tensor to the next GPU. Inter-GPU communication is minimal โ€” just one activation tensor (typically kilobytes to a few megabytes) passed at each GPU boundary per token.

This is how Ollama splits models by default. The practical consequence: PCIe bandwidth is almost entirely irrelevant for layer-parallel inference. Community benchmarks on LocalLLaMA document that layer-sharded workloads function correctly even when GPUs communicate through PCIe x1 slots โ€” and some configurations have even demonstrated workable (if slow) inference across 10GbE Ethernet between separate machines.[3] If you primarily use Ollama, your PCIe topology has negligible impact on token throughput.

Tensor Parallelism

Tensor parallelism splits each individual matrix multiplication across all GPUs simultaneously. Every GPU works on every layer in parallel, each handling a horizontal slice of the weight matrices and attention heads. After each layer, an all-reduce operation synchronizes results across all GPUs before proceeding โ€” meaning every GPU sends data to every other GPU, at every layer, for every token.

This is vLLM's default strategy for multi-GPU setups. It delivers lower latency and higher throughput than layer parallelism because the full compute capacity of all GPUs applies to every token. The cost: tensor parallelism requires approximately 20 Gbps of sustained inter-GPU bandwidth to avoid communication becoming the bottleneck.[3] PCIe 4.0 x16 provides roughly 32 GB/s theoretical bidirectional throughput โ€” which sounds adequate but degrades substantially under shared access with 4 GPUs competing for the same root complex.

Where NVLink Fits In

NVLink is NVIDIA's dedicated high-speed GPU-to-GPU interconnect. On RTX 3090s, the third-generation NVLink bridge provides approximately 112 GB/s bidirectional bandwidth between a connected pair โ€” roughly 3โ€“4ร— the practical bandwidth of PCIe 4.0 x16. For tensor-parallel workloads, this is the difference between communication being a minor overhead and being a hard ceiling.

The hard constraint with RTX 3090 NVLink: bridges only connect pairs. You can bridge GPU 0โ†”1 and GPU 2โ†”3, but there is no mechanism to bridge the two pairs together. Cross-pair communication (GPU 0 โ†” GPU 2, GPU 1 โ†” GPU 3, etc.) must travel through PCIe. In a 4-GPU all-reduce, roughly 50% of inter-GPU traffic crosses pair boundaries and thus traverses PCIe regardless of how many NVLink bridges you install.

Your Rig Today: Where the Bottlenecks Are

Memory Bandwidth โ€” The Primary Speed Limit

Each RTX 3090 provides 936 GB/s of GDDR6X memory bandwidth.[5] This is your fundamental per-card speed constraint for token generation. In autoregressive inference, generating each token requires loading the relevant weight matrices from VRAM. For a 70B model at Q4 quantization (~40GB), a rough upper-bound calculation: 40GB รท 936 GB/s โ‰ˆ 43ms per token, or about 23 tok/s theoretical maximum per card. Real-world numbers sit lower due to compute overhead, KV cache access patterns, and communication costs. With 4 cards, the aggregate bandwidth is 4ร— 936 = 3,744 GB/s โ€” but only if all 4 cards are working on the same request, which requires effective parallelism.

VRAM Capacity โ€” The Model Size Gate

96GB total gives you comfortable headroom for most current models. A 70B model at Q4_K_M quantization requires ~40GB. A 32B model at FP8 requires ~32GB. You have room for these plus KV cache. Where you start hitting limits: long-context workloads. According to benchmark data from Modal's inference research, an 8B model's KV cache alone climbs from ~0.3GB at 2K context to over 20GB at 128K context.[5] Serving multiple concurrent long-context sessions on the same 96GB pool can exhaust VRAM unexpectedly.

NVLink Topology โ€” The Multi-GPU Communication Ceiling

Benchmarks from Himesh Prasad's vLLM testing on a 4ร— RTX 3090 rig (running QwQ-32B fp8 and Qwen2.5-7B-1M via vLLM) quantify this precisely:[1]

ConfigurationOutput tok/sThroughput tok/s
2 GPUs, NVLink ON716,790
2 GPUs, NVLink OFF483,583
4 GPUs, NVLink ON535,093
4 GPUs, NVLink OFF494,669

Two things stand out. First, NVLink delivers a ~48% output throughput boost for a 2-GPU tensor-parallel pair โ€” this is real and reproducible. Second, going from 2 to 4 GPUs with NVLink actually reduces single-request output speed (71 โ†’ 53 tok/s) because cross-pair PCIe communication introduces synchronization latency that partial NVLink coverage cannot fix. The 4-GPU NVLink advantage over 4-GPU PCIe-only is just ~8% on output tok/s โ€” a fraction of the 2-GPU benefit.[1]

โš ๏ธ The 4-GPU NVLink Trap Installing NVLink bridges on all 4 cards of a 3090 rig does not give you 4ร— NVLink bandwidth. It gives you 2 isolated NVLink pairs that still communicate with each other via PCIe. The topology ceiling is structural, not fixable with more bridges.

Software Stack Implications

Ollama (layer parallelism): PCIe topology is irrelevant. Performance is dominated by per-GPU memory bandwidth and total VRAM. Simpler to operate, but leaves tensor-parallel throughput gains on the table.

vLLM (tensor parallelism): Sensitive to inter-GPU bandwidth. Achieves significantly higher throughput, especially under concurrent request load. More complex to configure. The NVLink topology limits on a 4ร— 3090 rig mean vLLM is not extracting its full potential โ€” cross-pair PCIe latency is the constraint.

Cacique TTS: Typically single-GPU. TTS model inference is lightweight enough that a single 3090 handles the load without meaningful bottlenecking. Not the constraining workload on this rig.

Cost

NVLink bridges for RTX 3090s cost $30โ€“$80 on eBay.[3] This is the lowest-cost hardware upgrade available for this rig.

Impact

For a 2-GPU tensor-parallel pair, NVLink delivers a ~48โ€“50% boost in output throughput. The benchmark numbers are direct: 48 โ†’ 71 tok/s output, 3,583 โ†’ 6,790 tok/s total throughput on QwQ-32B fp8 in vLLM.[1] For 4-GPU tensor-parallel runs, the benefit drops to ~8โ€“10% because cross-pair PCIe traffic persists regardless of how many bridges are installed.

When to Get NVLink Bridges

When NVLink Bridges Won't Help

Verdict: Buy them regardless. At $30โ€“$80, the downside is near-zero. The upside is a genuine 50% throughput improvement on 2-GPU vLLM workloads โ€” the highest ROI per dollar of any upgrade on this list.

Hardware Upgrade Option 2: More RTX 3090s

What You Gain

Each additional RTX 3090 adds 24GB of VRAM. Going from 96GB to 120GB (5 cards) or 144GB (6 cards) unlocks models that don't fit today: 100B+ parameter models at Q4, or 70B models at Q6/Q8 for better quality. More VRAM also means more KV cache headroom per session and lower risk of OOM on long-context requests.

The builder who benchmarked a 4โ†’6 GPU 3090 upgrade found that 6 GPUs forms three complete NVLink pairs, restoring pairing symmetry and making tensor-parallel partitioning cleaner โ€” though vLLM doesn't natively support 6-way tensor-parallel for most models (attention heads must divide evenly by GPU count).[1]

Cost

RTX 3090s run ~$699โ€“$750 on eBay.[4] A 5th card costs ~$730. A 6th brings total additional spend to ~$1,460.

Tradeoffs

Verdict: Right choice if VRAM capacity is your bottleneck โ€” you're hitting OOM or need to run larger models. Wrong choice if you want faster inference on models that already fit.

Hardware Upgrade Option 3: Replace with RTX 4090s

Per-Card Performance

A single RTX 4090 achieves approximately 52 tok/s on Llama 3.1 70B Q4, versus 42 tok/s for a single RTX 3090 โ€” roughly 19% faster per card.[4] Most of this advantage comes from memory bandwidth: 1,008 GB/s (4090) vs 936 GB/s (3090),[5] an ~8% bandwidth improvement, with the remainder coming from improved tensor core efficiency in the Ada Lovelace architecture.

VRAM

The RTX 4090 has 24GB VRAM โ€” identical to the 3090. A 4ร— 4090 rig gives you the same 96GB total. You buy speed, not capacity.

Cost Analysis

RTX 4090 new: $1,599โ€“$2,200.[4] Replacing 4 RTX 3090s with 4 RTX 4090s at ~$1,800 each = $7,200 outlay, minus ~$2,800 recovered from 3090 resale (4 ร— $700). Net cost: ~$4,400 for a ~19% per-card throughput improvement.

NVLink

RTX 4090s also support NVLink bridges in pairs โ€” same topology constraints as the 3090. A 4ร— 4090 rig with NVLink bridges has the same cross-pair PCIe limitation.

Verdict: A 19% per-card speed improvement at $4,400 net cost is a poor ROI unless you have a specific per-request latency target that requires it. The 4090's cost-per-improvement ratio is poor compared to NVLink bridges ($60 for 50%) or software tuning (free). Consider only if 3090 resale value is high and you're running latency-sensitive production workloads.

Hardware Upgrade Option 4: Replace with RTX 5090s

Per-Card Performance

The RTX 5090 (Blackwell architecture) provides 1,792 GB/s memory bandwidth[5] โ€” a 91% increase over the 3090's 936 GB/s. Independent AI benchmarks consistently show 2โ€“3ร— faster inference versus RTX 3090 across standard LLM workloads.[6] This is a genuine architectural generational leap, not incremental: 5th-generation Tensor Cores, FP4 precision support (3,352 TOPS[6]), 32GB of GDDR7 (8GB more per card), and a 512-bit memory interface.

The NVLink Problem

RTX 5090 does not support NVLink. NVIDIA reserved NVLink for workstation-class (RTX 6000 Ada, etc.) and data center cards starting with the Blackwell consumer lineup.[5][6] In a 4ร— 5090 tensor-parallel configuration, all inter-GPU communication runs over PCIe โ€” negating a significant portion of the per-card bandwidth advantage under vLLM's tensor parallelism.

Power

RTX 5090 TDP is 575W[6] versus 350W for the 3090. Four RTX 5090s draw ~2,300W at load โ€” a significant infrastructure upgrade requirement (PSU, power circuits, cooling).

Cost

RTX 5090 MSRP is ~$2,000, but street prices run $3,000+ due to supply constraints.[5] Four cards: ~$12,000โ€“$14,000. Minus 3090 resale (~$2,800): net ~$9,200โ€“$11,200.

VRAM

32GB per card vs 24GB โ€” 128GB total vs 96GB with the 3090 rig. This opens additional model headroom.

๐Ÿ’ก The 5090 Sweet Spot: Ollama users If you use Ollama (layer-parallel), the 5090's missing NVLink doesn't hurt you โ€” layer parallelism doesn't need high-bandwidth inter-GPU communication. The 2ร— memory bandwidth per card translates almost directly to 2ร— token throughput. For Ollama-primary users, 4ร— 5090 is the clearest path to doubling inference speed.

Verdict: Compelling per-card performance, but expensive and high-power. Best justified for Ollama-primary workloads where NVLink absence doesn't matter. For vLLM-heavy tensor-parallel workloads, the missing NVLink is a real constraint on 4-GPU configurations.

Hardware Upgrade Option 5: CPU / PCIe / RAM Platform Upgrades

When CPU Matters

For Ollama (layer-parallel), the CPU does almost nothing during token generation โ€” the GPUs do all the work. CPU matters at model load time and for context preprocessing, but not for sustained inference throughput. A modern mid-range CPU (Ryzen 7000 series, Intel 13th/14th gen) is sufficient.

For vLLM (tensor-parallel), the CPU orchestrates the all-reduce operations between GPUs. Under heavy tensor-parallel load, CPU orchestration overhead becomes measurable. Community benchmarks note that single-core CPU performance and PCIe root complex topology begin to matter once you push 4+ GPUs under tensor parallelism.[3] A Threadripper or EPYC platform provides more PCIe lanes (128+ vs 24 on consumer AMD), cleaner topology, and better multi-GPU memory access โ€” but these platforms cost $3,000โ€“$8,000 for CPU + motherboard alone.

PCIe Generation

For layer-parallel Ollama workloads, PCIe 3.0 vs 4.0 vs 5.0 makes no measurable difference. The communication volume is too small. For tensor-parallel vLLM workloads, PCIe 4.0 x16 per GPU is the target โ€” not because bandwidth is the bottleneck, but because running GPUs at x8 or x4 can introduce asymmetry and latency spikes. PCIe 5.0 doesn't move the needle for current model sizes.

RAM Speed

System RAM has minimal impact on GPU-resident model inference. Once the model is loaded into VRAM, system RAM is barely touched during token generation. DDR5-6000 over DDR4-3200 will not improve your LLM tokens-per-second numbers in any measurable way for GPU-resident inference.

Verdict: Platform upgrades (Threadripper/EPYC) are relevant only if you're running dense 4-GPU vLLM workloads and have already exhausted GPU and NVLink options. For most people, this is premature optimization at high cost. Consumer AM5 or Intel 13th/14th gen platforms are not the bottleneck for typical LLM inference rigs.

Non-GPU Quick Wins (Software & Tuning)

Before any hardware spend, these software changes are free and measurable:

Power Limits โ€” 220W Sweet Spot

Benchmarks on the 4ร— RTX 3090 vLLM rig show the efficiency sweet spot at 220W per GPU โ€” not the default 350W TDP:[1]

Power Limit (W)Output tok/sThroughput tok/s
200W32287
220W39353
275W43392
300W44400

At 220W, you get 88% of the throughput at 63% of the power. Set this with: nvidia-smi -pl 220 -i 0,1,2,3

Quantization Choice

Quantization is the process of reducing the numeric precision of model weights from floating-point (FP16/FP32) to lower-bit integer representations (Q8, Q4, Q2). With 4ร— 24GB cards and 96GB total, you can run 70B models at Q4_K_M (~40GB) with near-FP16 quality for most tasks โ€” this is the practical sweet spot. Q8 on 70B (~80GB) fits in your VRAM pool but leaves little room for KV cache. Q2 fits easily but quality degrades noticeably on reasoning and code tasks. Choose Q4_K_M for 70B as the default unless you have a specific quality requirement driving you higher.

vLLM Over Ollama for High-Throughput Workloads

If you're serving multiple concurrent users or running batch workloads, vLLM's PagedAttention and continuous batching will substantially outperform Ollama. Switching to vLLM for batch/production use is a free throughput improvement โ€” at the cost of more complex configuration.

Resizable BAR (ReBAR)

Enable Resizable BAR in BIOS if your platform supports it. This allows the CPU to access the full GPU VRAM directly rather than through a 256MB aperture window, reducing paging overhead for large context windows. It costs nothing and provides a small but real improvement on long-context workloads.

Recommended Upgrade Priority

Based on cost-to-impact ratio, here is the ranked upgrade order for a 4ร— RTX 3090 inference rig:

PriorityUpgradeEst. CostImpactBest For
1NVLink bridges (2 pairs)~$60โ€“$160+50% throughput on 2-GPU pairsvLLM users, models โ‰ค48GB
2Power limits to 220WFreeBetter efficiency, thermal headroomEveryone
3Switch to vLLM for batchFreeHigher throughput under loadMulti-user / batch workloads
4Quantization tuning (Q4_K_M)FreeBetter quality-to-VRAM tradeoffEveryone
5Add 5th/6th RTX 3090~$700โ€“$1,500+24โ€“48GB VRAM, bigger modelsVRAM-constrained users
6Replace with RTX 5090s~$9,000โ€“$11,000 net2โ€“3ร— per-card throughputOllama-primary, long-term bet
7Replace with RTX 4090s~$4,400 net~19% per-card throughputProduction latency requirements
8Platform (CPU/PCIe)$3,000+Marginal for most workloadsDense vLLM, many GPUs
โœ… The practical starting point Before any hardware spend: set power limits to 220W per GPU, enable ReBAR in BIOS, and evaluate whether your workload benefits from vLLM over Ollama. Then add NVLink bridges โ€” $60โ€“$160 for a 50% throughput improvement on 2-GPU vLLM pairs is the highest ROI upgrade available. Everything else is diminishing returns or a significant capital commitment.

Appendix: How Quantization Makes Models Smaller (and Faster)

Quantization is the process of reducing the numeric precision of a model's weight values. A standard FP32 model stores each weight as a 32-bit floating-point number. FP16 uses 16 bits. Q8 uses 8 bits. Q4 uses 4 bits. Q2 uses 2 bits. The memory footprint shrinks proportionally:

The speed benefit comes from memory bandwidth: since the bottleneck is loading weights from VRAM, a Q4 model generates tokens roughly 2ร— faster than its FP16 equivalent on the same hardware โ€” because you're reading half the bytes. The "K" and "M" suffixes in formats like Q4_K_M refer to the quantization strategy within each bit depth: K-quants use a more sophisticated calibrated approach that preserves model quality better than naive quantization, and "M" (medium) is typically the practical sweet spot between size and quality retention.


References

  1. Himesh Prasad โ€” VLLM Performance Benchmarks 4x RTX 3090 (Power Limits, and NVLINK) โ€” March 2025. Real-world vLLM benchmarks on a 4/6ร— RTX 3090 rig measuring power limit sweet spots and NVLink impact under tensor parallelism. โ†— link
  2. ywian.com โ€” RTX 3090 AI Inference Benchmarks (2024) โ€” Practical benchmarks for 1ร— and 2ร— RTX 3090 configurations including NVLink impact analysis and setup tips. โ†— link
  3. LocalLLaMA (Reddit) โ€” Multi-GPU inference and PCIe/NVLink community benchmarks โ€” Community-sourced insights on layer vs tensor parallelism, PCIe slot bandwidth requirements, and NVLink ROI for inference workloads. โ†— link
  4. LocalAIMaster.com โ€” Best GPUs for AI 2025 โ€” October 2025 head-to-head benchmark data for RTX 3090 vs 4090 on Llama 3.1 70B Q4 inference, including cost and power analysis. โ†— link
  5. ComputeMarket.com โ€” Best GPU for AI in 2026 โ€” March 2026 comprehensive GPU rankings with memory bandwidth data, VRAM requirements by model size and context length, and H100/RTX 5090 analysis. โ†— link
  6. BestGPUsForAI.com โ€” RTX 5090 vs RTX 3090 for AI (2026) โ€” Architecture comparison covering Blackwell vs Ampere, TOPS, NVLink absence on consumer Blackwell, and energy efficiency tradeoffs. โ†— link