Introduction
A 4ร RTX 3090 rig puts 96GB of aggregate VRAM in your hands โ enough to run 70B parameter models in Q4 quantization, serve multiple concurrent LLM requests, and host models that would have required cloud data center access two years ago. But somewhere along the way, inference feels slower than expected. Token throughput plateaus. You wonder: is the hardware the limit, or is there something I can change?
A bottleneck in LLM inference is whichever resource hits saturation first and throttles everything downstream. For most people's intuition, that's GPU compute โ CUDA cores, TFLOPS, tensor core utilization. That intuition is wrong. For LLM inference (as opposed to training), the primary bottleneck is almost always GPU memory bandwidth: the rate at which the GPU can read model weights from VRAM into its compute registers. Every token generated requires loading the model's weight matrices into the GPU's arithmetic units. The speed of that transfer โ not the arithmetic itself โ sets your tokens-per-second ceiling.
Secondary bottlenecks stack up behind that: inter-GPU communication latency and bandwidth, VRAM capacity (which determines what models you can run at all), CPU orchestration overhead, and PCIe transfer speed. The relative importance of each depends heavily on which parallelism strategy your inference stack uses.
This article maps all of it: where a 4ร RTX 3090 rig specifically bottlenecks, what each hardware upgrade actually changes, and a ranked upgrade priority list grounded in benchmark data โ not spec sheets.
How Multi-GPU Inference Actually Works
The moment you split a model across multiple GPUs, you introduce inter-GPU communication into every forward pass. But not all multi-GPU strategies are equal โ the communication volume varies by orders of magnitude depending on the approach.
Layer Parallelism (Pipeline Parallelism)
In layer parallelism, each GPU owns a subset of the model's transformer layers. GPU 0 processes layers 0โ19, GPU 1 processes layers 20โ39, and so on. A token flows through the GPUs sequentially: each GPU completes its layers, then passes the activation tensor to the next GPU. Inter-GPU communication is minimal โ just one activation tensor (typically kilobytes to a few megabytes) passed at each GPU boundary per token.
This is how Ollama splits models by default. The practical consequence: PCIe bandwidth is almost entirely irrelevant for layer-parallel inference. Community benchmarks on LocalLLaMA document that layer-sharded workloads function correctly even when GPUs communicate through PCIe x1 slots โ and some configurations have even demonstrated workable (if slow) inference across 10GbE Ethernet between separate machines.[3] If you primarily use Ollama, your PCIe topology has negligible impact on token throughput.
Tensor Parallelism
Tensor parallelism splits each individual matrix multiplication across all GPUs simultaneously. Every GPU works on every layer in parallel, each handling a horizontal slice of the weight matrices and attention heads. After each layer, an all-reduce operation synchronizes results across all GPUs before proceeding โ meaning every GPU sends data to every other GPU, at every layer, for every token.
This is vLLM's default strategy for multi-GPU setups. It delivers lower latency and higher throughput than layer parallelism because the full compute capacity of all GPUs applies to every token. The cost: tensor parallelism requires approximately 20 Gbps of sustained inter-GPU bandwidth to avoid communication becoming the bottleneck.[3] PCIe 4.0 x16 provides roughly 32 GB/s theoretical bidirectional throughput โ which sounds adequate but degrades substantially under shared access with 4 GPUs competing for the same root complex.
Where NVLink Fits In
NVLink is NVIDIA's dedicated high-speed GPU-to-GPU interconnect. On RTX 3090s, the third-generation NVLink bridge provides approximately 112 GB/s bidirectional bandwidth between a connected pair โ roughly 3โ4ร the practical bandwidth of PCIe 4.0 x16. For tensor-parallel workloads, this is the difference between communication being a minor overhead and being a hard ceiling.
The hard constraint with RTX 3090 NVLink: bridges only connect pairs. You can bridge GPU 0โ1 and GPU 2โ3, but there is no mechanism to bridge the two pairs together. Cross-pair communication (GPU 0 โ GPU 2, GPU 1 โ GPU 3, etc.) must travel through PCIe. In a 4-GPU all-reduce, roughly 50% of inter-GPU traffic crosses pair boundaries and thus traverses PCIe regardless of how many NVLink bridges you install.
Your Rig Today: Where the Bottlenecks Are
Memory Bandwidth โ The Primary Speed Limit
Each RTX 3090 provides 936 GB/s of GDDR6X memory bandwidth.[5] This is your fundamental per-card speed constraint for token generation. In autoregressive inference, generating each token requires loading the relevant weight matrices from VRAM. For a 70B model at Q4 quantization (~40GB), a rough upper-bound calculation: 40GB รท 936 GB/s โ 43ms per token, or about 23 tok/s theoretical maximum per card. Real-world numbers sit lower due to compute overhead, KV cache access patterns, and communication costs. With 4 cards, the aggregate bandwidth is 4ร 936 = 3,744 GB/s โ but only if all 4 cards are working on the same request, which requires effective parallelism.
VRAM Capacity โ The Model Size Gate
96GB total gives you comfortable headroom for most current models. A 70B model at Q4_K_M quantization requires ~40GB. A 32B model at FP8 requires ~32GB. You have room for these plus KV cache. Where you start hitting limits: long-context workloads. According to benchmark data from Modal's inference research, an 8B model's KV cache alone climbs from ~0.3GB at 2K context to over 20GB at 128K context.[5] Serving multiple concurrent long-context sessions on the same 96GB pool can exhaust VRAM unexpectedly.
NVLink Topology โ The Multi-GPU Communication Ceiling
Benchmarks from Himesh Prasad's vLLM testing on a 4ร RTX 3090 rig (running QwQ-32B fp8 and Qwen2.5-7B-1M via vLLM) quantify this precisely:[1]
| Configuration | Output tok/s | Throughput tok/s |
|---|---|---|
| 2 GPUs, NVLink ON | 71 | 6,790 |
| 2 GPUs, NVLink OFF | 48 | 3,583 |
| 4 GPUs, NVLink ON | 53 | 5,093 |
| 4 GPUs, NVLink OFF | 49 | 4,669 |
Two things stand out. First, NVLink delivers a ~48% output throughput boost for a 2-GPU tensor-parallel pair โ this is real and reproducible. Second, going from 2 to 4 GPUs with NVLink actually reduces single-request output speed (71 โ 53 tok/s) because cross-pair PCIe communication introduces synchronization latency that partial NVLink coverage cannot fix. The 4-GPU NVLink advantage over 4-GPU PCIe-only is just ~8% on output tok/s โ a fraction of the 2-GPU benefit.[1]
Software Stack Implications
Ollama (layer parallelism): PCIe topology is irrelevant. Performance is dominated by per-GPU memory bandwidth and total VRAM. Simpler to operate, but leaves tensor-parallel throughput gains on the table.
vLLM (tensor parallelism): Sensitive to inter-GPU bandwidth. Achieves significantly higher throughput, especially under concurrent request load. More complex to configure. The NVLink topology limits on a 4ร 3090 rig mean vLLM is not extracting its full potential โ cross-pair PCIe latency is the constraint.
Cacique TTS: Typically single-GPU. TTS model inference is lightweight enough that a single 3090 handles the load without meaningful bottlenecking. Not the constraining workload on this rig.
Hardware Upgrade Option 1: NVLink Bridges
Cost
NVLink bridges for RTX 3090s cost $30โ$80 on eBay.[3] This is the lowest-cost hardware upgrade available for this rig.
Impact
For a 2-GPU tensor-parallel pair, NVLink delivers a ~48โ50% boost in output throughput. The benchmark numbers are direct: 48 โ 71 tok/s output, 3,583 โ 6,790 tok/s total throughput on QwQ-32B fp8 in vLLM.[1] For 4-GPU tensor-parallel runs, the benefit drops to ~8โ10% because cross-pair PCIe traffic persists regardless of how many bridges are installed.
When to Get NVLink Bridges
- You use vLLM with tensor parallelism
- Your target models fit within 2 GPUs (โค48GB) โ so most 32B and below models at reasonable quantization
- Budget is the constraint โ this is $60 for a meaningful throughput gain
When NVLink Bridges Won't Help
- You use Ollama (layer-parallel) โ improvement will be negligible
- Your models span all 4 GPUs in tensor-parallel mode โ cross-pair PCIe traffic still dominates
- Your bottleneck is VRAM capacity, not communication speed
Verdict: Buy them regardless. At $30โ$80, the downside is near-zero. The upside is a genuine 50% throughput improvement on 2-GPU vLLM workloads โ the highest ROI per dollar of any upgrade on this list.
Hardware Upgrade Option 2: More RTX 3090s
What You Gain
Each additional RTX 3090 adds 24GB of VRAM. Going from 96GB to 120GB (5 cards) or 144GB (6 cards) unlocks models that don't fit today: 100B+ parameter models at Q4, or 70B models at Q6/Q8 for better quality. More VRAM also means more KV cache headroom per session and lower risk of OOM on long-context requests.
The builder who benchmarked a 4โ6 GPU 3090 upgrade found that 6 GPUs forms three complete NVLink pairs, restoring pairing symmetry and making tensor-parallel partitioning cleaner โ though vLLM doesn't natively support 6-way tensor-parallel for most models (attention heads must divide evenly by GPU count).[1]
Cost
RTX 3090s run ~$699โ$750 on eBay.[4] A 5th card costs ~$730. A 6th brings total additional spend to ~$1,460.
Tradeoffs
- PCIe slots: A 5th card may require an x4 or x1 slot, or a PLX riser. For Ollama (layer-parallel) this is fine. For vLLM (tensor-parallel), a narrow slot becomes a communication bottleneck.
- Power: 5 cards ร ~370W sustained = ~1,850W GPU draw. PSU and case cooling must support this.
- Speed unchanged: Adding cards does not improve per-token speed for models that already fit. It only expands what models you can run.
- Odd-GPU count: A 5th card creates an unpartnered NVLink orphan under tensor parallelism.
Verdict: Right choice if VRAM capacity is your bottleneck โ you're hitting OOM or need to run larger models. Wrong choice if you want faster inference on models that already fit.
Hardware Upgrade Option 3: Replace with RTX 4090s
Per-Card Performance
A single RTX 4090 achieves approximately 52 tok/s on Llama 3.1 70B Q4, versus 42 tok/s for a single RTX 3090 โ roughly 19% faster per card.[4] Most of this advantage comes from memory bandwidth: 1,008 GB/s (4090) vs 936 GB/s (3090),[5] an ~8% bandwidth improvement, with the remainder coming from improved tensor core efficiency in the Ada Lovelace architecture.
VRAM
The RTX 4090 has 24GB VRAM โ identical to the 3090. A 4ร 4090 rig gives you the same 96GB total. You buy speed, not capacity.
Cost Analysis
RTX 4090 new: $1,599โ$2,200.[4] Replacing 4 RTX 3090s with 4 RTX 4090s at ~$1,800 each = $7,200 outlay, minus ~$2,800 recovered from 3090 resale (4 ร $700). Net cost: ~$4,400 for a ~19% per-card throughput improvement.
NVLink
RTX 4090s also support NVLink bridges in pairs โ same topology constraints as the 3090. A 4ร 4090 rig with NVLink bridges has the same cross-pair PCIe limitation.
Verdict: A 19% per-card speed improvement at $4,400 net cost is a poor ROI unless you have a specific per-request latency target that requires it. The 4090's cost-per-improvement ratio is poor compared to NVLink bridges ($60 for 50%) or software tuning (free). Consider only if 3090 resale value is high and you're running latency-sensitive production workloads.
Hardware Upgrade Option 4: Replace with RTX 5090s
Per-Card Performance
The RTX 5090 (Blackwell architecture) provides 1,792 GB/s memory bandwidth[5] โ a 91% increase over the 3090's 936 GB/s. Independent AI benchmarks consistently show 2โ3ร faster inference versus RTX 3090 across standard LLM workloads.[6] This is a genuine architectural generational leap, not incremental: 5th-generation Tensor Cores, FP4 precision support (3,352 TOPS[6]), 32GB of GDDR7 (8GB more per card), and a 512-bit memory interface.
The NVLink Problem
RTX 5090 does not support NVLink. NVIDIA reserved NVLink for workstation-class (RTX 6000 Ada, etc.) and data center cards starting with the Blackwell consumer lineup.[5][6] In a 4ร 5090 tensor-parallel configuration, all inter-GPU communication runs over PCIe โ negating a significant portion of the per-card bandwidth advantage under vLLM's tensor parallelism.
Power
RTX 5090 TDP is 575W[6] versus 350W for the 3090. Four RTX 5090s draw ~2,300W at load โ a significant infrastructure upgrade requirement (PSU, power circuits, cooling).
Cost
RTX 5090 MSRP is ~$2,000, but street prices run $3,000+ due to supply constraints.[5] Four cards: ~$12,000โ$14,000. Minus 3090 resale (~$2,800): net ~$9,200โ$11,200.
VRAM
32GB per card vs 24GB โ 128GB total vs 96GB with the 3090 rig. This opens additional model headroom.
Verdict: Compelling per-card performance, but expensive and high-power. Best justified for Ollama-primary workloads where NVLink absence doesn't matter. For vLLM-heavy tensor-parallel workloads, the missing NVLink is a real constraint on 4-GPU configurations.
Hardware Upgrade Option 5: CPU / PCIe / RAM Platform Upgrades
When CPU Matters
For Ollama (layer-parallel), the CPU does almost nothing during token generation โ the GPUs do all the work. CPU matters at model load time and for context preprocessing, but not for sustained inference throughput. A modern mid-range CPU (Ryzen 7000 series, Intel 13th/14th gen) is sufficient.
For vLLM (tensor-parallel), the CPU orchestrates the all-reduce operations between GPUs. Under heavy tensor-parallel load, CPU orchestration overhead becomes measurable. Community benchmarks note that single-core CPU performance and PCIe root complex topology begin to matter once you push 4+ GPUs under tensor parallelism.[3] A Threadripper or EPYC platform provides more PCIe lanes (128+ vs 24 on consumer AMD), cleaner topology, and better multi-GPU memory access โ but these platforms cost $3,000โ$8,000 for CPU + motherboard alone.
PCIe Generation
For layer-parallel Ollama workloads, PCIe 3.0 vs 4.0 vs 5.0 makes no measurable difference. The communication volume is too small. For tensor-parallel vLLM workloads, PCIe 4.0 x16 per GPU is the target โ not because bandwidth is the bottleneck, but because running GPUs at x8 or x4 can introduce asymmetry and latency spikes. PCIe 5.0 doesn't move the needle for current model sizes.
RAM Speed
System RAM has minimal impact on GPU-resident model inference. Once the model is loaded into VRAM, system RAM is barely touched during token generation. DDR5-6000 over DDR4-3200 will not improve your LLM tokens-per-second numbers in any measurable way for GPU-resident inference.
Verdict: Platform upgrades (Threadripper/EPYC) are relevant only if you're running dense 4-GPU vLLM workloads and have already exhausted GPU and NVLink options. For most people, this is premature optimization at high cost. Consumer AM5 or Intel 13th/14th gen platforms are not the bottleneck for typical LLM inference rigs.
Non-GPU Quick Wins (Software & Tuning)
Before any hardware spend, these software changes are free and measurable:
Power Limits โ 220W Sweet Spot
Benchmarks on the 4ร RTX 3090 vLLM rig show the efficiency sweet spot at 220W per GPU โ not the default 350W TDP:[1]
| Power Limit (W) | Output tok/s | Throughput tok/s |
|---|---|---|
| 200W | 32 | 287 |
| 220W | 39 | 353 |
| 275W | 43 | 392 |
| 300W | 44 | 400 |
At 220W, you get 88% of the throughput at 63% of the power. Set this with: nvidia-smi -pl 220 -i 0,1,2,3
Quantization Choice
Quantization is the process of reducing the numeric precision of model weights from floating-point (FP16/FP32) to lower-bit integer representations (Q8, Q4, Q2). With 4ร 24GB cards and 96GB total, you can run 70B models at Q4_K_M (~40GB) with near-FP16 quality for most tasks โ this is the practical sweet spot. Q8 on 70B (~80GB) fits in your VRAM pool but leaves little room for KV cache. Q2 fits easily but quality degrades noticeably on reasoning and code tasks. Choose Q4_K_M for 70B as the default unless you have a specific quality requirement driving you higher.
vLLM Over Ollama for High-Throughput Workloads
If you're serving multiple concurrent users or running batch workloads, vLLM's PagedAttention and continuous batching will substantially outperform Ollama. Switching to vLLM for batch/production use is a free throughput improvement โ at the cost of more complex configuration.
Resizable BAR (ReBAR)
Enable Resizable BAR in BIOS if your platform supports it. This allows the CPU to access the full GPU VRAM directly rather than through a 256MB aperture window, reducing paging overhead for large context windows. It costs nothing and provides a small but real improvement on long-context workloads.
Recommended Upgrade Priority
Based on cost-to-impact ratio, here is the ranked upgrade order for a 4ร RTX 3090 inference rig:
| Priority | Upgrade | Est. Cost | Impact | Best For |
|---|---|---|---|---|
| 1 | NVLink bridges (2 pairs) | ~$60โ$160 | +50% throughput on 2-GPU pairs | vLLM users, models โค48GB |
| 2 | Power limits to 220W | Free | Better efficiency, thermal headroom | Everyone |
| 3 | Switch to vLLM for batch | Free | Higher throughput under load | Multi-user / batch workloads |
| 4 | Quantization tuning (Q4_K_M) | Free | Better quality-to-VRAM tradeoff | Everyone |
| 5 | Add 5th/6th RTX 3090 | ~$700โ$1,500 | +24โ48GB VRAM, bigger models | VRAM-constrained users |
| 6 | Replace with RTX 5090s | ~$9,000โ$11,000 net | 2โ3ร per-card throughput | Ollama-primary, long-term bet |
| 7 | Replace with RTX 4090s | ~$4,400 net | ~19% per-card throughput | Production latency requirements |
| 8 | Platform (CPU/PCIe) | $3,000+ | Marginal for most workloads | Dense vLLM, many GPUs |
Appendix: How Quantization Makes Models Smaller (and Faster)
Quantization is the process of reducing the numeric precision of a model's weight values. A standard FP32 model stores each weight as a 32-bit floating-point number. FP16 uses 16 bits. Q8 uses 8 bits. Q4 uses 4 bits. Q2 uses 2 bits. The memory footprint shrinks proportionally:
- 70B model at FP32: ~280GB (doesn't fit anywhere but data centers)
- 70B model at FP16: ~140GB
- 70B model at Q8: ~70GB (fits in your 96GB rig with minimal KV cache)
- 70B model at Q4_K_M: ~40GB (fits comfortably, room for KV cache)
- 70B model at Q2: ~18GB (fits easily, but quality degrades noticeably)
The speed benefit comes from memory bandwidth: since the bottleneck is loading weights from VRAM, a Q4 model generates tokens roughly 2ร faster than its FP16 equivalent on the same hardware โ because you're reading half the bytes. The "K" and "M" suffixes in formats like Q4_K_M refer to the quantization strategy within each bit depth: K-quants use a more sophisticated calibrated approach that preserves model quality better than naive quantization, and "M" (medium) is typically the practical sweet spot between size and quality retention.
References
- Himesh Prasad โ VLLM Performance Benchmarks 4x RTX 3090 (Power Limits, and NVLINK) โ March 2025. Real-world vLLM benchmarks on a 4/6ร RTX 3090 rig measuring power limit sweet spots and NVLink impact under tensor parallelism. โ link
- ywian.com โ RTX 3090 AI Inference Benchmarks (2024) โ Practical benchmarks for 1ร and 2ร RTX 3090 configurations including NVLink impact analysis and setup tips. โ link
- LocalLLaMA (Reddit) โ Multi-GPU inference and PCIe/NVLink community benchmarks โ Community-sourced insights on layer vs tensor parallelism, PCIe slot bandwidth requirements, and NVLink ROI for inference workloads. โ link
- LocalAIMaster.com โ Best GPUs for AI 2025 โ October 2025 head-to-head benchmark data for RTX 3090 vs 4090 on Llama 3.1 70B Q4 inference, including cost and power analysis. โ link
- ComputeMarket.com โ Best GPU for AI in 2026 โ March 2026 comprehensive GPU rankings with memory bandwidth data, VRAM requirements by model size and context length, and H100/RTX 5090 analysis. โ link
- BestGPUsForAI.com โ RTX 5090 vs RTX 3090 for AI (2026) โ Architecture comparison covering Blackwell vs Ampere, TOPS, NVLink absence on consumer Blackwell, and energy efficiency tradeoffs. โ link