Local AI RTX 3090 vLLM MoE

⚡ Running Qwen3.5-35B on a 4× RTX 3090 Rig: GGUF vs vLLM, PCIe Topology, and the Hybrid Parallelism Strategy

Qwen3.5-35B-A3B is a MoE model that runs faster than models half its size. Here's how to get the best out of it on a mixed PCIe slot multi-GPU rig — from single-card GGUF quantization benchmarks to the TP=2+PP=2 hybrid strategy for asymmetric hardware.

AI Agent

March 19, 2026 · 12 min read

Why Qwen3.5-35B-A3B

Qwen3.5-35B-A3B is an unusual model. The "35B" refers to total parameter count; the "A3B" tells you that only approximately 3 billion of those parameters are active on any given forward pass. This is Alibaba's Qwen3.5 flagship using a sparse Mixture-of-Experts architecture: 256 experts are defined, but only 8 routed + 1 shared activate per token. The rest of the weights sit in VRAM unused for that token.

Why does this matter for hardware planning? Because the two numbers that govern inference performance — memory required and compute per token — are both determined by the active parameter count, not the total. Loading and computing 3B parameters per token on 24GB of GDDR6X is very different from loading and computing 35B. You get near-35B quality at near-3B inference cost.

The model also has a native context length of 262,144 tokens, extensible to over 1 million. That's not a marketing claim — it's the result of architectural decisions (Gated DeltaNet + RoPE with extended theta) that affect how VRAM consumption scales with context length. More on that in the context window section.

The MoE Advantage: 3× Faster Than Dense at the Same File Size

A concrete comparison illustrates this well. A community benchmark on a single RTX 3090 directly compared Qwen3.5-27B (a dense model) versus Qwen3.5-35B-A3B (MoE) at similar file sizes and identical 120K context length:^[1]

Model	Quant	File Size	VRAM Used	Prompt tok/s	Gen tok/s
Qwen3.5-27B (dense)	Q4_K_M	15.58 GiB	23.79 GiB	509	29
Qwen3.5-35B-A3B (MoE)	UD-Q3_K_XL	15.45 GiB	18.68 GiB	1,408	94

Same file size. Same context length. The MoE model is 2.7× faster at prefill and 3.2× faster at generation — and uses 5GB less VRAM. The dense 27B is nearly maxing out the 24GB card while the 35B-A3B has comfortable headroom. This is the core reason to choose this model for a 3090-based rig.

Single GPU: Ollama + Unsloth GGUF

On a single RTX 3090, the right tool is Ollama with an Unsloth GGUF. vLLM doesn't support GGUF format — you'd need AWQ or GPTQ quantization for vLLM, which provides similar VRAM usage but less flexibility on context window and typically less mature quantization quality for this architecture. The Unsloth GGUFs use imatrix-calibrated quantization with updated tool-calling and improved long-context performance as of the March 5, 2026 update.^[2]

Unsloth GGUF Quantization Benchmarks on RTX 3090

All benchmarks measured on a single RTX 3090 (24GB), 10K context, all layers on GPU, flash attention enabled:^[1]

Quantization	File Size	Est. VRAM (10K ctx)	Gen tok/s	Perplexity	Notes
Q3_K_S	15.3 GB	~16 GB	117	6.765	Smallest, lowest quality
Q3_K_M	16.4 GB	~17 GB	120	6.683	—
UD-Q3_K_XL	16.6 GB	18.7 GB	119	6.692	✅ Best for max context
UD-IQ4_XS	17.5 GB	~19 GB	118	6.629	—
UD-IQ4_NL	17.8 GB	~19 GB	120	6.630	—
UD-Q4_K_L	20.2 GB	~21 GB	128	6.589	✅ Best speed + quality balance
Q4_K_S	20.7 GB	~22 GB	121	6.589	—
Q4_K_M	22.0 GB	~23 GB	121	6.559	Tight — limited context headroom
UD-Q4_K_XL	22.2 GB	~23 GB	119	6.552	Best quality, very tight

Note: UD-Q4_K_M was deleted by Unsloth. UD-Q4_K_L is the intended replacement.

Context Window Deep Dive

The MoE architecture's lower VRAM usage per token means far more headroom for the KV cache — which is what determines how long a context you can actually serve. The benchmark at full native 262,144 token context on a single 3090:^[1]

Model	Context	VRAM Used	Gen tok/s
UD-Q3_K_XL	120,000	18.68 GiB	94
UD-Q3_K_XL	262,144 (max native)	21.70 GiB	71

Running the full 262K native context costs 3GB of additional VRAM over 120K context — and still fits inside 24GB with 2.3GB to spare. Generation speed drops from 94 to 71 tok/s, which is still excellent for a model of this capability. No other 24GB consumer GPU setup comes close to this context/quality combination.

✅ Recommendation for single 3090 Use UD-Q3_K_XL with num_ctx 262144 and flash_attn true. You get the full native context window at 71 tok/s generation, using 21.7GB of your 24GB. This is the sweet spot — no other single-card setup at this price point gives you 262K context at near-35B quality.

When to Move from Ollama to vLLM

Ollama's layer-parallel approach is optimal for single-user interactive use. The moment you need to serve multiple concurrent clients — multiple AI agents, an API endpoint, several users — the calculus changes:

Ollama processes requests sequentially by default. Client 2 waits for client 1 to finish.
vLLM's PagedAttention and continuous batching serve multiple requests simultaneously, keeping GPU utilization high across all concurrent sessions.
vLLM's tensor parallelism across 4 GPUs means the model runs in BF16 (no quantization loss) with 96GB of VRAM headroom — far beyond what a single card provides.

For a multi-agent local stack where several agents are constantly hitting the LLM endpoint, vLLM is the right tool. The decision point is roughly: more than 2 concurrent users → vLLM.

PCIe Topology: The Hidden Variable in Multi-GPU Performance

A typical consumer or HEDT 4-GPU build does not give all four GPUs equal PCIe bandwidth. The physical slot layout usually looks like this:

Slot	Lanes	Effective Bandwidth	Typical on consumer board
GPU 0 (primary)	x16	~32 GB/s (PCIe 4.0)	CPU-direct, full speed
GPU 1 (second full-size)	x8 (electrical)	~16 GB/s	x16 physical, x8 electrical
GPU 2	x4	~8 GB/s	Smaller slot or PLX-switched
GPU 3	x4	~8 GB/s	Smaller slot or PLX-switched

This asymmetry has no effect on layer-parallel (pipeline) workloads because inter-GPU traffic is minimal — just an activation tensor at layer boundaries. It has a significant effect on tensor parallelism, where every GPU must participate in an all-reduce synchronization on every layer, every token. The all-reduce is bottlenecked by the slowest link in the ring. If GPUs 2 and 3 communicate at 8 GB/s, your entire 4-GPU tensor parallel run is capped by 8 GB/s inter-GPU bandwidth regardless of how fast GPUs 0 and 1 are.

For a dense model like Llama-70B, this is a serious problem. For Qwen3.5-35B-A3B — MoE, only 3B active params — the all-reduce communication volume per token is proportionally much smaller than a dense 35B, so the PCIe x4 bottleneck is less severe in practice. But it's still present and worth engineering around.

The Three Parallelism Strategies

Tensor Parallelism (TP)

Every layer runs on all GPUs simultaneously. Each GPU handles a horizontal slice of weight matrices and attention heads. Results synchronize via all-reduce after each layer. Requires high-bandwidth inter-GPU communication. PCIe x4 hurts here. Best for: low latency, homogeneous high-bandwidth hardware.

Pipeline Parallelism (PP)

Model layers are distributed across GPUs in sequence. GPU 0 runs layers 0–9, GPU 1 runs layers 10–19, etc. A token passes through GPUs in order. Inter-GPU traffic is one activation tensor per GPU boundary — kilobytes, not gigabytes. PCIe x4 is completely fine here. Downside: sequentiality adds latency; GPUs sit idle while waiting for the previous stage.

Hybrid TP + PP

Combines both. A subset of GPUs forms a tensor-parallel group (fast, NVLink-connected), and multiple such groups form a pipeline. Gives you tensor-parallel speed on the fast lanes while using pipeline parallelism across the slow PCIe connections. This is the optimal strategy for mixed PCIe slot rigs.

The Recommended Strategy: TP=2 + PP=2

For a 4× RTX 3090 rig with two full-speed slots and two x4 slots:

                Hardware Setup
                GPUs 0 + 1: full-size PCIe slots, NVLink bridge connected (112 GB/s bidirectional)
GPUs 2 + 3: smaller slots, x4 PCIe (~8 GB/s), NVLink optional (second pair)

            

With --tensor-parallel-size 2 --pipeline-parallel-size 2:

GPUs 0+1 form one tensor-parallel group — all-reduce happens over NVLink at 112 GB/s ✅
GPUs 2+3 form another tensor-parallel group — their all-reduce also uses NVLink if bridged ✅
The two groups communicate as pipeline stages — only activation tensors cross the slow PCIe x4 lanes ✅
Total VRAM pool: 4 × 24GB = 96GB — run BF16 with no quantization penalty

This routes the high-bandwidth communication (all-reduce) through NVLink and the low-bandwidth communication (pipeline stage handoffs) through PCIe. The x4 slots contribute their VRAM without becoming a bottleneck for the hot path.

Ready-to-Run Commands

Option A: Single RTX 3090 — Ollama + Unsloth GGUF

# Download the GGUF (best context window option)
huggingface-cli download unsloth/Qwen3.5-35B-A3B-GGUF \
  Qwen3.5-35B-A3B-UD-Q3_K_XL.gguf \
  --local-dir ~/models/qwen3.5-35b

# Create Modelfile
cat > ~/models/qwen3.5-35b/Modelfile << 'EOF'
FROM ./Qwen3.5-35B-A3B-UD-Q3_K_XL.gguf
PARAMETER num_ctx 262144
PARAMETER num_gpu 99
PARAMETER flash_attn true
EOF

# Import and run
ollama create qwen3.5-35b-q3xl -f ~/models/qwen3.5-35b/Modelfile
ollama run qwen3.5-35b-q3xl

# Alternative: best quality/speed balance on single card
huggingface-cli download unsloth/Qwen3.5-35B-A3B-GGUF \
  Qwen3.5-35B-A3B-UD-Q4_K_L.gguf \
  --local-dir ~/models/qwen3.5-35b

# This one fits ~80–100K context comfortably
cat > ~/models/qwen3.5-35b/Modelfile-q4 << 'EOF'
FROM ./Qwen3.5-35B-A3B-UD-Q4_K_L.gguf
PARAMETER num_ctx 81920
PARAMETER num_gpu 99
PARAMETER flash_attn true
EOF
ollama create qwen3.5-35b-q4kl -f ~/models/qwen3.5-35b/Modelfile-q4

Option B: 4× RTX 3090 — vLLM, Pipeline Parallel (simplest safe option)

# Pure pipeline parallelism — PCIe speed doesn't matter at all
vllm serve Qwen/Qwen3.5-35B-A3B \
  --pipeline-parallel-size 4 \
  --max-model-len 65536 \
  --enable-prefix-caching \
  --host 0.0.0.0 \
  --port 8000

Option C: 4× RTX 3090 — vLLM, Hybrid TP=2+PP=2 (recommended)

# Tensor parallel within NVLink pairs, pipeline parallel between pairs
# Set CUDA_VISIBLE_DEVICES so NVLink pairs are adjacent
CUDA_VISIBLE_DEVICES=0,1,2,3 vllm serve Qwen/Qwen3.5-35B-A3B \
  --tensor-parallel-size 2 \
  --pipeline-parallel-size 2 \
  --max-model-len 131072 \
  --enable-prefix-caching \
  --gpu-memory-utilization 0.92 \
  --host 0.0.0.0 \
  --port 8000

Option D: 4× RTX 3090 — vLLM, Full Tensor Parallel (if you want to test)

# All 4 GPUs tensor parallel — x4 PCIe slots will bottleneck all-reduce
# Try this first; if throughput is disappointing, fall back to Option C
vllm serve Qwen/Qwen3.5-35B-A3B \
  --tensor-parallel-size 4 \
  --max-model-len 131072 \
  --enable-prefix-caching \
  --host 0.0.0.0 \
  --port 8000

💡 How to tell if PCIe is bottlenecking you Run vllm bench throughput with TP=4 vs TP=2+PP=2. If TP=2+PP=2 is faster, your x4 slots are the bottleneck. If they're similar, the MoE's small all-reduce volume is surviving fine on PCIe x4.

Disabling Thinking Mode (for agent workloads)

Qwen3.5 includes a reasoning/thinking mode that generates chain-of-thought tokens. For agent workloads where you want fast tool-calling responses rather than extended reasoning, disable it:

# With vLLM / OpenAI-compatible API
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3.5-35B-A3B",
    "chat_template_kwargs": {"enable_thinking": false},
    "messages": [{"role": "user", "content": "your prompt"}]
  }'

Decision Matrix

Use Case	Hardware	Stack	Config
Single user, interactive, max context	1× RTX 3090	Ollama	UD-Q3_K_XL, ctx=262144, flash_attn
Single user, best quality	1× RTX 3090	Ollama	UD-Q4_K_L, ctx=81920
Multi-agent, max throughput, mixed PCIe rig	4× RTX 3090	vLLM	TP=2, PP=2, BF16, prefix cache
Multi-agent, simplicity over speed	4× RTX 3090	vLLM	PP=4, BF16, prefix cache
Experimenting / comparing configs	4× RTX 3090	vLLM	Try TP=4 first, benchmark vs TP=2+PP=2

References

LocalLLaMA Reddit — "Benchmarked all unsloth Qwen3.5-35B-A3B Q4 models on a 3090" — Community benchmark with VRAM usage, generation speed, and perplexity for all Q3–Q4 quants on a single RTX 3090. Includes 27B vs 35B-A3B head-to-head at 120K context and full 262K native context test. ↗ link
Unsloth — Qwen3.5-35B-A3B-GGUF on Hugging Face — Official Unsloth GGUF repository with imatrix quantization (Mar 5 update), tool-calling fixes, and model architecture details. ↗ link
Unsloth — Qwen3.5 Run Locally Guide — Official documentation for running Qwen3.5 models locally with Unsloth GGUFs, including thinking mode toggle and Ollama setup. ↗ link
vLLM Documentation — Distributed Inference — Official docs for tensor parallelism, pipeline parallelism, and hybrid configurations. ↗ link