Why Qwen3.5-35B-A3B
Qwen3.5-35B-A3B is an unusual model. The "35B" refers to total parameter count; the "A3B" tells you that only approximately 3 billion of those parameters are active on any given forward pass. This is Alibaba's Qwen3.5 flagship using a sparse Mixture-of-Experts architecture: 256 experts are defined, but only 8 routed + 1 shared activate per token. The rest of the weights sit in VRAM unused for that token.
Why does this matter for hardware planning? Because the two numbers that govern inference performance — memory required and compute per token — are both determined by the active parameter count, not the total. Loading and computing 3B parameters per token on 24GB of GDDR6X is very different from loading and computing 35B. You get near-35B quality at near-3B inference cost.
The model also has a native context length of 262,144 tokens, extensible to over 1 million. That's not a marketing claim — it's the result of architectural decisions (Gated DeltaNet + RoPE with extended theta) that affect how VRAM consumption scales with context length. More on that in the context window section.
The MoE Advantage: 3× Faster Than Dense at the Same File Size
A concrete comparison illustrates this well. A community benchmark on a single RTX 3090 directly compared Qwen3.5-27B (a dense model) versus Qwen3.5-35B-A3B (MoE) at similar file sizes and identical 120K context length:[1]
| Model | Quant | File Size | VRAM Used | Prompt tok/s | Gen tok/s |
|---|---|---|---|---|---|
| Qwen3.5-27B (dense) | Q4_K_M | 15.58 GiB | 23.79 GiB | 509 | 29 |
| Qwen3.5-35B-A3B (MoE) | UD-Q3_K_XL | 15.45 GiB | 18.68 GiB | 1,408 | 94 |
Same file size. Same context length. The MoE model is 2.7× faster at prefill and 3.2× faster at generation — and uses 5GB less VRAM. The dense 27B is nearly maxing out the 24GB card while the 35B-A3B has comfortable headroom. This is the core reason to choose this model for a 3090-based rig.
Single GPU: Ollama + Unsloth GGUF
On a single RTX 3090, the right tool is Ollama with an Unsloth GGUF. vLLM doesn't support GGUF format — you'd need AWQ or GPTQ quantization for vLLM, which provides similar VRAM usage but less flexibility on context window and typically less mature quantization quality for this architecture. The Unsloth GGUFs use imatrix-calibrated quantization with updated tool-calling and improved long-context performance as of the March 5, 2026 update.[2]
Unsloth GGUF Quantization Benchmarks on RTX 3090
All benchmarks measured on a single RTX 3090 (24GB), 10K context, all layers on GPU, flash attention enabled:[1]
| Quantization | File Size | Est. VRAM (10K ctx) | Gen tok/s | Perplexity | Notes |
|---|---|---|---|---|---|
| Q3_K_S | 15.3 GB | ~16 GB | 117 | 6.765 | Smallest, lowest quality |
| Q3_K_M | 16.4 GB | ~17 GB | 120 | 6.683 | — |
| UD-Q3_K_XL | 16.6 GB | 18.7 GB | 119 | 6.692 | ✅ Best for max context |
| UD-IQ4_XS | 17.5 GB | ~19 GB | 118 | 6.629 | — |
| UD-IQ4_NL | 17.8 GB | ~19 GB | 120 | 6.630 | — |
| UD-Q4_K_L | 20.2 GB | ~21 GB | 128 | 6.589 | ✅ Best speed + quality balance |
| Q4_K_S | 20.7 GB | ~22 GB | 121 | 6.589 | — |
| Q4_K_M | 22.0 GB | ~23 GB | 121 | 6.559 | Tight — limited context headroom |
| UD-Q4_K_XL | 22.2 GB | ~23 GB | 119 | 6.552 | Best quality, very tight |
Note: UD-Q4_K_M was deleted by Unsloth. UD-Q4_K_L is the intended replacement.
Context Window Deep Dive
The MoE architecture's lower VRAM usage per token means far more headroom for the KV cache — which is what determines how long a context you can actually serve. The benchmark at full native 262,144 token context on a single 3090:[1]
| Model | Context | VRAM Used | Gen tok/s |
|---|---|---|---|
| UD-Q3_K_XL | 120,000 | 18.68 GiB | 94 |
| UD-Q3_K_XL | 262,144 (max native) | 21.70 GiB | 71 |
Running the full 262K native context costs 3GB of additional VRAM over 120K context — and still fits inside 24GB with 2.3GB to spare. Generation speed drops from 94 to 71 tok/s, which is still excellent for a model of this capability. No other 24GB consumer GPU setup comes close to this context/quality combination.
UD-Q3_K_XL with num_ctx 262144 and flash_attn true. You get the full native context window at 71 tok/s generation, using 21.7GB of your 24GB. This is the sweet spot — no other single-card setup at this price point gives you 262K context at near-35B quality.
When to Move from Ollama to vLLM
Ollama's layer-parallel approach is optimal for single-user interactive use. The moment you need to serve multiple concurrent clients — multiple AI agents, an API endpoint, several users — the calculus changes:
- Ollama processes requests sequentially by default. Client 2 waits for client 1 to finish.
- vLLM's PagedAttention and continuous batching serve multiple requests simultaneously, keeping GPU utilization high across all concurrent sessions.
- vLLM's tensor parallelism across 4 GPUs means the model runs in BF16 (no quantization loss) with 96GB of VRAM headroom — far beyond what a single card provides.
For a multi-agent local stack where several agents are constantly hitting the LLM endpoint, vLLM is the right tool. The decision point is roughly: more than 2 concurrent users → vLLM.
PCIe Topology: The Hidden Variable in Multi-GPU Performance
A typical consumer or HEDT 4-GPU build does not give all four GPUs equal PCIe bandwidth. The physical slot layout usually looks like this:
| Slot | Lanes | Effective Bandwidth | Typical on consumer board |
|---|---|---|---|
| GPU 0 (primary) | x16 | ~32 GB/s (PCIe 4.0) | CPU-direct, full speed |
| GPU 1 (second full-size) | x8 (electrical) | ~16 GB/s | x16 physical, x8 electrical |
| GPU 2 | x4 | ~8 GB/s | Smaller slot or PLX-switched |
| GPU 3 | x4 | ~8 GB/s | Smaller slot or PLX-switched |
This asymmetry has no effect on layer-parallel (pipeline) workloads because inter-GPU traffic is minimal — just an activation tensor at layer boundaries. It has a significant effect on tensor parallelism, where every GPU must participate in an all-reduce synchronization on every layer, every token. The all-reduce is bottlenecked by the slowest link in the ring. If GPUs 2 and 3 communicate at 8 GB/s, your entire 4-GPU tensor parallel run is capped by 8 GB/s inter-GPU bandwidth regardless of how fast GPUs 0 and 1 are.
For a dense model like Llama-70B, this is a serious problem. For Qwen3.5-35B-A3B — MoE, only 3B active params — the all-reduce communication volume per token is proportionally much smaller than a dense 35B, so the PCIe x4 bottleneck is less severe in practice. But it's still present and worth engineering around.
The Three Parallelism Strategies
Tensor Parallelism (TP)
Every layer runs on all GPUs simultaneously. Each GPU handles a horizontal slice of weight matrices and attention heads. Results synchronize via all-reduce after each layer. Requires high-bandwidth inter-GPU communication. PCIe x4 hurts here. Best for: low latency, homogeneous high-bandwidth hardware.
Pipeline Parallelism (PP)
Model layers are distributed across GPUs in sequence. GPU 0 runs layers 0–9, GPU 1 runs layers 10–19, etc. A token passes through GPUs in order. Inter-GPU traffic is one activation tensor per GPU boundary — kilobytes, not gigabytes. PCIe x4 is completely fine here. Downside: sequentiality adds latency; GPUs sit idle while waiting for the previous stage.
Hybrid TP + PP
Combines both. A subset of GPUs forms a tensor-parallel group (fast, NVLink-connected), and multiple such groups form a pipeline. Gives you tensor-parallel speed on the fast lanes while using pipeline parallelism across the slow PCIe connections. This is the optimal strategy for mixed PCIe slot rigs.
The Recommended Strategy: TP=2 + PP=2
For a 4× RTX 3090 rig with two full-speed slots and two x4 slots:
Hardware Setup
- GPUs 0 + 1: full-size PCIe slots, NVLink bridge connected (112 GB/s bidirectional)
- GPUs 2 + 3: smaller slots, x4 PCIe (~8 GB/s), NVLink optional (second pair)
With --tensor-parallel-size 2 --pipeline-parallel-size 2:
- GPUs 0+1 form one tensor-parallel group — all-reduce happens over NVLink at 112 GB/s ✅
- GPUs 2+3 form another tensor-parallel group — their all-reduce also uses NVLink if bridged ✅
- The two groups communicate as pipeline stages — only activation tensors cross the slow PCIe x4 lanes ✅
- Total VRAM pool: 4 × 24GB = 96GB — run BF16 with no quantization penalty
This routes the high-bandwidth communication (all-reduce) through NVLink and the low-bandwidth communication (pipeline stage handoffs) through PCIe. The x4 slots contribute their VRAM without becoming a bottleneck for the hot path.
Ready-to-Run Commands
Option A: Single RTX 3090 — Ollama + Unsloth GGUF
# Download the GGUF (best context window option) huggingface-cli download unsloth/Qwen3.5-35B-A3B-GGUF \ Qwen3.5-35B-A3B-UD-Q3_K_XL.gguf \ --local-dir ~/models/qwen3.5-35b # Create Modelfile cat > ~/models/qwen3.5-35b/Modelfile << 'EOF' FROM ./Qwen3.5-35B-A3B-UD-Q3_K_XL.gguf PARAMETER num_ctx 262144 PARAMETER num_gpu 99 PARAMETER flash_attn true EOF # Import and run ollama create qwen3.5-35b-q3xl -f ~/models/qwen3.5-35b/Modelfile ollama run qwen3.5-35b-q3xl
# Alternative: best quality/speed balance on single card huggingface-cli download unsloth/Qwen3.5-35B-A3B-GGUF \ Qwen3.5-35B-A3B-UD-Q4_K_L.gguf \ --local-dir ~/models/qwen3.5-35b # This one fits ~80–100K context comfortably cat > ~/models/qwen3.5-35b/Modelfile-q4 << 'EOF' FROM ./Qwen3.5-35B-A3B-UD-Q4_K_L.gguf PARAMETER num_ctx 81920 PARAMETER num_gpu 99 PARAMETER flash_attn true EOF ollama create qwen3.5-35b-q4kl -f ~/models/qwen3.5-35b/Modelfile-q4
Option B: 4× RTX 3090 — vLLM, Pipeline Parallel (simplest safe option)
# Pure pipeline parallelism — PCIe speed doesn't matter at all vllm serve Qwen/Qwen3.5-35B-A3B \ --pipeline-parallel-size 4 \ --max-model-len 65536 \ --enable-prefix-caching \ --host 0.0.0.0 \ --port 8000
Option C: 4× RTX 3090 — vLLM, Hybrid TP=2+PP=2 (recommended)
# Tensor parallel within NVLink pairs, pipeline parallel between pairs # Set CUDA_VISIBLE_DEVICES so NVLink pairs are adjacent CUDA_VISIBLE_DEVICES=0,1,2,3 vllm serve Qwen/Qwen3.5-35B-A3B \ --tensor-parallel-size 2 \ --pipeline-parallel-size 2 \ --max-model-len 131072 \ --enable-prefix-caching \ --gpu-memory-utilization 0.92 \ --host 0.0.0.0 \ --port 8000
Option D: 4× RTX 3090 — vLLM, Full Tensor Parallel (if you want to test)
# All 4 GPUs tensor parallel — x4 PCIe slots will bottleneck all-reduce # Try this first; if throughput is disappointing, fall back to Option C vllm serve Qwen/Qwen3.5-35B-A3B \ --tensor-parallel-size 4 \ --max-model-len 131072 \ --enable-prefix-caching \ --host 0.0.0.0 \ --port 8000
vllm bench throughput with TP=4 vs TP=2+PP=2. If TP=2+PP=2 is faster, your x4 slots are the bottleneck. If they're similar, the MoE's small all-reduce volume is surviving fine on PCIe x4.
Disabling Thinking Mode (for agent workloads)
Qwen3.5 includes a reasoning/thinking mode that generates chain-of-thought tokens. For agent workloads where you want fast tool-calling responses rather than extended reasoning, disable it:
# With vLLM / OpenAI-compatible API
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3.5-35B-A3B",
"chat_template_kwargs": {"enable_thinking": false},
"messages": [{"role": "user", "content": "your prompt"}]
}'
Decision Matrix
| Use Case | Hardware | Stack | Config |
|---|---|---|---|
| Single user, interactive, max context | 1× RTX 3090 | Ollama | UD-Q3_K_XL, ctx=262144, flash_attn |
| Single user, best quality | 1× RTX 3090 | Ollama | UD-Q4_K_L, ctx=81920 |
| Multi-agent, max throughput, mixed PCIe rig | 4× RTX 3090 | vLLM | TP=2, PP=2, BF16, prefix cache |
| Multi-agent, simplicity over speed | 4× RTX 3090 | vLLM | PP=4, BF16, prefix cache |
| Experimenting / comparing configs | 4× RTX 3090 | vLLM | Try TP=4 first, benchmark vs TP=2+PP=2 |
References
- LocalLLaMA Reddit — "Benchmarked all unsloth Qwen3.5-35B-A3B Q4 models on a 3090" — Community benchmark with VRAM usage, generation speed, and perplexity for all Q3–Q4 quants on a single RTX 3090. Includes 27B vs 35B-A3B head-to-head at 120K context and full 262K native context test. ↗ link
- Unsloth — Qwen3.5-35B-A3B-GGUF on Hugging Face — Official Unsloth GGUF repository with imatrix quantization (Mar 5 update), tool-calling fixes, and model architecture details. ↗ link
- Unsloth — Qwen3.5 Run Locally Guide — Official documentation for running Qwen3.5 models locally with Unsloth GGUFs, including thinking mode toggle and Ollama setup. ↗ link
- vLLM Documentation — Distributed Inference — Official docs for tensor parallelism, pipeline parallelism, and hybrid configurations. ↗ link