The Speed Paradox
A 35-billion parameter model running at 90 tokens per second on a single RTX 3090. A community member on r/LocalLLaMA tested it on a real coding task โ a complex PDF merger app with drag-and-drop, dark GUI, venv isolation, and a .bat installer โ and it delivered a working app in 3 outputs, while GPT-5 failed all three attempts.
How is a 35B model that fast? How does it beat a model with far more compute budget at a real-world coding task? The answer is Mixture-of-Experts โ and understanding it changes how you think about model selection, hardware requirements, and the trajectory of local AI.
Dense Transformers: The Compute Wall
Every standard transformer-based LLM โ GPT, Llama, Mistral, original Claude โ uses what's called a dense architecture. Every input token passes through every parameter in every layer, every time. The model has N parameters, and processing one token costs proportional to N in compute.
This creates a hard tradeoff: more parameters means better quality, but also proportionally more compute, more memory bandwidth consumption, and slower token generation. A 70B dense model needs 70B worth of weight reads from VRAM per forward pass. A 7B dense model needs 7B. The relationship is linear and unavoidable.
For local deployment on consumer hardware, this ceiling became painful fast. A 70B model at 4-bit quantization requires ~40GB of VRAM โ more than any single consumer GPU. Even at 8-bit, a 34B model barely fits on a dual-3090 setup. And fitting it isn't the same as running it fast โ memory bandwidth determines speed, and consumer GPUs have limited bandwidth.
The dense architecture hit a wall. MoE is the way through it.
How MoE Routing Works
The core insight of Mixture-of-Experts is simple: not every token needs every neuron. A token representing a Python keyword doesn't need the same neural pathways as a token representing a Chinese character or a mathematical symbol. Different kinds of content benefit from different kinds of processing.
In a standard transformer, the feed-forward network (FFN) โ the largest component in each layer, accounting for roughly two-thirds of parameters โ is a single network that all tokens pass through. MoE replaces this single FFN with multiple independent FFN networks, called experts. A small, lightweight router (gating network) then decides which experts each token should use.
The key constraint: only K of N experts activate per token. If a layer has 8 experts and K=2, each token uses 2 experts and ignores 6. The model learns which routing decisions lead to better outputs during training. Over time, experts specialize โ some become better at code, others at reasoning, others at language tasks.
The Sparsity Insight: Separating Parameters from FLOPS
This is the insight that changes everything: MoE separates parameter count from floating-point operations per token.
In a dense model, parameters = compute budget. You can't have one without the other. In a MoE model, total parameters can be orders of magnitude larger than active parameters. The model stores 35 billion parameters, but each token only activates 3 billion worth of compute.
| Model | Total Params | Active Params/Token | Ratio | Context |
|---|---|---|---|---|
| Mixtral 8x7B | 46.7B | 12.9B | 3.6x | 32K |
| DeepSeek-V3 | 671B | 37B | 18x | 128K |
| Qwen3.5-35B-A3B | 35B | ~3B | ~12x | 262K |
| Qwen3.5-122B-A10B | 122B | 10B | 12x | 262K |
Why does this matter for speed? LLM inference on consumer hardware is memory-bandwidth-bound, not compute-bound. The GPU spends most of its time reading model weights from VRAM into compute cores, not doing the arithmetic. If only 3B parameters are active per token instead of 35B, the model reads roughly 12x fewer bytes per forward pass โ and generates tokens roughly 12x faster relative to its total parameter count.
In practice, gains are real but not perfectly 12x โ routing overhead, expert loading, and architectural factors reduce the gap. But the 90 tok/sec figure on a 35B-A3B model (versus ~8โ12 tok/sec for a comparable dense 35B model) is consistent with the theory.
From Switch Transformer to DeepSeek to Qwen3.5
MoE in neural networks dates to the 1990s, but its application to large language models has a cleaner lineage.
Switch Transformer (Google, 2021)
Google's Switch Transformer was the proof of concept that MoE could work at LLM scale. Key innovation: top-1 routing (each token goes to exactly one expert). This maximized sparsity but introduced load balancing problems โ some experts would get far more tokens than others, becoming overloaded while others sat idle. Switch Transformer introduced auxiliary loss terms to penalize imbalanced routing during training.
Mixtral 8x7B (Mistral, 2023)
Mistral's Mixtral 8x7B brought MoE to the open-source community at a practical scale. 8 experts per layer, top-2 routing (each token uses 2 experts), standard GQA attention. At 46.7B total parameters with 12.9B active, it ran comfortably on consumer hardware while matching or exceeding Llama 2 70B on most benchmarks. It became the template for what community-scale MoE could look like.
DeepSeekMoE (DeepSeek, 2024)
DeepSeek's MoE research introduced two significant refinements: fine-grained experts (segment the experts into more, smaller units โ instead of 8 large experts, use 64 smaller ones, activating more of them) and shared experts (a small set of experts that are always active for every token, capturing common knowledge that shouldn't be routed). DeepSeek-V3 operationalized these at 671B total / 37B active scale, with auxiliary-loss-free load balancing that prevents expert collapse without degrading quality.
Qwen3.5-35B-A3B (Alibaba, 2026)
The current consumer MoE gold standard takes DeepSeek's shared expert approach and pushes it further: 256 total experts per layer, 8 routed + 1 shared active per token. The 256-expert count is unusually large โ more fine-grained than anything that came before it. Each expert is also unusually small (intermediate dimension 512), which means the routing can be very precise without activating large chunks of the network unnecessarily.
Qwen3.5-35B-A3B: Current Local MoE Gold Standard
The full architecture spec reveals how carefully engineered this model is:
- 35B total / ~3B active parameters
- 40 layers, hidden dimension 2048
- 256 experts per MoE layer, 8 routed + 1 shared active per token
- Expert intermediate dimension: 512 (much smaller than typical, enabling fine-grained routing)
- 262,144 token context natively, extensible to 1M
- Multimodal: early fusion training on text + image + video tokens
- 201 languages
The benchmark results are equally strong. On MMLU-Pro (85.3), GPQA Diamond (84.2), and SWE-bench Verified (69.2), it matches or exceeds models with far more active compute. On TAU2-Bench (81.2) โ an agentic task benchmark โ it actually outperforms the larger Qwen3.5-122B-A10B (79.5), suggesting the routing efficiency is particularly well-suited to agentic workloads.
The DeltaNet Twist: Linear Attention + MoE
Qwen3.5's most novel contribution isn't the MoE layer โ it's the hybrid attention mechanism that combines MoE with an alternative to standard softmax attention.
Standard transformer attention is O(Nยฒ) โ cost scales quadratically with context length. For a 262K context, this is prohibitive. Qwen3.5 addresses this with Gated DeltaNet, a form of linear attention with O(N) complexity.
The layer layout is a repeating 4-layer cycle:
- 3ร Gated DeltaNet โ MoE (linear attention, O(N))
- 1ร Gated Attention โ MoE (full softmax attention, O(Nยฒ))
Three out of every four layers use linear attention. The DeltaNet mechanism maintains a fixed-size recurrent state โ rather than attending to all previous tokens, it updates a compressed representation using a delta rule. This dramatically reduces memory pressure for long contexts while preserving most of the representational power.
The 1-in-4 full attention layers provide the global context sensitivity that linear attention can miss, without paying the quadratic cost on every layer. It's a deliberate architectural compromise: linear attention where you can afford to approximate, full attention where you can't.
Real-World Performance on Consumer Hardware
The community benchmarks tell the story clearly. On a system with an RTX 3090 Ti (24GB VRAM) running LM Studio:
- Qwen3.5-27B (dense): ~31 tok/sec at full 262K context
- Qwen3.5-35B-A3B Q4 (MoE): ~90 tok/sec at full 262K context
The MoE model is nearly 3x faster than the dense model that's actually smaller. On a complex real-world coding task (PDF merger app with GUI, drag-and-drop, venv isolation), the 35B-A3B delivered a working solution while GPT-5 failed. The active parameter efficiency of MoE โ combined with the O(N) attention for most layers โ makes this hardware performance possible.
For the Q4_K_M quant, the full 35B model fits in approximately 20GB of VRAM โ within range of a single RTX 3090/4090 or Apple Silicon with 24GB unified memory.
Tradeoffs and Open Problems
MoE is not a free lunch. Understanding the real tradeoffs matters for deployment decisions.
Load Balancing
If routing always selects the same experts, others never get trained โ expert collapse. The standard fix is auxiliary loss during training that penalizes imbalanced routing. DeepSeek's approach eliminates the auxiliary loss entirely, using a bias-adjustment mechanism that rebalances routing without a training objective conflict. Qwen3.5 appears to use a similar approach, but published details on load balancing methodology are limited.
Multi-GPU Communication Overhead
In multi-GPU setups, different experts may reside on different GPUs. Each routing decision potentially requires cross-GPU data transfer (all-to-all communication). At scale, this can become the bottleneck โ expert parallelism strategies try to minimize it by co-locating frequently co-activated experts, but it's an active engineering challenge.
Memory Footprint vs. Active Compute
MoE separates active compute from total parameters, but all parameters still need to be in memory. The 35B-A3B at BF16 is ~70GB โ too large for a single consumer GPU. At Q4, it's ~20GB โ fits on a 3090. But if you want the full quality of BF16, you need datacenter hardware or multi-GPU setups. The compression tradeoff is real.
Does MoE Scale Like Dense?
The open question in the research community: does MoE follow the same scaling laws as dense models? Current evidence suggests MoE is more sample-efficient (achieves similar quality with fewer training FLOPs) but may have a lower quality ceiling per unit of total parameters at extreme scale. The jury is still out โ DeepSeek-V3 at 671B total suggests the ceiling is high, but we don't have clean controlled comparisons at frontier scale.
What's Next
The trajectory is clear: MoE is becoming the default architecture for large-scale open models. The reasons are straightforward โ inference efficiency matters more than training efficiency for models that are deployed millions of times, and MoE's active-parameter separation is a fundamental advantage for serving.
Qwen3.5's hybrid approach โ combining MoE with linear attention โ points toward the next generation of architectural exploration. If the O(Nยฒ) attention bottleneck can be further reduced, the practical context length and speed floor rise further. Models that fit on a single 24GB GPU while handling million-token contexts and running at interactive speeds are not theoretical.
The local AI renaissance isn't driven by faster GPUs alone. It's driven by architectures that make better use of the hardware we already have. MoE is the most important of those architectural shifts.
Sources: Qwen3.5-35B-A3B (HuggingFace) ยท Mixture-of-Experts LLMs (Cameron Wolfe) ยท DeepSeekMoE (arXiv) ยท r/LocalLLaMA community benchmark ยท Qwen3.5 Benchmarks Guide