Introduction
On February 16, 2026 — Chinese New Year's Day — Alibaba's Qwen team dropped the first bombshell: Qwen3.5-397B-A17B, a 397-billion-parameter flagship that sent shockwaves through the AI community. Then came the Medium series on February 24, with models that fit on a single consumer GPU yet outperform Claude Sonnet 4.5 and GPT-5-mini on multiple benchmarks. And today, March 2, the Small series completes the lineup — four tiny models from 0.8B to 9B parameters that are natively multimodal and run on a phone.
This isn't incremental progress. This is a paradigm shift. Alibaba has proven that with the right architecture — Mixture-of-Experts combined with Gated Delta Networks — you don't need a trillion-dollar data center to run frontier AI. You need a GPU rig in your office and an Apache 2.0 license. Cost: $0 per token, forever.
⚡ The TL;DR
Qwen3.5-35B-A3B has 35 billion parameters but only activates 3 billion per token. It beats Claude Sonnet 4.5 on MMMLU and MMMU-Pro. It runs on a single GPU with 32GB VRAM using 4-bit quantization. It supports 1M token context. It's Apache 2.0. It does agentic tool calling natively. This is the model that makes "local-first AI" a real production strategy.
The Qwen3.5 Timeline
The Qwen3.5 rollout happened in three waves over two weeks:
| Date | Release | Models |
|---|---|---|
| Feb 16, 2026 | Flagship | Qwen3.5-397B-A17B (397B total, 17B active) — the full-size beast |
| Feb 24, 2026 | Medium Series | Qwen3.5-35B-A3B, Qwen3.5-122B-A10B, Qwen3.5-27B, Qwen3.5-Flash (API-only) |
| Mar 2, 2026 | Small Series | Qwen3.5-9B, Qwen3.5-4B, Qwen3.5-2B, Qwen3.5-0.8B |
All open-weight models are released under the Apache 2.0 license — full commercial use, fine-tuning, and redistribution allowed. Available on Hugging Face and ModelScope.
Medium Series: The Sweet Spot
The Medium series is where things get interesting for production teams. These models target the "Goldilocks zone" — small enough to self-host, powerful enough to replace proprietary APIs.
Qwen3.5-35B-A3B — The Star of the Show
This is the model everyone's talking about. Here's why:
MoE Architecture
35 billion total parameters, but only 3 billion activate per token. Uses 256 experts with 8 routed + 1 shared expert. This means inference speed comparable to a 3B dense model, with intelligence closer to a 35B one.
1M Token Context
With 4-bit quantization of both weights and KV cache, the model handles over 1 million tokens on a consumer GPU with 32GB VRAM. No need for complex RAG chunking — just feed it the whole codebase.
Native Tool Calling
Built-in support for function calling and API interaction. No prompt engineering hacks — the model natively understands tool schemas and can plan multi-step agentic workflows.
Thinking Mode
Default "thinking" mode generates internal reasoning chains (<think> tags) before answering. Works like chain-of-thought prompting but baked into the model weights.
Qwen3.5-122B-A10B — The Server-Grade Option
For teams with more hardware (80GB VRAM GPUs like A100/H100), the 122B model activates 10 billion parameters per token and supports 1M+ context lengths. It narrows the gap with the flagship 397B model and outperforms many proprietary alternatives. This is the model for enterprises running inference clusters.
Qwen3.5-27B — Dense and Efficient
A traditional dense architecture (no MoE) optimized for high throughput. Supports 800K+ token context. Best suited for scenarios where MoE routing overhead matters or when you want maximum predictability in inference latency.
Qwen3.5-Flash — API Only
The production-hosted version, available through Alibaba Cloud Model Studio. It's the 35B-A3B model optimized for production with a default 1M context window, built-in tools (web search, code interpreter), and aggressive pricing: $0.10/1M input tokens, $0.40/1M output tokens. That's 36x cheaper than Claude Sonnet 4.5 on a per-token basis.
Small Series: AI on Every Device
Released today (March 2, 2026), the Small series brings the Qwen3.5 architecture to edge devices and resource-constrained environments. All four models are natively multimodal — text, images, and video from the same weights. No bolted-on vision adapter.
| Model | Parameters | Context | VRAM (BF16) | Use Case |
|---|---|---|---|---|
| Qwen3.5-9B | 9B | 262K (1M ext.) | ~18 GB | Compact reasoning powerhouse — beats prev-gen Qwen3-30B |
| Qwen3.5-4B | 4B | 262K | ~8 GB | Multimodal base for lightweight agents |
| Qwen3.5-2B | 2B | 262K | ~4 GB | Fast prototyping, edge deployment |
| Qwen3.5-0.8B | 0.8B | 262K | ~1.6 GB | Runs on a phone — ultra-low latency |
The headliner: Qwen3.5-9B beats the previous-generation Qwen3-30B on most benchmarks — a model more than 3x its size. It also beats OpenAI's gpt-oss-120B on GPQA Diamond (81.7 vs 80.1), a model with 13.5x more parameters.
Architecture: MoE + Gated Delta Networks
What makes Qwen3.5 different from standard transformers? Two key innovations working together:
Mixture-of-Experts (MoE)
Instead of activating all parameters for every token, MoE routes each token to a subset of "expert" sub-networks. The 35B-A3B model has 256 experts but only routes to 8 per token (plus 1 shared expert that always activates). Result: you get the knowledge capacity of 35B parameters with the inference cost of 3B.
Gated Delta Networks (Linear Attention)
The Qwen3.5 architecture uses a 3:1 ratio of Gated DeltaNet (linear attention) blocks to standard full-attention blocks. Linear attention has constant memory complexity — which is what enables the massive 262K-1M+ context windows even on small models. The full attention blocks provide precision for complex reasoning. This hybrid approach breaks the "memory wall" that typically limits context length on consumer hardware.
Near-Lossless 4-bit Quantization
Alibaba specifically engineered these models to maintain accuracy under aggressive quantization. The 35B-A3B model running in 4-bit (weights + KV cache) loses minimal quality while cutting VRAM usage dramatically. This is what makes the "1M context on 32GB VRAM" claim possible.
Multi-Token Prediction
All models include multi-token prediction (MTP) for faster inference — the model predicts multiple upcoming tokens simultaneously, reducing the number of forward passes needed.
Benchmarks That Matter
Medium Series vs. Proprietary Models
The 35B-A3B model doesn't just compete with proprietary models — it beats them:
| Benchmark | Qwen3.5-35B-A3B | Claude Sonnet 4.5 | GPT-5-mini |
|---|---|---|---|
| MMMLU (multilingual knowledge) | ✅ Higher | Lower | Lower |
| MMMU-Pro (visual reasoning) | ✅ Higher | Lower | Lower |
The model also surpasses its own predecessor, the much larger Qwen3-235B-A22B, proving that architecture beats raw scale.
Small Series: Punching Way Above Weight
| Benchmark | Qwen3.5-9B | Qwen3.5-4B | GPT-5-Nano | Gemini 2.5 Flash-Lite |
|---|---|---|---|---|
| MMMU-Pro | 70.1 | 66.3 | 57.2 | 59.7 |
| MathVision | 78.9 | 74.6 | 62.2 | 52.1 |
| GPQA Diamond | 81.7 | 76.2 | — | — |
| OmniDocBench 1.5 | 87.7 | 86.2 | 55.9 | 79.4 |
| VideoMME (w/sub) | 84.5 | 83.5 | — | 74.6 |
| MMMLU | 81.2 | — | — | — |
The 9B model beats GPT-5-Nano by 13 points on MMMU-Pro and nearly 17 points on MathVision. These aren't marginal gains. And the 9B model runs on a single consumer GPU.
Running It Locally
Here's the practical guide for running Qwen3.5 on your own hardware:
Hardware Requirements
| Model | Quantization | VRAM Needed | Example GPU |
|---|---|---|---|
| Qwen3.5-0.8B | BF16 | ~1.6 GB | Any modern GPU / phone |
| Qwen3.5-4B | BF16 | ~8 GB | RTX 3060 / RTX 4060 |
| Qwen3.5-9B | BF16 | ~18 GB | RTX 3090 / RTX 4090 |
| Qwen3.5-35B-A3B | 4-bit GPTQ | ~20 GB | RTX 4090 (24GB) / RTX 3090 |
| Qwen3.5-35B-A3B | BF16 (1M ctx) | ~32 GB | 2× RTX 3090 or 1× A6000 |
| Qwen3.5-122B-A10B | 4-bit | ~40 GB | A100 80GB / 2× RTX 4090 |
Quick Start with vLLM
# Install vLLM
pip install vllm
# Serve Qwen3.5-35B-A3B with 4-bit quantization
vllm serve Qwen/Qwen3.5-35B-A3B-AWQ \
--quantization awq \
--max-model-len 131072 \
--tensor-parallel-size 1
# Or use llama.cpp for GGUF quantized versions
./llama-server -m qwen3.5-35b-a3b-q4_k_m.gguf \
--ctx-size 131072 --n-gpu-layers 99
Quick Start with Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen3.5-35B-A3B",
torch_dtype="auto",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3.5-35B-A3B")
messages = [{"role": "user", "content": "Explain MoE architecture in 3 sentences."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Agentic Tool Calling
One of Qwen3.5's standout features is native tool calling support. Unlike models that require elaborate prompt engineering to use tools, Qwen3.5 understands tool schemas natively and can:
- Plan multi-step workflows — decompose complex tasks into tool calls
- Call APIs directly — generate properly formatted function calls with correct arguments
- Handle tool responses — incorporate results from tool calls into ongoing reasoning
- Chain tools — use the output of one tool as input to another
The Flash API version goes further with built-in official tools including web search ($10/1,000 calls) and a code interpreter (currently free). This makes Qwen3.5-Flash one of the most capable and affordable agentic models available via API.
For self-hosted deployments, the tool calling works with standard OpenAI-compatible function calling schemas, making it a drop-in replacement for existing agentic pipelines that currently use GPT or Claude.
Pricing: The Cost Comparison
This is where the picture gets stark. If you self-host Qwen3.5-35B-A3B, your per-token cost is effectively $0 (just hardware and electricity). But even the hosted Flash API is dramatically cheaper than Western alternatives:
| Model | Input / 1M tokens | Output / 1M tokens | Total |
|---|---|---|---|
| Qwen3.5-Flash | $0.10 | $0.40 | $0.50 |
| DeepSeek V3.2 | $0.28 | $0.42 | $0.70 |
| Grok 4.1 Fast | $0.20 | $0.50 | $0.70 |
| Claude Haiku 4.5 | $1.00 | $5.00 | $6.00 |
| Claude Sonnet 4.5 | $3.00 | $15.00 | $18.00 |
| GPT-5.2 | $1.75 | $14.00 | $15.75 |
| Claude Opus 4.6 | $5.00 | $25.00 | $30.00 |
Qwen3.5-Flash is 36x cheaper than Claude Sonnet 4.5. And if you self-host the equivalent model (35B-A3B), it's free. For teams running millions of tokens per day through agentic workflows, this is the difference between $540/day and $0/day.
Pros & Cons
✅ Pros
- Frontier-level performance at a fraction of the compute
- Apache 2.0 — full commercial freedom, fine-tuning, redistribution
- MoE architecture = blazing fast inference for the intelligence level
- 1M token context on consumer hardware (32GB VRAM)
- Near-lossless 4-bit quantization engineered from the ground up
- Native tool calling for agentic workflows
- Natively multimodal (text + image + video) in small series
- Full model lineup from 0.8B to 397B — something for every use case
- Active community, strong Hugging Face ecosystem support
- Base models also released for research and fine-tuning
❌ Cons
- Alibaba Cloud API mainly targets APAC — latency from US/EU may vary
- MoE models need more total VRAM than dense equivalents (even if faster)
- Qwen3.5-Flash is proprietary (API-only) — can't self-host that specific variant
- Benchmark performance doesn't always translate to real-world agentic tasks
- Chinese-origin model may face regulatory scrutiny in some enterprises
- Community tooling (quantizations, adapters) still catching up vs. Llama ecosystem
- Long-context performance under 4-bit quantization needs real-world validation
How It Compares
| Model | Params (Active) | Architecture | License | Context | Best For |
|---|---|---|---|---|---|
| Qwen3.5-35B-A3B | 35B (3B) | MoE + DeltaNet | Apache 2.0 | 1M+ | Local agentic AI, cost-zero inference |
| Llama 4 Scout | 109B (17B) | MoE | Meta Community | 10M | Long-context research |
| DeepSeek V3.2 | 671B (37B) | MoE | MIT | 128K | Coding, reasoning |
| Claude Sonnet 4.5 | Unknown | Dense (assumed) | Proprietary | 200K | Creative writing, code, general |
| GPT-5-mini | Unknown | Proprietary | Proprietary | 128K | General purpose, API access |
| Gemini 3 Flash | Unknown | Proprietary | Proprietary | 1M | Multimodal, long context |
Getting Started
Ready to try Qwen3.5? Here are your options:
Option 1: Self-Host (Free)
- Download from Hugging Face
- Use vLLM, llama.cpp, or Transformers to serve
- For the 35B-A3B with 4-bit quant: need ~20GB VRAM (one RTX 4090)
- For 1M context: need ~32GB VRAM
Option 2: Alibaba Cloud API
- Sign up at Alibaba Cloud Model Studio
- Use Qwen3.5-Flash for $0.10/1M input, $0.40/1M output tokens
- OpenAI-compatible API format — drop-in replacement
Option 3: Third-Party Hosting
Providers like Together AI, Fireworks, and Groq typically add new Qwen models within days of release. Check their model catalogs for availability and pricing.
References
- Qwen Official Website — Qwen3.5 Announcement
- VentureBeat — Alibaba's Qwen3.5 Medium Models Offer Sonnet 4.5 Performance
- VentureBeat — Qwen3.5-9B Beats OpenAI's gpt-oss-120B
- MarkTechPost — Qwen3.5 Medium Model Series
- Reuters — Alibaba Unveils Qwen3.5 for 'Agentic AI Era'
- CNBC — Alibaba Unveils Qwen3.5 as China's Chatbot Race Shifts to AI Agents
- Awesome Agents — Qwen 3.5 Small Series Ships Four Models
- GitHub — QwenLM/Qwen3.5 Repository
- Hugging Face — Qwen3.5 Collection
- Neuronad — Qwen 3.5's Small Models Are a Big Deal
- MLQ AI — Alibaba Launches Qwen 3.5
- Hacker News Discussion (456 points, 267 comments)