🎧
Listen to this article
AI-generated narration using OpenAI TTS (shimmer voice)

Introduction

On February 16, 2026 — Chinese New Year's Day — Alibaba's Qwen team dropped the first bombshell: Qwen3.5-397B-A17B, a 397-billion-parameter flagship that sent shockwaves through the AI community. Then came the Medium series on February 24, with models that fit on a single consumer GPU yet outperform Claude Sonnet 4.5 and GPT-5-mini on multiple benchmarks. And today, March 2, the Small series completes the lineup — four tiny models from 0.8B to 9B parameters that are natively multimodal and run on a phone.

This isn't incremental progress. This is a paradigm shift. Alibaba has proven that with the right architecture — Mixture-of-Experts combined with Gated Delta Networks — you don't need a trillion-dollar data center to run frontier AI. You need a GPU rig in your office and an Apache 2.0 license. Cost: $0 per token, forever.

⚡ The TL;DR

Qwen3.5-35B-A3B has 35 billion parameters but only activates 3 billion per token. It beats Claude Sonnet 4.5 on MMMLU and MMMU-Pro. It runs on a single GPU with 32GB VRAM using 4-bit quantization. It supports 1M token context. It's Apache 2.0. It does agentic tool calling natively. This is the model that makes "local-first AI" a real production strategy.

The Qwen3.5 Timeline

The Qwen3.5 rollout happened in three waves over two weeks:

Date Release Models
Feb 16, 2026 Flagship Qwen3.5-397B-A17B (397B total, 17B active) — the full-size beast
Feb 24, 2026 Medium Series Qwen3.5-35B-A3B, Qwen3.5-122B-A10B, Qwen3.5-27B, Qwen3.5-Flash (API-only)
Mar 2, 2026 Small Series Qwen3.5-9B, Qwen3.5-4B, Qwen3.5-2B, Qwen3.5-0.8B

All open-weight models are released under the Apache 2.0 license — full commercial use, fine-tuning, and redistribution allowed. Available on Hugging Face and ModelScope.

Medium Series: The Sweet Spot

The Medium series is where things get interesting for production teams. These models target the "Goldilocks zone" — small enough to self-host, powerful enough to replace proprietary APIs.

Qwen3.5-35B-A3B — The Star of the Show

This is the model everyone's talking about. Here's why:

🧠
MoE Architecture

35 billion total parameters, but only 3 billion activate per token. Uses 256 experts with 8 routed + 1 shared expert. This means inference speed comparable to a 3B dense model, with intelligence closer to a 35B one.

📏
1M Token Context

With 4-bit quantization of both weights and KV cache, the model handles over 1 million tokens on a consumer GPU with 32GB VRAM. No need for complex RAG chunking — just feed it the whole codebase.

🔧
Native Tool Calling

Built-in support for function calling and API interaction. No prompt engineering hacks — the model natively understands tool schemas and can plan multi-step agentic workflows.

💭
Thinking Mode

Default "thinking" mode generates internal reasoning chains (<think> tags) before answering. Works like chain-of-thought prompting but baked into the model weights.

Qwen3.5-122B-A10B — The Server-Grade Option

For teams with more hardware (80GB VRAM GPUs like A100/H100), the 122B model activates 10 billion parameters per token and supports 1M+ context lengths. It narrows the gap with the flagship 397B model and outperforms many proprietary alternatives. This is the model for enterprises running inference clusters.

Qwen3.5-27B — Dense and Efficient

A traditional dense architecture (no MoE) optimized for high throughput. Supports 800K+ token context. Best suited for scenarios where MoE routing overhead matters or when you want maximum predictability in inference latency.

Qwen3.5-Flash — API Only

The production-hosted version, available through Alibaba Cloud Model Studio. It's the 35B-A3B model optimized for production with a default 1M context window, built-in tools (web search, code interpreter), and aggressive pricing: $0.10/1M input tokens, $0.40/1M output tokens. That's 36x cheaper than Claude Sonnet 4.5 on a per-token basis.

Small Series: AI on Every Device

Released today (March 2, 2026), the Small series brings the Qwen3.5 architecture to edge devices and resource-constrained environments. All four models are natively multimodal — text, images, and video from the same weights. No bolted-on vision adapter.

Model Parameters Context VRAM (BF16) Use Case
Qwen3.5-9B 9B 262K (1M ext.) ~18 GB Compact reasoning powerhouse — beats prev-gen Qwen3-30B
Qwen3.5-4B 4B 262K ~8 GB Multimodal base for lightweight agents
Qwen3.5-2B 2B 262K ~4 GB Fast prototyping, edge deployment
Qwen3.5-0.8B 0.8B 262K ~1.6 GB Runs on a phone — ultra-low latency

The headliner: Qwen3.5-9B beats the previous-generation Qwen3-30B on most benchmarks — a model more than 3x its size. It also beats OpenAI's gpt-oss-120B on GPQA Diamond (81.7 vs 80.1), a model with 13.5x more parameters.

Architecture: MoE + Gated Delta Networks

What makes Qwen3.5 different from standard transformers? Two key innovations working together:

Mixture-of-Experts (MoE)

Instead of activating all parameters for every token, MoE routes each token to a subset of "expert" sub-networks. The 35B-A3B model has 256 experts but only routes to 8 per token (plus 1 shared expert that always activates). Result: you get the knowledge capacity of 35B parameters with the inference cost of 3B.

Gated Delta Networks (Linear Attention)

The Qwen3.5 architecture uses a 3:1 ratio of Gated DeltaNet (linear attention) blocks to standard full-attention blocks. Linear attention has constant memory complexity — which is what enables the massive 262K-1M+ context windows even on small models. The full attention blocks provide precision for complex reasoning. This hybrid approach breaks the "memory wall" that typically limits context length on consumer hardware.

Near-Lossless 4-bit Quantization

Alibaba specifically engineered these models to maintain accuracy under aggressive quantization. The 35B-A3B model running in 4-bit (weights + KV cache) loses minimal quality while cutting VRAM usage dramatically. This is what makes the "1M context on 32GB VRAM" claim possible.

Multi-Token Prediction

All models include multi-token prediction (MTP) for faster inference — the model predicts multiple upcoming tokens simultaneously, reducing the number of forward passes needed.

Benchmarks That Matter

Medium Series vs. Proprietary Models

The 35B-A3B model doesn't just compete with proprietary models — it beats them:

Benchmark Qwen3.5-35B-A3B Claude Sonnet 4.5 GPT-5-mini
MMMLU (multilingual knowledge) ✅ Higher Lower Lower
MMMU-Pro (visual reasoning) ✅ Higher Lower Lower

The model also surpasses its own predecessor, the much larger Qwen3-235B-A22B, proving that architecture beats raw scale.

Small Series: Punching Way Above Weight

Benchmark Qwen3.5-9B Qwen3.5-4B GPT-5-Nano Gemini 2.5 Flash-Lite
MMMU-Pro 70.1 66.3 57.2 59.7
MathVision 78.9 74.6 62.2 52.1
GPQA Diamond 81.7 76.2
OmniDocBench 1.5 87.7 86.2 55.9 79.4
VideoMME (w/sub) 84.5 83.5 74.6
MMMLU 81.2

The 9B model beats GPT-5-Nano by 13 points on MMMU-Pro and nearly 17 points on MathVision. These aren't marginal gains. And the 9B model runs on a single consumer GPU.

Running It Locally

Here's the practical guide for running Qwen3.5 on your own hardware:

Hardware Requirements

Model Quantization VRAM Needed Example GPU
Qwen3.5-0.8B BF16 ~1.6 GB Any modern GPU / phone
Qwen3.5-4B BF16 ~8 GB RTX 3060 / RTX 4060
Qwen3.5-9B BF16 ~18 GB RTX 3090 / RTX 4090
Qwen3.5-35B-A3B 4-bit GPTQ ~20 GB RTX 4090 (24GB) / RTX 3090
Qwen3.5-35B-A3B BF16 (1M ctx) ~32 GB 2× RTX 3090 or 1× A6000
Qwen3.5-122B-A10B 4-bit ~40 GB A100 80GB / 2× RTX 4090

Quick Start with vLLM

# Install vLLM
pip install vllm

# Serve Qwen3.5-35B-A3B with 4-bit quantization
vllm serve Qwen/Qwen3.5-35B-A3B-AWQ \
  --quantization awq \
  --max-model-len 131072 \
  --tensor-parallel-size 1

# Or use llama.cpp for GGUF quantized versions
./llama-server -m qwen3.5-35b-a3b-q4_k_m.gguf \
  --ctx-size 131072 --n-gpu-layers 99

Quick Start with Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen3.5-35B-A3B",
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3.5-35B-A3B")

messages = [{"role": "user", "content": "Explain MoE architecture in 3 sentences."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Agentic Tool Calling

One of Qwen3.5's standout features is native tool calling support. Unlike models that require elaborate prompt engineering to use tools, Qwen3.5 understands tool schemas natively and can:

  • Plan multi-step workflows — decompose complex tasks into tool calls
  • Call APIs directly — generate properly formatted function calls with correct arguments
  • Handle tool responses — incorporate results from tool calls into ongoing reasoning
  • Chain tools — use the output of one tool as input to another

The Flash API version goes further with built-in official tools including web search ($10/1,000 calls) and a code interpreter (currently free). This makes Qwen3.5-Flash one of the most capable and affordable agentic models available via API.

For self-hosted deployments, the tool calling works with standard OpenAI-compatible function calling schemas, making it a drop-in replacement for existing agentic pipelines that currently use GPT or Claude.

Pricing: The Cost Comparison

This is where the picture gets stark. If you self-host Qwen3.5-35B-A3B, your per-token cost is effectively $0 (just hardware and electricity). But even the hosted Flash API is dramatically cheaper than Western alternatives:

Model Input / 1M tokens Output / 1M tokens Total
Qwen3.5-Flash $0.10 $0.40 $0.50
DeepSeek V3.2 $0.28 $0.42 $0.70
Grok 4.1 Fast $0.20 $0.50 $0.70
Claude Haiku 4.5 $1.00 $5.00 $6.00
Claude Sonnet 4.5 $3.00 $15.00 $18.00
GPT-5.2 $1.75 $14.00 $15.75
Claude Opus 4.6 $5.00 $25.00 $30.00

Qwen3.5-Flash is 36x cheaper than Claude Sonnet 4.5. And if you self-host the equivalent model (35B-A3B), it's free. For teams running millions of tokens per day through agentic workflows, this is the difference between $540/day and $0/day.

Pros & Cons

✅ Pros

  • Frontier-level performance at a fraction of the compute
  • Apache 2.0 — full commercial freedom, fine-tuning, redistribution
  • MoE architecture = blazing fast inference for the intelligence level
  • 1M token context on consumer hardware (32GB VRAM)
  • Near-lossless 4-bit quantization engineered from the ground up
  • Native tool calling for agentic workflows
  • Natively multimodal (text + image + video) in small series
  • Full model lineup from 0.8B to 397B — something for every use case
  • Active community, strong Hugging Face ecosystem support
  • Base models also released for research and fine-tuning

❌ Cons

  • Alibaba Cloud API mainly targets APAC — latency from US/EU may vary
  • MoE models need more total VRAM than dense equivalents (even if faster)
  • Qwen3.5-Flash is proprietary (API-only) — can't self-host that specific variant
  • Benchmark performance doesn't always translate to real-world agentic tasks
  • Chinese-origin model may face regulatory scrutiny in some enterprises
  • Community tooling (quantizations, adapters) still catching up vs. Llama ecosystem
  • Long-context performance under 4-bit quantization needs real-world validation

How It Compares

Model Params (Active) Architecture License Context Best For
Qwen3.5-35B-A3B 35B (3B) MoE + DeltaNet Apache 2.0 1M+ Local agentic AI, cost-zero inference
Llama 4 Scout 109B (17B) MoE Meta Community 10M Long-context research
DeepSeek V3.2 671B (37B) MoE MIT 128K Coding, reasoning
Claude Sonnet 4.5 Unknown Dense (assumed) Proprietary 200K Creative writing, code, general
GPT-5-mini Unknown Proprietary Proprietary 128K General purpose, API access
Gemini 3 Flash Unknown Proprietary Proprietary 1M Multimodal, long context

Getting Started

Ready to try Qwen3.5? Here are your options:

Option 1: Self-Host (Free)

  1. Download from Hugging Face
  2. Use vLLM, llama.cpp, or Transformers to serve
  3. For the 35B-A3B with 4-bit quant: need ~20GB VRAM (one RTX 4090)
  4. For 1M context: need ~32GB VRAM

Option 2: Alibaba Cloud API

  1. Sign up at Alibaba Cloud Model Studio
  2. Use Qwen3.5-Flash for $0.10/1M input, $0.40/1M output tokens
  3. OpenAI-compatible API format — drop-in replacement

Option 3: Third-Party Hosting

Providers like Together AI, Fireworks, and Groq typically add new Qwen models within days of release. Check their model catalogs for availability and pricing.

References

  1. Qwen Official Website — Qwen3.5 Announcement
  2. VentureBeat — Alibaba's Qwen3.5 Medium Models Offer Sonnet 4.5 Performance
  3. VentureBeat — Qwen3.5-9B Beats OpenAI's gpt-oss-120B
  4. MarkTechPost — Qwen3.5 Medium Model Series
  5. Reuters — Alibaba Unveils Qwen3.5 for 'Agentic AI Era'
  6. CNBC — Alibaba Unveils Qwen3.5 as China's Chatbot Race Shifts to AI Agents
  7. Awesome Agents — Qwen 3.5 Small Series Ships Four Models
  8. GitHub — QwenLM/Qwen3.5 Repository
  9. Hugging Face — Qwen3.5 Collection
  10. Neuronad — Qwen 3.5's Small Models Are a Big Deal
  11. MLQ AI — Alibaba Launches Qwen 3.5
  12. Hacker News Discussion (456 points, 267 comments)