What Is DFlash? The Core Idea
Speculative decoding is one of the most practical speedups for LLM inference today. An LLM runs a fast "draft" model to predict tokens, then the target model verifies them in parallel. The trick: draft models must be fast and produce tokens the target model will accept. The bottleneck has always been that autoregressive draft models generate tokens one-by-one โ defeating the parallelism they're meant to enable.
DFlash solves this by replacing autoregressive drafting with block diffusion. A diffusion-based draft model generates a block of tokens simultaneously in one forward pass, conditioned on context features from the target model. Higher acceptance rates, much faster drafting, and no additional training overhead.
โก TL;DR
DFlash from z-lab/Alibaba Research replaces the autoregressive drafting step in speculative decoding with a lightweight block diffusion model. Draft tokens are generated in a single forward pass (not token-by-token), conditioned on context features from the target LLM. Result: 6x lossless acceleration, beating the previous best โ autoregressive speculative decoding methods like EAGLE-3 โ by 2.5x.
How DFlash Works
The target model (e.g., Qwen3.6-35B-A3B, Qwen3.5-27B) runs normally. The DFlash draft model โ a purpose-built diffusion network โ receives contextual embeddings from the target model and produces a block of draft tokens in a single forward pass. The target model then validates them in parallel.
Key insight: the DFlash model isn't an LLM in the traditional sense. It's a diffusion model optimized for the specific task of predicting next-token blocks given target model hidden states. This lets DFlash be much smaller and faster than any autoregressive draft model it replaces.
Supported Target/Draft Pairs
| Target Model | DFlash Draft Model | Backend |
|---|---|---|
| Qwen3.5-122B-A10B | z-lab/Qwen3.5-122B-A10B-DFlash | vLLM, SGLang |
| Qwen3.6-35B-A3B | z-lab/Qwen3.6-35B-A3B-DFlash | vLLM, SGLang, MLX |
| Kimi-K2.5 | z-lab/Kimi-K2.5-DFlash | vLLM, SGLang |
| Qwen3-Coder-Next | z-lab/Qwen3-Coder-Next-DFlash | vLLM, SGLang |
| gpt-oss-20b / 120b | z-lab/gpt-oss-20b-DFlash | vLLM, SGLang |
| LLaMA-3.1-8B-Instruct | z-lab/LLaMA3.1-8B-DFlash-UltraChat | Transformers, SGLang |
| Qwen3-4B (non-thinking) | z-lab/Qwen3-4B-DFlash-b16 | Transformers, MLX |
Benchmark Results: 6x Lossless, 2.5x Faster Than EAGLE-3
DFlash was evaluated on GSM8K, MATH500, HumanEval, MBPP, and MT-Bench. The results:
- Over 6x lossless acceleration โ quality metrics unchanged, speedup verified
- Up to 2.5x higher speedup than EAGLE-3 (previous SOTA autoregressive speculative decoding)
- Evaluated via transformers, SGLang, vLLM, and MLX backends
Installation & Backends
DFlash supports four installation backends, each with its own pip install command:
Transformers
For Qwen3 and LLaMA-3.1 models only. Full Python API with draft.spec_generate(). Supports CPU and CUDA.
SGLang
Server-mode serving with --speculative-algorithm DFLASH. Best for production serving.
vLLM
Production serving with --speculative-config. Uses modified vLLM branch (PR #40898).
MLX
Apple Silicon support. Tested on Apple M5 Pro. stream_generate() for live output.
vLLM Quick Start
vllm serve Qwen/Qwen3.5-27B \
--speculative-config '{"method": "dflash", "model": "z-lab/Qwen3.5-27B-DFlash",
"num_speculative_tokens": 15}' \
--attention-backend flash_attn \
--max-num-batched-tokens 32768
Why This Matters
Every GPU-inference user chasing lower latency benefits here. DFlash is available on MLX for Apple Silicon โ perfect for local deployment. The draft models are tiny (A3B experts) but accelerate massive base models dramatically. Citation: arXiv:2602.06036
๐ง Try it today
HuggingFace: z-lab/dflash | GitHub: z-lab/dflash