๐ŸŽง
Listen to this article
AI-generated narration using OpenAI TTS

What Is DFlash? The Core Idea

Speculative decoding is one of the most practical speedups for LLM inference today. An LLM runs a fast "draft" model to predict tokens, then the target model verifies them in parallel. The trick: draft models must be fast and produce tokens the target model will accept. The bottleneck has always been that autoregressive draft models generate tokens one-by-one โ€” defeating the parallelism they're meant to enable.

DFlash solves this by replacing autoregressive drafting with block diffusion. A diffusion-based draft model generates a block of tokens simultaneously in one forward pass, conditioned on context features from the target model. Higher acceptance rates, much faster drafting, and no additional training overhead.

โšก TL;DR

DFlash from z-lab/Alibaba Research replaces the autoregressive drafting step in speculative decoding with a lightweight block diffusion model. Draft tokens are generated in a single forward pass (not token-by-token), conditioned on context features from the target LLM. Result: 6x lossless acceleration, beating the previous best โ€” autoregressive speculative decoding methods like EAGLE-3 โ€” by 2.5x.

How DFlash Works

The target model (e.g., Qwen3.6-35B-A3B, Qwen3.5-27B) runs normally. The DFlash draft model โ€” a purpose-built diffusion network โ€” receives contextual embeddings from the target model and produces a block of draft tokens in a single forward pass. The target model then validates them in parallel.

Key insight: the DFlash model isn't an LLM in the traditional sense. It's a diffusion model optimized for the specific task of predicting next-token blocks given target model hidden states. This lets DFlash be much smaller and faster than any autoregressive draft model it replaces.

Supported Target/Draft Pairs

Target ModelDFlash Draft ModelBackend
Qwen3.5-122B-A10Bz-lab/Qwen3.5-122B-A10B-DFlashvLLM, SGLang
Qwen3.6-35B-A3Bz-lab/Qwen3.6-35B-A3B-DFlashvLLM, SGLang, MLX
Kimi-K2.5z-lab/Kimi-K2.5-DFlashvLLM, SGLang
Qwen3-Coder-Nextz-lab/Qwen3-Coder-Next-DFlashvLLM, SGLang
gpt-oss-20b / 120bz-lab/gpt-oss-20b-DFlashvLLM, SGLang
LLaMA-3.1-8B-Instructz-lab/LLaMA3.1-8B-DFlash-UltraChatTransformers, SGLang
Qwen3-4B (non-thinking)z-lab/Qwen3-4B-DFlash-b16Transformers, MLX

Benchmark Results: 6x Lossless, 2.5x Faster Than EAGLE-3

DFlash was evaluated on GSM8K, MATH500, HumanEval, MBPP, and MT-Bench. The results:

  • Over 6x lossless acceleration โ€” quality metrics unchanged, speedup verified
  • Up to 2.5x higher speedup than EAGLE-3 (previous SOTA autoregressive speculative decoding)
  • Evaluated via transformers, SGLang, vLLM, and MLX backends

Installation & Backends

DFlash supports four installation backends, each with its own pip install command:

๐Ÿค— Transformers

For Qwen3 and LLaMA-3.1 models only. Full Python API with draft.spec_generate(). Supports CPU and CUDA.

โšก SGLang

Server-mode serving with --speculative-algorithm DFLASH. Best for production serving.

๐Ÿš€ vLLM

Production serving with --speculative-config. Uses modified vLLM branch (PR #40898).

๐ŸŽ MLX

Apple Silicon support. Tested on Apple M5 Pro. stream_generate() for live output.

vLLM Quick Start

vllm serve Qwen/Qwen3.5-27B \
   --speculative-config '{"method": "dflash", "model": "z-lab/Qwen3.5-27B-DFlash",
     "num_speculative_tokens": 15}' \
   --attention-backend flash_attn \
   --max-num-batched-tokens 32768

Why This Matters

Every GPU-inference user chasing lower latency benefits here. DFlash is available on MLX for Apple Silicon โ€” perfect for local deployment. The draft models are tiny (A3B experts) but accelerate massive base models dramatically. Citation: arXiv:2602.06036

๐Ÿ”ง Try it today

HuggingFace: z-lab/dflash | GitHub: z-lab/dflash

References

  1. HuggingFace โ€” z-lab (Official Organization)
  2. arXiv:2602.06036 โ€” DFlash: Block Diffusion for Flash Speculative Decoding (Official Paper)
  3. GitHub โ€” z-lab/dflash (Official Repository)