๐Ÿ“บ Watch the video version: ThinkSmart.Life/youtube

Introduction: The Promise and the Pain of Sparse Models

Mixture-of-Experts (MoE) architectures have become a dominant pattern in frontier model design. The appeal is clear: by routing each token to only a subset of specialized expert sub-networks, you can grow total parameter count dramatically without proportionally increasing per-token compute. DeepSeek-V3 illustrates this in its most extreme form โ€” 685 billion total parameters, but only 37 billion activated per token. That is an 18ร— gap between the model you store and the model you run.[1]

For inference, this trade-off is largely favorable. For training, the gap between stored and activated parameters creates a three-dimensional systems problem that simply does not exist in dense transformers. Every expert's parameters, gradients, and optimizer states must reside in GPU memory during training โ€” not just the ones used for any given batch. Tokens must be dynamically routed across potentially hundreds of experts distributed across thousands of GPUs. The expert computations that result are often small matrix multiplications that leave GPU arithmetic units underutilized. And these three pressures โ€” memory, communication, compute โ€” are tightly coupled: optimizing one often shifts the bottleneck to another.

NVIDIA's paper "Scalable Training of Mixture-of-Experts Models with Megatron-Core" (arXiv:2603.07685) describes an integrated framework that addresses all three constraints simultaneously.[1] At 88 pages with 42 figures, it is the most comprehensive open systems reference for large-scale MoE training available. It covers Parallel Folding (a novel parallelism decoupling strategy), a full suite of memory and communication optimizations, reduced-precision training at FP8 and NVFP4, and production features from upcycling to reinforcement learning. The result: 1,233 TFLOPS/GPU for DeepSeek-V3-685B on NVIDIA GB300.

88
Pages ยท 42 Figures
18ร—
Param/Compute Gap (DeepSeek-V3)
1,233
TFLOPS/GPU on GB300
1,024+
GPU Scale Supported

The Three Walls: Why MoE Training Is a Systems Problem

MoE sparsity creates what the paper calls a Parameter-Compute Mismatch: total parameters scale with the number of experts E, while per-token compute only scales with the number of activated experts K. With K โ‰ช E, training MoE models at scale creates three tightly coupled constraints the paper calls The Three Walls.

๐Ÿ”ด Memory Wall

All E experts must stay in GPU memory. Only K activate per token. 18ร— more parameters than an equivalent dense model โ€” all requiring gradients and optimizer states.

๐ŸŸก Communication Wall

Expert Parallelism requires all-to-all collectives every MoE layer. In DeepSeek-V3, unoptimized all-to-all consumed 60% of total training time.

๐ŸŸฃ Compute Wall

Fine-grained experts produce small GEMMs. Only ~50% GPU time in compute (vs ~70% in dense models). Routing adds ~9% overhead. Dynamic shapes block CUDA Graph optimization.

The coupling problem: These three walls are not independent. Increasing batch size improves GEMM utilization (compute wall) but amplifies memory pressure and communication volume. CUDA Graphs eliminate host overhead (compute wall) but require static tensor shapes, which conflicts with dynamic MoE routing. Every optimization forces a trade-off with at least one other wall.

The Memory Wall in Detail

In a dense transformer, per-token compute and memory scale together. Add more parameters, and each forward pass uses them. MoE breaks this coupling. A model with 685B total parameters and 37B active per token must still store all 685B parameters in GPU memory across all training ranks, along with their gradient accumulators and Adam-family optimizer states (roughly 2 additional parameter-sized buffers per parameter). The memory footprint of a MoE model with the same per-token compute as a 37B dense model is approximately 18ร— larger.

Dynamic routing compounds the problem. At any given step, some experts receive far more tokens than others. This load imbalance causes unpredictable activation memory spikes โ€” the intermediate activations of an overloaded expert temporarily exceed what a balanced load would require, leading to memory fragmentation and occasional out-of-memory events that are difficult to predict statically.

The Communication Wall in Detail

Expert Parallelism (EP) solves the memory wall by distributing experts across GPU ranks โ€” but mandates a new communication pattern at every MoE layer. Before expert computation, every token must be dispatched to its assigned expert on potentially any GPU in the EP group. After computation, expert outputs must be gathered back to the originating GPU. This requires two all-to-all collective operations per MoE layer.

The data volume is substantial. For T tokens, Top-K routing, and hidden dimension h, each dispatch all-to-all moves approximately TยทKยทhยท(EPโˆ’1)/EP values. A full dispatch-and-combine cycle doubles this. At large EP, this communication increasingly moves from high-bandwidth intra-node NVLink (450 GB/s on DGX H100) to narrower inter-node InfiniBand (400 Gbps โ€” roughly 10ร— less bandwidth per GPU). The paper's measurements show that unoptimized all-to-all in DeepSeek-V3 consumed up to 60% of total training step time.[1]

The Compute Efficiency Wall in Detail

Dense transformers achieve high Model FLOPS Utilization (MFU) on large, regular GEMM operations. MoE undermines this in several ways:

Parallel Folding: Rethinking Multi-Dimensional Parallelism

The central architectural innovation in the paper is Parallel Folding โ€” a strategy for decoupling the parallelism configurations of Attention and MoE layers.[1,6]

In conventional distributed training frameworks, Expert Parallelism (EP) is constrained to be a sub-group of Data Parallelism (DP): EP โ‰ค DP. This arises because EP groups are formed by splitting DP groups. Every expert-parallel rank corresponds to a data-parallel rank, meaning the number of expert shards is bounded by the data parallel degree. For models with large numbers of experts or configurations where optimal DP is small, this constraint prevents reaching the EP degree needed to fit expert parameters into GPU memory.

Parallel Folding eliminates this constraint by assigning each layer type its own independent parallelism group:

Layer TypeParallelism GroupDimensions
Attention layersTP ร— CP ร— DP ร— PP4D โ€” standard dense parallelism
MoE layersTP ร— EP ร— DP ร— PP4D โ€” independent EP configuration

The key insight: Attention computation benefits from TP (for weight sharding) and CP (for long-context sequence splitting). Expert computation benefits from large EP (to distribute memory) but does not benefit from CP in the same way. By letting each layer type use what suits it best, Parallel Folding unlocks configurations that are simply not expressible in standard frameworks.

The reported MFU improvements are meaningful: 49.3% for Mixtral 8ร—22B and 39.0% for Qwen2-57B-A14B on H100 clusters, versus 46.3% and 35.3% respectively for the prior MCore baseline.[1] These are MFU (normalized against theoretical peak), making them architecture- and hardware-comparable.

Tradeoff: Parallel Folding adds implementation complexity. Managing two independent parallelism group hierarchies within a single training job requires careful coordination of gradient accumulation, optimizer steps, and checkpoint sharding. The paper describes these interactions but they are non-trivial to implement correctly.

Breaking the Memory Wall

Megatron-Core attacks the memory wall with a layered strategy โ€” no single technique is sufficient, and the techniques interact in ways that require careful co-design.

Fine-Grained Activation Recomputation

Standard gradient checkpointing saves only layer-boundary activations and recomputes intermediate activations during the backward pass, trading compute for memory. Megatron-Core extends this to be selective and per-layer, allowing practitioners to configure exactly which activation tensors to retain versus recompute. For MoE layers specifically, this is valuable because expert activations scale with E (all expert intermediate states) rather than just K, making selective recomputation of low-priority activations disproportionately effective.

Memory-Efficient Permutation

Naive token permutation (grouping tokens by assigned expert before computation) allocates a new buffer of size Tร—Kร—h, creating a memory spike precisely when expert activations also need to exist. The paper's memory-efficient permutation achieves zero-overhead activation reduction by reusing existing buffers in-place, eliminating the spike entirely.[1]

Activation Offloading

Activation tensors are moved to host (CPU) DRAM during the forward pass and brought back as needed during the backward pass. Host memory is typically 10โ€“20ร— larger than GPU memory per device. The cost is PCIe bandwidth; the paper uses asynchronous transfers to overlap the host-GPU traffic with ongoing computation, partially hiding latency. This technique becomes increasingly attractive on newer hardware where NVLink bandwidth dwarfs PCIe, making the off-GPU path relatively more expensive โ€” a tradeoff that practitioners must evaluate per-configuration.

FSDP for MoE

ZeRO-style (Zero Redundancy Optimizer) parameter sharding distributes optimizer states, gradients, and optionally parameters across DP ranks, reducing per-GPU memory proportionally to DP group size. Megatron-Core's implementation handles the EPโ€“DP interaction correctly โ€” expert parameters are already sharded by EP, so FSDP must coordinate with this existing sharding without creating redundancy or inconsistency.

Breaking the Communication Wall

DeepEP and HybridEP

Custom all-to-all dispatchers designed to maximize bandwidth utilization on NVIDIA interconnects.[1] DeepEP targets single-node EP deployments where all traffic traverses NVLink โ€” tuning buffer management and transfer scheduling for the 450 GB/s intra-node bandwidth. HybridEP handles cross-node EP where traffic mixes NVLink (intra-node) and InfiniBand (inter-node), optimizing the topology-aware routing that standard all-to-all implementations miss. Standard collective implementations do not differentiate between these topologies, leaving significant bandwidth on the table.

EP Communication Overlapping

In a baseline MoE layer, dispatch โ†’ compute โ†’ combine execute sequentially. The GPU sits idle waiting for all-to-all to complete before expert GEMMs can start. Communication overlapping restructures this pipeline so that token routing and expert computation partially overlap: while one batch of tokens is being dispatched, the previous batch is already being processed by its experts. This requires careful scheduling and buffer management but can substantially reduce the effective communication overhead visible to end-to-end throughput. The paper targets the 60% baseline figure specifically with this technique.

Breaking the Compute Efficiency Wall

Grouped GEMM

Rather than launching a separate GEMM kernel for each expert's token batch, Grouped GEMM batches all expert computations into a single kernel call with grouped matrix multiply semantics. This amortizes kernel launch overhead and allows the GPU to better schedule work across experts, improving utilization of arithmetic units even when individual expert batches are small. This is the primary tool for recovering the ~20-point MFU gap between MoE and dense compute patterns.

Kernel Fusion

The router computation, auxiliary loss calculation, and token permutation are fused into fewer kernel launches. Each fusion reduces host-side overhead (kernel launch latency, between-kernel synchronization) and allows intermediate results to remain in fast register or L1 cache rather than round-tripping through global memory.

CUDA Graphs and ECHO

CUDA Graph optimization captures a computation graph once and replays it without host-side kernel launch overhead. This is highly effective for dense models with static shapes โ€” but MoE's dynamic routing means token counts per expert vary each step, producing variable tensor shapes that CUDA Graph cannot handle natively.

The paper introduces ECHO (full CUDA Graph coverage for dropless MoE) to address this directly. ECHO combines sync-free kernels (that do not require host-device synchronization for shape information) with Paged Stashing (a technique for managing dynamic buffer allocations within a CUDA Graph context). Together, they extend CUDA Graph coverage to the MoE layers that previously required dynamic dispatch โ€” capturing the kernel launch savings that were previously unavailable for dropless MoE configurations.[1]

FP8 and NVFP4: Precision as a Systems Lever

Reduced-precision training matters more for MoE than for dense models because it simultaneously addresses all three walls, not just compute efficiency:

The paper describes four precision recipes across hardware generations:

RecipeHardwareNotes
Per-Tensor FP8H100 (Hopper)Single scale factor per tensor. Simple, good baseline.
Blockwise FP8H100 (Hopper)Scale factor per block of weights. Better accuracy at cost of metadata overhead.
MXFP8GB200/GB300 (Blackwell)Microscaling FP8. Hardware-native on Blackwell; no software overhead.
NVFP4GB200/GB300 (Blackwell)4-bit. Maximum compression. Requires careful per-layer calibration.

MoE-specific challenges arise from dynamic token routing. Grouped quantization โ€” applying quantization scaling factors at the expert-group level rather than globally โ€” is required because different experts see different activation distributions. Standard per-tensor quantization applied across experts would sacrifice accuracy. The paper describes grouped quantization kernels that maintain per-expert scaling without introducing additional synchronization overhead.[1]

FP8 Primary Weights is a storage optimization that eliminates the redundant BF16 master weight copy maintained by some training frameworks. Instead, FP8 tensors serve as the primary (and only) weight storage, with conversion to higher precision only when needed for gradient accumulation. This removes an entire parameter-sized memory allocation that previously coexisted with the FP8 copy.

Performance Results and Case Studies

1,233
TFLOPS/GPU ยท DeepSeek-V3 ยท GB300
1,048
TFLOPS/GPU ยท DeepSeek-V3 ยท GB200
974
TFLOPS/GPU ยท Qwen3-235B ยท GB300
49.3%
MFU ยท Mixtral 8x22B ยท H100

These numbers are TFLOPS throughput, not MFU โ€” they represent absolute compute throughput, not a fraction of theoretical peak. GB300 has a theoretical peak of ~2,800 TFLOPS in FP8, so 1,233 TFLOPS/GPU for DeepSeek-V3 represents roughly 44% hardware utilization at full model scale, which is strong for a 685B-parameter sparse model with complex routing overhead.

The DeepSeek-V3 Case Study (Section 9.2)

The paper provides a detailed walkthrough of optimizing DeepSeek-V3 training on both GB200 and H100 clusters using a three-phase workflow:[1]

  1. Phase 1: Establish memory-feasible parallelism. Find a parallelism configuration where the model fits in GPU memory without OOM. This sets the baseline; no throughput optimization yet.
  2. Phase 2: Select optimal parallelism strategy. Given the memory-feasible space, search for the configuration that maximizes MFU. For DeepSeek-V3, Parallel Folding significantly expands the available configurations relative to the EP โ‰ค DP baseline.
  3. Phase 3: Profile and optimize bottlenecks. Apply communication overlapping, kernel fusion, CUDA Graphs (where applicable), and precision tuning based on profiling data from Phase 2.

The paper reports that fine-grained MoE models (like Qwen2-57B-A14B with 64 experts, 8 active per token) consistently achieve lower MFU than coarse-grained models (like Mixtral 8ร—22B with 8 experts, 2 active per token) due to smaller expert hidden dimensions (lower GEMM efficiency) and higher communication volume from more expert activations per token. This is an architectural trade-off, not a framework limitation โ€” fine-grained architectures trade training efficiency for model quality gains.

Production Features: The Complete Toolkit

Beyond the core three-walls optimizations, Section 7 of the paper catalogs production features that are required for real training runs at scale:

Limitations and Open Questions

The fine-grained MoE efficiency gap is architectural, not fixable by the framework. Models like DeepSeek-V3 and Qwen3 use fine-grained expert architectures (many small experts) that trade training efficiency for model quality. Megatron-Core optimizes what it can, but the gap between 49.3% MFU (Mixtral coarse-grained) and 39% MFU (Qwen fine-grained) on H100 reflects the architecture, not the framework.

Conclusion

NVIDIA's Megatron-Core MoE technical report is the most comprehensive open systems reference for large-scale sparse model training published to date. Its value is not in any single technique โ€” Parallel Folding, DeepEP, ECHO, or FP8 primary weights โ€” but in the integrated treatment of all three walls simultaneously with explicit analysis of their interactions and trade-offs.

For practitioners training MoE models today, the actionable takeaways are concrete. The three-phase optimization workflow (memory feasibility โ†’ parallelism selection โ†’ bottleneck profiling) provides a replicable methodology. Parallel Folding's removal of the EP โ‰ค DP constraint directly expands the configuration search space for large models. Communication overlapping and DeepEP/HybridEP target the 60%-of-training-time problem that otherwise dominates cost at scale. And FP8/NVFP4 on Blackwell hardware compounds all three improvements simultaneously.

The limitations are real: fine-grained MoE architectures trade training efficiency for model quality in ways the framework cannot fully compensate for; GB300 performance numbers require hardware most practitioners don't have; and reinforcement learning integration remains an early-stage component. But the framework itself โ€” available open-source in Megatron-Core โ€” is production-ready and the paper's 88 pages represent genuine engineering depth worth studying for anyone building at the MoE frontier.

Key Takeaways for ML Engineers

1. The Three Walls framework is a useful mental model for MoE training constraints. Memory, communication, and compute are coupled โ€” address all three or your optimization shifts the bottleneck rather than removing it.

2. Parallel Folding breaks the EP โ‰ค DP constraint. If you're training MoE on Megatron-Core, update to a version that supports it and re-run your parallelism search.

3. Communication is the dominant cost at large EP. 60% of training time in all-to-all is not an edge case โ€” it's the default without active optimization. DeepEP/HybridEP + overlapping is not optional for production runs.

4. FP8 hits all three walls at once. On Blackwell hardware, MXFP8 and NVFP4 are the right default starting points, not a later optimization. On Hopper, per-tensor FP8 is a solid default for compute-bound workloads.

References

  1. Yan et al. (2026) โ€” Scalable Training of Mixture-of-Experts Models with Megatron Core. arXiv:2603.07685
  2. Rajbhandari et al. (2020) โ€” ZeRO: Memory Optimizations Toward Training Trillion Parameter Models
  3. Shoeybi et al. (2019) โ€” Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
  4. DeepSeek-AI (2025) โ€” DeepSeek-V3 Technical Report
  5. Jiang et al. (2024) โ€” Mixtral of Experts
  6. Liu et al. (2025) โ€” MoE Parallel Folding: Heterogeneous Parallelism Mappings for Efficient Large-Scale MoE Model Training with Megatron Core