🧠 Scalable MoE Training: Inside NVIDIA's Megatron-Core Technical Report

DeepSeek-V3 has 685B parameters but only 37B active per token — an 18× gap that breaks every assumption in dense model training. NVIDIA's 88-page technical report explains how to train at this scale without running into walls.

March 10, 2026 · min read

📺 Watch the video version:

Introduction: The Promise and the Pain of Sparse Models

Mixture-of-Experts (MoE) architectures have become a dominant pattern in frontier model design. The appeal is clear: by routing each token to only a subset of specialized expert sub-networks, you can grow total parameter count dramatically without proportionally increasing per-token compute. DeepSeek-V3 illustrates this in its most extreme form — 685 billion total parameters, but only 37 billion activated per token. That is an 18× gap between the model you store and the model you run.^[1]

For inference, this trade-off is largely favorable. For training, the gap between stored and activated parameters creates a three-dimensional systems problem that simply does not exist in dense transformers. Every expert's parameters, gradients, and optimizer states must reside in GPU memory during training — not just the ones used for any given batch. Tokens must be dynamically routed across potentially hundreds of experts distributed across thousands of GPUs. The expert computations that result are often small matrix multiplications that leave GPU arithmetic units underutilized. And these three pressures — memory, communication, compute — are tightly coupled: optimizing one often shifts the bottleneck to another.

NVIDIA's paper "Scalable Training of Mixture-of-Experts Models with Megatron-Core" (arXiv:2603.07685) describes an integrated framework that addresses all three constraints simultaneously.^[1] At 88 pages with 42 figures, it is the most comprehensive open systems reference for large-scale MoE training available. It covers Parallel Folding (a novel parallelism decoupling strategy), a full suite of memory and communication optimizations, reduced-precision training at FP8 and NVFP4, and production features from upcycling to reinforcement learning. The result: 1,233 TFLOPS/GPU for DeepSeek-V3-685B on NVIDIA GB300.

Pages · 42 Figures

18×

Param/Compute Gap (DeepSeek-V3)

1,233

TFLOPS/GPU on GB300

1,024+

GPU Scale Supported

The Three Walls: Why MoE Training Is a Systems Problem

MoE sparsity creates what the paper calls a Parameter-Compute Mismatch: total parameters scale with the number of experts E, while per-token compute only scales with the number of activated experts K. With K ≪ E, training MoE models at scale creates three tightly coupled constraints the paper calls The Three Walls.

🔴 Memory Wall

All E experts must stay in GPU memory. Only K activate per token. 18× more parameters than an equivalent dense model — all requiring gradients and optimizer states.

🟡 Communication Wall

Expert Parallelism requires all-to-all collectives every MoE layer. In DeepSeek-V3, unoptimized all-to-all consumed 60% of total training time.

🟣 Compute Wall

Fine-grained experts produce small GEMMs. Only ~50% GPU time in compute (vs ~70% in dense models). Routing adds ~9% overhead. Dynamic shapes block CUDA Graph optimization.

The coupling problem: These three walls are not independent. Increasing batch size improves GEMM utilization (compute wall) but amplifies memory pressure and communication volume. CUDA Graphs eliminate host overhead (compute wall) but require static tensor shapes, which conflicts with dynamic MoE routing. Every optimization forces a trade-off with at least one other wall.

The Memory Wall in Detail

In a dense transformer, per-token compute and memory scale together. Add more parameters, and each forward pass uses them. MoE breaks this coupling. A model with 685B total parameters and 37B active per token must still store all 685B parameters in GPU memory across all training ranks, along with their gradient accumulators and Adam-family optimizer states (roughly 2 additional parameter-sized buffers per parameter). The memory footprint of a MoE model with the same per-token compute as a 37B dense model is approximately 18× larger.

Dynamic routing compounds the problem. At any given step, some experts receive far more tokens than others. This load imbalance causes unpredictable activation memory spikes — the intermediate activations of an overloaded expert temporarily exceed what a balanced load would require, leading to memory fragmentation and occasional out-of-memory events that are difficult to predict statically.

The Communication Wall in Detail

Expert Parallelism (EP) solves the memory wall by distributing experts across GPU ranks — but mandates a new communication pattern at every MoE layer. Before expert computation, every token must be dispatched to its assigned expert on potentially any GPU in the EP group. After computation, expert outputs must be gathered back to the originating GPU. This requires two all-to-all collective operations per MoE layer.

The data volume is substantial. For T tokens, Top-K routing, and hidden dimension h, each dispatch all-to-all moves approximately T·K·h·(EP−1)/EP values. A full dispatch-and-combine cycle doubles this. At large EP, this communication increasingly moves from high-bandwidth intra-node NVLink (450 GB/s on DGX H100) to narrower inter-node InfiniBand (400 Gbps — roughly 10× less bandwidth per GPU). The paper's measurements show that unoptimized all-to-all in DeepSeek-V3 consumed up to 60% of total training step time.^[1]

The Compute Efficiency Wall in Detail

Dense transformers achieve high Model FLOPS Utilization (MFU) on large, regular GEMM operations. MoE undermines this in several ways:

Small GEMMs: Fine-grained experts with many small weight matrices produce matrix multiplications too small to saturate GPU arithmetic units. The paper measures ~50% of execution time in GEMMs for DeepSeek-V3 (MoE) vs ~70% for Llama-3 405B (dense).^[1]
Routing and permutation overhead: Tokens must be permuted into expert groups before computation and un-permuted afterward. The paper quantifies this at approximately 9% of per-layer execution time after optimization.^[1]
Host overhead: MoE launches more kernels per FLOP due to sparsity. Each launch carries fixed host-side cost. In dropless MoE, dynamic tensor shapes require costly host-device synchronization, preventing CUDA Graph optimization.
Load imbalance: Dynamic routing assigns uneven token counts to experts, leaving some idle while others are overloaded.

Parallel Folding: Rethinking Multi-Dimensional Parallelism

The central architectural innovation in the paper is Parallel Folding — a strategy for decoupling the parallelism configurations of Attention and MoE layers.^[1,6]

In conventional distributed training frameworks, Expert Parallelism (EP) is constrained to be a sub-group of Data Parallelism (DP): EP ≤ DP. This arises because EP groups are formed by splitting DP groups. Every expert-parallel rank corresponds to a data-parallel rank, meaning the number of expert shards is bounded by the data parallel degree. For models with large numbers of experts or configurations where optimal DP is small, this constraint prevents reaching the EP degree needed to fit expert parameters into GPU memory.

Parallel Folding eliminates this constraint by assigning each layer type its own independent parallelism group:

Layer Type	Parallelism Group	Dimensions
Attention layers	TP × CP × DP × PP	4D — standard dense parallelism
MoE layers	TP × EP × DP × PP	4D — independent EP configuration

The key insight: Attention computation benefits from TP (for weight sharding) and CP (for long-context sequence splitting). Expert computation benefits from large EP (to distribute memory) but does not benefit from CP in the same way. By letting each layer type use what suits it best, Parallel Folding unlocks configurations that are simply not expressible in standard frameworks.

The reported MFU improvements are meaningful: 49.3% for Mixtral 8×22B and 39.0% for Qwen2-57B-A14B on H100 clusters, versus 46.3% and 35.3% respectively for the prior MCore baseline.^[1] These are MFU (normalized against theoretical peak), making them architecture- and hardware-comparable.

Tradeoff: Parallel Folding adds implementation complexity. Managing two independent parallelism group hierarchies within a single training job requires careful coordination of gradient accumulation, optimizer steps, and checkpoint sharding. The paper describes these interactions but they are non-trivial to implement correctly.

Breaking the Memory Wall

Megatron-Core attacks the memory wall with a layered strategy — no single technique is sufficient, and the techniques interact in ways that require careful co-design.

Fine-Grained Activation Recomputation

Standard gradient checkpointing saves only layer-boundary activations and recomputes intermediate activations during the backward pass, trading compute for memory. Megatron-Core extends this to be selective and per-layer, allowing practitioners to configure exactly which activation tensors to retain versus recompute. For MoE layers specifically, this is valuable because expert activations scale with E (all expert intermediate states) rather than just K, making selective recomputation of low-priority activations disproportionately effective.

Memory-Efficient Permutation

Naive token permutation (grouping tokens by assigned expert before computation) allocates a new buffer of size T×K×h, creating a memory spike precisely when expert activations also need to exist. The paper's memory-efficient permutation achieves zero-overhead activation reduction by reusing existing buffers in-place, eliminating the spike entirely.^[1]

Activation Offloading

Activation tensors are moved to host (CPU) DRAM during the forward pass and brought back as needed during the backward pass. Host memory is typically 10–20× larger than GPU memory per device. The cost is PCIe bandwidth; the paper uses asynchronous transfers to overlap the host-GPU traffic with ongoing computation, partially hiding latency. This technique becomes increasingly attractive on newer hardware where NVLink bandwidth dwarfs PCIe, making the off-GPU path relatively more expensive — a tradeoff that practitioners must evaluate per-configuration.

FSDP for MoE

ZeRO-style (Zero Redundancy Optimizer) parameter sharding distributes optimizer states, gradients, and optionally parameters across DP ranks, reducing per-GPU memory proportionally to DP group size. Megatron-Core's implementation handles the EP–DP interaction correctly — expert parameters are already sharded by EP, so FSDP must coordinate with this existing sharding without creating redundancy or inconsistency.

Breaking the Communication Wall

DeepEP and HybridEP

Custom all-to-all dispatchers designed to maximize bandwidth utilization on NVIDIA interconnects.^[1] DeepEP targets single-node EP deployments where all traffic traverses NVLink — tuning buffer management and transfer scheduling for the 450 GB/s intra-node bandwidth. HybridEP handles cross-node EP where traffic mixes NVLink (intra-node) and InfiniBand (inter-node), optimizing the topology-aware routing that standard all-to-all implementations miss. Standard collective implementations do not differentiate between these topologies, leaving significant bandwidth on the table.

EP Communication Overlapping

In a baseline MoE layer, dispatch → compute → combine execute sequentially. The GPU sits idle waiting for all-to-all to complete before expert GEMMs can start. Communication overlapping restructures this pipeline so that token routing and expert computation partially overlap: while one batch of tokens is being dispatched, the previous batch is already being processed by its experts. This requires careful scheduling and buffer management but can substantially reduce the effective communication overhead visible to end-to-end throughput. The paper targets the 60% baseline figure specifically with this technique.

Breaking the Compute Efficiency Wall

Grouped GEMM

Rather than launching a separate GEMM kernel for each expert's token batch, Grouped GEMM batches all expert computations into a single kernel call with grouped matrix multiply semantics. This amortizes kernel launch overhead and allows the GPU to better schedule work across experts, improving utilization of arithmetic units even when individual expert batches are small. This is the primary tool for recovering the ~20-point MFU gap between MoE and dense compute patterns.

Kernel Fusion

The router computation, auxiliary loss calculation, and token permutation are fused into fewer kernel launches. Each fusion reduces host-side overhead (kernel launch latency, between-kernel synchronization) and allows intermediate results to remain in fast register or L1 cache rather than round-tripping through global memory.

CUDA Graphs and ECHO

CUDA Graph optimization captures a computation graph once and replays it without host-side kernel launch overhead. This is highly effective for dense models with static shapes — but MoE's dynamic routing means token counts per expert vary each step, producing variable tensor shapes that CUDA Graph cannot handle natively.

The paper introduces ECHO (full CUDA Graph coverage for dropless MoE) to address this directly. ECHO combines sync-free kernels (that do not require host-device synchronization for shape information) with Paged Stashing (a technique for managing dynamic buffer allocations within a CUDA Graph context). Together, they extend CUDA Graph coverage to the MoE layers that previously required dynamic dispatch — capturing the kernel launch savings that were previously unavailable for dropless MoE configurations.^[1]

FP8 and NVFP4: Precision as a Systems Lever

Reduced-precision training matters more for MoE than for dense models because it simultaneously addresses all three walls, not just compute efficiency:

Memory wall: FP8 weights are half the size of BF16, halving parameter and activation memory. NVFP4 takes this further to 4-bit weights.
Communication wall: Smaller activations mean less data moved in all-to-all collectives — directly reducing the dominant training cost.
Compute wall: FP8 and FP4 tensor cores execute at higher throughput than BF16 on Hopper (FP8) and Blackwell (FP8/FP4) hardware.

The paper describes four precision recipes across hardware generations:

Recipe	Hardware	Notes
Per-Tensor FP8	H100 (Hopper)	Single scale factor per tensor. Simple, good baseline.
Blockwise FP8	H100 (Hopper)	Scale factor per block of weights. Better accuracy at cost of metadata overhead.
MXFP8	GB200/GB300 (Blackwell)	Microscaling FP8. Hardware-native on Blackwell; no software overhead.
NVFP4	GB200/GB300 (Blackwell)	4-bit. Maximum compression. Requires careful per-layer calibration.

MoE-specific challenges arise from dynamic token routing. Grouped quantization — applying quantization scaling factors at the expert-group level rather than globally — is required because different experts see different activation distributions. Standard per-tensor quantization applied across experts would sacrifice accuracy. The paper describes grouped quantization kernels that maintain per-expert scaling without introducing additional synchronization overhead.^[1]

FP8 Primary Weights is a storage optimization that eliminates the redundant BF16 master weight copy maintained by some training frameworks. Instead, FP8 tensors serve as the primary (and only) weight storage, with conversion to higher precision only when needed for gradient accumulation. This removes an entire parameter-sized memory allocation that previously coexisted with the FP8 copy.

Performance Results and Case Studies

1,233

TFLOPS/GPU · DeepSeek-V3 · GB300

1,048

TFLOPS/GPU · DeepSeek-V3 · GB200

974

TFLOPS/GPU · Qwen3-235B · GB300

49.3%

MFU · Mixtral 8x22B · H100

These numbers are TFLOPS throughput, not MFU — they represent absolute compute throughput, not a fraction of theoretical peak. GB300 has a theoretical peak of ~2,800 TFLOPS in FP8, so 1,233 TFLOPS/GPU for DeepSeek-V3 represents roughly 44% hardware utilization at full model scale, which is strong for a 685B-parameter sparse model with complex routing overhead.

The DeepSeek-V3 Case Study (Section 9.2)

The paper provides a detailed walkthrough of optimizing DeepSeek-V3 training on both GB200 and H100 clusters using a three-phase workflow:^[1]

Phase 1: Establish memory-feasible parallelism. Find a parallelism configuration where the model fits in GPU memory without OOM. This sets the baseline; no throughput optimization yet.
Phase 2: Select optimal parallelism strategy. Given the memory-feasible space, search for the configuration that maximizes MFU. For DeepSeek-V3, Parallel Folding significantly expands the available configurations relative to the EP ≤ DP baseline.
Phase 3: Profile and optimize bottlenecks. Apply communication overlapping, kernel fusion, CUDA Graphs (where applicable), and precision tuning based on profiling data from Phase 2.

The paper reports that fine-grained MoE models (like Qwen2-57B-A14B with 64 experts, 8 active per token) consistently achieve lower MFU than coarse-grained models (like Mixtral 8×22B with 8 experts, 2 active per token) due to smaller expert hidden dimensions (lower GEMM efficiency) and higher communication volume from more expert activations per token. This is an architectural trade-off, not a framework limitation — fine-grained architectures trade training efficiency for model quality gains.

Production Features: The Complete Toolkit

Beyond the core three-walls optimizations, Section 7 of the paper catalogs production features that are required for real training runs at scale:

Load balancing: Auxiliary loss terms penalize uneven token-to-expert routing. Token dropping (with configurable capacity factors) provides a hard cap on per-expert load. The paper describes both strategies and their interaction with CUDA Graphs.
Shared experts: Certain experts are always active regardless of routing (DeepSeek-style). Megatron-Core supports these as a separate execution path from standard Top-K routing.
Latent MoE: NVIDIA's Nemotron-3 architecture uses LatentMoE, which learns routing in a latent space rather than input space. Full support is included.
Distributed checkpoint: Asynchronous checkpoint writes that do not block training. At 685B+ parameters, synchronous checkpointing would stall training for minutes per save. Async checkpoint is an operational requirement at this scale.
Upcycling: Converting a pre-trained dense model into a MoE model by replicating dense FFN weights into multiple experts, then fine-tuning. This significantly reduces the compute cost of training a new MoE model from scratch.
Multi-Token Prediction (MTP): Training the model to predict multiple tokens per forward pass. Improves training signal density without proportional compute increase.
Muon optimizer: An alternative to Adam that applies different update rules to different parameter groups, showing improved convergence on some MoE configurations.
Megatron-Bridge (RL): Integration layer for reinforcement learning post-training, described in Section 10. Allows Megatron-Core to serve as the policy model backend in RL pipelines.

Limitations and Open Questions

The fine-grained MoE efficiency gap is architectural, not fixable by the framework. Models like DeepSeek-V3 and Qwen3 use fine-grained expert architectures (many small experts) that trade training efficiency for model quality. Megatron-Core optimizes what it can, but the gap between 49.3% MFU (Mixtral coarse-grained) and 39% MFU (Qwen fine-grained) on H100 reflects the architecture, not the framework.

Hardware dependency: The headline 1,233 TFLOPS/GPU result requires NVIDIA GB300 with NVFP4 and MXFP8. Training DeepSeek-V3 on H100 clusters will not approach this figure. The paper's H100 results are more representative for most practitioners today.
Communication wall is reduced but not eliminated: At extreme EP scales (hundreds of nodes), inter-node InfiniBand bandwidth remains a fundamental bottleneck. DeepEP and communication overlapping reduce the cost but cannot overcome the physical bandwidth ceiling.
ECHO is a workaround, not a fundamental solution: The sync-free kernel and Paged Stashing approach enables CUDA Graphs for dropless MoE, but the underlying tension between dynamic routing and static computation graphs remains. Future hardware or framework changes may offer cleaner solutions.
RL via Megatron-Bridge is underspecified: Section 10 describes the RL integration at high level but provides limited benchmark data. Practitioners looking to use Megatron-Core for RL post-training will need to rely on community documentation and experimentation.
Load balancing remains an open problem: The paper covers auxiliary loss and token dropping but notes that neither fully solves load imbalance at scale. Some experts consistently receive more tokens than others regardless of balancing strategy, leaving hardware underutilized.

Conclusion

NVIDIA's Megatron-Core MoE technical report is the most comprehensive open systems reference for large-scale sparse model training published to date. Its value is not in any single technique — Parallel Folding, DeepEP, ECHO, or FP8 primary weights — but in the integrated treatment of all three walls simultaneously with explicit analysis of their interactions and trade-offs.

For practitioners training MoE models today, the actionable takeaways are concrete. The three-phase optimization workflow (memory feasibility → parallelism selection → bottleneck profiling) provides a replicable methodology. Parallel Folding's removal of the EP ≤ DP constraint directly expands the configuration search space for large models. Communication overlapping and DeepEP/HybridEP target the 60%-of-training-time problem that otherwise dominates cost at scale. And FP8/NVFP4 on Blackwell hardware compounds all three improvements simultaneously.

The limitations are real: fine-grained MoE architectures trade training efficiency for model quality in ways the framework cannot fully compensate for; GB300 performance numbers require hardware most practitioners don't have; and reinforcement learning integration remains an early-stage component. But the framework itself — available open-source in Megatron-Core — is production-ready and the paper's 88 pages represent genuine engineering depth worth studying for anyone building at the MoE frontier.

Key Takeaways for ML Engineers

1. The Three Walls framework is a useful mental model for MoE training constraints. Memory, communication, and compute are coupled — address all three or your optimization shifts the bottleneck rather than removing it.

2. Parallel Folding breaks the EP ≤ DP constraint. If you're training MoE on Megatron-Core, update to a version that supports it and re-run your parallelism search.

3. Communication is the dominant cost at large EP. 60% of training time in all-to-all is not an edge case — it's the default without active optimization. DeepEP/HybridEP + overlapping is not optional for production runs.

4. FP8 hits all three walls at once. On Blackwell hardware, MXFP8 and NVFP4 are the right default starting points, not a later optimization. On Hopper, per-tensor FP8 is a solid default for compute-bound workloads.

🧠 Scalable MoE Training: Inside NVIDIA's Megatron-Core Technical Report

Introduction: The Promise and the Pain of Sparse Models

The Three Walls: Why MoE Training Is a Systems Problem

The Memory Wall in Detail

The Communication Wall in Detail

The Compute Efficiency Wall in Detail

Parallel Folding: Rethinking Multi-Dimensional Parallelism

Breaking the Memory Wall

Fine-Grained Activation Recomputation

Memory-Efficient Permutation

Activation Offloading

FSDP for MoE

Breaking the Communication Wall

DeepEP and HybridEP

EP Communication Overlapping

Breaking the Compute Efficiency Wall

Grouped GEMM

Kernel Fusion

CUDA Graphs and ECHO

FP8 and NVFP4: Precision as a Systems Lever

Performance Results and Case Studies

The DeepSeek-V3 Case Study (Section 9.2)

Production Features: The Complete Toolkit

Limitations and Open Questions

Conclusion

Key Takeaways for ML Engineers

References