Listen to this article

AI-narrated audio version of this research post

🎬 Watch This as a Video

174 second deep-dive · 🎧 Narrated by ThinkSmart.Life AI

In April 2024, a team at Meta AI published a paper with a deceptively simple idea: what if we stop training language models to predict one token at a time, and instead train them to predict multiple future tokens at once?

It sounds like a minor tweak. But the results are anything but minor — the authors' 13B parameter models solved 12% more problems on HumanEval and 17% more on MBPP, while inference speeds hit 3×. And none of this required adding compute, memory, or training time.

The technique, called Multi-Token Prediction or MTP, has now been merged into llama.cpp via pull request #22673 — meaning anyone with a local build of llama.cpp can now harness MTP's speedups and quality gains on their own models.

Faster inference via self-speculative decoding
+17%
Better on MBPP coding benchmarks (13B models)
Extra compute or training time overhead

Let's break down what MTP actually is, why it works, and what its arrival in llama.cpp means for anyone running local models.

What Is Multi-Token Prediction?

To understand MTP, you need to understand the status quo — how language models are trained today.

Standard LLMs use next-token prediction. At every position in the training corpus, the model is trained to predict exactly one token ahead: P(xt+1 | xt:1). It's a clever trick — just predict the next word, and over hundreds of billions of tokens, the model learns language, reasoning, code, everything. But the paper's central argument is that this approach is fundamentally inefficient.

"Teacher forcing with next-token prediction latches on to local patterns and overlooks 'hard' decisions. It remains a fact that state-of-the-art next-token predictors require orders of magnitude more data than human children to arrive at the same level of fluency."

MTP changes the objective. Instead of predicting one token ahead, the model predicts n tokens at once: P(xt+1:t+n | xt:1). At every position, it produces n outputs in parallel. The model learns to think ahead — to understand not just what comes next, but what comes after that.

This is the key insight: by forcing the model to predict further into the future, it develops better internal representations. It has to understand the structure of language, not just its surface patterns.

Architecture: Shared Trunk, New Heads

MTP's elegance lies in its simplicity. The paper uses a shared transformer trunk — the same backbone you'd use for a normal model — and adds n independent output heads on top of it. Each head predicts one of the next tokens.

┌─────────────────────────┐
│  Shared Transformer Trunk│  ← Same as normal model
│  (produces hidden state) │
└──────────┬──────────────┘
           │ hidden representation z
     ┌─────┼──────────┬──────────┐
     ▼     ▼          ▼          ▼
    h1     h2         h3         hn
  (head 1)  (head 2)  (head 3)  (head n)
    │       │          │          │
    ▼       ▼          ▼          ▼
   x_t+1  x_t+2      x_t+3      x_t+n
                

The heads share a single unembedding matrix (the final layer that maps hidden representations to token probabilities), which means there's zero additional memory cost and zero additional training time. You trade a few layers from the trunk for new output heads — a compute-neutral swap.

The training objective combines all n heads: the total loss is the sum of cross-entropy losses from each head. And crucially, because all heads operate on the same shared representation, they force the trunk to learn something richer — something that captures not just what comes next, but the broader context of the sequence.

There's one practical challenge: with vocabulary sizes of 32K-50K tokens, generating logits for n heads simultaneously eats massive GPU memory. The paper addresses this with a careful rearrangement of forward and backward passes — computing and discarding each head's logits sequentially rather than materializing all of them at once.

The Results

The Meta AI team trained models ranging from 300 million to 13 billion parameters on code datasets. Here are the headline results:

Benchmark Baseline (next-token) MTP (4-token) Improvement
HumanEval (Pass@1, 13B) 68.3% 76.5% +12%
MBPP (Pass@1, 13B) ~69% ~81% +17%
MBPP (8-byte MTP, 7B) 19.3% 32.3% +12.9%

The gains scale with model size. Small models (300M-1.3B) actually performed slightly worse with MTP — but as you go to 6.7B and 13B, the multi-token approach pulls decisively ahead. The paper suggests that larger models have the capacity to leverage the richer training signal effectively.

But perhaps the most exciting result isn't the quality gains — it's the speed boost.

Because MTP models already have multiple prediction heads trained, they can use those heads for self-speculative decoding during inference. Normally, speculative decoding requires a separate "draft" model — an extra model you have to run first. With MTP, the draft is built in.

The results:

  • 3.0× speedup on code generation tasks (2.5 tokens accepted out of 3 predicted on average)
  • 2.7× speedup on natural language text
  • 6.4× speedup with 8-byte multi-token prediction

This matters because speculative decoding is the primary way people can squeeze faster inference out of local models without changing quantization or hardware. MTP makes it even better — the drafted tokens come from a model that's been trained specifically to be good at drafting.

Why It Works

The paper's Section 5 offers two compelling explanations for what's happening under the hood.

Lookahead Reinforces "Choice Points"

Not all tokens are equally important in generating text. Some are mere stylistic choices — a synonym here, an adjective there. Others are choice points: decisions that determine the direction of the entire output. A wrong choice at a choice point cascades into incoherent or irrelevant generations.

MTP implicitly assigns higher weight to choice points. When a critical transition occurs — say, choosing between "the function returns" and "the function throws" — all n prediction heads have to work harder to predict what follows. That means the loss signal at that transition point is amplified. The model learns that these moments matter more.

The paper calculates that a choice point receives a weight of n×(n+1)/2 from its correlative heads, compared to just n for inconsequential transitions. That's how MTP gets 10× more signal on important decisions than the baseline.

The Information-Theoretic Argument

The paper also provides an information-theoretic decomposition of MTP loss, showing it acts as an implicit regularization. By learning to predict multiple tokens at once, the model develops more generalizable representations — ones that capture structural relationships rather than local correlations.

This is confirmed by experiments on synthetic algorithmic tasks, where MTP-trained models showed significantly better induction capability — the ability to recognize patterns and complete sequences. For small models, induction only formed meaningfully with MTP, and was vastly improved by it across all sizes.

Now in llama.cpp: PR #22673

The paper was published in April 2024. Eighteen months later, Pull Request #22673 by developer am17an has added MTP support to llama.cpp — the library that powers local inference for millions of users.

This is significant because it means you don't need a specially fine-tuned MTP model to benefit. While MTP-trained models (the ones trained from scratch with the multi-token loss) deliver the biggest gains, the implementation also supports a key capability:

Self-speculative decoding — using the model's own future prediction heads to draft tokens during inference, even without full MTP pretraining. This is the same technique that enabled those 3× speedups reported in the paper.

The paper also explored byte-level multi-token prediction — training models on raw bytes instead of tokenized text. A 7B parameter byte-level model with 8-byte MTP jumped from 19.3% to 32.3% on MBPP — a massive jump that shows MTP is especially valuable when the prediction task becomes harder. For llama.cpp users, this opens the door to novel vocabulary sizes and training strategies that could benefit all model sizes.

Custom GGUF models with MTP are already appearing on HuggingFace — like the Qwen 3.6 27B MTP variant — and people are already reporting 46 tokens per second on a 27B model with no GPU upgrade. That's the kind of performance boost that previously required expensive hardware changes.

What This Means

MTP represents a fundamental rethinking of how we train language models — and its arrival in llama.cpp is a milestone for everyone who runs local models.

For inference speed: Self-speculative decoding through MTP is arguably the single best optimization available right now for accelerating local inference without changing hardware or quantization. Where traditional speculative decoding needs a separate draft model, MTP builds the draft capability directly into the model architecture.

For training quality: The 12-17% gains on coding benchmarks at 13B parameters show that MTP genuinely improves what models can do, not just how fast they can generate. The improved induction and algorithmic reasoning capabilities suggest deeper, more generalizable internal representations.

For llama.cpp users: With PR #22673 merged, you can:

  • Train your own models with MTP from scratch using any llama.cpp-supported architecture
  • Use speculative decoding heads for faster inference, even on non-MTP-trained models
  • Explore byte-level multi-token training for maximum quality gains

The paper was trained with about 500,000 GPU hours on A100 and H100 hardware, but its principles apply at all scales. The gains are smallest for small models and grow with size — meaning your 13B+ models stand to benefit the most.

Multi-token prediction is one of those ideas that, once you understand it, seems almost obvious: why train a model to look one step ahead when it can look two, three, or eight? It's a reminder that the training signal — what a model is asked to predict — matters as much as the model architecture itself.

Multi-Token Prediction: The MTP Innovation That's Making llama.cpp 3× Faster | ThinkSmart.Life Research

Listen to this article

AI-narrated audio version of this research post

🎬 Watch This as a Video

174 second deep-dive · 🎧 Narrated by ThinkSmart.Life AI

In April 2024, a team at Meta AI published a paper with a deceptively simple idea: what if we stop training language models to predict one token at a time, and instead train them to predict multiple future tokens at once?

It sounds like a minor tweak. But the results are anything but minor — the authors' 13B parameter models solved 12% more problems on HumanEval and 17% more on MBPP, while inference speeds hit 3×. And none of this required adding compute, memory, or training time.

The technique, called Multi-Token Prediction or MTP, has now been merged into llama.cpp via pull request #22673 — meaning anyone with a local build of llama.cpp can now harness MTP's speedups and quality gains on their own models.

Faster inference via self-speculative decoding
+17%
Better on MBPP coding benchmarks (13B models)
Extra compute or training time overhead

Let's break down what MTP actually is, why it works, and what its arrival in llama.cpp means for anyone running local models.

What Is Multi-Token Prediction?

To understand MTP, you need to understand the status quo — how language models are trained today.

Standard LLMs use next-token prediction. At every position in the training corpus, the model is trained to predict exactly one token ahead: P(xt+1 | xt:1). It's a clever trick — just predict the next word, and over hundreds of billions of tokens, the model learns language, reasoning, code, everything. But the paper's central argument is that this approach is fundamentally inefficient.

"Teacher forcing with next-token prediction latches on to local patterns and overlooks 'hard' decisions. It remains a fact that state-of-the-art next-token predictors require orders of magnitude more data than human children to arrive at the same level of fluency."

MTP changes the objective. Instead of predicting one token ahead, the model predicts n tokens at once: P(xt+1:t+n | xt:1). At every position, it produces n outputs in parallel. The model learns to think ahead — to understand not just what comes next, but what comes after that.

This is the key insight: by forcing the model to predict further into the future, it develops better internal representations. It has to understand the structure of language, not just its surface patterns.

Architecture: Shared Trunk, New Heads

MTP's elegance lies in its simplicity. The paper uses a shared transformer trunk — the same backbone you'd use for a normal model — and adds n independent output heads on top of it. Each head predicts one of the next tokens.

┌─────────────────────────┐
│  Shared Transformer Trunk│  ← Same as normal model
│  (produces hidden state) │
└──────────┬──────────────┘
           │ hidden representation z
     ┌─────┼──────────┬──────────┐
     ▼     ▼          ▼          ▼
    h1     h2         h3         hn
  (head 1)  (head 2)  (head 3)  (head n)
    │       │          │          │
    ▼       ▼          ▼          ▼
   x_t+1  x_t+2      x_t+3      x_t+n
                

The heads share a single unembedding matrix (the final layer that maps hidden representations to token probabilities), which means there's zero additional memory cost and zero additional training time. You trade a few layers from the trunk for new output heads — a compute-neutral swap.

The training objective combines all n heads: the total loss is the sum of cross-entropy losses from each head. And crucially, because all heads operate on the same shared representation, they force the trunk to learn something richer — something that captures not just what comes next, but the broader context of the sequence.

There's one practical challenge: with vocabulary sizes of 32K-50K tokens, generating logits for n heads simultaneously eats massive GPU memory. The paper addresses this with a careful rearrangement of forward and backward passes — computing and discarding each head's logits sequentially rather than materializing all of them at once.

The Results

The Meta AI team trained models ranging from 300 million to 13 billion parameters on code datasets. Here are the headline results:

Benchmark Baseline (next-token) MTP (4-token) Improvement
HumanEval (Pass@1, 13B) 68.3% 76.5% +12%
MBPP (Pass@1, 13B) ~69% ~81% +17%
MBPP (8-byte MTP, 7B) 19.3% 32.3% +12.9%

The gains scale with model size. Small models (300M-1.3B) actually performed slightly worse with MTP — but as you go to 6.7B and 13B, the multi-token approach pulls decisively ahead. The paper suggests that larger models have the capacity to leverage the richer training signal effectively.

But perhaps the most exciting result isn't the quality gains — it's the speed boost.

Because MTP models already have multiple prediction heads trained, they can use those heads for self-speculative decoding during inference. Normally, speculative decoding requires a separate "draft" model — an extra model you have to run first. With MTP, the draft is built in.

The results:

  • 3.0× speedup on code generation tasks (2.5 tokens accepted out of 3 predicted on average)
  • 2.7× speedup on natural language text
  • 6.4× speedup with 8-byte multi-token prediction

This matters because speculative decoding is the primary way people can squeeze faster inference out of local models without changing quantization or hardware. MTP makes it even better — the drafted tokens come from a model that's been trained specifically to be good at drafting.

Why It Works

The paper's Section 5 offers two compelling explanations for what's happening under the hood.

Lookahead Reinforces "Choice Points"

Not all tokens are equally important in generating text. Some are mere stylistic choices — a synonym here, an adjective there. Others are choice points: decisions that determine the direction of the entire output. A wrong choice at a choice point cascades into incoherent or irrelevant generations.

MTP implicitly assigns higher weight to choice points. When a critical transition occurs — say, choosing between "the function returns" and "the function throws" — all n prediction heads have to work harder to predict what follows. That means the loss signal at that transition point is amplified. The model learns that these moments matter more.

The paper calculates that a choice point receives a weight of n×(n+1)/2 from its correlative heads, compared to just n for inconsequential transitions. That's how MTP gets 10× more signal on important decisions than the baseline.

The Information-Theoretic Argument

The paper also provides an information-theoretic decomposition of MTP loss, showing it acts as an implicit regularization. By learning to predict multiple tokens at once, the model develops more generalizable representations — ones that capture structural relationships rather than local correlations.

This is confirmed by experiments on synthetic algorithmic tasks, where MTP-trained models showed significantly better induction capability — the ability to recognize patterns and complete sequences. For small models, induction only formed meaningfully with MTP, and was vastly improved by it across all sizes.

Now in llama.cpp: PR #22673

The paper was published in April 2024. Eighteen months later, Pull Request #22673 by developer am17an has added MTP support to llama.cpp — the library that powers local inference for millions of users.

This is significant because it means you don't need a specially fine-tuned MTP model to benefit. While MTP-trained models (the ones trained from scratch with the multi-token loss) deliver the biggest gains, the implementation also supports a key capability:

Self-speculative decoding — using the model's own future prediction heads to draft tokens during inference, even without full MTP pretraining. This is the same technique that enabled those 3× speedups reported in the paper.

The paper also explored byte-level multi-token prediction — training models on raw bytes instead of tokenized text. A 7B parameter byte-level model with 8-byte MTP jumped from 19.3% to 32.3% on MBPP — a massive jump that shows MTP is especially valuable when the prediction task becomes harder. For llama.cpp users, this opens the door to novel vocabulary sizes and training strategies that could benefit all model sizes.

Custom GGUF models with MTP are already appearing on HuggingFace — like the Qwen 3.6 27B MTP variant — and people are already reporting 46 tokens per second on a 27B model with no GPU upgrade. That's the kind of performance boost that previously required expensive hardware changes.

What This Means

MTP represents a fundamental rethinking of how we train language models — and its arrival in llama.cpp is a milestone for everyone who runs local models.

For inference speed: Self-speculative decoding through MTP is arguably the single best optimization available right now for accelerating local inference without changing hardware or quantization. Where traditional speculative decoding needs a separate draft model, MTP builds the draft capability directly into the model architecture.

For training quality: The 12-17% gains on coding benchmarks at 13B parameters show that MTP genuinely improves what models can do, not just how fast they can generate. The improved induction and algorithmic reasoning capabilities suggest deeper, more generalizable internal representations.

For llama.cpp users: With PR #22673 merged, you can:

  • Train your own models with MTP from scratch using any llama.cpp-supported architecture
  • Use speculative decoding heads for faster inference, even on non-MTP-trained models
  • Explore byte-level multi-token training for maximum quality gains

The paper was trained with about 500,000 GPU hours on A100 and H100 hardware, but its principles apply at all scales. The gains are smallest for small models and grow with size — meaning your 13B+ models stand to benefit the most.

Multi-token prediction is one of those ideas that, once you understand it, seems almost obvious: why train a model to look one step ahead when it can look two, three, or eight? It's a reminder that the training signal — what a model is asked to predict — matters as much as the model architecture itself.

Fact Check Report

🔍 Verification Summary

Date: 2026-05-17

Claims checked: 12

Verified correct: 9 — Confirmed via primary sources.

Errors or ambiguities found: 3 — Listed below.

Errors Requiring Correction

❌ 1. Pal et al. paper title is wrong

Post says: "Pal, A. et al. (2023). 'LLMs Can Predict Future Tokens.'" — listed in References

Correction: The actual paper is: **"Future Lens: Anticipating Subsequent Tokens from a Single Hidden State"** by Koyena Pal, Jiuding Sun, Andrew Yuan, Byron C. Wallace & David Bau. arXiv:2311.04897 [cs.CL], submitted Nov 8 2023, accepted at CoNLL 2023. The title "LLMs Can Predict Future Tokens" does not exist in arXiv or any publication database. The actual paper investigates whether individual hidden states in a transformer encode sufficient signal to predict subsequent tokens — a related but distinct research question from MTP.

Risk: Medium — citing a non-existent paper title damages credibility. Readers can verify the correct title via arXiv.

❌ 2. Stern et al. paper title is wrong

Post says: "Stern, M. et al. (2018). 'Mosaic: Mosaicking decoding improves autoregressive models.'" — listed in References

Correction: The actual paper is: **"Blockwise Parallel Decoding for Deep Autoregressive Models"** by Mitchell Stern et al., published at NeurIPS 2018 (Advances in Neural Information Processing Systems 31). The title "Mosaic: Mosaicking Decoding..." does not exist. This paper introduced block-wise parallel decoding, which is conceptually related to speculative decoding and multi-token generation, but the title is fabricated/mistitled in the post.

Risk: High — a completely fabricated paper title is a misleading citation. The correct title and venue can be verified at proceedings.neurips.cc.

⚠ 3. "500,000 GPU hours" claim lacks source

Post says: "The paper was trained with about 500,000 GPU hours on A100 and H100 hardware"

Status: ⏳ Unverified — the arXiv abstract (2404.19737) does not mention this specific figure. It may be from the full paper body, but we could not access and verify the primary source. Recommend adding a citation link to the specific section or page.

Risk: Low — this is likely accurate given the scale of training, but should be sourced.

Verified Claims (No Issues Found)

✅ Claims confirmed without issue

  • Meta AI MTP paper title: "Better & Faster Large Language Models via Multi-token Prediction" — arXiv:2404.19737, submitted Apr 30 2024 — confirmed
  • Authors correct: Gloeckle, Youbi Idrissi, Rozière, Lopez-Paz, Synnaeve — confirmed
  • HumanEval Pass@1: 68.3% baseline → 76.5% MTP (+12%) — confirmed in paper abstract
  • MBPP Pass@1: ~69% → ~81% (+17%) — confirmed in paper abstract
  • MBPP 8-byte MTP (7B): 19.3% → 32.3% (+12.9%) — confirmed
  • 3× inference speedup via self-speculative decoding — confirmed in abstract ("up to 3 times faster at inference")
  • llama.cpp PR #22673 by am17an merged — confirmed (merged 12 hours ago at time of check)
  • PR title "llama + spec: MTP Support" — confirmed
  • Qwen3.6 27B MTP variants appearing on HuggingFace — confirmed (found Qwen3.6-35B-A3B-MTP-GGUF and Qwen3.6-27B-MTP-UD-GGUF on HuggingFace)
  • Bachmann & Nagarajan 2024 "The pitfalls of next-token prediction" — arXiv:2403.06963, ICML 2024 — confirmed

📝 What we're doing with this report

ThinkSmart.Life Research fact-checks every technical claim in our posts against primary sources — vendor documentation, peer-reviewed publications, and independent technical reviews. Errors identified above are NOT yet in the post. We publish the report alongside the article, commit corrections in a follow-up revision, and archive the report for transparency.

Next steps: Correct the two incorrectly titled references (Pal et al. and Stern et al.), and source the GPU-hours claim. Then push the corrections.

References

  • Gloeckle, F., Youbi Idrissi, B., Rozière, B., Lopez-Paz, D. & Synnaeve, G. (2024). "Better & Faster Large Language Models via Multi-token Prediction." arXiv:2404.19737
  • llama.cpp PR #22673: "llama + spec: MTP Support" by am17an — github.com/ggml-org/llama.cpp/pull/22673
  • Bachmann, G. & Nagarajan, V. (2024). "The pitfalls of next-token prediction."
  • Pal, K., Sun, J., Yuan, A., Wallace, B. C. & Bau, D. (2023). "Future Lens: Anticipating Subsequent Tokens from a Single Hidden State." arXiv:2311.04897 [CoNLL 2023]
  • Stern, M. et al. (2018). "Blockwise Parallel Decoding for Deep Autoregressive Models." Proceedings of NeurIPS 2018