🐦 πŸ’Ό πŸ”—
πŸ“ΊWatch the video version: ThinkSmart.Life/youtube
🎧
Listen to this article
AI-generated narration

Google Drops Gemma 4 β€” Apache 2.0, Gemini 3 Research, Four Models

On April 2, 2026, Google DeepMind released Gemma 4 β€” the most capable open model family it has ever shipped. Four models, one license change, and a clear message to the developer community: the best open AI is now genuinely competitive with the best closed AI.

The headline numbers are striking. The 31B Dense model ranks #3 on the LMArena open model leaderboard, sitting just behind GLM-5 and Kimi 2.5 β€” both of which are dramatically larger models. At 31 billion parameters, Gemma 4's flagship claims it can outcompete models 20Γ— its own size. That's the efficiency story at the heart of this release.

But there's more than raw performance. For the first time in Gemma's history, Google is shipping under the Apache 2.0 license β€” abandoning the custom "Gemma license" that had frustrated the community with MAU caps, acceptable-use restrictions, and commercial ambiguity. Full commercial freedom, no strings attached.

The Gemma 4 family is built on the same research and architectural innovations that power Gemini 3, Google's latest frontier closed model. You're not getting a stripped-down version β€” you're getting the technology transfer. Every model in the family includes multimodal capabilities: images and audio across all four models, video on the two largest. The smaller E2B and E4B models are specifically optimized for smartphones, Raspberry Pi, and Jetson Nano β€” with near-zero latency on modern mobile hardware.

Previous Gemma versions accumulated 400 million downloads and spawned over 100,000 community variants in what Google calls the Gemmaverse. Gemma 4 launches into that ecosystem with a dramatically stronger foundation.

#3
Arena Open Model Rank (31B)
4
Models Released
400M
Previous Gemma Downloads
Apache 2.0
License (New!)

⚑ TL;DR

Gemma 4 is Google DeepMind's newest open model family: four models (E2B, E4B, 26B MoE, 31B Dense) built from Gemini 3 research, licensed under Apache 2.0. The 31B Dense ranks #3 globally among open models on LMArena. All models support images and audio; the larger two add video. Architecture highlights include Per-Layer Embeddings, alternating local/global attention, dual RoPE for 256K context, and sparse MoE routing. This guide covers every model in the family, the architectural innovations, benchmark results, hardware requirements, and who should use which.

The Model Family: Four Models, One Family

Gemma 4 comes in four distinct models, each designed for a different deployment environment. The naming convention uses "E" for "Effective" models (E2B, E4B) β€” a new class optimized for on-device inference β€” and straightforward parameter counts for the larger workstation-class models.

Model Active Params Total Params Context Modalities Best For
Gemma 4 E2B 2.3B 5.1B 128K Text, Image, Audio Mobile, IoT, always-on apps
Gemma 4 E4B 4.5B 8B 128K Text, Image, Audio Smartphones, Raspberry Pi, edge inference
Gemma 4 26B MoE 3.8B 25.2B 256K Text, Image, Video Single-GPU workstation, high throughput
Gemma 4 31B Dense 30.7B 30.7B 256K Text, Image, Video Quality-first, fine-tuning, research

The distinction between E2B/E4B and the larger models goes deeper than just parameter count. The "E" models use a novel architecture called Per-Layer Embeddings (PLE) that gives them a kind of dual identity: small active parameter counts (for fast inference) but dramatically larger representational depth. At 2.3B active parameters, the E2B model carries 5.1 billion parameters worth of knowledge β€” more than double what you'd expect from a 2B-class model.

The 26B MoE occupies a unique niche: it activates only 3.8B parameters per token (similar to E4B!), but its 128-expert MoE pool gives it a much richer knowledge base. This translates to unusually high tokens-per-second throughput for a model of its total size. The 31B Dense, by contrast, activates all 30.7B parameters for every token β€” sacrificing throughput for quality, which shows in its benchmark leadership.

Model Tiers at a Glance

πŸ“±
E2B β€” Always-On Mobile

2.3B active parameters with PLE giving 5.1B depth. Optimized for Android and ultra-low-power deployment. Near-zero latency on modern smartphones and Jetson Nano. The model Google built by working directly with the Pixel team, Qualcomm, and MediaTek.

πŸ”‹
E4B β€” Edge Champion

4.5B active, 8B total. The sweet spot for Raspberry Pi, single-board computers, and mid-range smartphones. Multimodal from day one β€” can process images and audio without specialized hardware accelerators.

⚑
26B MoE β€” Efficiency King

Only 3.8B parameters active per token from a pool of 25.2B. Runs unquantized on a single H100 80GB. Achieves 256K context and video understanding. Ranks #6 on Arena. The model for high-throughput production deployments.

πŸ†
31B Dense β€” Quality Leader

All 30.7B parameters active. 256K context, video support, AIME 89.2%, LMArena #3. Designed for fine-tuning and quality-critical workloads. If you want the best open model available today and have a single H100, this is it.

Architecture Deep Dive

Gemma 4 isn't just bigger Gemma 3. Google DeepMind introduced several significant architectural innovations across the family, each solving a specific problem: how to run efficiently on mobile, how to scale context to 256K tokens, and how to pack maximum knowledge into sparse models.

Per-Layer Embeddings (PLE) β€” The E-Model Secret

The E2B and E4B models introduce Per-Layer Embeddings, a technique that injects a secondary embedding signal into every decoder layer rather than only the first. Traditional transformers embed tokens once at the input and pass that representation through every layer. PLE gives each layer its own view of the token β€” essentially a per-layer context signal that enriches the residual stream at each depth.

The practical result: a model with 2.3B active parameters can carry 5.1 billion parameters worth of representational capacity. The model "knows more" than its active count suggests, because information is distributed across the per-layer embedding tables as well as the weight matrices. This is what enables E2B to score 60.0% on MMLU Pro β€” respectable for a sub-3B active parameter model.

Alternating Attention β€” Local + Global

All Gemma 4 models use alternating attention layers: a pattern that interleaves local sliding-window attention with global full-context attention. Local layers use a sliding window of 512–1,024 tokens, attending only to nearby context. Global layers attend to the full sequence length.

This hybrid approach dramatically reduces the computational and memory cost of long-context inference. Instead of paying the full O(nΒ²) cost of global attention at every layer, local layers handle the majority of processing while global layers periodically integrate the full context. The result: 256K context windows become practical on a single GPU.

Dual RoPE β€” 256K Without Quality Degradation

Extending context windows in transformers is notoriously tricky β€” naive RoPE interpolation degrades quality at long ranges. Gemma 4 solves this with Dual RoPE: standard rotary positional embeddings for local (sliding-window) layers, and proportional RoPE scaling for global (full-context) layers.

By using different positional encoding strategies for different attention types, the model maintains full precision within local windows while still propagating correct positional information at 256K range in global layers. This is why the 26B MoE and 31B Dense can reliably retrieve information from documents tens of thousands of tokens away.

Sparse MoE Design β€” 128 Experts, 8+1 Active

The 26B MoE model uses a classic sparse Mixture-of-Experts design with 128 small expert networks. For each token, a learned router selects 8 domain experts plus 1 shared expert (a permanently active expert that handles cross-domain reasoning). Only those 9 experts activate β€” the remaining 119 stay dormant.

This is why 25.2B total parameters collapse to only 3.8B active per token. The experts specialize: some become adept at code, others at mathematics, others at language understanding. The shared expert acts as a universal integrator. The result is a model that stores the knowledge of a 25B parameter model but pays only the inference cost of a 4B one.

Shared KV Cache

Gemma 4 also introduces a shared KV cache across the last N decoder layers. Instead of each layer maintaining its own separate key-value cache, the final layers reuse KV tensors computed earlier in the network. This reduces memory pressure significantly during inference β€” particularly important for long-context generation where KV cache can dominate GPU memory allocation.

Vision & Audio Encoders

Every Gemma 4 model ships with a vision encoder and an audio encoder. This isn't an afterthought β€” multimodal capability is a first-class feature across the entire family.

Vision Encoder

The vision encoder uses a 2D position encoder with multidimensional RoPE, allowing the model to represent spatial position in two dimensions rather than collapsing the image into a 1D sequence. Images are tokenized with a configurable token budget ranging from 70 to 1,120 tokens per image β€” lower budgets for faster inference, higher budgets for detailed visual analysis.

This configurable token budget is a practical feature for production deployment: you can tune the accuracy/speed tradeoff at inference time without retraining. A thumbnail image might need only 70 tokens; a technical diagram requiring OCR or precise object detection might warrant the full 1,120.

Audio Encoder

The audio encoder is based on a USM-style conformer architecture β€” the same class of model Google uses in its production speech recognition systems. It handles speech recognition as well as general audio understanding, processing up to 30 seconds of audio per input.

The conformer architecture combines convolutional layers (for local acoustic patterns) with transformer self-attention (for long-range audio dependencies). This makes Gemma 4 E2B and E4B genuinely useful for on-device voice assistants: the entire pipeline β€” audio encoding, understanding, and response generation β€” runs in a single model on a smartphone.

Video (Large Models Only)

The 26B MoE and 31B Dense models add video understanding on top of the vision encoder. Video frames are sampled and processed through the vision encoder, with temporal position encoded via the multidimensional RoPE scheme. The 256K context window gives these models substantial capacity to reason across longer video sequences.

Benchmark Breakdown

Benchmark results tell the Gemma 4 story clearly: the 31B Dense is a genuine top-3 open model, and the 26B MoE competes at an efficiency level that's hard to match.

Reasoning

Benchmark 31B Dense 26B MoE E4B E2B
MMLU Pro 85.2% 82.6% 69.4% 60.0%
AIME 2026 (no tools) 89.2% 88.3% β€” β€”
GPQA Diamond 84.3% 82.3% β€” β€”
BigBench Extra Hard 74.4% β€” β€” β€”
LMArena ELO ~1452 (#3) #6 β€” β€”

The BigBench Extra Hard number deserves special attention. Gemma 3 scored 19.3% on this benchmark; Gemma 4 31B scores 74.4%. That's not an incremental improvement β€” it's a nearly 4Γ— jump on one of the hardest reasoning benchmarks. BigBench Extra Hard is specifically designed to be resistant to memorization and shallow pattern matching, making this result a meaningful signal about genuine reasoning capability.

The AIME 2026 score of 89.2% (without tools) is equally remarkable. AIME β€” the American Invitational Mathematics Examination β€” is a competition that stumps most professional mathematicians without significant preparation. Scoring nearly 90% without any external tools puts the 31B Dense in genuinely elite mathematical reasoning territory.

Coding

Benchmark 31B Dense 26B MoE
LiveCodeBench v6 80.0% 77.1%
Codeforces ELO 2150 β€”

A Codeforces ELO of 2150 places the 31B Dense in the Grandmaster tier on competitive programming β€” higher than roughly 99% of human contestants. LiveCodeBench v6 at 80.0% reflects real-world coding quality across diverse programming challenges, not cherry-picked tasks.

Vision & Multimodal

Benchmark 31B Dense 26B MoE
MMMU Pro (vision) 76.9% β€”
MATH-Vision 85.6% β€”

MATH-Vision at 85.6% β€” solving mathematical problems presented as images β€” demonstrates that Gemma 4's vision encoder isn't just for image description. It can genuinely parse and reason about diagrams, equations, and visual mathematical structure.

Honest Analysis: Where the Numbers Come From

The 31B ranks #3 on LMArena behind GLM-5 and Kimi 2.5. Both of those models are substantially larger than 31B β€” making the efficiency story compelling. But it's worth noting: #3 globally still means there are closed models well above it. This is the best open model for most use cases, but it's not the best model, full stop.

The 26B MoE's rank of #6 with only 3.8B active parameters is arguably the more impressive result from a pure efficiency standpoint. You're getting top-6 open model quality while paying approximately the compute cost of a 4B model.

Multimodal Capabilities in Practice

What does it mean that all four Gemma 4 models ship with vision and audio? In practice, it reshapes what you can build.

For the E2B and E4B models, multimodal on-device means applications that were previously impossible at this size. An E4B deployment on a Raspberry Pi can now handle voice input, process images from a connected camera, and respond in text β€” all without hitting a remote API. The latency and privacy implications are significant.

For the 26B MoE and 31B Dense, video adds a new dimension. A document analysis pipeline can now ingest video recordings of presentations, extract frames, and reason across them within a single 256K context window. Customer service applications can analyze video submitted by users. Scientific workflows can process experimental recordings.

"The configurable token budget for vision β€” 70 to 1,120 tokens per image β€” is a feature that deserves more attention. You're trading off accuracy for speed dynamically, without model retraining."

Audio support across all models opens up speech-native interfaces. The USM conformer encoder means you can send raw audio directly to the model rather than running a separate speech-to-text pipeline first. This is meaningfully simpler to build and maintain, especially on edge devices.

Agentic Capabilities

Gemma 4 ships with first-class support for agentic workflows β€” not as fine-tuned variants, but baked into the base models.

Native Function Calling

All four models support structured function calling, allowing them to invoke external tools in a typed, predictable format. The function call interface is designed to be compatible with existing OpenAI-style function calling conventions, reducing migration friction for teams already using agentic frameworks.

Structured JSON Output

Gemma 4 models can generate structured JSON on demand, constrained by a provided schema. This is essential for production pipelines where downstream systems expect reliable, typed output β€” API responses, database entries, configuration objects. The model doesn't just attempt JSON; it respects schema constraints during generation.

System Instructions

System instruction following in Gemma 4 is significantly improved over Gemma 3. The models reliably maintain persona, output format constraints, safety rules, and domain restrictions across multi-turn conversations. This matters for agents that need to stay on-task across many turns of tool calls and user interactions.

πŸ”§
Function Calling

Native, typed tool invocation compatible with OpenAI-style conventions. Supports multi-tool calls in a single response and parallel tool execution planning.

{}
Structured JSON

Schema-constrained JSON generation. The model respects field types, required vs. optional fields, and nested schemas during autoregressive generation.

πŸ“‹
System Instructions

Reliable instruction following across multi-turn conversations. Persona, format, safety, and domain constraints persist through long agentic sessions.

🧩
Framework Compatibility

Works with LangChain, LlamaIndex, CrewAI, and other major agentic frameworks out of the box. Hugging Face and Ollama support on day one.

Hardware Requirements

Gemma 4's model range covers the full hardware spectrum, from a Raspberry Pi Zero to a data center GPU.

Gemma 4 E2B
2.3B Active / 5.1B Total
πŸ“± Android smartphone (2022+)
πŸ“ Raspberry Pi 4/5
πŸ€– Jetson Nano
πŸ’» MacBook Air (M1+, 8GB)
Gemma 4 E4B
4.5B Active / 8B Total
πŸ“± Flagship smartphones (8GB RAM)
πŸ“ Raspberry Pi 5 (8GB)
πŸ’» MacBook Air (M2+, 16GB)
πŸ–₯️ RTX 3060 (12GB)
Gemma 4 26B MoE
3.8B Active / 25.2B Total
πŸ–₯️ H100 80GB (unquantized)
πŸ–₯️ RTX 4090 (quantized 4-bit)
πŸ’» MacBook Pro M3 Max (128GB)
⚑ High-throughput production
Gemma 4 31B Dense
30.7B Active / 30.7B Total
πŸ–₯️ H100 80GB (unquantized)
πŸ–₯️ RTX 4090 + RTX 4080 (quantized)
πŸ’» MacBook Pro M3 Ultra (192GB)
🎯 Quality-critical, fine-tuning

A key detail from Ars Technica's reporting: both the 26B MoE and 31B Dense run unquantized in bfloat16 on a single H100 80GB. This matters because quantization always trades some quality for size β€” being able to run the full-precision model on a single enterprise GPU is significant for research and fine-tuning use cases.

The 26B MoE's advantage here is throughput: with only 3.8B parameters active per token, it generates tokens much faster than the 31B Dense at the same hardware level. For production serving where you need high requests-per-second, the 26B MoE is the better choice.

The Apache 2.0 Moment

The license change in Gemma 4 deserves its own section. It's not a footnote β€” it's a fundamental shift in how Google is positioning Gemma in the open ecosystem.

Previous Gemma releases used a custom "Gemma License" that included:

  • Monthly Active User (MAU) caps β€” free up to a threshold, paid above it
  • Acceptable-use restrictions that limited certain commercial applications
  • Attribution and modification requirements that complicated redistribution
  • Legal ambiguity that made enterprise legal teams nervous

βœ… What Apache 2.0 Actually Means

No MAU caps β€” deploy to 10 users or 10 million users, same license.
No acceptable-use restrictions β€” commercial use, SaaS products, integrations, all permitted.
Full modification rights β€” fine-tune, quantize, distill, build derivatives freely.
Sublicensing permitted β€” include in your own product without licensing requirements passing through.
Patent grant included β€” Apache 2.0 includes an explicit patent license, which the Gemma License did not.

Ars Technica characterized this as Google "acknowledging developer frustrations" β€” a diplomatic way of saying the previous license was unpopular enough to represent a real adoption barrier. Enterprise customers were avoiding Gemma specifically because of legal uncertainty. Research teams were choosing Llama or Mistral models not because of quality, but because the license was cleaner.

The switch to Apache 2.0 removes all of that friction. For the 100,000+ existing Gemma community variants (the Gemmaverse), this is a retroactive gift: they can now relicense their work under Apache 2.0 as well, since the base model allows it.

This is Google making a strategic bet on the open ecosystem. By choosing Apache 2.0 β€” the most permissive common open source license β€” they're prioritizing adoption over control. The calculus is simple: a widely adopted open model generates more goodwill, more fine-tunes, more ecosystem tooling, and ultimately more infrastructure spend on Google Cloud than a restrictive license that drives developers to competitors.

Use Cases & Real-World Fine-tuning

Google's announcement highlighted two real-world fine-tuning examples that illustrate Gemma 4's range.

BgGPT β€” Bulgarian Language Model

A team used Gemma 4 as the base for BgGPT, a Bulgarian-language model. This is a demonstration of the multilingual foundation Gemma 4 provides: you can fine-tune for low-resource languages without starting from scratch, leveraging the broad multilingual pretraining from Gemini 3 research. For language communities without large commercial AI investments, this matters enormously.

Cell2Sentence-Scale β€” Yale Cancer Therapy Discovery

Yale researchers used Gemma 4 as the base for Cell2Sentence-Scale, a fine-tuned model for cancer therapy discovery. This application β€” using a language model to reason about cellular biology β€” is exactly the kind of high-stakes scientific deployment where Apache 2.0 licensing, strong reasoning capabilities, and fine-tuning support all converge.

Deployment Patterns

Beyond fine-tuning, the model family enables several deployment patterns:

  • Mobile AI assistants: E2B/E4B with audio input and image analysis, fully on-device, privacy-preserving
  • Edge analytics: E4B on Jetson Nano for real-time visual inspection in manufacturing or security
  • Agentic pipelines: 26B MoE as a high-throughput orchestrator in multi-agent systems
  • Research & RAG: 31B Dense with 256K context for long-document analysis and retrieval-augmented generation
  • Code assistants: 31B Dense (Codeforces ELO 2150) for competitive programming and complex code generation
  • Video understanding: 26B MoE or 31B Dense for video QA, summarization, and content analysis pipelines

Who Should Use Which Model

βœ… Use E2B When…

  • Building Android or iOS apps with on-device AI
  • Deploying on Raspberry Pi or Jetson Nano
  • Need near-zero latency with voice/audio input
  • Running always-on background AI (battery matters)
  • Privacy-sensitive applications (data never leaves device)

⚠️ E2B Limitations

  • MMLU Pro at 60% β€” limited complex reasoning
  • 128K context only (no video)
  • Not suitable for expert-level coding or math

βœ… Use E4B When…

  • Flagship smartphone or Raspberry Pi 5 (8GB)
  • Need meaningfully better quality than E2B
  • Image + audio multimodal at the edge
  • Consumer-facing apps needing mid-tier quality

⚠️ E4B Limitations

  • No video support
  • 128K context limit
  • MMLU Pro ~69% β€” weaker than full workstation models

βœ… Use 26B MoE When…

  • High-throughput production serving
  • Single H100 or RTX 4090 (quantized)
  • Need 256K context + video
  • Cost-sensitive at scale β€” fast tokens/sec
  • Arena-quality (#6) with low active compute

⚠️ 26B MoE Limitations

  • MoE models harder to fine-tune than dense
  • Slightly below 31B Dense on all benchmarks

βœ… Use 31B Dense When…

  • Maximum quality is the priority
  • Fine-tuning for domain-specific tasks
  • Research and academic applications
  • Expert-level coding (Codeforces 2150)
  • Complex mathematical reasoning (AIME 89.2%)

⚠️ 31B Dense Limitations

  • Slower inference than 26B MoE (all params active)
  • Requires H100 for unquantized use
  • Still behind top closed models (GPT-5, Gemini 3, etc.)

The Bottom Line

Gemma 4 represents a genuine step-change in what's available under open licenses. For mobile developers, the E2B and E4B models eliminate the need for remote API calls entirely β€” full multimodal capability on-device, at near-zero latency. For workstation and cloud deployments, the 26B MoE and 31B Dense are legitimate top-5 open models, not consolation prizes.

The Apache 2.0 license removes the last real objection to enterprise adoption. The research pedigree β€” built from Gemini 3 technology β€” ensures this isn't a rushed release. And the 400 million downloads from previous Gemma versions means tooling, integrations, and community support are already in place.

The open AI landscape has never been more competitive. Gemma 4 is a strong entry β€” and the first open model family from a major lab that honestly competes across the full hardware spectrum from smartphone to data center.

References

  1. Google Official Blog β€” Gemma 4 Announcement (April 2, 2026)
  2. Ars Technica β€” Google Announces Gemma 4, Switches to Apache 2.0 License (April 2, 2026)
  3. WaveSpeed AI β€” What is Google Gemma 4? Technical Analysis (April 2, 2026)
  4. Hugging Face β€” google/gemma-4-31b (Model Card)
  5. Google AI for Developers β€” Gemma Documentation