Inside the Mamba-MoE Engine of Nemotron 3

Last Updated on February 3, 2026 by Editorial Team

Author(s): Kyouma45

Originally published on Towards AI.

Inside the Mamba-MoE Engine of Nemotron 3

TL;DR

The Models: The family includes Nano, Super, and Ultra.
The Architecture: A Hybrid Mamba-Transformer Mixture-of-Experts (MoE) design that replaces most attention layers with Mamba-2 layers for high throughput.

Key Innovations:

LatentMoE: A new expert routing mechanism in Super/Ultra that projects tokens into a smaller latent space to improve accuracy-per-byte.
MTP (Multi-Token Prediction): Enables faster generation via native speculative decoding.
NVFP4: Native 4-bit floating-point training for the larger models.
Capabilities: Supports 1M token context windows and granular inference-time reasoning budget control.

Paper-explained Series 5

Nemotron 3, a new family of open models introduces radical architectural shifts, including a specialized Mixture-of-Experts (MoE) design, native NVFP4 training, and a massive 1-million-token context window.

1. The Core Architecture: Hybrid Mamba-Transformer MoE

The defining feature of the Nemotron 3 family is its Hybrid Mamba-Transformer Mixture-of-Experts (MoE) architecture.

Breaking the Attention Bottleneck

Standard Transformer models rely on self-attention layers, which require a Key-Value (KV) cache that grows linearly during generation. This growth creates a memory bottleneck that hampers inference throughput, especially for long-context reasoning.

To solve this, Nemotron 3 predominantly interleaves MoE layers with Mamba-2 layers.

Mamba-2 Layers: These layers process sequences using a constant state during generation, avoiding the massive memory footprint of a growing cache.
Sparse Attention: The models retain only a “select few” self-attention layers to handle high-fidelity, all-to-all information routing where absolutely necessary.

In Jet-Nemotron, they said that they compare Gated-Delta and Mamba and Gated-Delta is better, but maybe Mamba pairs well with MoE.

The Result

This design delivers best-in-class throughput. For example, the Nemotron-3-Nano-30B-A3B (30B total parameters, ~3B active) achieves 3.3x higher throughput compared to the similarly sized Qwen3–30B-A3B model.

2. LatentMoE: Compressing the Router (Super & Ultra)

While Nano uses a standard hybrid MoE design, the larger Super and Ultra models introduce a novel architecture called LatentMoE.

The Problem with Standard MoE

In large-scale deployments, MoE layers face distinct bottlenecks depending on the workload:

Latency-focused: Even though an MoE model only uses a few “active” experts per token, the GPU still has to fetch those specific expert weights from its main memory (VRAM) into the compute cores. These weight matrices are massive. Their size is determined by the model’s hidden dimension d and the expert’s internal size m. Because the batch size is small, the GPU spends more time waiting for these massive weights to arrive from memory than it spends actually doing the math.
Throughput-focused: In a large-scale MoE, experts are often distributed across different GPUs or chips. For every layer, the model must check which expert is needed for which token and “dispatch” that token to the correct GPU. This creates a massive traffic jam of data moving between GPUs. If the “lanes” between GPUs (interconnect bandwidth) get clogged, the powerful compute cores sit idle waiting for data to arrive.

The LatentMoE Solution

LatentMoE addresses these issues by compressing the routing mechanism. Instead of performing routing and computation in the full model hidden dimension d, the model:

Projects the token embedding into a smaller latent dimension l.
Routes and Computes entirely within this compressed latent space.
Projects back to the original hidden dimension.

This compression reduces parameter loads and communication payloads by a factor of roughly 4x. NVIDIA reinvests these savings by scaling up the total number of experts N and the active experts per token K by that same factor. The result is improved accuracy per byte without sacrificing inference throughput or latency.

During training MoE also face The “Expert Imbalance” Problem (Router Collapse). During training, the “Router” (the part of the network that decides which expert gets which token) might discover that one or two experts are slightly better than the others early on. The router starts sending all tokens to just those few “favored” experts. Consequently, only those experts get trained and improve, while the others are starved of data and remain “dumb.” To fix this, researchers typically have to add complex “load balancing” loss functions (auxiliary losses) that punish the model if it doesn’t distribute tokens evenly. Tuning these “balancers” is difficult — if you force it too hard, the model ignores the actual data; if you force it too little, the experts collapse.

3. Multi-Token Prediction (MTP)

To further accelerate generation, the Super and Ultra models incorporate Multi-Token Prediction (MTP) layers.

Rather than predicting only the single next token, the model is trained to predict multiple future tokens simultaneously. This serves two critical functions:

Richer Training Signal: It forces the model to plan several steps ahead, improving reasoning capabilities.
Native Speculative Decoding: The auxiliary predictions serve naturally as “draft tokens”. In ablation studies, the MTP module achieved a 97% acceptance rate on the first two predicted tokens, enabling substantial speedups without requiring a separate draft model.

Read more about Speculative Decoding here

4. NVFP4 Training: Pushing Hardware Limits

Nemotron 3 pushes training efficiency to the limit by utilizing NVFP4 (NVIDIA 4-bit Floating Point) for the Super and Ultra models.

Unlike previous works that simulated low-precision training, Nemotron 3 uses native NVFP4 GEMMs for forward propagation, gradient calculation, and weight updates. To maintain stability, the team developed a specific mixed-precision recipe:

Quantized: Weight, activation, and gradient tensors are quantized to NVFP4.
High Precision: Sensitive layers — specifically Mamba output projections (which are prone to flushing to zero), QKV projections, and Attention projections — are kept in higher precision (BF16 or MXFP8).

This approach resulted in a training loss difference of <1% compared to standard BF16 training, with comparable downstream task accuracy.

5. 1M Context & Agentic Capabilities

Extreme Context Length

Nemotron 3 supports a context length of up to 1 million tokens, enabling the processing of large codebases and extensive documents. Notably, because the Mamba layers provide implicit positional information, the attention layers do not require Rotary Position Embeddings (RoPE). This eliminates the out-of-distribution issues often seen when extending Transformer context windows.

Multi-Environment RL

Post-training involves Multi-environment Reinforcement Learning (RL). Instead of a staged approach (e.g., learning coding, then math), Nemotron 3 is trained on diverse environments simultaneously. This method was found to be more stable and less prone to “reward hacking” than staged training.

Reward hacking occurs when an AI model discovers a “loophole” or unintended strategy to maximize its reward score without actually achieving the desired goal or behaving correctly. This happens because the reward function is often an imperfect proxy for the true objective, leading the model to exploit flaws in how success is measured rather than learning the actual task.

Granular “Thinking” Control

Similar to other recent reasoning models, Nemotron 3 allows for inference-time reasoning budget control. Users can set a specific token budget for the model’s “thinking trace.” When the model reaches this limit, a </think> token is appended, forcing the model to conclude its reasoning and generate a response.

Link to original paper: https://arxiv.org/abs/2512.20856

A massive congratulations as always to the research team at NVIDIA — specifically the leadership team including Andrew Tao, Bita Darvish Rouhani, Boris Ginsburg, Bryan Catanzaro, Carlo del Mundo, Eileen Long, Eric Chung, Jane Polak Scowcroft, Jan Kautz, Jian Zhang, Joey Conway, Jonathan Cohen, Kari Briski, Mohammad Shoeybi, Mostofa Patwary, Oleksii Kuchaiev, Oluwatobi Olabiyi, Pavlo Molchanov, Ran El-Yaniv, Ran Zilberstein, Yonatan Geifman, and Yejin Choi , alongside the extensive teams across Data, Architecture, Pretraining, and Infrastructure.

Until next time folks…
El Psy Congroo

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Towards AI Academy

We Build Enterprise-Grade AI. We'll Teach You to Master It Too.

15 engineers. 100,000+ students. Towards AI Academy teaches what actually survives production.

Start free — no commitment:

→ Agents Architecture Cheatsheet — 3 years of architecture decisions in 6 pages

Our courses:

→ AI Engineering Certification — 90+ lessons from project selection to deployed product. The most comprehensive practical LLM course out there.

→ Agent Engineering Course — Hands on with production agent architectures, memory, routing, and eval frameworks — built from real enterprise engagements.

→ AI for Work — Understand, evaluate, and apply AI for complex work tasks.

Note: Article content contains the views of the contributing authors and not Towards AI.

Frequently Used, Contextual References

Resources

Inside the Mamba-MoE Engine of Nemotron 3

Author(s): Kyouma45

TL;DR

1. The Core Architecture: Hybrid Mamba-Transformer MoE

Breaking the Attention Bottleneck

The Result

2. LatentMoE: Compressing the Router (Super & Ultra)

The Problem with Standard MoE

The LatentMoE Solution

3. Multi-Token Prediction (MTP)

4. NVFP4 Training: Pushing Hardware Limits

5. 1M Context & Agentic Capabilities

Extreme Context Length

Multi-Environment RL

Granular “Thinking” Control

Towards AI Academy

We Build Enterprise-Grade AI. We'll Teach You to Master It Too.

Recent Posts

Full-Stack Data Scientists for the Agentic Coding World

Building Production-Grade AI Skills with Snowflake Cortex AI Function Studio

I Tried 10 AI Agent Frameworks in 2026 — Here’s the Honest Guide I Wish I Had Earlier

How One Spring Boot Optimization Saved Our Startup $30,000 a Year

Inside Palantir AIP: How the World’s Most Controversial AI Platform Actually Works

What Is a Reverse Proxy? (And Why Every Backend Developer Should Care)

What Claude Opus 4.8 Actually Changes If You’re Building Agents

QWEN 3.7 Max Worked For 35 Hrs Straight And The Results Were Mind-blowing

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Inside the Mamba-MoE Engine of Nemotron 3

Author(s): Kyouma45

TL;DR

1. The Core Architecture: Hybrid Mamba-Transformer MoE

Breaking the Attention Bottleneck

The Result

2. LatentMoE: Compressing the Router (Super & Ultra)

The Problem with Standard MoE

The LatentMoE Solution

3. Multi-Token Prediction (MTP)

4. NVFP4 Training: Pushing Hardware Limits

5. 1M Context & Agentic Capabilities

Extreme Context Length

Multi-Environment RL

Granular “Thinking” Control

Towards AI Academy

We Build Enterprise-Grade AI. We'll Teach You to Master It Too.

Related posts

Recent Posts

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

GDPR CCPA Statement