Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: pub@towardsai.net
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab VeloxTrend Ultrarix Capital Partners Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Free: 6-day Agentic AI Engineering Email Guide.
Learnings from Towards AI's hands-on work with real clients.
Inside the Mamba-MoE Engine of Nemotron 3
Artificial Intelligence   Latest   Machine Learning

Inside the Mamba-MoE Engine of Nemotron 3

Last Updated on February 3, 2026 by Editorial Team

Author(s): Kyouma45

Originally published on Towards AI.

Inside the Mamba-MoE Engine of Nemotron 3

TL;DR

The Models: The family includes Nano, Super, and Ultra.
The Architecture: A Hybrid Mamba-Transformer Mixture-of-Experts (MoE) design that replaces most attention layers with Mamba-2 layers for high throughput.

Key Innovations:

  • LatentMoE: A new expert routing mechanism in Super/Ultra that projects tokens into a smaller latent space to improve accuracy-per-byte.
  • MTP (Multi-Token Prediction): Enables faster generation via native speculative decoding.
  • NVFP4: Native 4-bit floating-point training for the larger models.
  • Capabilities: Supports 1M token context windows and granular inference-time reasoning budget control.

Paper-explained Series 5

Nemotron 3, a new family of open models introduces radical architectural shifts, including a specialized Mixture-of-Experts (MoE) design, native NVFP4 training, and a massive 1-million-token context window.

1. The Core Architecture: Hybrid Mamba-Transformer MoE

The defining feature of the Nemotron 3 family is its Hybrid Mamba-Transformer Mixture-of-Experts (MoE) architecture.

Breaking the Attention Bottleneck

Standard Transformer models rely on self-attention layers, which require a Key-Value (KV) cache that grows linearly during generation. This growth creates a memory bottleneck that hampers inference throughput, especially for long-context reasoning.

To solve this, Nemotron 3 predominantly interleaves MoE layers with Mamba-2 layers.

  • Mamba-2 Layers: These layers process sequences using a constant state during generation, avoiding the massive memory footprint of a growing cache.
  • Sparse Attention: The models retain only a “select few” self-attention layers to handle high-fidelity, all-to-all information routing where absolutely necessary.

In Jet-Nemotron, they said that they compare Gated-Delta and Mamba and Gated-Delta is better, but maybe Mamba pairs well with MoE.

The Result

This design delivers best-in-class throughput. For example, the Nemotron-3-Nano-30B-A3B (30B total parameters, ~3B active) achieves 3.3x higher throughput compared to the similarly sized Qwen3–30B-A3B model.

2. LatentMoE: Compressing the Router (Super & Ultra)

While Nano uses a standard hybrid MoE design, the larger Super and Ultra models introduce a novel architecture called LatentMoE.

The Problem with Standard MoE

In large-scale deployments, MoE layers face distinct bottlenecks depending on the workload:

  • Latency-focused: Even though an MoE model only uses a few “active” experts per token, the GPU still has to fetch those specific expert weights from its main memory (VRAM) into the compute cores. These weight matrices are massive. Their size is determined by the model’s hidden dimension d and the expert’s internal size m. Because the batch size is small, the GPU spends more time waiting for these massive weights to arrive from memory than it spends actually doing the math.
  • Throughput-focused: In a large-scale MoE, experts are often distributed across different GPUs or chips. For every layer, the model must check which expert is needed for which token and “dispatch” that token to the correct GPU. This creates a massive traffic jam of data moving between GPUs. If the “lanes” between GPUs (interconnect bandwidth) get clogged, the powerful compute cores sit idle waiting for data to arrive.

The LatentMoE Solution

LatentMoE addresses these issues by compressing the routing mechanism. Instead of performing routing and computation in the full model hidden dimension d, the model:

  • Projects the token embedding into a smaller latent dimension l.
  • Routes and Computes entirely within this compressed latent space.
  • Projects back to the original hidden dimension.

This compression reduces parameter loads and communication payloads by a factor of roughly 4x. NVIDIA reinvests these savings by scaling up the total number of experts N and the active experts per token K by that same factor. The result is improved accuracy per byte without sacrificing inference throughput or latency.

During training MoE also face The “Expert Imbalance” Problem (Router Collapse). During training, the “Router” (the part of the network that decides which expert gets which token) might discover that one or two experts are slightly better than the others early on. The router starts sending all tokens to just those few “favored” experts. Consequently, only those experts get trained and improve, while the others are starved of data and remain “dumb.” To fix this, researchers typically have to add complex “load balancing” loss functions (auxiliary losses) that punish the model if it doesn’t distribute tokens evenly. Tuning these “balancers” is difficult — if you force it too hard, the model ignores the actual data; if you force it too little, the experts collapse.

3. Multi-Token Prediction (MTP)

To further accelerate generation, the Super and Ultra models incorporate Multi-Token Prediction (MTP) layers.

Rather than predicting only the single next token, the model is trained to predict multiple future tokens simultaneously. This serves two critical functions:

  • Richer Training Signal: It forces the model to plan several steps ahead, improving reasoning capabilities.
  • Native Speculative Decoding: The auxiliary predictions serve naturally as “draft tokens”. In ablation studies, the MTP module achieved a 97% acceptance rate on the first two predicted tokens, enabling substantial speedups without requiring a separate draft model.

Read more about Speculative Decoding here

4. NVFP4 Training: Pushing Hardware Limits

Nemotron 3 pushes training efficiency to the limit by utilizing NVFP4 (NVIDIA 4-bit Floating Point) for the Super and Ultra models.

Unlike previous works that simulated low-precision training, Nemotron 3 uses native NVFP4 GEMMs for forward propagation, gradient calculation, and weight updates. To maintain stability, the team developed a specific mixed-precision recipe:

  • Quantized: Weight, activation, and gradient tensors are quantized to NVFP4.
  • High Precision: Sensitive layers — specifically Mamba output projections (which are prone to flushing to zero), QKV projections, and Attention projections — are kept in higher precision (BF16 or MXFP8).

This approach resulted in a training loss difference of <1% compared to standard BF16 training, with comparable downstream task accuracy.

5. 1M Context & Agentic Capabilities

Extreme Context Length

Nemotron 3 supports a context length of up to 1 million tokens, enabling the processing of large codebases and extensive documents. Notably, because the Mamba layers provide implicit positional information, the attention layers do not require Rotary Position Embeddings (RoPE). This eliminates the out-of-distribution issues often seen when extending Transformer context windows.

Multi-Environment RL

Post-training involves Multi-environment Reinforcement Learning (RL). Instead of a staged approach (e.g., learning coding, then math), Nemotron 3 is trained on diverse environments simultaneously. This method was found to be more stable and less prone to “reward hacking” than staged training.

Reward hacking occurs when an AI model discovers a “loophole” or unintended strategy to maximize its reward score without actually achieving the desired goal or behaving correctly. This happens because the reward function is often an imperfect proxy for the true objective, leading the model to exploit flaws in how success is measured rather than learning the actual task.

Granular “Thinking” Control

Similar to other recent reasoning models, Nemotron 3 allows for inference-time reasoning budget control. Users can set a specific token budget for the model’s “thinking trace.” When the model reaches this limit, a </think> token is appended, forcing the model to conclude its reasoning and generate a response.

Link to original paper: https://arxiv.org/abs/2512.20856

A massive congratulations as always to the research team at NVIDIA — specifically the leadership team including Andrew Tao, Bita Darvish Rouhani, Boris Ginsburg, Bryan Catanzaro, Carlo del Mundo, Eileen Long, Eric Chung, Jane Polak Scowcroft, Jan Kautz, Jian Zhang, Joey Conway, Jonathan Cohen, Kari Briski, Mohammad Shoeybi, Mostofa Patwary, Oleksii Kuchaiev, Oluwatobi Olabiyi, Pavlo Molchanov, Ran El-Yaniv, Ran Zilberstein, Yonatan Geifman, and Yejin Choi , alongside the extensive teams across Data, Architecture, Pretraining, and Infrastructure.

Until next time folks…
El Psy Congroo

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI


Towards AI Academy

We Build Enterprise-Grade AI. We'll Teach You to Master It Too.

15 engineers. 100,000+ students. Towards AI Academy teaches what actually survives production.

Start free — no commitment:

6-Day Agentic AI Engineering Email Guide — one practical lesson per day

Agents Architecture Cheatsheet — 3 years of architecture decisions in 6 pages

Our courses:

AI Engineering Certification — 90+ lessons from project selection to deployed product. The most comprehensive practical LLM course out there.

Agent Engineering Course — Hands on with production agent architectures, memory, routing, and eval frameworks — built from real enterprise engagements.

AI for Work — Understand, evaluate, and apply AI for complex work tasks.

Note: Article content contains the views of the contributing authors and not Towards AI.