Why Google Thinks Our Entire Approach to Training LLMs Needs to Change

Last Updated on December 4, 2025 by Editorial Team

Author(s): Harsh Chandekar

Originally published on Towards AI.

Google AI Says Deep Learning is an ‘Illusion’. Here Are Ideas That Could Change Everything.

Hey, so I came across this new paper from Google researchers, and it’s actually pretty interesting.
They’re basically asking: can we train LLMs in a different way so they become smarter, more adaptable, better with context, and actually remember things long-term?

They propose this idea called Nested Learning, which is kind of like giving the model different layers of memory that learn at different speeds, instead of everything being trained the same way. It’s a different take on how we build and train these models, and it might actually change a lot if it works well.

If you want to check out the paper yourself, here’s the link:
https://openreview.net/pdf?id=nbMeRvNb7A

Why the LLMs have Memory Problem

Large Language Models (LLMs) exhibit a memory processing pattern strikingly similar to anterograde amnesia — a neurological disorder where a person cannot form new long-term memories. While they retain their vast knowledge from before the “onset” of their training completion, they are unable to permanently store new experiences. It is like you learn and remember things only till age 20 even if you are 40 years old.

This is the core reason why models like ChatGPT are largely static after their pre-training phase. The information you provide in a conversation, which exists within a temporary “context window,” does not permanently update the model’s core knowledge base. This limitation is a major roadblock to creating AI that can continually learn from experience without suffering from “catastrophic forgetting” — the tendency to lose old information when learning new things.

Current LLMs are flat, they have only one single main learning process, one “Gradient Flow”.

How are current LLM’s trained?

Why Google Thinks Our Entire Approach to Training LLMs Needs to Change — Currently used methodology for training the LLMs, Single learning process

The approach, called “Nested Learning,” takes inspiration from the human brain’s multi-layered memory systems. This post breaks down the concepts explaining the new approach and how it can shape the future of AI.

1. “Deeper” AI Isn’t About More Layers, It’s About More Levels

The traditional view in AI is that making models more powerful means stacking more and more computational layers. It argues this is a “flattened image” of learning, an illusion of progress. Simply stacking layers isn’t a universal solution because it fails to address fundamental challenges, it often doesn’t change a model’s “ability to fast adapt…or continually learn” and may not even increase its “computational depth…leaving their ability to implement complex algorithms untouched.”

The architecture shift proposed is from “stacked layers” to “Nested Learning” (NL). Instead of a flat architecture, NL reframes a model as an interconnected system of components organized in hierarchical “levels.” Different modules — A, B, and C — that are nested and connected, but each learns at a different speed. An update in the fastest module C not only changes itself, but it also actively changes what has been learned in the slower, more stable modules (B and A) above it.

This is a significant shift because it’s not about making models bigger, but architecting them to be smarter. The goal is to build models from “stacked learning process components,” where each component has its own learning rate and context. This structure is designed to mimic the brain’s ability to process information on different timescales.

In the Nested Learning framework, Module C is the fast-learning system, similar to our working memory. It updates instantly and handles short-term information, like when you remember a phone number just long enough to type it.

Module B learns at a medium speed and acts more like the hippocampus, consolidating information that repeats — similar to how you gradually start remembering a new colleague’s name after hearing it a few times.

Module A is the slowest system, responsible for long-term, stable knowledge, much like the neocortex in humans; this is the part that stores skills or facts you retain for years, like knowing how to swim or ride a bicycle.

Together, these three modules mirror the human brain’s multi-timescale learning process, but inside an AI model.

It makes a critical distinction between memorization and true learning:

Memory is a neural update caused by an input, and learning is the process for acquiring effective and useful memory. In the current LLMs, we often see the problem of memorization instead of true learning, which is similar to when a student is preparing for a exam a day before just memorizing (rote learning) all the things without actually studying / preparing for the exam for the entire semester.

2. Secret Learners: Even Your AI’s Optimizer is a Memory System

One of the most surprising insights from the Nested Learning paper is that components we don’t typically see as “learners” — like the optimizers that train the model — are actually simple learning systems themselves. This gives us validation to the concept of NL when we see that updating of optimizers with different frequencies actually work !

Consider the evolution of optimizers as a multi-level learning process:

Level 1 (Simple Gradient Descent): This is a naive, one-level update. The model’s parameters are adjusted based only on the error from the most recent piece of data. It has no memory of past updates. Only ( α ) as learning parameter.
Level 2 (Momentum): Adding momentum is like adding a second level of learning. The optimizer now has a simple “associative memory” that considers the direction of past updates (gradients), not just the present one. This prevents it from making sharp, erratic changes based on a single input. (α and β ) are the learning parameters , where the α represents current learning rate and β parameter represent the weight of the momentum .
Level 3 (Adam): The Adam optimizer is an even more complex nested system. It builds on momentum by also tracking the variance of past gradients. By considering both momentum and variance, it acts as a more intelligent “learning component” that can adapt its updates more effectively. That is when we give the history to it of the previous updates/ weights, it performs better almost every time when not provided the history.

Source : Comparison of Optimizers which has similar analogy to Traditional Learning vs nested Learning in Transformers

This progression from simple gradient descent to Adam isn’t just an anecdotal example; it is concrete, existing evidence that nesting learning processes — giving a system memory of its own learning — yields more powerful and intelligent behavior. It’s the core principle of Nested Learning, already hiding in plain sight.

3. Models Need Different “Brain Waves” to Truly Learn

The core inspiration for Nested Learning comes from neuroscience. The human brain operates on different frequencies, often visualized as brain waves. Fast “Gamma” waves are associated with processing immediate sensory input, requiring rapid reaction. Slower “Delta” waves are linked to deep sleep and memory consolidation, where the brain reorganizes knowledge for long-term storage.

Source, The Analogy of Waves to the learning of new things using the Nested Learning

Nested Learning translates this brain wave analogy into a core AI concept: Update Frequency. The idea is that different parts of an AI model should learn and update at different rates, just as the brain processes information on different timescales. This led the researchers to propose a new formulation for memory called a Continuum Memory System (CMS), which generalizes beyond the traditional “long-term/short-term” binary to include a whole spectrum of update speeds.

They put this idea into practice with a self-referential learning module, called HOPE. HOPE stand for Hierarchical Optimizing Processing Ensemble which is based on the old Titans architecture used for smart memory management. It is tightly coupled and best suited with the Nested Architecture.

In current LLMs also we pass the 3 parameters internally while generating the next token which are (q,K,V) : Query, Key, Value which helps in next word prediction.

Structured as a CMS, HOPE has multiple learnable levels, each updating at a different frequency:

Level 1 (Low Frequency): This level acts like long-term memory, updating its knowledge very slowly (e.g., once every 60 million tokens).
Level 2 & 3 (Mid Frequency): These levels update more regularly, acting as an intermediate memory system (e.g., every 1 million tokens).
Level 4 (High Frequency): This level updates very quickly, processing information from the immediate context, much like short-term working memory.

Here are the differences between using Transformers as a backbone vs using HOPE.

Transformers don’t learn during a forward pass — HOPE does.
Transformers train on one timescale — HOPE trains on multiple timescales.
Transformers store context in activations only — HOPE stores it in changeable internal memory (its parameters).

A comparison of Hope architectural backbone with Transformers (Normalization and potential data-dependent components are removed for the sake of clarity). Source

When tested the HOPE architecture using the Nested Learning methodology, the results were pretty convincing.

Here is the report by the Google for testing the HOPE: https://research.google/blog/introducing-nested-learning-a-new-ml-paradigm-for-continual-learning/

From Static Tools to Self-Improving Systems

The central idea of Nested Learning is that the future of AI may not be about simply making models deeper by adding more layers, but about adding a new dimension to deep learning: nested levels of learning that operate at different speeds.

This shift in perspective could be the key to moving beyond the static AI models of today and toward the “self-modifying titans” envisioned by the researchers — truly self-improving systems that can continually learn from new experiences without forgetting the past.

If deep learning as we know it is just one flat level of learning, what new capabilities might be unlocked when our models can learn how to learn, all the way down?

Will this be a shift in the LLM training or will it just remain as another mock trial by the Giants to assert their dominance. Let me know your thoughts !

Thankyou for Reading !

Frequently Used, Contextual References

Resources

Why Google Thinks Our Entire Approach to Training LLMs Needs to Change

Author(s): Harsh Chandekar

Why the LLMs have Memory Problem

1. “Deeper” AI Isn’t About More Layers, It’s About More Levels

2. Secret Learners: Even Your AI’s Optimizer is a Memory System

3. Models Need Different “Brain Waves” to Truly Learn

From Static Tools to Self-Improving Systems

Towards AI Academy

We Build Enterprise-Grade AI. We'll Teach You to Master It Too.

Recent Posts

Full-Stack Data Scientists for the Agentic Coding World

Building Production-Grade AI Skills with Snowflake Cortex AI Function Studio

I Tried 10 AI Agent Frameworks in 2026 — Here’s the Honest Guide I Wish I Had Earlier

How One Spring Boot Optimization Saved Our Startup $30,000 a Year

Inside Palantir AIP: How the World’s Most Controversial AI Platform Actually Works

What Is a Reverse Proxy? (And Why Every Backend Developer Should Care)

What Claude Opus 4.8 Actually Changes If You’re Building Agents

QWEN 3.7 Max Worked For 35 Hrs Straight And The Results Were Mind-blowing

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Why Google Thinks Our Entire Approach to Training LLMs Needs to Change

Author(s): Harsh Chandekar

Why the LLMs have Memory Problem

1. “Deeper” AI Isn’t About More Layers, It’s About More Levels

2. Secret Learners: Even Your AI’s Optimizer is a Memory System

3. Models Need Different “Brain Waves” to Truly Learn

From Static Tools to Self-Improving Systems

Towards AI Academy

We Build Enterprise-Grade AI. We'll Teach You to Master It Too.

Related posts

Recent Posts

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

GDPR CCPA Statement