Why Google Thinks Our Entire Approach to Training LLMs Needs to Change
Last Updated on December 4, 2025 by Editorial Team
Author(s): Harsh Chandekar
Originally published on Towards AI.
Google AI Says Deep Learning is an ‘Illusion’. Here Are Ideas That Could Change Everything.
Hey, so I came across this new paper from Google researchers, and it’s actually pretty interesting.
They’re basically asking: can we train LLMs in a different way so they become smarter, more adaptable, better with context, and actually remember things long-term?
They propose this idea called Nested Learning, which is kind of like giving the model different layers of memory that learn at different speeds, instead of everything being trained the same way. It’s a different take on how we build and train these models, and it might actually change a lot if it works well.
If you want to check out the paper yourself, here’s the link:
https://openreview.net/pdf?id=nbMeRvNb7A
Why the LLMs have Memory Problem
Large Language Models (LLMs) exhibit a memory processing pattern strikingly similar to anterograde amnesia — a neurological disorder where a person cannot form new long-term memories. While they retain their vast knowledge from before the “onset” of their training completion, they are unable to permanently store new experiences. It is like you learn and remember things only till age 20 even if you are 40 years old.
This is the core reason why models like ChatGPT are largely static after their pre-training phase. The information you provide in a conversation, which exists within a temporary “context window,” does not permanently update the model’s core knowledge base. This limitation is a major roadblock to creating AI that can continually learn from experience without suffering from “catastrophic forgetting” — the tendency to lose old information when learning new things.
Current LLMs are flat, they have only one single main learning process, one “Gradient Flow”.
How are current LLM’s trained?

The approach, called “Nested Learning,” takes inspiration from the human brain’s multi-layered memory systems. This post breaks down the concepts explaining the new approach and how it can shape the future of AI.
1. “Deeper” AI Isn’t About More Layers, It’s About More Levels
The traditional view in AI is that making models more powerful means stacking more and more computational layers. It argues this is a “flattened image” of learning, an illusion of progress. Simply stacking layers isn’t a universal solution because it fails to address fundamental challenges, it often doesn’t change a model’s “ability to fast adapt…or continually learn” and may not even increase its “computational depth…leaving their ability to implement complex algorithms untouched.”
The architecture shift proposed is from “stacked layers” to “Nested Learning” (NL). Instead of a flat architecture, NL reframes a model as an interconnected system of components organized in hierarchical “levels.” Different modules — A, B, and C — that are nested and connected, but each learns at a different speed. An update in the fastest module C not only changes itself, but it also actively changes what has been learned in the slower, more stable modules (B and A) above it.

This is a significant shift because it’s not about making models bigger, but architecting them to be smarter. The goal is to build models from “stacked learning process components,” where each component has its own learning rate and context. This structure is designed to mimic the brain’s ability to process information on different timescales.
In the Nested Learning framework, Module C is the fast-learning system, similar to our working memory. It updates instantly and handles short-term information, like when you remember a phone number just long enough to type it.
Module B learns at a medium speed and acts more like the hippocampus, consolidating information that repeats — similar to how you gradually start remembering a new colleague’s name after hearing it a few times.
Module A is the slowest system, responsible for long-term, stable knowledge, much like the neocortex in humans; this is the part that stores skills or facts you retain for years, like knowing how to swim or ride a bicycle.
Together, these three modules mirror the human brain’s multi-timescale learning process, but inside an AI model.
It makes a critical distinction between memorization and true learning:
Memory is a neural update caused by an input, and learning is the process for acquiring effective and useful memory. In the current LLMs, we often see the problem of memorization instead of true learning, which is similar to when a student is preparing for a exam a day before just memorizing (rote learning) all the things without actually studying / preparing for the exam for the entire semester.

2. Secret Learners: Even Your AI’s Optimizer is a Memory System
One of the most surprising insights from the Nested Learning paper is that components we don’t typically see as “learners” — like the optimizers that train the model — are actually simple learning systems themselves. This gives us validation to the concept of NL when we see that updating of optimizers with different frequencies actually work !
Consider the evolution of optimizers as a multi-level learning process:
- Level 1 (Simple Gradient Descent): This is a naive, one-level update. The model’s parameters are adjusted based only on the error from the most recent piece of data. It has no memory of past updates. Only ( α ) as learning parameter.
- Level 2 (Momentum): Adding momentum is like adding a second level of learning. The optimizer now has a simple “associative memory” that considers the direction of past updates (gradients), not just the present one. This prevents it from making sharp, erratic changes based on a single input. (α and β ) are the learning parameters , where the α represents current learning rate and β parameter represent the weight of the momentum .
- Level 3 (Adam): The Adam optimizer is an even more complex nested system. It builds on momentum by also tracking the variance of past gradients. By considering both momentum and variance, it acts as a more intelligent “learning component” that can adapt its updates more effectively. That is when we give the history to it of the previous updates/ weights, it performs better almost every time when not provided the history.

This progression from simple gradient descent to Adam isn’t just an anecdotal example; it is concrete, existing evidence that nesting learning processes — giving a system memory of its own learning — yields more powerful and intelligent behavior. It’s the core principle of Nested Learning, already hiding in plain sight.
3. Models Need Different “Brain Waves” to Truly Learn
The core inspiration for Nested Learning comes from neuroscience. The human brain operates on different frequencies, often visualized as brain waves. Fast “Gamma” waves are associated with processing immediate sensory input, requiring rapid reaction. Slower “Delta” waves are linked to deep sleep and memory consolidation, where the brain reorganizes knowledge for long-term storage.

Nested Learning translates this brain wave analogy into a core AI concept: Update Frequency. The idea is that different parts of an AI model should learn and update at different rates, just as the brain processes information on different timescales. This led the researchers to propose a new formulation for memory called a Continuum Memory System (CMS), which generalizes beyond the traditional “long-term/short-term” binary to include a whole spectrum of update speeds.
They put this idea into practice with a self-referential learning module, called HOPE. HOPE stand for Hierarchical Optimizing Processing Ensemble which is based on the old Titans architecture used for smart memory management. It is tightly coupled and best suited with the Nested Architecture.
In current LLMs also we pass the 3 parameters internally while generating the next token which are (q,K,V) : Query, Key, Value which helps in next word prediction.
Structured as a CMS, HOPE has multiple learnable levels, each updating at a different frequency:
- Level 1 (Low Frequency): This level acts like long-term memory, updating its knowledge very slowly (e.g., once every 60 million tokens).
- Level 2 & 3 (Mid Frequency): These levels update more regularly, acting as an intermediate memory system (e.g., every 1 million tokens).
- Level 4 (High Frequency): This level updates very quickly, processing information from the immediate context, much like short-term working memory.
Here are the differences between using Transformers as a backbone vs using HOPE.
- Transformers don’t learn during a forward pass — HOPE does.
- Transformers train on one timescale — HOPE trains on multiple timescales.
- Transformers store context in activations only — HOPE stores it in changeable internal memory (its parameters).

When tested the HOPE architecture using the Nested Learning methodology, the results were pretty convincing.
Here is the report by the Google for testing the HOPE: https://research.google/blog/introducing-nested-learning-a-new-ml-paradigm-for-continual-learning/

From Static Tools to Self-Improving Systems
The central idea of Nested Learning is that the future of AI may not be about simply making models deeper by adding more layers, but about adding a new dimension to deep learning: nested levels of learning that operate at different speeds.
This shift in perspective could be the key to moving beyond the static AI models of today and toward the “self-modifying titans” envisioned by the researchers — truly self-improving systems that can continually learn from new experiences without forgetting the past.
If deep learning as we know it is just one flat level of learning, what new capabilities might be unlocked when our models can learn how to learn, all the way down?
Will this be a shift in the LLM training or will it just remain as another mock trial by the Giants to assert their dominance. Let me know your thoughts !
Thankyou for Reading !
More Related Articles
- VectorDB Internals for Engineers: What You Need to Know
- Agentic Frameworks for Beginners: The blueprint to build smart agents
- Retrieval-Augmented Generation (RAG)Optimizing Pipelines and Experiments for Better AI Responses: A Deep Dive
- Why Does Your LLM Application hallucinate?
- The Silent Threats: How LLMs Are Leaking Your Sensitive Data
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI
Towards AI Academy
We Build Enterprise-Grade AI. We'll Teach You to Master It Too.
15 engineers. 100,000+ students. Towards AI Academy teaches what actually survives production.
Start free — no commitment:
→ 6-Day Agentic AI Engineering Email Guide — one practical lesson per day
→ Agents Architecture Cheatsheet — 3 years of architecture decisions in 6 pages
Our courses:
→ AI Engineering Certification — 90+ lessons from project selection to deployed product. The most comprehensive practical LLM course out there.
→ Agent Engineering Course — Hands on with production agent architectures, memory, routing, and eval frameworks — built from real enterprise engagements.
→ AI for Work — Understand, evaluate, and apply AI for complex work tasks.
Note: Article content contains the views of the contributing authors and not Towards AI.