
Retaining Knowledge in AI: Solving Catastrophic Forgetting in LLMs
Last Updated on May 12, 2025 by Editorial Team
Author(s): Sanket Rajaram
Originally published on Towards AI.
Part 1: The Learning Journey of a Kid in the School
Imagine a kid in school learning about basic arithmetic in one semester. By the next year, they move on to geometry and algebra, but in the process, they seem to forget how to add and subtract. Teachers must frequently re-teach old concepts because the kid struggles to retain prior knowledge while learning new skills.
Now, consider the same kid given a strategy to retain their foundational knowledge β perhaps through periodic revision, linking new knowledge with old, or using visual aids to cement learning. The kid becomes better at building on previous lessons without losing what theyβve learned.
This simple analogy mirrors a fundamental problem in the field of machine learning: catastrophic forgetting.

Connecting the Analogy to Machine Learning
Just like a kid, AI models, especially large language models (LLMs), βlearnβ by adjusting their parameters while training on data. When fine-tuned on new tasks or specialized domains, these models often overwrite existing knowledge, much like the kid forgetting arithmetic when learning geometry. This phenomenon is called catastrophic forgetting.
But what if these models, like the kid, could retain their foundational knowledge while acquiring new skills? This is where techniques like Elastic Weight Consolidation (EWC), Replay Methods, and Parameter-Efficient Fine-Tuning (PEFT) come into play. These methods act as the teacherβs revision strategies, helping the model retain and build upon its knowledge.
Why Does Catastrophic Forgetting Matter?
- In Education: A kid who forgets foundational knowledge struggles in advanced subjects.
- In AI Systems: Consider a virtual assistant fine-tuned to answer legal questions but losing its general conversational ability β this impacts its usability and reliability.
The rest of this article explores how we can teach AI to retain knowledge like a disciplined student, enabling it to learn new tasks without forgetting what it already knows.
Part 2: Why AI Models Forget Like Students
As we dive deeper into the issue of catastrophic forgetting, letβs understand why AI models face this problem.
Think of an AI model like the kidβs notebook. Every time the kid learns something new, they erase parts of the notebook to make space for fresh lessons.
Unlike humans, who can store information in separate parts of their brain and recall older concepts when needed, most AI models overwrite existing knowledge when fine-tuned for new tasks.

The Problem of Overwriting
In machine learning, this βoverwritingβ happens because:
- Shared Parameters: The same model parameters are used for both old and new tasks. When new data updates these parameters, it inadvertently destroys patterns learned for older tasks.
- Sequential Learning: AI models, like students, often learn tasks in sequence. Without mechanisms to recall previous tasks, the model treats new tasks as if they exist in isolation.
- Lack of Structured Retention: Unlike humans, AI models donβt naturally βdistillβ past knowledge into foundational memories, making them vulnerable to forgetting.
A Quick Dive Into the Science of Forgetting
Catastrophic forgetting is closely tied to how neural networks learn:
- When fine-tuning, the model adjusts weights (parameters) based on the new taskβs data.
- This process doesnβt distinguish between weights important for old tasks and those less critical, leading to unintended loss of previous knowledge.
For example:
- A large language model trained to understand general-purpose text might be fine-tuned on medical reports. After fine-tuning, it becomes great at analyzing medical language but might forget how to handle general queries.
This behavior is similar to a kid who, after months of focusing on one subject, struggles to recall what they learned last year.
Part 3: How Can We Help AI Retain Knowledge?
Just as teachers use revision techniques or structured curricula to help kids retain foundational knowledge, researchers have developed several approaches to tackle catastrophic forgetting. These approaches ensure that while the model learns new skills, it preserves its ability to perform previously mastered tasks.

Techniques to Retain AI Knowledge
Letβs now transition into how we can βteachβ AI to retain and distill knowledge effectively.
Below are – practical techniques, explained through our analogy:
1. Replay Method: Revisiting Old and New Knowledge
Imagine the teacher periodically revisiting old concepts during lessons about new topics. Similarly, replay methods allow AI models to βrevisitβ old knowledge while learning new tasks.
This is done by including a mix of old and new data during fine-tuning.
- Example in AI: Fine-tuning a language model for legal queries while also including a small portion of general text data in the training process.
- Benefit: The model learns new skills without overwriting prior knowledge.
- Drawback: This approach requires storing or accessing old data, which can be expensive or impractical.
How It Works
The replay method combines data from the original (general-purpose) dataset with the new (domain-specific) dataset during fine-tuning.
By interleaving examples from both datasets, the model simultaneously learns the new task while retaining the knowledge it acquired from the original data.
This approach mimics human learning, where revisiting older concepts reinforces memory while acquiring new information.
Why It Helps
Catastrophic forgetting happens when the modelβs parameters are overwritten during fine-tuning on the new data.
By exposing the model to examples from the original task, itβs forced to maintain a balance, ensuring older knowledge isnβt entirely discarded.
Implementation in Practice
- Step 1: Load the original dataset (e.g., a general-purpose corpus like Wikipedia).
- Step 2: Load the domain-specific dataset (e.g., a specialized dataset like legal or medical text).
- Step 3: Merge the datasets to create a combined training set.
- Step 4: Fine-tune the model on this combined dataset, ensuring both types of knowledge are retained.
Pros
- Easy to implement using existing datasets.
- Balances general-purpose and domain-specific performance.
- Helps the model generalize better, especially in multi-task environments.
Cons
- Requires access to the original dataset, which might be too large or unavailable.
- Increases computational cost due to the larger training data.
- The model might underperform in highly specialized tasks since it splits focus between old and new data.
2. Elastic Weight Consolidation (EWC): Protecting Key Lessons
In our analogy, this is like the teacher emphasizing key lessons from last year, marking them as βimportantβ so the student doesnβt forget them.
EWC works by identifying parameters (weights) that are critical for older tasks and penalizing changes to these weights during fine-tuning.
- Example in AI: Protecting weights that are essential for general language understanding while fine-tuning for domain-specific tasks.
- Benefit: No need to store old data; effectively retains critical knowledge.
- Drawback: Computational overhead, especially when managing many tasks.
How It Works
EWC is inspired by the way the human brain prioritizes retaining essential knowledge while learning new skills.
In this method, the model identifies which parameters are crucial for its original task and penalizes changes to these parameters during fine-tuning.
This is done using the Fisher Information Matrix, which measures the importance of each parameter.
Imagine your model is a painter, and its parameters are paintbrushes.
EWC ensures the painter doesnβt drastically alter their most effective brushes while trying to paint a new masterpiece.
Why It Helps
Catastrophic forgetting occurs because the model doesnβt distinguish between critical and non-critical parameters, treating all of them equally during fine-tuning.
By applying a penalty (regularization) to the critical parameters, EWC ensures these arenβt overwritten.
Implementation in Practice
- Compute Importance Weights: Before fine-tuning, calculate the Fisher Information Matrix to determine the importance of each parameter for the original task.
- Fine-Tuning with Regularization: During fine-tuning, add a regularization term to the loss function. This term penalizes changes to critical parameters based on their importance.
Pros
- Doesnβt require access to the original data, which is beneficial for privacy-sensitive tasks.
- Effectively retains critical knowledge from the original task.
- Allows specialized fine-tuning without losing general-purpose capabilities.
Cons
- Computationally expensive to compute the Fisher Information Matrix.
- Requires maintaining a copy of the original model for comparison.
- Might not scale well for very large models or complex tasks.
3. Parameter-Efficient Fine-Tuning (PEFT): Specialized Notebooks
Think of PEFT as giving the kid a separate notebook for each subject. The student doesnβt overwrite old notes but adds new ones in a dedicated space.
In AI, PEFT methods like Adapters, LoRA, or Prompt Tuning update only a small subset of task-specific parameters while keeping the base model frozen.
- Example in AI: Adding lightweight adapters for a legal domain while keeping the main language model intact.
- Benefit: Highly memory-efficient and avoids catastrophic forgetting entirely.
- Drawback: Requires specialized frameworks (e.g.,
adapter-transformers
) and adds slight inference latency.
PEFT represents a family of methods designed to fine-tune specific parts of an LLM while leaving the rest of the model untouched.
Instead of retraining the entire model, PEFT introduces additional, lightweight parameters that adapt the model to new tasks.
This drastically reduces computational costs and prevents overwriting the original weights, thereby avoiding catastrophic forgetting.
How PEFT Works
PEFT methods add a small number of trainable parameters to the model, which are fine-tuned while keeping the base model frozen. These parameters are task-specific and do not interfere with the general-purpose knowledge stored in the pretrained layers.
Subtypes of PEFT
Adapters
- Description: Adapters are additional layers inserted into the transformer architecture. These layers capture task-specific knowledge while leaving the base model intact.
- Advantages: Efficient in terms of parameter usage; adapters can be swapped in and out for different tasks.
- Example: Fine-tuning a chatbot for multiple industries (e.g., legal, medical) by simply switching adapters.
LoRA (Low-Rank Adaptation)
- Description: LoRA updates only a low-rank decomposition of the modelβs weight matrices. Instead of fine-tuning entire layers, it trains smaller matrices that approximate task-specific changes.
- Advantages: Extremely parameter-efficient and ideal for hardware-constrained environments.
- Example: Adapting a model for a niche language like Basque while keeping it performant in English.
Prompt-Tuning
- Description: Prompt-tuning optimizes soft prompts (learnable embeddings) rather than the model weights. These prompts guide the model to perform specific tasks without altering its parameters.
- Advantages: Minimal computational cost; works well for zero-shot or few-shot learning.
- Example: Adding prompts to steer a general-purpose model into answering medical questions accurately.
P-Tuning
- Description: A variation of prompt-tuning that uses trainable embeddings at the input layer to encode task-specific information.
- Advantages: Improves performance in scenarios with limited task data.
- Example: Tuning an LLM for sentiment analysis on a small movie reviews dataset.
Pros of PEFT
Resource Efficiency: Fine-tunes only a small fraction of the model, reducing computation and storage costs.
- Task Modularity: Task-specific parameters can be swapped in and out without retraining the entire model.
- Scalability: Suitable for scenarios where multiple tasks or domains are involved.
Cons of PEFT
- Specialized Setup: Requires libraries like
adapter-transformers
, adding a layer of complexity. - Inference Latency: Slightly increases latency due to additional computations for the adapters or soft prompts.
Building Knowledge Like a Lifelong Learner
By employing these techniques, we ensure that AI systems donβt just learn like students cramming for a test but evolve into lifelong learners β capable of acquiring new skills while retaining the foundations theyβve built.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI