Who is Harry Potter? Inside Microsoft Research’s Fine-Tuning Method for Unlearning Concepts in LLMs
Last Updated on November 5, 2023 by Editorial Team
Author(s): Jesus Rodriguez
Originally published on Towards AI.
I recently started an AI-focused educational newsletter, that already has over 160,000 subscribers. TheSequence is a no-BS (meaning no hype, no news, etc) ML-oriented newsletter that takes 5 minutes to read. The goal is to keep you up to date with machine learning projects, research papers, and concepts. Please give it a try by subscribing below:
TheSequence U+007C Jesus Rodriguez U+007C Substack
The best source to stay up-to-date with the developments in the machine learning, artificial intelligence, and data…
thesequence.substack.com
Large language models(LLMs) are regularly trained in vast amounts of unlabeled data, which often leads to acquiring knowledge of incredibly diverse subjects. The datasets used in the pretraining of LLMs often including copyrighted material, triggering both legal and ethical concerns for developers, users, and original content creators. Quite often, specific knowledge from LLMs is required to be removed in order to adapt it to a specific domain. While the learning in LLMs is certainly impressive, the unlearning of specific concepts remains a very nascent area of exploration. While fine-tuning methods are certainly effective for incorporating new concepts, can they be used to unlearn specific knowledge?
In one of the most fascinating papers of this year, Microsoft Research explores an unlearning technique for LLMs. The challenge was nothing less than making Llama-7B to forget any knowledge of Harry Potter.
The Unlearning Challenge in LLMs
Recent months have witnessed heightened scrutiny of the data employed to train LLMs. The spotlight has shone on issues ranging from copyright infringement to privacy concerns, bias in content, false data, and even the presence of toxic or harmful information. It is evident that some training data poses inherent problems. But what happens when the realization dawns that certain data must be expunged from a trained LLM?
Traditionally, the AI community has found it relatively straightforward to fine-tune LLMs for incorporating new information. Yet, the act of making these machines forget previously learned data presents a formidable challenge. To draw an analogy, it’s akin to attempting to remove specific ingredients from a fully baked cake — a task that appears nearly insurmountable. While fine-tuning can introduce new flavors, removing a particular ingredient poses a considerable hurdle.
Adding to the complexity is the exorbitant cost associated with retraining LLMs. The process of training these massive models demands investments that can easily reach tens of millions of dollars or more. Given these formidable obstacles, unlearning remains one of the most intricate enigmas within the AI sphere. Doubts linger about its feasibility, with some even questioning whether achieving perfect unlearning is merely a distant dream. In the absence of concrete research on the subject, skepticism in the AI community grows.
The Method
Microsoft Research’s approach to unlearning in generative language models comprises three core components:
1. Token Identification through Reinforced Modeling: The researchers construct a specialized model designed to strengthen its knowledge of the content to be unlearned, achieved by further fine-tuning the target data, such as the Harry Potter books. This process identifies tokens whose probabilities have notably increased, indicating content-related tokens that should be avoided during generation.
2. Expression Replacement: To facilitate unlearning, distinctive phrases from the target data are replaced with generic equivalents. The model then predicts alternative labels for these tokens, simulating a version of itself that hasn’t learned the specific target content.
3. Fine-Tuning: Armed with these alternative labels, the model undergoes fine-tuning. Essentially, whenever the model encounters a context associated with the target data, it effectively “forgets” the original content.
In this scenario, Microsoft Research tackles the challenge of unlearning a subset of a generative language model’s training data. Suppose the model has been trained on a dataset X, and a subset Y (referred to as the unlearn target) needs to be forgotten. The aim is to approximate the effect of retraining the model on the dataset X \ Y, recognizing that a full retraining on X \ Y would be impractical due to its time and cost implications.
One initial notion for unlearning text might be to train the model on the text while inverting the loss function. However, empirical findings indicate that this approach does not yield promising results in this context. The limitation stems from situations where the model’s successful prediction of certain tokens is not tied to knowledge of the Harry Potter novels but rather reflects its general language comprehension. For example, predicting “Harry” in the sentence “Harry Potter went up to him and said, ‘Hello. My name is’” would not unlearn the books but instead hinder the model’s understanding of the phrase “my name is.”
Another challenge arises when the baseline model confidently predicts tokens like “Ron” or “Hermione” in a sentence like “Harry Potter’s two best friends are.” Applying a simple reverse loss would require numerous gradient descent steps to alter the prediction. Additionally, the most likely token would merely switch to an alternative related to the Harry Potter novels.
Instead, the goal is to provide the model with plausible alternatives to tokens like “Ron” that are unrelated to the Harry Potter books but remain contextually appropriate. In essence, for every token in the text, the question becomes: What would a model unexposed to the Harry Potter books predict as the next token in this sentence? This is referred to as the generic prediction, and Microsoft’s method employs techniques such as reinforcement bootstrapping and anchored terms to obtain these generic predictions.
The Results
Microsoft Research undertook an ambitious endeavor, initially considered nearly impossible: the endeavor to erase from memory the enchanting world of Harry Potter within the Llama2–7b model, originally trained by Meta. Multiple sources suggest that the model’s training encompassed the “books3” dataset, an extensive repository that includes the iconic books, among a trove of other copyrighted literary works (including those authored by a contributor to this research).
To illustrate the model’s remarkable depth of knowledge, one need only present it with a seemingly generic prompt like, “When Harry went back to school that fall,” and observe as it weaves a detailed narrative set within J.K. Rowling’s magical universe.
However, through the application of Microsoft Research’s proposed technique, a profound transformation in the model’s responses emerged. Let’s delve into a few examples by comparing the completions generated by the original Llama2–7b model with those produced by our finely-tuned iteration:
Microsoft Research’s investigation yields a crucial insight: unlearning, while presenting challenges, emerges as a feasible undertaking, as evidenced by the favorable outcomes in their experiments involving the Llama2–7b model. Nonetheless, this achievement warrants a cautious perspective. Their current evaluation methodology, reliant on prompts given to the model and the subsequent analysis of its responses, proves effective in specific contexts. However, it could potentially overlook more complex, adversarial methods for extracting retained information. It’s conceivable that unconventional techniques, like delving into token probability distributions, might inadvertently expose the model’s concealed familiarity with unlearned content.
In summary, while their technique marks a promising initial step, its adaptability to diverse content categories remains subject to thorough examination. The approach presented provides a foundational framework, yet it necessitates further research for refinement and expansion, especially in the context of broader unlearning tasks within large language models.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI