The Mathematics of Small Things: On Grokking and The Double Descent Phenomenon
Last Updated on July 23, 2024 by Editorial Team
Author(s): Ayo Akinkugbe
Originally published on Towards AI.
The Conundrum β To Overfit or Generalize?
So hereβs the thing when training a model β you are often advised never to overfit. Somehow it makes sense because overfitting is when a modelβs algorithm learns its training data so well that it fails to make accurate predictions on new, unseen data. However, understanding when your model begins to overfit can be useful. A model that overfits also shows the point where the objective function for the modelβs algorithm has been optimized. This can be useful in knowing when to stop training.
Conversely, a model that makes accurate predictions on new, unseen data is said to generalize well. The goal of model development is generalization, not overfitting. However, there is often a tension between optimizing the objective function during training and being able to generalize using the model on new data. The goal is not to overfit. Though overfitting isnβt desirable, it can serve as a guide to generalization if understood and leveraged accordingly.
For context, a model is trained on training data, evaluated on a validation set, and then tested on a test dataset. In each instance, an error that measures how well the model predicts accurately is measured β training error and test error, respectively. The difference between these errors is often referred to as the generalization error. When small, it means the model generalized well. When large, the model is said to most likely overfit.
There are numerous books, papers, and techniques written on how to ensure a good fit in a model β how to overcome overfitting and how to enhance generalization. That is not the subject of this article. This article explores two observed phenomena (Grokking and Double Descent) in large models regarding how they overfit and generalize, and some speculations about these types of behavior.
Grokking
Imagine you have been trying to learn a language. Letβs say you have tried everything you can for the past five years. You are bad at it. You arenβt learning, not even the fundamentals. Then suddenly, one morning after five years of trying, you wake up and you are speaking the language fluently. This described scenario has been observed in large neural networks and is referred to as Grokking. Grokking in machine learning refers to a model suddenly achieving a deep and thorough understanding of the data it is being trained on. This phenomenon is characterized by a sharp and unexpected improvement in performance after a relatively long period of seemingly stagnated or mediocre results. It is as if the model suddenly βgets it!β
The interesting thing about this phenomenon is that even though it has been observed, it isnβt explainable. We donβt know why large models behave this way, as it is contrary to the observed behaviors of neural models explained earlier. Models are often nipped right before they begin to overfit to ensure they can generalize on unseen data. Why would a model generalize far after overfitting on a dataset?
Double Descent
Double Descent refers to another phenomenon observed in the training of deep learning models. It describes the relationship between model complexity and performance in large models. Unlike the traditional U-shaped curve usually observed, Double Descent has an additional descent phase that occurs beyond the point where the model fits the training data perfectly. That is, the model at first performs well on new data, starts to overfit, and then starts performing better than the first time.
Simply put, Double Descent is a phenomenon where models appear to perform better, then worse, and then better again as they get bigger.
Differences between Grokking and Double Descent
Even though similar and sometimes referred to as the same phenomenon, Grokking is distinct from Double Descent on the following criteria:
- Pattern of Model Improvement: Grokking involves a sudden improvement in model performance after a prolonged period of suboptimal performance. Itβs more about the learning process within a fixed model structure. Double Descent describes a non-monotonic relationship between model complexity and performance, with an initial increase, a degradation at the interpolation threshold, and then an unexpected improvement as complexity continues to increase.
- Timing: Grokking happens after extensive training, with the model suddenly improving. Double Descent occurs as model complexity is varied, showing different performance phases depending on complexity.
- Scope: Grokking focuses on the training process and the modelβs internalization of data patterns. Double Descent focuses on the impact of model complexity on performance, highlighting unexpected behavior beyond the traditional bias-variance tradeoff.
- Underlying Mechanism: Grokking may be related to the model finally understanding intricate data structures and patterns after extensive training. Double Descent relates to how over-parameterized models can find simpler, more generalizable solutions despite their complexity.
Even though these are different phenomena, one thing they both have in common is that they veer off from classical machine learning theory of how a model learns and generalizes. A concept that helps explain how and why models learn the way they do classically is the Manifold Hypothesis.
Manifold Hypothesis
Imagine you have a sheet of paper (a 2-dimensional surface) that you can twist, fold, and crumple. Now, this paper exists in a 3-dimensional space (length, width, and height), but its true intrinsic dimensionality is still just 2D. When the paper is flat, itβs easy to see that itβs just a 2D surface. When you crumple the paper, it might appear more complex and seem to fill more of the 3D space. However, it still fundamentally has only two dimensions. If the paper were crumpled, the paper does not fill the entire 3D space but instead exists on a constrained, lower-dimensional surface within the manifold.
The Manifold Hypothesis is a fundamental concept in machine learning that explains how and why models might learn the way they do. The hypothesis suggests that high-dimensional data (such as images, sounds, or other complex data) lies on a lower-dimensional manifold within the high-dimensional space. For example, most realistic images (faces, objects, etc.) do not randomly occupy the entire high-dimensional space but are instead concentrated in specific regions (the manifold). These regions capture the underlying structure and relationships between the data points.
This hypothesis has important implications for understanding how machine learning models, especially deep learning models, operate and generalize.
- If a machine learning model can identify and learn this lower-dimensional manifold, it can more efficiently understand and generalize from the data, as any new realistic combination of the features should exist in that manifold.
- By focusing on the manifold, the model avoids the noise and irrelevant parts of the high-dimensional space, leading to better performance and generalization.
Speculations
What might the Manifold Hypothesis have to do with these two unexplainable phenomena? Below are a few speculations o
- More Time Required to Unravel the Manifold for Different Dataset Structures: In Grokking, an over-parameterized model suddenly has a Eureka moment after a long time of training. This phenomenon has mainly been observed with algorithmically generated datasets. The Manifold Hypothesis suggests that real-world data has intricate structures. What if there are degrees of intricacies? What if different data types exhibit different degrees of manifold in a higher dimension space? What if this behavior leads to more complexity in how the model learns the information structure, leading to phenomena like Grokking and Double Descent?
- Correspondence Principle in AI: In physics, a similar dichotomy exists between classical and quantum physics. Quantum physics is the physics of very small things β where atoms and electrons collide or act accordingly. However, classical physics is straightforward β often deterministic and established. The coexistence of these two subfields in the field of physics has been made possible through a reconciliation β that when quantum numbers are large, the predictions of quantum physics match those of classical physics. This is the Correspondence Principle. Maybe Artificial Intelligence needs a correspondence principle β one that connects the phenomena between how large models behave in relation to statistical laws that govern and predict how traditional smaller models behave.
- Unexplored Laws for Patterns in Complex Data Structures: Like the laws of large numbers, maybe there are laws yet to be discovered for patterns as they pertain to language, arithmetic, and other complex real-world data structures ingested by large models.
Learning theory demarcates easily. Lines are drawn like a linear classifier. However, we step into the real world and there are nuances. βIt depends,β we like to say. Many factors that might seem insignificant in theory determine the outcome β the small significant things we overlooked.
In a world fast approaching where we demand machines to think and act like humans, the small, significant things need to be calculated and accounted for. This is the mathematics of really small things. These phenomena are strange until we discover this β hidden beneath the manifold.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI