The Role of Signal-to-Noise in Loss Convergence

Last Updated on February 12, 2026 by Editorial Team

Author(s): Austin DeWolfe

Originally published on Towards AI.

The Role of Signal-to-Noise in Loss Convergence — Source: Image by author

Consider the normal NLP training curve during pre-training. That nice beautiful line of healthy training. Why does it behave like that? Why does the image above not behave like that?

There are two factors that determine the kind of line we will get out of training. The signal to noise ratio, and the size of the generalized solution space.

The graph above is training on symbolic data. That looks like this:

Early in training the problems are much easier, “What is 2+2?”. The model generalizes this solution very quickly. When it hits the 4 digit multiplication portion of the curriculum, the training looks more like a curve we’d see in NLP, like the long convergence from around 660k to 870k steps in the first image. The signal is rich, but the solution space is much larger than what it dealt with before. It appears to be asymptotic to the full generalized solution, but eventually it reaches 99.999% accuracy on its logits during auto-regressive evaluation and we can call that good enough, with a long history of only correct answers spanning thousands of steps. Bengio et al. (2009)

Consider this graph:

Every blue mark is a graduation to a new curriculum. Curriculum graduation happens when the model has 100% accuracy during auto-regressive evaluation on a sample of 25 problems across 6 different evaluation windows, with each evaluation being 600 steps apart. The green box is the region of 4 digit multiplication covering about 25k steps.

In the earlier training sample we see the line with C:0011100. This is a mask for carry digits, what numerical positions of the addition will carry over to the next digit place. Before this line was in the training data the model would take over 120k steps to learn the same data, after carry is added it is about 25k. Nanda et al. (2023)

The solution space for 4 digit multiplication is the largest the model has seen to this point by orders of magnitude. That wasn’t changed to increase convergence speed. A single addition to the signal gave tremendous improvement in convergence.

I have taken to considering the process of learning to be composed of 3 stages: The Weak Generalized Solution (WGS), Memorization, and The Strong Generalized Solution (SGS).

The WGS is where all training starts. Some initial important insights are encoded. Something prevents this from approaching the SGS in some training regimes. My opinion on the matter based on watching gradient norms and breaking them out to see how many weights are actually being updated tells me that the movement of weights in the WGS suffers from gradient clipping on its path to the SGS. As a result “helper weights” are recruited, this is memorization. Little additional rules to solve specific failure modes pop up. Eventually these signals are noise and have to be shed to realize the full SGS, this is the process of grokking. Power et al. (2022)

In our earlier graphs we see rapid convergence to the SGS because our signal is so strong that the model never needs helper weights. It rides the signal to the SGS. Had the model only been given the question #### x #### = ? without the scratchpad, it would take significantly longer and look more like the traditional curve of NLP. Even just adding commas between the numbers can reduce convergence by orders of magnitude.

Consider 4 digit multiplication, #,### x #,###. That is a big solution space. But what is bigger is if we include 3 digit a’s and 4 digit b’s (a x b = c) such as ### x #,###. Does this make training slower or faster?

We are increasing the solution space, but are we getting more valuable signal for riding towards the SGS? It turns out it is faster to increase our solution space for more signal. There is probably a factor at play of the 3 digit a and 4 digit b being within the set of only 4 digit multiplication, so while our solution space is larger from one lens, it is contained within the set of what we were approaching towards in the SGS anyway.

We have 3 digit a, and 4 digit b. a x b = c. What happens if our training data includes b x a = c? Do we converge faster or slower?

Generally this leads to faster convergence, though when we get into even more variables this isn’t the case (probably because scratchpad changes explode the solution space). What is the case though is that its generalized solution is even more robust. As strong as the model is on a x b = c, b x a = c is actually out-of-distribution if it isn’t trained on it. When it is faced with these problems it has a high success over 90%, but in the distribution of a x b = c the success is 99.9999%. This also means we can use b x a = c as a held out eval, though it isn’t worth the additional training time to not include it. It can be useful for experimentation though. Dziri et al. (2023)

Let’s look at top-k logits from a failure mode early on in a curriculum:

We are in the WGS region of training. It is off by 1 and the signal dominates the error by 1 and the correct answer. I don’t have the image of the rest of that training but it was like this #,##X,### where X is where the error happened. The model is off by 1 in the 4th digit place. It has nearly generalized the full solution in its first 600 steps of training on this 4 digit multiplication curriculum where this eval was taken, but it will still take 25k steps to fully generalize. Other failure modes early in the WGS are the same ones we see in LLMs, doubling a number, halving a number, or adding or subtracting 1 then doubling or halving.

It is no wonder then that LLMs are so poor at math. They only learn it in regimes of high noise, surrounded by natural language. They only have an WGS of math.

It seems to also be the case that when the signal is strong enough and the solution space is small enough the model will transition from the WGS to the SGS without memorizing. Or at the very least the portion of memorization happens so fast as to be basically none. There are many portions of the curriculum that pass by in their minimum number of steps, 600 x 6 for eval to trigger graduation with a train loss of 1e-7, it is just floating point noise at that point (until you zoom in and see the same scaling laws again).

If you look at the first graph you will notice it is out to the 800k step range. In this period of time it learned addition, subtraction, multiplication, division, and all comparison operators from -1m to 1m, as well as order of operations where multiple ops are combined with parenthesis.

By adjusting the stages that brought it to this capability, either by making the solution space of each more granular or by increasing the signal to noise ratio, convergence is faster. From over 800k steps to achieve -1m to 1m on all operators, to closer to 320k on all operators.

So then what is happening during NLP training? The solution space is massive. The signal to noise ratio is decently healthy, there are strong relationships in NLP. If we consider 4 digit multiplication to have a solution space of 10⁸, how large then is NLP? V^L where V is the vocabulary size and L is the context window? That is a massive amount… But what about homonyms? V^L is very literal combinatorics, but that doesn’t capture the nuance of language.

Consider: river bank != river bank != river bank, I could have a flowing river with a bank at the shore, or a bank called river bank, or a geographic feature called river bank, or any other meaningful combination of ideas. Words are not ideas, and LLMs model ideas while outputting words just like people. However they only take in word fragments, which helps to limit the solution space.

Then if that is the wrong combinatorics is it something more like the totality of concepts captured in digitized language factorial? !Ac, where Ac is all concepts? Chomsky argues it is infinite as a result of grammar. Chomsky (1957). Both are too large, and Chomsky’s argument isn’t dealing with finite datasets its more about grammar relationships allowing for an infinite space over an infinite time to create it.

I believe if it were infinite we wouldn’t see scaling laws. Our training data is in a boundary. Our vocab is in a boundary. Our current society is within a boundary. This is not even a world model. The distribution the model learns is finite based on the training data. So we see scaling laws because the solution space is an unfathomably large number, yet still bounded, so we can improve convergence by doubling model size which allows us to capture more the WGS per backprop with fewer helper weights.

I believe the solution space is B^L x C. Where B is the effective branching factor per position informed by Shannon entropy of language, and C is the number of meaningful compositional combinations that emerge from token interactions. It could also be V^L x M^L where M is the average meanings per token, either way it is bounded and large.

This means there exists a hypothetically large enough model to capture the totality of digitized language in its WGS. And if that is the case it would rapidly approach the SGS during training.

There is also then an optimum NLP training regime that dramatically reduces training time as a result of carefully controlling the bounds of the distribution, and layering a portion of its semantics to the next data set so the overlap helps the model find the generalized solutions faster. Grokking and NLP training are the same.

Power et al. (2022). “Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets.” https://arxiv.org/abs/2201.02177

Liu et al. (2023). “A Unified View of Grokking: Induction Head Formation, Phase Transitions, and Sparsity.” https://arxiv.org/abs/2205.10343

Bengio et al. (2009). “Curriculum Learning.” https://ronan.collobert.com/pub/2009_curriculum_icml.pdf

Nanda et al. (2023). “Progressive Learning, Grokking, and The Interpretability of Neural Networks.” https://arxiv.org/abs/2301.05217

Dziri et al. (2023). “Faith and Fate: Limits of Transformers on Compositionality.” https://arxiv.org/abs/2305.18654

Humayan et al. (2024). “Deep Networks Always Grok and Here is Why” https://arxiv.org/abs/2402.15555

Singh et al. (2026). “Explaining Grokking in Transformers through the Lens of Inductive Bias” https://arxiv.org/html/2602.06702v1

Thilak et al. (2022). “Towards Understanding Grokking: A Comparative Analysis of Phase Transitions.” https://arxiv.org/abs/2205.10343

Liu et al. (2022) “Towards Understanding Grokking: An Effective Theory of Representation Learning” https://arxiv.org/abs/2205.10343

Chomsky (1957). “Syntactic Structures.”

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

The Role of Signal-to-Noise in Loss Convergence

Author(s): Austin DeWolfe

Towards AI Academy

We Build Enterprise-Grade AI. We'll Teach You to Master It Too.

Recent Posts

Full-Stack Data Scientists for the Agentic Coding World

Building Production-Grade AI Skills with Snowflake Cortex AI Function Studio

I Tried 10 AI Agent Frameworks in 2026 — Here’s the Honest Guide I Wish I Had Earlier

How One Spring Boot Optimization Saved Our Startup $30,000 a Year

Inside Palantir AIP: How the World’s Most Controversial AI Platform Actually Works

What Is a Reverse Proxy? (And Why Every Backend Developer Should Care)

What Claude Opus 4.8 Actually Changes If You’re Building Agents

QWEN 3.7 Max Worked For 35 Hrs Straight And The Results Were Mind-blowing

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

The Role of Signal-to-Noise in Loss Convergence

Author(s): Austin DeWolfe

Towards AI Academy

We Build Enterprise-Grade AI. We'll Teach You to Master It Too.

Related posts

Recent Posts

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

GDPR CCPA Statement