
The Math Behind Foundational Models
Last Updated on April 15, 2025 by Editorial Team
Author(s): Fabiana Clemente
Originally published on Towards AI.
How LLMs Work and What Are Their Limitations
When I was a kid, maybe around 3 or 4 years old, my mother used to sit with my older sister and help her learn how to read. One day, I joined them on the couch, quietly listening as they read a picture book together. As the pages turned, I began reciting the story out loud, word for word.
My mother was stunned. Had my youngest just learned to read β just by watching her sister? It was a proud-parent momentβ¦ until she decided to test it. She handed me the book and flipped to a random page. I kept telling the story β but it no longer matched the pictures. I hadnβt learned to read. I had memorized the story, every beat of it. Impressive? Sure. But I wasnβt reading β I was mimicking. Now, why does that matter?
Well, foundational models like GPT-4, Claude, or LLaMA have become the engines behind everything from AI assistants to automated coding tools and text summarizers. But despite how remarkable these models are, thereβs still a lot of mystery β and a fair share of misconceptions β about how these large language models (LLMs) actually work. Are they reasoning? Understanding? Or just doing something that looks smart?
Letβs unpack whatβs going on under the hood.
Itβs all about probabilities
At the center of an LLM is a very simple task: given a sequence of words, predict the next one. Thatβs it. Of course Iβm oversimplifying it, but you get the gist.
Everything you see β poetry, essays, code generation β is just the output of a model trained to say, βgiven this string of tokens, whatβs the most likely next token?β
Mathematically, this is framed as maximizing the conditional probability β P(wtββ£w1β,w2β,β¦,wtβ1β), where each wt is a word, or as it is more commonly described, a token.
This conditional probability is computed over billions of parameters optimized through gradient descent.
So how does a model even represent language this way?
Transformers changed everything
In reality, for these models words arenβt really understood as words. Theyβre are numbers, more specifically, high-dimensional vectors in a learned space.
During the training, the model learns an embedding (youβve probably already heard about them by now). An embedding is nothing more than data structures that are able to capture semantic relationships. For example, the vector difference between king and queen is surprisingly similar to that of man and woman.
But embeddings were just the start. The real breakthrough when it comes to text generation came in 2017 with the Transformer architecture.
Why? The introduction of a mechanism called self-attention!
Different from past architectures like RNNs and LSTMs, where text was processed sequentially, transformers have the ability to process the entire input at once. Every token can βattendβ to every other token in the sequence and weigh its importance when deciding what to say next.
It looks abstract, but the idea is simple β each token decides how much to βpay attentionβ to every other token, and uses that to update its own representation. Stack dozens of these layers, add some normalization, skip connections, feedforward net, and you get the deep transformer network behind LLMs.
LLMs β a game of scale
The magic that we see happening with LLM is possible due to something as simple as scale. Training an LLM model resumes to showing billions of sequences and minimizing the cross-entropy between the predicted tokens and the original ones.
More importantly, the loss function must direct the model training, meaning, it should ask something like β how surprised were you by the real next token? With all these elements combined, over time, with enough data and compute power, the model gets very good at being βunsurprisedβ, meaning, it will become very good at predicting what is expected to be the next token.
With this intel, you are probably wondering- so, are LLMs incapable of true interpretation and understanding? If you are, youβre correct! Thatβs one of their core limitations β they donβt understand, they mimic. And mimic, no matter how sophisticated, can only take you so far.
LLMs abilities and boundaries
When you scale up transformers and feed them enough text, they start doing more than autocomplete. They can translate, reason, write code and even (try!) explain jokes. These are called emergent abilities β capabilities that werenβt explicitly programmed but arise from the modelβs structure and training.
Yet these abilities are statistical, not logical. An LLM doesnβt prove a theorem β it repeats patterns it has seen or inferred. Its βreasoningβ is prediction not a real deduction. Thatβs why other phenomenons like hallucinations, poor long-term planning and bias also happen:
- Hallucinations β they happen when a model confidently generates information that isnβt true, and while theyβre a known limitation, strategies like retrieval-augmented generation and guardrails can help reduce them.
- Poor long-term planning β shows up when a model struggles to stay consistent or follow through on multi-step tasks β mainly because it lacks true memory or goal awareness β but techniques like external memory, state tracking, and planning frameworks can help.
- Bias β refers to the model reflecting or amplifying unfair assumptions or stereotypes found in its training data, and while itβs a tough challenge, it can be mitigated through strategies like careful data filtering, preprocessing, or even augmenting with balanced synthetic data.
These arenβt real bugs , they are consequences of the math!
So⦠are we close to AGI?
By now I hope I have convinced you that we are not! But just in case, let me throw a couple more reasons why LLMs are not, yet, the path to AGI.
In reality, LLMs knowledge is bounded to its training data, nothing more, nothing else. Even if it memorizes facts, it lacks the capacity to ground them in the real world unless paired with tools like search or retrieval augmentation (RAG). The math simply doesnβt allow it.
The probability distributions it models are fundamentally approximate. The cross-entropy loss doesnβt guarantee correctness, only plausibility. And since token prediction is done greedily or through sampling (like top-k or nucleus sampling), the generated text isnβt deterministic β itβs probabilistic.
Furthermore, transformer models have fixed context windows. Despite recent advances like transformer-XL, Longformer, or RWKV, the model still operates within a finite number of tokens. It doesnβt remember what you said last week. It doesnβt know what it doesnβt know.
There are some really interesting talks out there about AGI and the mathematical limitations weβre currently facing. My favorite, without a doubt, is one by Yann LeCun.
Where are we headed?
So whereβs all this going?
Researchers are already trying to stretch the math behind LLMs in new directions. Mixture-of-Experts models route different inputs through different parts of the network, kind of like calling in the right specialist for the job. Retrieval-Augmented Generation (RAG) gives models a way to pull in fresh information instead of relying only on what they memorized during training. And fine-tuning with RLHF β Reinforcement Learning from Human Feedback β helps steer responses toward what we, as humans, actually want to hear.
All cool stuff. But at the end of the day, these are still patches on top of the same basic engine: a really powerful guessing machine. Weβre not at the stage of true reasoning or understanding yet. And thatβs okay.
Foundational models are still incredible. Theyβre probably the most versatile tools weβve ever trained. But once you understand the math, you see them more clearly for what they are: massive statistical machines, predicting one token at a time based on patterns theyβve seen before.
Doesnβt make them any less impressive. Just means we should use them with a bit more perspective. The future of AI isnβt just about making models bigger β itβs about making them smarter, more reliable and grounded in how the world actually works.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI