The Math Behind Foundational Models

Last Updated on April 15, 2025 by Editorial Team

Author(s): Fabiana Clemente

Originally published on Towards AI.

How LLMs Work and What Are Their Limitations

The Math Behind Foundational Models — Image generate with Google’s Gemini

When I was a kid, maybe around 3 or 4 years old, my mother used to sit with my older sister and help her learn how to read. One day, I joined them on the couch, quietly listening as they read a picture book together. As the pages turned, I began reciting the story out loud, word for word.

My mother was stunned. Had my youngest just learned to read — just by watching her sister? It was a proud-parent moment… until she decided to test it. She handed me the book and flipped to a random page. I kept telling the story — but it no longer matched the pictures. I hadn’t learned to read. I had memorized the story, every beat of it. Impressive? Sure. But I wasn’t reading — I was mimicking. Now, why does that matter?

Well, foundational models like GPT-4, Claude, or LLaMA have become the engines behind everything from AI assistants to automated coding tools and text summarizers. But despite how remarkable these models are, there’s still a lot of mystery — and a fair share of misconceptions — about how these large language models (LLMs) actually work. Are they reasoning? Understanding? Or just doing something that looks smart?

Let’s unpack what’s going on under the hood.

It’s all about probabilities

At the center of an LLM is a very simple task: given a sequence of words, predict the next one. That’s it. Of course I’m oversimplifying it, but you get the gist.

Everything you see — poetry, essays, code generation — is just the output of a model trained to say, “given this string of tokens, what’s the most likely next token?”

Mathematically, this is framed as maximizing the conditional probability — P(wt∣w1,w2,…,wt−1), where each wt is a word, or as it is more commonly described, a token.

This conditional probability is computed over billions of parameters optimized through gradient descent.

So how does a model even represent language this way?

Transformers changed everything

In reality, for these models words aren’t really understood as words. They’re are numbers, more specifically, high-dimensional vectors in a learned space.

During the training, the model learns an embedding (you’ve probably already heard about them by now). An embedding is nothing more than data structures that are able to capture semantic relationships. For example, the vector difference between king and queen is surprisingly similar to that of man and woman.

But embeddings were just the start. The real breakthrough when it comes to text generation came in 2017 with the Transformer architecture.

Why? The introduction of a mechanism called self-attention!

Different from past architectures like RNNs and LSTMs, where text was processed sequentially, transformers have the ability to process the entire input at once. Every token can “attend” to every other token in the sequence and weigh its importance when deciding what to say next.

It looks abstract, but the idea is simple — each token decides how much to “pay attention” to every other token, and uses that to update its own representation. Stack dozens of these layers, add some normalization, skip connections, feedforward net, and you get the deep transformer network behind LLMs.

LLMs — a game of scale

The magic that we see happening with LLM is possible due to something as simple as scale. Training an LLM model resumes to showing billions of sequences and minimizing the cross-entropy between the predicted tokens and the original ones.

More importantly, the loss function must direct the model training, meaning, it should ask something like — how surprised were you by the real next token? With all these elements combined, over time, with enough data and compute power, the model gets very good at being “unsurprised”, meaning, it will become very good at predicting what is expected to be the next token.

With this intel, you are probably wondering- so, are LLMs incapable of true interpretation and understanding? If you are, you’re correct! That’s one of their core limitations — they don’t understand, they mimic. And mimic, no matter how sophisticated, can only take you so far.

LLMs abilities and boundaries

When you scale up transformers and feed them enough text, they start doing more than autocomplete. They can translate, reason, write code and even (try!) explain jokes. These are called emergent abilities — capabilities that weren’t explicitly programmed but arise from the model’s structure and training.

Yet these abilities are statistical, not logical. An LLM doesn’t prove a theorem — it repeats patterns it has seen or inferred. Its “reasoning” is prediction not a real deduction. That’s why other phenomenons like hallucinations, poor long-term planning and bias also happen:

Hallucinations — they happen when a model confidently generates information that isn’t true, and while they’re a known limitation, strategies like retrieval-augmented generation and guardrails can help reduce them.
Poor long-term planning — shows up when a model struggles to stay consistent or follow through on multi-step tasks — mainly because it lacks true memory or goal awareness — but techniques like external memory, state tracking, and planning frameworks can help.
Bias — refers to the model reflecting or amplifying unfair assumptions or stereotypes found in its training data, and while it’s a tough challenge, it can be mitigated through strategies like careful data filtering, preprocessing, or even augmenting with balanced synthetic data.

These aren’t real bugs , they are consequences of the math!

So… are we close to AGI?

By now I hope I have convinced you that we are not! But just in case, let me throw a couple more reasons why LLMs are not, yet, the path to AGI.

In reality, LLMs knowledge is bounded to its training data, nothing more, nothing else. Even if it memorizes facts, it lacks the capacity to ground them in the real world unless paired with tools like search or retrieval augmentation (RAG). The math simply doesn’t allow it.

The probability distributions it models are fundamentally approximate. The cross-entropy loss doesn’t guarantee correctness, only plausibility. And since token prediction is done greedily or through sampling (like top-k or nucleus sampling), the generated text isn’t deterministic — it’s probabilistic.

Furthermore, transformer models have fixed context windows. Despite recent advances like transformer-XL, Longformer, or RWKV, the model still operates within a finite number of tokens. It doesn’t remember what you said last week. It doesn’t know what it doesn’t know.

There are some really interesting talks out there about AGI and the mathematical limitations we’re currently facing. My favorite, without a doubt, is one by Yann LeCun.

Where are we headed?

So where’s all this going?

Researchers are already trying to stretch the math behind LLMs in new directions. Mixture-of-Experts models route different inputs through different parts of the network, kind of like calling in the right specialist for the job. Retrieval-Augmented Generation (RAG) gives models a way to pull in fresh information instead of relying only on what they memorized during training. And fine-tuning with RLHF — Reinforcement Learning from Human Feedback — helps steer responses toward what we, as humans, actually want to hear.

All cool stuff. But at the end of the day, these are still patches on top of the same basic engine: a really powerful guessing machine. We’re not at the stage of true reasoning or understanding yet. And that’s okay.

Foundational models are still incredible. They’re probably the most versatile tools we’ve ever trained. But once you understand the math, you see them more clearly for what they are: massive statistical machines, predicting one token at a time based on patterns they’ve seen before.

Doesn’t make them any less impressive. Just means we should use them with a bit more perspective. The future of AI isn’t just about making models bigger — it’s about making them smarter, more reliable and grounded in how the world actually works.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

The Math Behind Foundational Models

Author(s): Fabiana Clemente

How LLMs Work and What Are Their Limitations

It’s all about probabilities

Transformers changed everything

LLMs — a game of scale

LLMs abilities and boundaries

So… are we close to AGI?

Where are we headed?

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

Why Knowledge Graphs Are the Missing Piece in AI Agent API Discovery

The Complexity of Self-Driving Cars Explained Simply

Bridging Symbolic AI and Deep Learning: How Knowledge Graphs are Revolutionizing ResNets

LAI #93: Smarter Model Choices, Multi-Agent Systems, and Cutting Through AI Noise

Who Wins Purview vs Rogue AI in Data Control

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

The Math Behind Foundational Models

Author(s): Fabiana Clemente

How LLMs Work and What Are Their Limitations

It’s all about probabilities

Transformers changed everything

LLMs — a game of scale

LLMs abilities and boundaries

So… are we close to AGI?

Where are we headed?

Related posts

Popular posts

Updates

Recent Posts

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

GDPR CCPA Statement