How ‘It’ Learned to Mean ‘Cat’ — A Journey Through Attention in Transformers
Author(s): Ashwin Biju Alikkal
Originally published on Towards AI.
When I first started learning how machines understand language, I was honestly a bit amused. How could something as deep and emotional as human language be reduced to just counting words or giving every word a fixed meaning?
That curiosity pulled me into the fascinating history of language models. In this blog, I’ll take you on a short time-travel journey. From the early days of Bag-of-Words and Word2Vec, to the moment attention changed everything. Along the way, we’ll break down the core idea behind attention, go through a simple math example.
Let’s dive in!
This will be the structure that we will be following:
- A Brief History of Attention
- What Is Attention in Transformers?
- Mathematical Explanation of Attention
- Mathematical Example for the Above
- Why Attention Matters in Modern AI
- Conclusion
- Source
A Brief History of Attention
In the early days of language modelling, we treated text like a simple bag of words where we counted word occurrences and ignored the order or context. While this bag-of-words approach helped models start “reading,” it couldn’t tell the difference between “dog bites man” and “man bites dog.” We needed to give more understanding of grammar and context to the model.
Then came Word2Vec, a breakthrough that gave each word a dense vector representation based on its surrounding words. Suddenly, words like king and queen or Paris and France began to live near each other in vector space. But even this had a big flaw: it assigned the same meaning to a word in every context. Whether you meant a river bank or a money bank, Word2Vec would treat bank the same.
That’s when attention changed the game. Instead of giving each word a fixed meaning, attention mechanisms allowed models to dynamically focus on the most relevant parts of a sentence. And from this concept, the Transformer architecture was born, which became a base in the era of GPTs, BERTs, and beyond.

What Is Attention in Transformers?
Reading the below sentence:
“The animal didn’t cross the street because it was too tired.”
Now ask yourself: what does “it” refer to? Your brain probably zoomed in on “the animal” and ignored the rest while figuring it out.
That’s exactly what attention does in a Transformer model. It helps the model decide which words to focus on while processing each word.
In technical terms, attention is a mechanism that allows the model to assign different levels of importance (weights) to different words in a sentence. So when trying to understand or generate a word, the model doesn’t just treat every other word equally — it learns to “pay attention” to the ones that matter most in that context. (Like here in this case, “it” refers to “animal”)
Instead of reading left to right or word by word like older models, Transformers use self-attention to look at all words at once and understand their relationships. That’s what makes them so powerful — and why they’ve taken over the world of language models.

Instead of reading left to right or word by word like older models, Transformers use self-attention to look at all words at once and understand their relationships.
(Note: There is a self-attention mechanism in the decoder as well)
There are two key aspects involved in the attention mechanism:
- Scoring Relevance: First, the model calculates how relevant each word is to the one it’s currently processing. For example, in the sentence “The cat sat on the mat,” when predicting the word “mat,” the model might find “sat” and “on” more relevant than “the”. This relevance is scored using a mathematical function involving something called queries, keys, and values (Mathematics coming up in the next section! Be ready. )
- Combining Information: Once it knows what to pay attention to, the model combines those relevant words into a new, meaningful representation. It’s like making a smoothie where you blend just the right ingredients (here in this case, words) based on how important they are for the current context.
But here is where things get cool!
To allow the model to focus on different patterns in the sentence at once, this attention process is run multiple times in parallel. Each run is called an attention head. So, one head might focus on subject-verb pairs (“cat” ->“sat”), another on spatial context (“on” -> “mat”), and yet another on articles or prepositions.
All these heads work together to give the model a much robust, more flexible understanding of the input.
But what happens after the model has finished “paying attention”? It passes the updated information through a feedforward neural network. This part is like the model’s thinking and memory center. It helps the model store patterns it has seen during training and generalise them to new inputs.
Mathematical Explanation of Attention
Okay, now that we get the intuition behind attention, let’s peek under the hood.
Every time a Transformer processes a word (or token), it tries to update that word’s vector representation by looking at all the other words in the sequence and figuring out what’s relevant.
To do this, the attention mechanism takes in:
- Vector for the current token
- Vectors for all previous tokens
The goal? To produce a new vector for the current token that incorporates helpful context from the others.
Now, before the transformer can compute attention, it needs to transform the input vectors into three different forms — Queries, Keys and Values. But how? This transformation happens using three learned projection matrices:
W_Q → Query projection matrix
W_K → Key projection matrix
W_V → Value projection matrix
Each of these matrices is trained during the model’s learning phase and has the same shape: If the input vector is of size say, 512 , and we want to project it to a smaller size say, 64, then W_Q, W_K and W_V are of shape (512 x 64).
So for each input token vector X, we compute:
Q (Queries) = X * W_Q: Captures what the current token wants to “know”
K (Keys) = X * W_K: Encodes what each token has to be “offer”
V (Values) = X * W_V : Holds the actual information to be pulled from tokens if they’re relevant
Hence, the above gives us the new versions of the input tokens that are tuned to their role in the attention process. Now we have two different steps (Remember scoring relevance and combining information in the above. We will be using that)
- Scoring Relevance: Relevance scoring step of attention is conducted by multiplying the query vector of the current version with the keys (K) matrix. This produces a score stating how relevant each previous token is. Passing that by a softmax equation normalizes these scores so they sum up to 1.
- Combining Information: Now that we have the relevance scores, we multiply the value vector associated with each token by that token’s score. Summing up those resulting vectors produces the output of this attention step.
Mathematical Example for the Above
Now let’s look at an example for the above. Consider the input, “Sarah fed the cat because it”.
(Note: We are using a single-head self-attention layer with size = 4 so that every number is easy to follow. Normally, model dimensions are 768. But the same algebra scales to real models with hundreds or thousands of dimensions.)

Each token in the sentence is represented by a vector. These vectors are called token embeddings and are the numerical form in which the model can understand and operate on words.
(Please note that the vectors above are artificially small (4 dimensions) for simplicity — but in real models, each word might be represented by 300,768, etc., dimensional vectors)

In transformers, W_Q, W_K, and W_V are learned matrices. They are optimized during training to help the model learn relationships between tokens.
Here in the above setup, we use these specific values to make calculations easy and interpretable.
Now, for each token embedding x_i,
q_i = x_i * W_Q
k_i = x_i* W_K
v_i = x_i*W_V

and finally we can create the matrices for Queries (Q), Keys (K) and Values (V).
Now comes the interesting part where we will compute the raw scores using the formula below:

Now, if we see above, k4, i.e. “cat” gets the largest score. We hope the model will resolve “it” to “cat”.


Now this 4-D vector is the new representation of “it” after attention. Because alpha_3 was largest, “cat” contributes most, so “it” now carries “cat” related information.
Why Attention Matters in Modern AI
Since the Transformer architecture was first introduced, researchers have been constantly trying to improve it, and guess where most of their attention goes? The attention layer itself. That’s because it’s not just the heart of the model, it’s also the most computationally heavy part.
Over time, researchers came up with all sorts of new attention mechanisms to make things faster and smarter like Local/Sparse Attention, Multi-query and Grouped Attention, and Flash Attention. Each of these was an attempt to keep the magic of attention while cutting down on the heavy computation. (Will post about these in the subsequent posts)
Conclusion
It’s wild to think that the ability of machines to focus — something so human has completely transformed how they read, translate, write, and even create art. Attention isn’t just a technical concept; it’s the spark that made models like GPT and BERT possible.
Attention is one of the most important breakthroughs in modern AI. It gave models the ability to focus on what actually matters in a sentence, which made understanding and generating language far more accurate.
If you’re learning about transformers or working with language models, my advice is this: take time to really understand attention. It’s the core of everything — from GPT to BERT to the latest AI tools. Once you get how it works, the rest of the architecture becomes much easier to follow.
Source
A big thank you to Jay Alammar and Maarten Grootendorst for their brilliant explanations in Hands-On Large Language Models, which helped me understand and simplify these concepts.
Also, a special thanks to the legendary paper “Attention Is All You Need” by Vaswani et al., which started it all. If you haven’t read it yet, it’s worth checking out!
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI
Take our 90+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!
Towards AI has published Building LLMs for Production—our 470+ page guide to mastering LLMs with practical projects and expert insights!

Discover Your Dream AI Career at Towards AI Jobs
Towards AI has built a jobs board tailored specifically to Machine Learning and Data Science Jobs and Skills. Our software searches for live AI jobs each hour, labels and categorises them and makes them easily searchable. Explore over 40,000 live jobs today with Towards AI Jobs!
Note: Content contains the views of the contributing authors and not Towards AI.