How ‘It’ Learned to Mean ‘Cat’ — A Journey Through Attention in Transformers

Author(s): Ashwin Biju Alikkal

Originally published on Towards AI.

How ‘It’ Learned to Mean ‘Cat’ — A Journey Through Attention in Transformers

When I first started learning how machines understand language, I was honestly a bit amused. How could something as deep and emotional as human language be reduced to just counting words or giving every word a fixed meaning?

That curiosity pulled me into the fascinating history of language models. In this blog, I’ll take you on a short time-travel journey. From the early days of Bag-of-Words and Word2Vec, to the moment attention changed everything. Along the way, we’ll break down the core idea behind attention, go through a simple math example.

Let’s dive in!

This will be the structure that we will be following:

A Brief History of Attention
What Is Attention in Transformers?
Mathematical Explanation of Attention
Mathematical Example for the Above
Why Attention Matters in Modern AI
Conclusion
Source

A Brief History of Attention

In the early days of language modelling, we treated text like a simple bag of words where we counted word occurrences and ignored the order or context. While this bag-of-words approach helped models start “reading,” it couldn’t tell the difference between “dog bites man” and “man bites dog.” We needed to give more understanding of grammar and context to the model.

Then came Word2Vec, a breakthrough that gave each word a dense vector representation based on its surrounding words. Suddenly, words like king and queen or Paris and France began to live near each other in vector space. But even this had a big flaw: it assigned the same meaning to a word in every context. Whether you meant a river bank or a money bank, Word2Vec would treat bank the same.

That’s when attention changed the game. Instead of giving each word a fixed meaning, attention mechanisms allowed models to dynamically focus on the most relevant parts of a sentence. And from this concept, the Transformer architecture was born, which became a base in the era of GPTs, BERTs, and beyond.

A peek into the history of Language AI (Source: https://www.oreilly.com/library/view/hands-on-large-language/9781098150952/ch01.html)

What Is Attention in Transformers?

Reading the below sentence:

“The animal didn’t cross the street because it was too tired.”

Now ask yourself: what does “it” refer to? Your brain probably zoomed in on “the animal” and ignored the rest while figuring it out.

That’s exactly what attention does in a Transformer model. It helps the model decide which words to focus on while processing each word.

In technical terms, attention is a mechanism that allows the model to assign different levels of importance (weights) to different words in a sentence. So when trying to understand or generate a word, the model doesn’t just treat every other word equally — it learns to “pay attention” to the ones that matter most in that context. (Like here in this case, “it” refers to “animal”)

Instead of reading left to right or word by word like older models, Transformers use self-attention to look at all words at once and understand their relationships. That’s what makes them so powerful — and why they’ve taken over the world of language models.

Transformer translates “I love llamas” to Dutch using self-attention and encoder-decoder blocks (Source: https://www.oreilly.com/library/view/hands-on-large-language/9781098150952/ch01.html)

Instead of reading left to right or word by word like older models, Transformers use self-attention to look at all words at once and understand their relationships.

(Note: There is a self-attention mechanism in the decoder as well)

There are two key aspects involved in the attention mechanism:

Scoring Relevance: First, the model calculates how relevant each word is to the one it’s currently processing. For example, in the sentence “The cat sat on the mat,” when predicting the word “mat,” the model might find “sat” and “on” more relevant than “the”. This relevance is scored using a mathematical function involving something called queries, keys, and values (Mathematics coming up in the next section! Be ready. )
Combining Information: Once it knows what to pay attention to, the model combines those relevant words into a new, meaningful representation. It’s like making a smoothie where you blend just the right ingredients (here in this case, words) based on how important they are for the current context.

But here is where things get cool!

To allow the model to focus on different patterns in the sentence at once, this attention process is run multiple times in parallel. Each run is called an attention head. So, one head might focus on subject-verb pairs (“cat” ->“sat”), another on spatial context (“on” -> “mat”), and yet another on articles or prepositions.

All these heads work together to give the model a much robust, more flexible understanding of the input.

But what happens after the model has finished “paying attention”? It passes the updated information through a feedforward neural network. This part is like the model’s thinking and memory center. It helps the model store patterns it has seen during training and generalise them to new inputs.

Mathematical Explanation of Attention

Okay, now that we get the intuition behind attention, let’s peek under the hood.

Every time a Transformer processes a word (or token), it tries to update that word’s vector representation by looking at all the other words in the sequence and figuring out what’s relevant.

To do this, the attention mechanism takes in:

Vector for the current token
Vectors for all previous tokens

The goal? To produce a new vector for the current token that incorporates helpful context from the others.

Now, before the transformer can compute attention, it needs to transform the input vectors into three different forms — Queries, Keys and Values. But how? This transformation happens using three learned projection matrices:

W_Q → Query projection matrix

W_K → Key projection matrix

W_V → Value projection matrix

Each of these matrices is trained during the model’s learning phase and has the same shape: If the input vector is of size say, 512 , and we want to project it to a smaller size say, 64, then W_Q, W_K and W_V are of shape (512 x 64).

So for each input token vector X, we compute:

Q (Queries) = X * W_Q: Captures what the current token wants to “know”

K (Keys) = X * W_K: Encodes what each token has to be “offer”

V (Values) = X * W_V : Holds the actual information to be pulled from tokens if they’re relevant

Hence, the above gives us the new versions of the input tokens that are tuned to their role in the attention process. Now we have two different steps (Remember scoring relevance and combining information in the above. We will be using that)

Scoring Relevance: Relevance scoring step of attention is conducted by multiplying the query vector of the current version with the keys (K) matrix. This produces a score stating how relevant each previous token is. Passing that by a softmax equation normalizes these scores so they sum up to 1.
Combining Information: Now that we have the relevance scores, we multiply the value vector associated with each token by that token’s score. Summing up those resulting vectors produces the output of this attention step.

Mathematical Example for the Above

Now let’s look at an example for the above. Consider the input, “Sarah fed the cat because it”.

(Note: We are using a single-head self-attention layer with size = 4 so that every number is easy to follow. Normally, model dimensions are 768. But the same algebra scales to real models with hundreds or thousands of dimensions.)

Each token in the sentence is represented by a vector. These vectors are called token embeddings and are the numerical form in which the model can understand and operate on words.

(Please note that the vectors above are artificially small (4 dimensions) for simplicity — but in real models, each word might be represented by 300,768, etc., dimensional vectors)

In transformers, W_Q, W_K, and W_V are learned matrices. They are optimized during training to help the model learn relationships between tokens.

Here in the above setup, we use these specific values to make calculations easy and interpretable.

Now, for each token embedding x_i,

q_i = x_i * W_Q

k_i = x_i* W_K

v_i = x_i*W_V

and finally we can create the matrices for Queries (Q), Keys (K) and Values (V).

Now comes the interesting part where we will compute the raw scores using the formula below:

Now, if we see above, k4, i.e. “cat” gets the largest score. We hope the model will resolve “it” to “cat”.

Now this 4-D vector is the new representation of “it” after attention. Because alpha_3 was largest, “cat” contributes most, so “it” now carries “cat” related information.

Why Attention Matters in Modern AI

Since the Transformer architecture was first introduced, researchers have been constantly trying to improve it, and guess where most of their attention goes? The attention layer itself. That’s because it’s not just the heart of the model, it’s also the most computationally heavy part.

Over time, researchers came up with all sorts of new attention mechanisms to make things faster and smarter like Local/Sparse Attention, Multi-query and Grouped Attention, and Flash Attention. Each of these was an attempt to keep the magic of attention while cutting down on the heavy computation. (Will post about these in the subsequent posts)

Conclusion

It’s wild to think that the ability of machines to focus — something so human has completely transformed how they read, translate, write, and even create art. Attention isn’t just a technical concept; it’s the spark that made models like GPT and BERT possible.

Attention is one of the most important breakthroughs in modern AI. It gave models the ability to focus on what actually matters in a sentence, which made understanding and generating language far more accurate.

If you’re learning about transformers or working with language models, my advice is this: take time to really understand attention. It’s the core of everything — from GPT to BERT to the latest AI tools. Once you get how it works, the rest of the architecture becomes much easier to follow.

Source

A big thank you to Jay Alammar and Maarten Grootendorst for their brilliant explanations in Hands-On Large Language Models, which helped me understand and simplify these concepts.

Also, a special thanks to the legendary paper “Attention Is All You Need” by Vaswani et al., which started it all. If you haven’t read it yet, it’s worth checking out!

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

How ‘It’ Learned to Mean ‘Cat’ — A Journey Through Attention in Transformers

Author(s): Ashwin Biju Alikkal

A Brief History of Attention

What Is Attention in Transformers?

Mathematical Explanation of Attention

Mathematical Example for the Above

Why Attention Matters in Modern AI

Conclusion

Source

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

Why Knowledge Graphs Are the Missing Piece in AI Agent API Discovery

The Complexity of Self-Driving Cars Explained Simply

Bridging Symbolic AI and Deep Learning: How Knowledge Graphs are Revolutionizing ResNets

LAI #93: Smarter Model Choices, Multi-Agent Systems, and Cutting Through AI Noise

Who Wins Purview vs Rogue AI in Data Control

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

How ‘It’ Learned to Mean ‘Cat’ — A Journey Through Attention in Transformers

Author(s): Ashwin Biju Alikkal

A Brief History of Attention

What Is Attention in Transformers?

Mathematical Explanation of Attention

Mathematical Example for the Above

Why Attention Matters in Modern AI

Conclusion

Source

Related posts

Popular posts

Updates

Recent Posts

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

GDPR CCPA Statement