Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: pub@towardsai.net
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab VeloxTrend Ultrarix Capital Partners Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Our 15 AI experts built the most comprehensive, practical, 90+ lesson courses to master AI Engineering - we have pathways for any experience at Towards AI Academy. Cohorts still open - use COHORT10 for 10% off.

Publication

How ‘It’ Learned to Mean ‘Cat’ — A Journey Through Attention in Transformers
Artificial Intelligence   Latest   Machine Learning

How ‘It’ Learned to Mean ‘Cat’ — A Journey Through Attention in Transformers

Author(s): Ashwin Biju Alikkal

Originally published on Towards AI.

How ‘It’ Learned to Mean ‘Cat’ — A Journey Through Attention in Transformers

When I first started learning how machines understand language, I was honestly a bit amused. How could something as deep and emotional as human language be reduced to just counting words or giving every word a fixed meaning?

That curiosity pulled me into the fascinating history of language models. In this blog, I’ll take you on a short time-travel journey. From the early days of Bag-of-Words and Word2Vec, to the moment attention changed everything. Along the way, we’ll break down the core idea behind attention, go through a simple math example.

Let’s dive in!

This will be the structure that we will be following:

  • A Brief History of Attention
  • What Is Attention in Transformers?
  • Mathematical Explanation of Attention
  • Mathematical Example for the Above
  • Why Attention Matters in Modern AI
  • Conclusion
  • Source

A Brief History of Attention

In the early days of language modelling, we treated text like a simple bag of words where we counted word occurrences and ignored the order or context. While this bag-of-words approach helped models start “reading,” it couldn’t tell the difference between “dog bites man” and “man bites dog.” We needed to give more understanding of grammar and context to the model.

Then came Word2Vec, a breakthrough that gave each word a dense vector representation based on its surrounding words. Suddenly, words like king and queen or Paris and France began to live near each other in vector space. But even this had a big flaw: it assigned the same meaning to a word in every context. Whether you meant a river bank or a money bank, Word2Vec would treat bank the same.

That’s when attention changed the game. Instead of giving each word a fixed meaning, attention mechanisms allowed models to dynamically focus on the most relevant parts of a sentence. And from this concept, the Transformer architecture was born, which became a base in the era of GPTs, BERTs, and beyond.

A peek into the history of Language AI (Source: https://www.oreilly.com/library/view/hands-on-large-language/9781098150952/ch01.html)

What Is Attention in Transformers?

Reading the below sentence:

“The animal didn’t cross the street because it was too tired.”

Now ask yourself: what does “it” refer to? Your brain probably zoomed in on “the animal” and ignored the rest while figuring it out.

That’s exactly what attention does in a Transformer model. It helps the model decide which words to focus on while processing each word.

In technical terms, attention is a mechanism that allows the model to assign different levels of importance (weights) to different words in a sentence. So when trying to understand or generate a word, the model doesn’t just treat every other word equally — it learns to “pay attention” to the ones that matter most in that context. (Like here in this case, “it” refers to “animal”)

Instead of reading left to right or word by word like older models, Transformers use self-attention to look at all words at once and understand their relationships. That’s what makes them so powerful — and why they’ve taken over the world of language models.

Transformer translates “I love llamas” to Dutch using self-attention and encoder-decoder blocks (Source: https://www.oreilly.com/library/view/hands-on-large-language/9781098150952/ch01.html)

Instead of reading left to right or word by word like older models, Transformers use self-attention to look at all words at once and understand their relationships.

(Note: There is a self-attention mechanism in the decoder as well)

There are two key aspects involved in the attention mechanism:

  1. Scoring Relevance: First, the model calculates how relevant each word is to the one it’s currently processing. For example, in the sentence “The cat sat on the mat,” when predicting the word “mat,” the model might find “sat” and “on” more relevant than “the”. This relevance is scored using a mathematical function involving something called queries, keys, and values (Mathematics coming up in the next section! Be ready. )
  2. Combining Information: Once it knows what to pay attention to, the model combines those relevant words into a new, meaningful representation. It’s like making a smoothie where you blend just the right ingredients (here in this case, words) based on how important they are for the current context.

But here is where things get cool!

To allow the model to focus on different patterns in the sentence at once, this attention process is run multiple times in parallel. Each run is called an attention head. So, one head might focus on subject-verb pairs (“cat” ->“sat”), another on spatial context (“on” -> “mat”), and yet another on articles or prepositions.

All these heads work together to give the model a much robust, more flexible understanding of the input.

But what happens after the model has finished “paying attention”? It passes the updated information through a feedforward neural network. This part is like the model’s thinking and memory center. It helps the model store patterns it has seen during training and generalise them to new inputs.

Mathematical Explanation of Attention

Okay, now that we get the intuition behind attention, let’s peek under the hood.

Every time a Transformer processes a word (or token), it tries to update that word’s vector representation by looking at all the other words in the sequence and figuring out what’s relevant.

To do this, the attention mechanism takes in:

  • Vector for the current token
  • Vectors for all previous tokens

The goal? To produce a new vector for the current token that incorporates helpful context from the others.

Now, before the transformer can compute attention, it needs to transform the input vectors into three different forms — Queries, Keys and Values. But how? This transformation happens using three learned projection matrices:

W_Q → Query projection matrix

W_K → Key projection matrix

W_V → Value projection matrix

Each of these matrices is trained during the model’s learning phase and has the same shape: If the input vector is of size say, 512 , and we want to project it to a smaller size say, 64, then W_Q, W_K and W_V are of shape (512 x 64).

So for each input token vector X, we compute:

Q (Queries) = X * W_Q: Captures what the current token wants to “know”

K (Keys) = X * W_K: Encodes what each token has to be “offer”

V (Values) = X * W_V : Holds the actual information to be pulled from tokens if they’re relevant

Hence, the above gives us the new versions of the input tokens that are tuned to their role in the attention process. Now we have two different steps (Remember scoring relevance and combining information in the above. We will be using that)

  1. Scoring Relevance: Relevance scoring step of attention is conducted by multiplying the query vector of the current version with the keys (K) matrix. This produces a score stating how relevant each previous token is. Passing that by a softmax equation normalizes these scores so they sum up to 1.
  2. Combining Information: Now that we have the relevance scores, we multiply the value vector associated with each token by that token’s score. Summing up those resulting vectors produces the output of this attention step.

Mathematical Example for the Above

Now let’s look at an example for the above. Consider the input, “Sarah fed the cat because it”.

(Note: We are using a single-head self-attention layer with size = 4 so that every number is easy to follow. Normally, model dimensions are 768. But the same algebra scales to real models with hundreds or thousands of dimensions.)

Token embeddings

Each token in the sentence is represented by a vector. These vectors are called token embeddings and are the numerical form in which the model can understand and operate on words.

(Please note that the vectors above are artificially small (4 dimensions) for simplicity — but in real models, each word might be represented by 300,768, etc., dimensional vectors)

In transformers, W_Q, W_K, and W_V are learned matrices. They are optimized during training to help the model learn relationships between tokens.

Here in the above setup, we use these specific values to make calculations easy and interpretable.

Now, for each token embedding x_i,

q_i = x_i * W_Q

k_i = x_i* W_K

v_i = x_i*W_V

and finally we can create the matrices for Queries (Q), Keys (K) and Values (V).

Now comes the interesting part where we will compute the raw scores using the formula below:

Now, if we see above, k4, i.e. “cat” gets the largest score. We hope the model will resolve “it” to “cat”.

Now this 4-D vector is the new representation of “it” after attention. Because alpha_3 was largest, “cat” contributes most, so “it” now carries “cat” related information.

Why Attention Matters in Modern AI

Since the Transformer architecture was first introduced, researchers have been constantly trying to improve it, and guess where most of their attention goes? The attention layer itself. That’s because it’s not just the heart of the model, it’s also the most computationally heavy part.

Over time, researchers came up with all sorts of new attention mechanisms to make things faster and smarter like Local/Sparse Attention, Multi-query and Grouped Attention, and Flash Attention. Each of these was an attempt to keep the magic of attention while cutting down on the heavy computation. (Will post about these in the subsequent posts)

Conclusion

It’s wild to think that the ability of machines to focus — something so human has completely transformed how they read, translate, write, and even create art. Attention isn’t just a technical concept; it’s the spark that made models like GPT and BERT possible.

Attention is one of the most important breakthroughs in modern AI. It gave models the ability to focus on what actually matters in a sentence, which made understanding and generating language far more accurate.

If you’re learning about transformers or working with language models, my advice is this: take time to really understand attention. It’s the core of everything — from GPT to BERT to the latest AI tools. Once you get how it works, the rest of the architecture becomes much easier to follow.

Source

A big thank you to Jay Alammar and Maarten Grootendorst for their brilliant explanations in Hands-On Large Language Models, which helped me understand and simplify these concepts.

Also, a special thanks to the legendary paper “Attention Is All You Need” by Vaswani et al., which started it all. If you haven’t read it yet, it’s worth checking out!

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI


Take our 90+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!

Towards AI has published Building LLMs for Production—our 470+ page guide to mastering LLMs with practical projects and expert insights!


Discover Your Dream AI Career at Towards AI Jobs

Towards AI has built a jobs board tailored specifically to Machine Learning and Data Science Jobs and Skills. Our software searches for live AI jobs each hour, labels and categorises them and makes them easily searchable. Explore over 40,000 live jobs today with Towards AI Jobs!

Note: Content contains the views of the contributing authors and not Towards AI.