How Words Learn to Pay Attention: Transformers Part 1

Last Updated on January 15, 2025 by Editorial Team

Author(s): Anushka sonawane

Originally published on Towards AI.

It’s not about how fast you go, but about how smart you can be while going fast.

If you’ve ever wondered how Google Translate or Siri understands you, or how chatgpt creates sentences that sound like a human, well, guess what? It’s all because of Transformers.Now, I know this might sound technical, but trust me, by the end of this blog, you’ll have a clear picture of how this model works and why it’s become such a big deal in the world of AI.

At its core, a Transformer is a type of deep learning model designed to handle sequential data — basically, any data that comes in a sequence, like sentences in a paragraph, words in a sentence, or even pixels in an image.

Think of it like a smart assistant. If you tell it something, it tries to understand your sentence, breaks it down, and then gives you an answer that makes sense. But instead of reading one word at a time like humans do, a transformer can look at all the words in a sentence at once and figure out how they all relate to each other.

Life Before Transformers — RNNs and LSTMs

What were RNNs?

Imagine you are playing a message-passing game. You stand in a line with your friends, and the first person whispers a message to the next, and so on until it reaches the last person. Sounds fun, right?

Now here’s the catch — the longer the chain, the more garbled the message becomes. RNNs (Recurrent Neural Networks) are like that chain. They pass information step by step, word by word, and sometimes… they just forget what was said earlier. That’s a problem when the sentences are long.
They face a couple of significant challenges:

➤ Vanishing and Exploding Gradients: During training, RNNs can encounter issues where the learning updates become either too small (vanishing) or too large (exploding). This makes it hard for the network to learn long-term patterns.
➤ Sequential Processing Bottleneck: RNNs handle one piece of information at a time, making them slow, especially with large datasets. This sequential nature also limits their ability to utilize modern computing hardware efficiently.
➤ Short-Term Memory: Just like in the game, where the original message gets lost over time, RNNs struggle to retain information from earlier parts of a long sequence, making them less effective for understanding context that spans over longer durations.

How did LSTMs Help?

Now, let’s say we improve the message-passing game. Instead of whispering everything, you only pass what’s important — like handing notes instead of repeating full sentences. This is what LSTMs (Long Short-Term Memory networks) did.
They introduced something called cell states that decide:
What to remember.
What to forget.

➤ Complexity and Interpretability: The intricate design of LSTMs, with multiple components deciding what to keep or discard, makes them harder to understand and interpret.
➤ Computational Intensity: Managing these notepads and decisions requires significant computational resources, leading to longer training times and increased demand for processing power.
➤ Difficulty with Hierarchical Structures: LSTMs may struggle to effectively model complex language structures, limiting their ability to generalize in tasks involving intricate hierarchies.
➤ Scalability Issues: The step-by-step processing approach of LSTMs limits their scalability on modern hardware, making it challenging to train large models efficiently.

Imagine you’re reading a sentence like “The cat chased the mouse and the dog barked loudly at the cat.”

In older models like RNNs (Recurrent Neural Networks) or LSTMs (Long Short-Term Memory), words are processed one by one, in order, like reading a book from left to right.

But here’s the issue:
As the model processes each word, it forgets earlier words. It’s like reading a long story and forgetting the beginning the more you move forward. So, when it reaches the end of the sentence, it’s already forgotten important details from the start (like the first “cat” being related to the second “cat”).

How Self-Attention Solves It:

Self-attention fixes this by looking at all words at once, no matter their position in the sentence. So, even if words are far apart, like the two “cats”, they can still connect to each other. Every word can pay attention to the others, remembering what’s important, even from far away.

What is Attention?

Imagine you’re reading a sentence, like “The cat sat on the mat.” Your brain automatically understands that “sat” refers to the cat and “on the mat” describes where the cat sat. You are paying “attention” to different words and their relationships to understand the sentence’s meaning. Transformers work similarly!

➤ Step 1: Initial Word Embeddings
Computers don’t understand words like we do. They need numbers. So, every word in the sentence is turned into a vector (a fancy word for a list of numbers).
Example Sentence:

These vectors are like the computer’s version of the words. But here’s the thing: Initially, these vectors have no sense of context. The word “dog” is just floating around in its own world, clueless about the “cat” or the “barked.”

➤ Step 2: Calculating the Weights (Dot Product)

Now, let’s teach the computer to focus on what’s important. This is done by comparing each word with every other word in the sentence.

For the word “dog,” we ask:

“How related is ‘dog’ to ‘cat’?”
“How related is ‘dog’ to ‘mouse’?”
And so on for all the words.

This comparison is done using a simple mathematical operation called the dot product (don’t worry about the math details). Think of it like giving scores for how well two words match.

Example scores for “dog”:

With “cat” → High score (because dog and cat are natural rivals).
With “barked” → Medium score (barking is something a dog does).
With “mouse” → Low score (dogs don’t care much about mice).

➤ Step 3: Normalizing the Weights

Now, the computer normalizes these scores so they add up to 1. This is like distributing your attention evenly. For example:

“cat” gets 0.5 (50% attention).
“barked” gets 0.3 (30% attention).
“mouse” gets 0.2 (20% attention).

Adjust these scores so they sum to 1, making sure they are balanced.

Here, Wij becomes the normalized weight.
For example:
If W11= 5, W12=3, W_{12} = 3W12=3, W13=4:
Sum=5+3+4=12
Normalized weights: W11=5/12=0.42, W12=3/12=0.25, W13=4/12=0.33

➤ Step 4: Reweighing Word Embeddings

To calculate a context-aware embedding for each word.
The embeddings need to capture relationships between words in the sequence. By combining embeddings of related words (weighted by attention), the model creates a representation of a word enriched with contextual information.
For example:

In the sentence “The cat chased the mouse”, the embedding for “cat” should reflect its relationship with “chased” and “mouse.”
Without this step, the embeddings would remain isolated and lack meaningful connections.

Using normalized weights Wij, we compute the final embedding Y1 for V1:

Y1=W11.V1+W12.V2+W13.V3

Substituting normalized weights:

Y1=0.42.V1+0.25.V2+0.33.V3

This is repeated for all words so that each word gets some context from every other word in the sentence.

This step-by-step process forms the basis of self-attention, where every word in a sentence can understand its relationship with all other words, creating embeddings enriched with context. This ability to capture global dependencies is what makes self-attention a powerful tool in transformer models for tasks like translation, summarization, and language modeling.

Next up, we’re diving into Multi-Head Attention and breaking down the different stages of Transformers — get ready to see how these models work their magic!

If you’d like to follow along with more insights or discuss any of these topics further, feel free to connect with me:

Looking forward to chatting and sharing more ideas!

Wait, There’s More!
If you enjoyed this, you’ll love my other blogs! 🎯

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

How Words Learn to Pay Attention: Transformers Part 1

Author(s): Anushka sonawane

How Self-Attention Solves It:

Everything You Need to Know About Chunking for RAG

How Chunking Makes Data Easier to Handle and Retrieve

AI Agents, Assemble(Part 1)! The Future of Problem-Solving with AutoGen

Getting to Know AI Agents: How They Work, Why They’re Useful, and What They Can Do for You

Feedback ↓ Cancel reply

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

LAI #66: Information Theory for People in a Hurry

🔎 Decoding LLM Pipeline — Step 1: Input Processing & Tokenization

Meta to Launch Its Own In-House AI Chip

I Built an AI Money Coach in Python — Here’s How You Can Too (Step-by-Step Guide!)

ChatGPT Now Works Natively in Xcode and VS Code

The World’s Leading AI and Technology Publication.

Company

CONTACT US

🔥 Recommended Articles 🔥

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

How Words Learn to Pay Attention: Transformers Part 1

Author(s): Anushka sonawane

How Self-Attention Solves It:

Everything You Need to Know About Chunking for RAG

How Chunking Makes Data Easier to Handle and Retrieve

AI Agents, Assemble(Part 1)! The Future of Problem-Solving with AutoGen

Getting to Know AI Agents: How They Work, Why They’re Useful, and What They Can Do for You

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement

Subscribe to our AI newsletter!

🔥 Recommended Articles 🔥