How Words Learn to Pay Attention: Transformers Part 1
Last Updated on January 15, 2025 by Editorial Team
Author(s): Anushka sonawane
Originally published on Towards AI.
Itβs not about how fast you go, but about how smart you can be while going fast.
If youβve ever wondered how Google Translate or Siri understands you, or how chatgpt creates sentences that sound like a human, well, guess what? Itβs all because of Transformers.Now, I know this might sound technical, but trust me, by the end of this blog, youβll have a clear picture of how this model works and why itβs become such a big deal in the world of AI.
At its core, a Transformer is a type of deep learning model designed to handle sequential data β basically, any data that comes in a sequence, like sentences in a paragraph, words in a sentence, or even pixels in an image.
Think of it like a smart assistant. If you tell it something, it tries to understand your sentence, breaks it down, and then gives you an answer that makes sense. But instead of reading one word at a time like humans do, a transformer can look at all the words in a sentence at once and figure out how they all relate to each other.
Life Before Transformers β RNNs and LSTMs
What were RNNs?
Imagine you are playing a message-passing game. You stand in a line with your friends, and the first person whispers a message to the next, and so on until it reaches the last person. Sounds fun, right?
Now hereβs the catch β the longer the chain, the more garbled the message becomes. RNNs (Recurrent Neural Networks) are like that chain. They pass information step by step, word by word, and sometimesβ¦ they just forget what was said earlier. Thatβs a problem when the sentences are long.
They face a couple of significant challenges:
➤ Vanishing and Exploding Gradients: During training, RNNs can encounter issues where the learning updates become either too small (vanishing) or too large (exploding). This makes it hard for the network to learn long-term patterns.
➤ Sequential Processing Bottleneck: RNNs handle one piece of information at a time, making them slow, especially with large datasets. This sequential nature also limits their ability to utilize modern computing hardware efficiently.
➤ Short-Term Memory: Just like in the game, where the original message gets lost over time, RNNs struggle to retain information from earlier parts of a long sequence, making them less effective for understanding context that spans over longer durations.
How did LSTMs Help?
Now, letβs say we improve the message-passing game. Instead of whispering everything, you only pass whatβs important β like handing notes instead of repeating full sentences. This is what LSTMs (Long Short-Term Memory networks) did.
They introduced something called cell states that decide:
What to remember.
What to forget.
➤ Complexity and Interpretability: The intricate design of LSTMs, with multiple components deciding what to keep or discard, makes them harder to understand and interpret.
➤ Computational Intensity: Managing these notepads and decisions requires significant computational resources, leading to longer training times and increased demand for processing power.
➤ Difficulty with Hierarchical Structures: LSTMs may struggle to effectively model complex language structures, limiting their ability to generalize in tasks involving intricate hierarchies.
➤ Scalability Issues: The step-by-step processing approach of LSTMs limits their scalability on modern hardware, making it challenging to train large models efficiently.
Imagine youβre reading a sentence like βThe cat chased the mouse and the dog barked loudly at the cat.β
In older models like RNNs (Recurrent Neural Networks) or LSTMs (Long Short-Term Memory), words are processed one by one, in order, like reading a book from left to right.
But hereβs the issue:
As the model processes each word, it forgets earlier words. Itβs like reading a long story and forgetting the beginning the more you move forward. So, when it reaches the end of the sentence, itβs already forgotten important details from the start (like the first βcatβ being related to the second βcatβ).
How Self-Attention Solves It:
Self-attention fixes this by looking at all words at once, no matter their position in the sentence. So, even if words are far apart, like the two βcatsβ, they can still connect to each other. Every word can pay attention to the others, remembering whatβs important, even from far away.
What is Attention?
Imagine youβre reading a sentence, like βThe cat sat on the mat.β Your brain automatically understands that βsatβ refers to the cat and βon the matβ describes where the cat sat. You are paying βattentionβ to different words and their relationships to understand the sentenceβs meaning. Transformers work similarly!
➤ Step 1: Initial Word Embeddings
Computers donβt understand words like we do. They need numbers. So, every word in the sentence is turned into a vector (a fancy word for a list of numbers).
Example Sentence:
These vectors are like the computerβs version of the words. But hereβs the thing: Initially, these vectors have no sense of context. The word βdogβ is just floating around in its own world, clueless about the βcatβ or the βbarked.β
➤ Step 2: Calculating the Weights (Dot Product)
Now, letβs teach the computer to focus on whatβs important. This is done by comparing each word with every other word in the sentence.
For the word βdog,β we ask:
- βHow related is βdogβ to βcatβ?β
- βHow related is βdogβ to βmouseβ?β
- And so on for all the words.
This comparison is done using a simple mathematical operation called the dot product (donβt worry about the math details). Think of it like giving scores for how well two words match.
Example scores for βdogβ:
- With βcatβ β High score (because dog and cat are natural rivals).
- With βbarkedβ β Medium score (barking is something a dog does).
- With βmouseβ β Low score (dogs donβt care much about mice).
➤ Step 3: Normalizing the Weights
Now, the computer normalizes these scores so they add up to 1. This is like distributing your attention evenly. For example:
- βcatβ gets 0.5 (50% attention).
- βbarkedβ gets 0.3 (30% attention).
- βmouseβ gets 0.2 (20% attention).
Adjust these scores so they sum to 1, making sure they are balanced.
Here, Wij becomes the normalized weight.
For example:
If W11= 5, W12=3, W_{12} = 3W12β=3, W13β=4:
Sum=5+3+4=12
Normalized weights: W11=5/12=0.42, W12=3/12=0.25, W13=4/12=0.33
➤ Step 4: Reweighing Word Embeddings
To calculate a context-aware embedding for each word.
The embeddings need to capture relationships between words in the sequence. By combining embeddings of related words (weighted by attention), the model creates a representation of a word enriched with contextual information.
For example:
- In the sentence βThe cat chased the mouseβ, the embedding for βcatβ should reflect its relationship with βchasedβ and βmouse.β
- Without this step, the embeddings would remain isolated and lack meaningful connections.
Using normalized weights Wijβ, we compute the final embedding Y1 for V1β:
Y1=W11.V1+W12.V2+W13.V3
Substituting normalized weights:
Y1=0.42.V1+0.25.V2+0.33.V3
This is repeated for all words so that each word gets some context from every other word in the sentence.
This step-by-step process forms the basis of self-attention, where every word in a sentence can understand its relationship with all other words, creating embeddings enriched with context. This ability to capture global dependencies is what makes self-attention a powerful tool in transformer models for tasks like translation, summarization, and language modeling.
Next up, weβre diving into Multi-Head Attention and breaking down the different stages of Transformers β get ready to see how these models work their magic!
If youβd like to follow along with more insights or discuss any of these topics further, feel free to connect with me:
Looking forward to chatting and sharing more ideas!
Wait, Thereβs More!
If you enjoyed this, youβll love my other blogs! 🎯
Everything You Need to Know About Chunking for RAG
How Chunking Makes Data Easier to Handle and Retrieve
pub.towardsai.net
AI Agents, Assemble(Part 1)! The Future of Problem-Solving with AutoGen
Getting to Know AI Agents: How They Work, Why Theyβre Useful, and What They Can Do for You
pub.towardsai.net
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI