Evolution of Transformers Pt2: Sequence Modelling(Transformers)

Last Updated on September 14, 2025 by Editorial Team

Author(s): Apoorv Jain

Originally published on Towards AI.

In the previous blog of this series, we explored the early revolutionary idea of Recurrent Neural Networks (RNNs) for sequence modelling. We discussed their core intuition, the advantages they offered, and the key limitations, particularly the challenges of maintaining gradient flow across long sequences often required in sequence modelling tasks.

In this article, we turn our attention to the simple yet powerful ideas that made transformers unique and highly scalable.

Evolution of Transformers Pt2: Sequence Modelling(Transformers) — Fig 1. Student’s eyes are scanning the sentences to understand the discourse.

We discussed the idea of a “mental summary” that keeps track of the context of previous words in a sentence. However, human eyes can do more than just maintain such a summary, they can easily scan multiple previous words and directly infer which ones are most relevant for understanding the current word. We are going to discuss the next approach inspired by this idea.

Self Attention

It is a very powerful technique for identifying the relevance of a token with respect to all other tokens. It allows the model to learn complex relationships between the different tokens while maintaining the simplicity and efficiency of the model.

Scenario(A):

The mouse froze for a moment, then bolted across the floor in panic, its tiny body trembling as the cat lunged toward it. Startled by the sudden movement and sensing danger, the mouse darted frantically, desperate to escape the predator’s looming presence.

How do you know what “it” refers to? Your brain instantly connects “it” to “mouse” by looking at the context provided by the word “scared”.

Instead of using a single vector representation for each token, we separate out three vectors for single token with different purposes( Q, K, V).

Query(Q): What am i searching for?
The word ‘it’ query vector says, “I need to know who I am referring to.”
Key(K): What can I offer?
The word ‘mouse’ key vectors says, “I am a noun, an animal, a potential subject of fear.”
Value(V): Actual offering.
The word ‘mouse’ value vector contains its rich semantic meaning.

Attention scores computed using scaled dot product method.

The dot product measures the similarity between two vectors. In the context of attention, the dot product between the Query vector and Key vector signifies the match between the requirement (query) and offering (key) of other tokens. To avoid excessively large values as the dimensionality of these vectors (dk) increases, the dot product is scaled by dk. The scaled scores are then passed through a softmax layer, which converts them into normalized weights. Finally, these weights are used to compute a weighted average of the actual offering (value) of the tokens, producing the next representation for the token.

Evolving Features Across Layers

The attention scores that are computed by the learnable Q, K and V in a transformer reflect the different types of features the model has learned during training, and these become richer and more abstract as information passes through multiple layers.

In the earlier layers, attention may capture low-level patterns such as syntactic relationships or positional dependencies, while deeper layers gradually focus on more complex semantic structures, contextual understanding, and task-specific representations. Essentially, the progression of attention across layers allows the model to refine its understanding of input sequences, moving from surface-level associations to higher-level, meaningful patterns that contribute to better predictions.

Visualising Self-Attention

We used BertViz library for analysis of this sentence. We selected BERT encoder to tokenize and compute contextual embeddings of the tokens of this sentences. We visualised the attention scores across different layers and multiple heads of the transformer model.

Earlier Layers
Fig 2[A] — subject-verb relationship showcased by strong connection between “cat”(Subject) and “chased”(Verb).
Fig 2[B] — object-verb relationship showcased by connection between “mouse”(object) and “chased”(Verb).
These features are low level relationships that we found in Layer 0 in different attention heads.
Deeper Layers
Fig 2[C] — understanding discourse of who got scared?
Fig 2[D] — coreference resolution for finding the subject “it” is referring to.
These features are complex and requires deep understanding, and thus found in deeper layers 9 and 10.

Fig 2. Low level to High level features learned through the layers

Alternate Scenario(B):

The mouse scurried frantically across the floor while the cat chased after it, not out of hunger but fear, its puffed tail and jerky movements betraying its startled instinct.

In Scenario A, the cat saw a frightened mouse and started chasing it, while in Scenario B, the cat itself panicked and began chasing the mouse.

Fig2[C] and Fig2[D] captures this ambiguity beautifully in the attention scores as you can see dual connections between “scared”, “it” with “mouse” and “cat”.

Fig 3. Two possible scenarios arising from ambiguity in the sentence. So, who was freaked out, the cat or the mouse? Source: Gemini

Transformers

The Transformer architecture revolutionized sequence modeling by introducing a structure built entirely on attention mechanisms, moving away from the traditional use of recurrent or convolutional neural networks. Instead of processing sequence data one time step at a time , the transformer enables every token to directly attend to all others in the input via self-attention.

Key components

Input Embedding Layer: Converts input tokens into high-dimensional vector representations that capture semantic information.
Positional Encoding: Adds position-based information to token embeddings to preserve the sequence order while training parallely. This is a consequence of removing recurrent connection across time steps.
Encoder Block:
Multi-Head Self-Attention: Each token attends to all others, capturing contextual relationships within the input sequence. This is done simultaneously by multiple heads, each having its own Q,K, and V matrix.
Feed-Forward Network (MLP Layer): A position-wise dense network refines each token’s representation after attention.
Layer Normalisation and Residual Connections: Stabilise training, enable effective gradient flow, and improve numerical stability. This is similar to a student revising in order to remember concepts.

Decoder Block :
Masked Multi-Head Self-Attention: Prevents attending to future tokens during generation for autoregressive modeling.
Encoder-Decoder (Cross) Attention: Allows the decoder to focus on encoded input representations when generating output.
Feed-Forward : As in the encoder, refines token-level representations.
Layer Normalization and Residual Connections: As above, for stabilization and convergence.
Output Linear Layer + Softmax : Generates an probability distribution of next token.

Was transformers the first to introduce attention?

No, it was the first to rely solely on self attention without any recurrent connections across timesteps(t) for machine translation task.

Fig 6. Removal of Recurrent connection across time steps

The original attention mechanism was introduced by Bahdanau et al. in 2014 to improve encoder-decoder RNN models for tasks like machine translation, enabling models to focus on relevant parts of the input sequence.
It was considered more as an enhancement technique for RNN based architectures, which relied on recurrent and convolutional layers for learning relationships between tokens.
The Transformer architecture (introduced in “Attention Is All You Need”, Vaswani et al., 2017) was the first model to entirely replace recurrence and convolution with attention specifically, attention.
That said, removing recurrent connections is not without consequences. There has been ongoing discussion about whether the lack of recurrence limits the reasoning abilities of transformers. A recent paper on hierarchical reasoning model, for instance, proposed reintroducing recurrent connections inspired by human brain and achieved strong reasoning performance with significantly fewer parameters (27M).

Training Objective

Fig 7. Parallel training for next token prediction

To train the transformer architecture, we need a task that can instill the world knowledge into it. One widely known task is the next token prediction task, in which you train the model to predict the next word’s probability distribution and then compare it to the original word’s probability distribution measured using the cross entropy loss function.
This loss is used to propagate the gradient throughout the layers, adjusting the weights in much the same way as how you train a standard neural network, though the training is less complex than RNN.

Cross entropy loss between predicted probability and true probability distribution of a token.

What is particularly fascinating is how an LLM can generate thousands of logical sentences from just a single prompt, producing them one at a time, yet still maintaining fluency, coherence, and logical consistency. At first glance, this might seem to require explicit planning in advance.
In fact, beyond next-word prediction, there is also another pretraining objective that has recently gained a lot of attention for structured tasks. We will explore this in the upcoming blogs.

Why Transformers Succeeded

No bottleneck problem: The attention mechanism reduces the burden on “mental summary” by allowing the model to directly track different dependencies. Moreover, gradients can flow directly to the relevant tokens without going through unnecessary tokens in the middle. This solves the long-range dependency problems.
Parallel Training: Using solely attention removed the linear dependency; processing the nth token does not require the output of the (n-1)th token. This makes training far more scalable and parallelizable, as the computations can be efficiently expressed as matrix operations on GPUs.
Transfer Learning: The property of retaining the domain knowledge after pretraining on the next-word prediction task allows us to fine-tune on the downstream task with a small number of samples.
Scalability: What differentiated Transformers was the scalability factor, where you can keep increasing the number of parameters to get an incremental increase in performance.

Ongoing Challenges

Sequential Inference: Even when you can train the model parallely on GPUs, there is still a linear dependency at the time of inference, you need (n-1)th token output to compute nth token. This limitation makes the inference slow.
Error accumulation: This comes from the fact that LLMs cannot back track. Once a token is produced, it cannot be revised or replaced, meaning that any mistake propagates forward. As a result, the margin for error is extremely narrow, and inaccuracies at the token level can be costly throughout the entire sequence.
Limited diversity: By default, the generated text is determined by greedy decoding, which produces fixed and less diverse outputs. We require temperature sampling to adjust the probability distribution of output tokens, which solves the diversity problem but up to some limit only.

References:

Attention Is All You Need , https://arxiv.org/abs/1706.03762
Neural Machine Translation by Jointly Learning to Align and Translate https://arxiv.org/abs/1409.0473
The Illustrated Transformer https://jalammar.github.io/illustrated-transformer/
Statquest https://www.youtube.com/watch?v=zxQyTK8quyY&t=1550s
3Blue1Brown https://www.youtube.com/watch?v=eMlx5fFNoYc
BertViz , https://github.com/jessevig/bertviz
Hierarchical Reasoning Model https://arxiv.org/abs/2506.21734

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Evolution of Transformers Pt2: Sequence Modelling(Transformers)

Author(s): Apoorv Jain

Self Attention

Evolving Features Across Layers

Visualising Self-Attention

Transformers

Key components

Was transformers the first to introduce attention?

Training Objective

Why Transformers Succeeded

Ongoing Challenges

References:

Towards AI Academy

We Build Enterprise-Grade AI. We'll Teach You to Master It Too.

Recent Posts

Full-Stack Data Scientists for the Agentic Coding World

Building Production-Grade AI Skills with Snowflake Cortex AI Function Studio

I Tried 10 AI Agent Frameworks in 2026 — Here’s the Honest Guide I Wish I Had Earlier

How One Spring Boot Optimization Saved Our Startup $30,000 a Year

Inside Palantir AIP: How the World’s Most Controversial AI Platform Actually Works

What Is a Reverse Proxy? (And Why Every Backend Developer Should Care)

What Claude Opus 4.8 Actually Changes If You’re Building Agents

QWEN 3.7 Max Worked For 35 Hrs Straight And The Results Were Mind-blowing

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Evolution of Transformers Pt2: Sequence Modelling(Transformers)

Author(s): Apoorv Jain

Self Attention

Evolving Features Across Layers

Visualising Self-Attention

Transformers

Key components

Was transformers the first to introduce attention?

Training Objective

Why Transformers Succeeded

Ongoing Challenges

References:

Towards AI Academy

We Build Enterprise-Grade AI. We'll Teach You to Master It Too.

Related posts

Recent Posts

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

GDPR CCPA Statement