Evolution of Transformers Pt2: Sequence Modelling(Transformers)
Last Updated on September 14, 2025 by Editorial Team
Author(s): Apoorv Jain
Originally published on Towards AI.
In the previous blog of this series, we explored the early revolutionary idea of Recurrent Neural Networks (RNNs) for sequence modelling. We discussed their core intuition, the advantages they offered, and the key limitations, particularly the challenges of maintaining gradient flow across long sequences often required in sequence modelling tasks.
In this article, we turn our attention to the simple yet powerful ideas that made transformers unique and highly scalable.

We discussed the idea of a “mental summary” that keeps track of the context of previous words in a sentence. However, human eyes can do more than just maintain such a summary, they can easily scan multiple previous words and directly infer which ones are most relevant for understanding the current word. We are going to discuss the next approach inspired by this idea.
Self Attention
It is a very powerful technique for identifying the relevance of a token with respect to all other tokens. It allows the model to learn complex relationships between the different tokens while maintaining the simplicity and efficiency of the model.

Scenario(A):
The mouse froze for a moment, then bolted across the floor in panic, its tiny body trembling as the cat lunged toward it. Startled by the sudden movement and sensing danger, the mouse darted frantically, desperate to escape the predator’s looming presence.
How do you know what “it” refers to? Your brain instantly connects “it” to “mouse” by looking at the context provided by the word “scared”.
Instead of using a single vector representation for each token, we separate out three vectors for single token with different purposes( Q, K, V).
- Query(Q): What am i searching for?
The word ‘it’ query vector says, “I need to know who I am referring to.” - Key(K): What can I offer?
The word ‘mouse’ key vectors says, “I am a noun, an animal, a potential subject of fear.” - Value(V): Actual offering.
The word ‘mouse’ value vector contains its rich semantic meaning.

The dot product measures the similarity between two vectors. In the context of attention, the dot product between the Query vector and Key vector signifies the match between the requirement (query) and offering (key) of other tokens. To avoid excessively large values as the dimensionality of these vectors (dk) increases, the dot product is scaled by dk. The scaled scores are then passed through a softmax layer, which converts them into normalized weights. Finally, these weights are used to compute a weighted average of the actual offering (value) of the tokens, producing the next representation for the token.
Evolving Features Across Layers
The attention scores that are computed by the learnable Q, K and V in a transformer reflect the different types of features the model has learned during training, and these become richer and more abstract as information passes through multiple layers.
In the earlier layers, attention may capture low-level patterns such as syntactic relationships or positional dependencies, while deeper layers gradually focus on more complex semantic structures, contextual understanding, and task-specific representations. Essentially, the progression of attention across layers allows the model to refine its understanding of input sequences, moving from surface-level associations to higher-level, meaningful patterns that contribute to better predictions.
Visualising Self-Attention
We used BertViz library for analysis of this sentence. We selected BERT encoder to tokenize and compute contextual embeddings of the tokens of this sentences. We visualised the attention scores across different layers and multiple heads of the transformer model.
- Earlier Layers
Fig 2[A] — subject-verb relationship showcased by strong connection between “cat”(Subject) and “chased”(Verb).
Fig 2[B] — object-verb relationship showcased by connection between “mouse”(object) and “chased”(Verb).
These features are low level relationships that we found in Layer 0 in different attention heads. - Deeper Layers
Fig 2[C] — understanding discourse of who got scared?
Fig 2[D] — coreference resolution for finding the subject “it” is referring to.
These features are complex and requires deep understanding, and thus found in deeper layers 9 and 10.


Alternate Scenario(B):
The mouse scurried frantically across the floor while the cat chased after it, not out of hunger but fear, its puffed tail and jerky movements betraying its startled instinct.
In Scenario A, the cat saw a frightened mouse and started chasing it, while in Scenario B, the cat itself panicked and began chasing the mouse.
Fig2[C] and Fig2[D] captures this ambiguity beautifully in the attention scores as you can see dual connections between “scared”, “it” with “mouse” and “cat”.

Transformers
The Transformer architecture revolutionized sequence modeling by introducing a structure built entirely on attention mechanisms, moving away from the traditional use of recurrent or convolutional neural networks. Instead of processing sequence data one time step at a time , the transformer enables every token to directly attend to all others in the input via self-attention.

Key components
- Input Embedding Layer: Converts input tokens into high-dimensional vector representations that capture semantic information.
- Positional Encoding: Adds position-based information to token embeddings to preserve the sequence order while training parallely. This is a consequence of removing recurrent connection across time steps.
- Encoder Block:
Multi-Head Self-Attention: Each token attends to all others, capturing contextual relationships within the input sequence. This is done simultaneously by multiple heads, each having its own Q,K, and V matrix.
Feed-Forward Network (MLP Layer): A position-wise dense network refines each token’s representation after attention.
Layer Normalisation and Residual Connections: Stabilise training, enable effective gradient flow, and improve numerical stability. This is similar to a student revising in order to remember concepts.

- Decoder Block :
Masked Multi-Head Self-Attention: Prevents attending to future tokens during generation for autoregressive modeling.
Encoder-Decoder (Cross) Attention: Allows the decoder to focus on encoded input representations when generating output.
Feed-Forward : As in the encoder, refines token-level representations.
Layer Normalization and Residual Connections: As above, for stabilization and convergence. - Output Linear Layer + Softmax : Generates an probability distribution of next token.
Was transformers the first to introduce attention?
No, it was the first to rely solely on self attention without any recurrent connections across timesteps(t) for machine translation task.

- The original attention mechanism was introduced by Bahdanau et al. in 2014 to improve encoder-decoder RNN models for tasks like machine translation, enabling models to focus on relevant parts of the input sequence.
- It was considered more as an enhancement technique for RNN based architectures, which relied on recurrent and convolutional layers for learning relationships between tokens.
- The Transformer architecture (introduced in “Attention Is All You Need”, Vaswani et al., 2017) was the first model to entirely replace recurrence and convolution with attention specifically, attention.
- That said, removing recurrent connections is not without consequences. There has been ongoing discussion about whether the lack of recurrence limits the reasoning abilities of transformers. A recent paper on hierarchical reasoning model, for instance, proposed reintroducing recurrent connections inspired by human brain and achieved strong reasoning performance with significantly fewer parameters (27M).
Training Objective

To train the transformer architecture, we need a task that can instill the world knowledge into it. One widely known task is the next token prediction task, in which you train the model to predict the next word’s probability distribution and then compare it to the original word’s probability distribution measured using the cross entropy loss function.
This loss is used to propagate the gradient throughout the layers, adjusting the weights in much the same way as how you train a standard neural network, though the training is less complex than RNN.

What is particularly fascinating is how an LLM can generate thousands of logical sentences from just a single prompt, producing them one at a time, yet still maintaining fluency, coherence, and logical consistency. At first glance, this might seem to require explicit planning in advance.
In fact, beyond next-word prediction, there is also another pretraining objective that has recently gained a lot of attention for structured tasks. We will explore this in the upcoming blogs.
Why Transformers Succeeded
- No bottleneck problem: The attention mechanism reduces the burden on “mental summary” by allowing the model to directly track different dependencies. Moreover, gradients can flow directly to the relevant tokens without going through unnecessary tokens in the middle. This solves the long-range dependency problems.
- Parallel Training: Using solely attention removed the linear dependency; processing the nth token does not require the output of the (n-1)th token. This makes training far more scalable and parallelizable, as the computations can be efficiently expressed as matrix operations on GPUs.
- Transfer Learning: The property of retaining the domain knowledge after pretraining on the next-word prediction task allows us to fine-tune on the downstream task with a small number of samples.
- Scalability: What differentiated Transformers was the scalability factor, where you can keep increasing the number of parameters to get an incremental increase in performance.
Ongoing Challenges

- Sequential Inference: Even when you can train the model parallely on GPUs, there is still a linear dependency at the time of inference, you need (n-1)th token output to compute nth token. This limitation makes the inference slow.
- Error accumulation: This comes from the fact that LLMs cannot back track. Once a token is produced, it cannot be revised or replaced, meaning that any mistake propagates forward. As a result, the margin for error is extremely narrow, and inaccuracies at the token level can be costly throughout the entire sequence.
- Limited diversity: By default, the generated text is determined by greedy decoding, which produces fixed and less diverse outputs. We require temperature sampling to adjust the probability distribution of output tokens, which solves the diversity problem but up to some limit only.
References:
- Attention Is All You Need , https://arxiv.org/abs/1706.03762
- Neural Machine Translation by Jointly Learning to Align and Translate https://arxiv.org/abs/1409.0473
- The Illustrated Transformer https://jalammar.github.io/illustrated-transformer/
- Statquest https://www.youtube.com/watch?v=zxQyTK8quyY&t=1550s
- 3Blue1Brown https://www.youtube.com/watch?v=eMlx5fFNoYc
- BertViz , https://github.com/jessevig/bertviz
- Hierarchical Reasoning Model https://arxiv.org/abs/2506.21734
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI
Towards AI Academy
We Build Enterprise-Grade AI. We'll Teach You to Master It Too.
15 engineers. 100,000+ students. Towards AI Academy teaches what actually survives production.
Start free — no commitment:
→ 6-Day Agentic AI Engineering Email Guide — one practical lesson per day
→ Agents Architecture Cheatsheet — 3 years of architecture decisions in 6 pages
Our courses:
→ AI Engineering Certification — 90+ lessons from project selection to deployed product. The most comprehensive practical LLM course out there.
→ Agent Engineering Course — Hands on with production agent architectures, memory, routing, and eval frameworks — built from real enterprise engagements.
→ AI for Work — Understand, evaluate, and apply AI for complex work tasks.
Note: Article content contains the views of the contributing authors and not Towards AI.