Transformers in AI: The Attention Timeline, From the 1990s to Present

Last Updated on June 3, 2024 by Editorial Team

Author(s): Thiongo John W

Originally published on Towards AI.

What we call transformer architecture today has taken more than three decades to evolve into its present state. The following is an exploration into the three decades of transformer architecture evolution.

What are Transformers in AI?

Transformers are a kind of architecture used in artificial intelligence, specifically, they are a type of neural network. They are adept at taking a sequence of data (like text or speech) and transforming it into another sequence (like a translation to another language or a summary of the information).

Transformers use attention. Attention allows the transformer to focus on specific parts of the sequence that the model is analyzing; understanding how those parts relate to each other; and their influence on the overall meaning. This is particularly useful for things like language, where word order and relationships between words are crucial.

Before we go into the history of transformers here’s a breakdown of what transformers can do so far:

Understanding and generating human language: This makes them great for tasks like machine translation, writing different kinds of creative content, and answering some of your questions in a comprehensive way.

Analyzing other kinds of sequences: Transformers aren’t limited to just language. They can be used to analyze DNA or protein sequences, helping researchers understand diseases and develop new medicines.

Finding patterns and trends: By looking at sequences of data, transformers can identify trends and anomalies, which can be useful for fraud detection or targeted recommendations.

Transformers are a powerful tool that have revolutionized many areas of artificial intelligence.

Let’s dive deep into the history of transformers.

The Prehistoric Era: From 1990 to 2013

It’s important to understand the concept of a neural network before you can delve into the history and evolution of deep learning architecture.

A neural network is a model inspired by the structure and fundamental function of a biological neural network. At its core, AI development is a reconstruction of the complex mechanism of neural networks in the human brain.

Below is a diagram of an artificial neural network showing the interconnection of nodes that is a simplification of a typical connection of neurons in a brain. Each circle is a neuron and the arrows show the connection between the input and the output.

A deep neural network has at least two hidden layers.

Image Source: Wikipedia Creative Commons License

In an animal illustration of the working of a neural network, the input is a cry from a hungry baby, the hidden part is the complex interconnection of nerves (the architecture), and the output is milk from the mammary glands of a female animal in perfect response to a hungry baby’s cry. An obvious deep neural network example which consists of hormone, digestive, and nervous interactions.

Back to the prehistoric era of transformers.

In 1990, in his paper titled, ‘Finding Structure in Time’, Jeffrey Elman starts to explain the project by putting it that:

“Time underlies many interesting human behaviors. Thus, the question of how to represent time in connectionist models is very important.”(1)

A connectionist model is a class of theories hypothesizing that knowledge is encoded in the brain as a result of different connections from representations rather than the representations themselves (American Psychological Association). That is, it’s not about the events but rather the connection between events.

This suggests that knowledge is distributed rather than localized.

In this case, it means that knowledge is retrieved from the spread of activation between connected neurons as shown in the artificial neural network above. The theory suggests that connected neurons are bound to fire up simultaneously.

Jeffrey Elman continues that:

“ The current report develops a proposal along these lines first described by Jordan (1986) which involves the use of recurrent links in order to provide networks with a dynamic memory. In this approach, hidden unit patterns are fed back to themselves: the internal representations which develop thus reflect task demands in the context of prior internal states.”(1).

Elman meant using a Recurrent Neural Network (RNN) when he said using ‘…patterns fed back to themselves…’. A RNN is a bi-directional neural network that allows the flow of output from some nodes to affect the input of some nodes. Feedforward neural networks are uni-directional.

Using the RNN the Elman network encoded each word in the training set as a vector in a process known as word embedding and a vector store. Word embedding is a representation of a word in Natural Language Processing (NLP).

A vector store or database is a store for a fixed list of numbers as well as other data items. Typically, a vector database uses Approximate Nearest Neighbor (ANN) algorithms to implement searches within the database to retrieve the closest matching database records.

Vectors are representations of data in high-dimensional space.

Using language modeling and feature learning, words and phrases can be represented and mapped to vectors of real numbers. For example, using the words ‘cat’, ‘dog’, ‘car’, and ‘lorry’. It is correct to assume that dog and cat would have somewhat matching vectors than ‘dog’ and ‘car’. Same goes for ‘car’ and ‘lorry’.

Remember, vectors have both magnitude and direction and can therefore be represented in 3-dimensional space.

One major shortcoming of the Elman network was that when words spelt the same came up, the model was not able to differentiate between the meanings.

In 1992, Jürgen Schmidhuber introduced the Fast Weight Controller, a novel concept for neural networks(2). This system could learn to answer questions by dynamically adjusting the focus of another neural network.

It achieved this by manipulating connections (called attention weights) based on key and value vectors labeled “FROM” and “TO”.

Interestingly, researchers later discovered that the Fast Weight Controller is mathematically equivalent to a simpler version of the Transformer. This connection highlights the foresight of Schmidhuber’s work.

Around the same time (in 1993), the term “learning internal spotlights of attention” was introduced by Jürgen in his paper with the title, ‘Reducing the Ratio Between Learning Complexity and Number of Time-Varying Variables in Fully Recurrent Nets’(2). This concept captures the essence of how these networks focus on specific parts of the data to make better sense of it.

Also in 1993, IBM researchers made significant progress in machine translation using alignment models. These models likely employed techniques similar to those being explored in neural attention research, paving the way for future advancements in automated language translation.

The year 1997 was a pivotal year in Transformer evolution because it was the year that Jürgen and Hochreiter proposed the Long Short-Term Memory(LSTM) in RNN. The authors put it that:

“Learning to store information over extended time intervals by recurrent backpropagation takes a very long time, mostly because of insufficient, decaying error backflow. We briefly review Hochreiter’s (1991) analysis of this problem, then address it by introducing a novel, efficient, gradient-based method called long short-term memory (LSTM).”(3).

LSTMs are a special type of recurrent neural network (RNN) designed to overcome a major limitation of regular RNNs. Traditional RNNs struggle to learn from long sequences of data because information fades over time. LSTMs address this by introducing a memory cell that can retain important information for extended periods.

This improved memory makes LSTMs more powerful than other sequence learning methods for tasks like processing and predicting data over time. Examples include handwriting recognition, speech recognition, machine translation, and even robot control.

The core of an LSTM unit is a cell with three special gates: forget, input, and output. These gates control how information flows through the network. The forget gate decides what past information to discard, the input gate selects new information to remember, and the output gate determines what information is most relevant for the current task.

The selective gating mechanism allows LSTMs to learn long-term relationships within data and make accurate predictions even for complex sequences.

Another milestone in the evolution of transformer architecture came in 2001 when for the first time, a corpus of one billion words was used compared to the earlier training corpora consisting of one million words.

In their 2001 paper titled, ‘ Scaling to very very large corpora for natural language disambiguation’, Bank and Brill put it that:

“The amount of readily available on-line text has reached hundreds of billions of words and continues to grow. Yet for most core natural language tasks, algorithms continue to be optimized, tested and compared after training on corpora consisting of only one million words or less.”(4).

Word Sense Disambiguation (WSD) can be explained like this:

In everyday language, we understand that words can have multiple meanings depending on the context. This is called ambiguity. Humans naturally figure out the intended meaning, that is, WSD, without much trouble. However, for computers, it’s a complex problem that affects many areas of natural language processing (NLP) like search engines and chatbots.

Natural language reflects how our brains work, and computers haven’t quite mastered replicating that. WSD is an ongoing challenge in NLP and machine learning.

Researchers have explored various techniques for WSD. Some rely on dictionaries to understand word meanings based on context. Others use machine learning, where a computer program is trained on examples where words are already labeled with their intended meanings. There are also methods that analyze how words are used together to infer their meaning. So far, supervised learning approaches (training with labeled examples) have been the most successful.

2014–2016: Simple Attention Mechanism

In 2014, the seq2seq model was proposed by Sutskever, Vinyals, and Le in their paper titled, ‘ Sequence to Sequence Learning with Neural Networks’. Here, the authors put it that:

“Deep Neural Networks (DNNs) are powerful models that have achieved excellent performance on difficult learning tasks. Although DNNs work well whenever large labeled training sets are available, they cannot be used to map sequences to sequences. …we present a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure. Our method uses a multilayered Long Short-Term Memory (LSTM) to map the input sequence to a vector of a fixed dimensionality, and then another deep LSTM to decode the target sequence from the vector.”(5).

Seq2seq refers to a group of machine learning techniques used for natural language processing (NLP). These techniques are like special tools that can take one sequence of data (like words in a sentence) and convert it into another sequence (like a translation in another language, a caption describing an image, or a summary of a text). In the case of Sutskever et.al, the model was used for English to French translation.

The use of seq2seq introduced the use of an encoder in the LSTM that takes in a sequence of tokens and turns them into vectors. For example take the sentence: ‘This is a cat.’ The sentence is tokenized to ‘this’ ‘is’ ‘a’ ‘cat’ and then each of the tokens is given a vector value. On the other hand, a decoder is another LSTM that converts the vector values into a sequence of tokens.

While it’s difficult to pinpoint an exact date for when tokenization began specifically for LLMs (Large Language Models), the 2014 proposal gives the most evolved use of tokenization.

Earlier examples of tokenization equivalents include:

1. Early Neural Network Research: Research on neural networks, the foundation of LLMs, began in the 1940s but significantly accelerated in the 1980s. Since tokenization is a fundamental step in preparing text data for neural networks, it’s likely researchers were using it during this period.

2. Fast Weight Controller (1992): This early neural network architecture introduced by Jürgen Schmidhuber in 1992 operated on key and value vectors. These vectors can be seen as a form of tokenization, even if not in the exact way we use it today.

3. Early Applications in Machine Translation (1993): The use of alignment models for machine translation in 1993 suggests techniques similar to tokenization were being explored for natural language processing (NLP) tasks.

While we can’t pinpoint a specific year, the evidence suggests tokenization for LLMs likely emerged alongside the development of early neural networks and NLP techniques in the late 1980s and early 1990s.

The year 2014 marked a turning point in Neural Machine Translation(NMT) with the exploration of gating mechanisms. Researchers found that a simplified version of gated recurrent units (GRUs) within a 130-million parameter seq2seq model yielded positive results.

Interestingly, studies showed that GRUs offered similar performance to the more complex gated Long Short-Term Memory (LSTM) networks.

In their paper titled, ‘On the Properties of Neural Machine Translation: Encoder–Decoder Approaches’, Cho et. al., state that:

“…the neural machine translation performs relatively well on short sentences without unknown words, but its performance degrades rapidly as the length of the sentence and the number of unknown words increase. Furthermore, we find that the proposed gated recursive convolutional network learns the grammatical structure of a sentence automatically.” (6).

Building on this progress, Bahdanau et al. in 2014 further improved seq2seq models by introducing an “additive” attention mechanism between two LSTM networks. This approach allowed the model to focus on specific parts of the source sentence when generating the target translation.

In 2015, Luong et al. delved deeper into attention mechanisms, evaluating the relative performance of global (considering the entire source sentence) and local (focusing on a window around each word) approaches.

In their paper, the authors state that:

“With local attention, we achieve a significant gain of 5.0 BLEU points over non-attentional systems which already incorporate known techniques such as dropout. Our ensemble model using different attention architectures has established a new state-of-the-art result in the WMT ’15 English to German translation task with 25.9 BLEU points, an improvement of 1.0 BLEU points over the existing best system backed by NMT and an n-gram re-ranker.”(7).

BLEU stands for Bilingual Evaluation Understudy. It’s a widely used metric in the field of AI, specifically for evaluating the quality of machine translation.

Here’s a breakdown of how BLEU works:

Compares translations: It compares a machine-generated translation of a text to one or more high-quality human translations of the same source text.

Focuses on n-grams: It analyzes how much these translations match up on a level of short phrases (n-grams) like 2-grams (pairs of words) or 3-grams (triplets of words).

Higher score means better: A BLEU score ranges from 0 to 1, with a higher score indicating a closer match between the machine translation and the human references. A score of 1 represents perfect equivalence.

Their findings revealed that a combined, “mixed attention” architecture yielded superior translations compared to Bahdanau’s model. Additionally, local attention offered a benefit in terms of faster translation times.

The above illustrations show that there is a challenge with using global attention for translation. It requires the model to consider every single word in the source sentence for each word it generates in the target language. This gets computationally expensive, especially for translating long texts like paragraphs or documents.

According to Luong et.al., the solution is a local attention mechanism. Instead of looking at everything, it focuses only on a relevant selection of words from the source sentence for each word it translates. This makes the process more efficient and handles longer texts better.

By 2016, Google Translate saw a significant leap forward. They transitioned from the older statistical machine translation method to a newer, neural network-based approach.

This new approach combined a seq2seq model with LSTMs and the “additive” attention mechanism. Remarkably, Google Translate achieved a higher level of performance in just nine months using this method, surpassing the capabilities of the previous statistical approach that had taken ten years to develop.

The shift marked a significant step towards the powerful neural machine translation systems we see today.

2017: The Beginning of Transformers

The paper titled, ‘Attention is All You Need’, is a sensational landmark in deep learning architecture. The paper authored by eight Google scientists starts by stating that:

“The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.”(8).

In 2017, the scientists tested a new machine translation model on two different language pairs (English-to-German and English-to-French). These models not only delivered superior translation quality, but they also offered significant advantages in terms of efficiency.

Image Source: ‘Attention is All You Need’

Here’s the breakdown of the results of the 2017 ‘Attention is All You Need’ project:

Better Translations: The models achieved higher BLEU scores compared to existing top performers, even those that combine multiple models. On the English-to-German task, the model surpassed the best results by over 2 BLEU points.

Faster Training: Another major benefit was the training speed. The transformer model reached a BLEU score of 41.0 on the English-to-French task after training for just 3.5 days on eight GPUs. This was a fraction of the training time required by the best models previously reported until then.

Efficiency Boost: By being more parallelizable, the models were able to leverage the power of multiple processors for faster training and potentially even better results.

In essence, the transformer models offered a significant leap forward in machine translation, that is, superior quality, faster training, and improved efficiency.

2018–2020: Explosion of Transformers in NLP and Other Fields

Natural Language Processing (NLP) experienced a significant shift in 2018 with the introduction of two key models: ELMo and BERT.

ELMo (2018): This model represented a leap forward from traditional methods like “bag of words’’ and word2vec. It employed a bi-directional LSTM, a type of neural network, to process entire sentences before assigning embedding vectors (numerical representations) to each word. This allowed ELMo to capture deeper contextual meaning within sentences.

BERT (2018): Building on ELMo’s success, BERT came along later in 2018. It utilized an encoder-only transformer, a more powerful neural network architecture, and achieved even better results than ELMo, especially when dealing with very large datasets (over 1 billion words).

The trend of transformers surpassing recurrent neural networks (RNNs) continued in 2020:

Vision Transformer (ViT, 2020): This model demonstrated that transformers could outperform RNNs in the realm of computer vision tasks like image recognition.

Speech Processing Transformer (2020): Similarly, a transformer-based model with additional convolutions proved superior to RNNs typically used for speech processing tasks.

Finally, in 2020, researchers addressed convergence issues with the original transformer architecture. Xiong et al. in a paper titled, ‘On Layer Normalization in the Transformer Architecture’, proposed a solution called the pre-LN Transformer, where normalization layers were placed before the multi-headed attention mechanism (instead of after).

The authors start by stating that:

“The Transformer is widely used in natural language processing tasks. To train a Transformer however, one usually needs a carefully designed learning rate warm-up stage, which is shown to be crucial to the final performance but will slow down the optimization and bring more hyper-parameter tunings.”

In conclusion, the authors hold it that:

“…our theory also shows that if the layer normalization is put inside the residual blocks (recently proposed as Pre-LN Transformer), the gradients are well-behaved at initialization. This motivates us to remove the warm-up stage for the training of Pre-LN Transformers. We show in our experiments that Pre-LN Transformers without the warm-up stage can reach comparable results with baselines while requiring significantly less training time and hyper-parameter tuning on a wide range of applications.”(9).

This modification improved the transformer’s training stability.

These advancements highlight the dominance of transformers in NLP tasks, paving the way for further innovation in language understanding and processing.

2021 to Present: The Start of the Generative Era, GPT X, DALL.E

The years 2021 to 2023 witnessed a significant leap forward in the development of large language models (LLMs).

One prominent example was the emergence of massive, unidirectional (autoregressive) transformers like GPT-3, boasting parameters exceeding 100 billion. These models, along with others from OpenAI’s GPT series, pushed the boundaries of what LLMs could achieve.

Fast Forward to Today: A World of LLM Applications Powered by Transformers

The present-day is a time of abundance in the LLM landscape. We’ve seen a surge in the development of not only colossal models but also a diverse range of applications utilizing their capabilities. Here are some key highlights:

Chat-GPT: This interactive chatbot leverages LLM technology to engage in natural conversation, making it a powerful tool for communication and interaction.

GPT-4: The next iteration of the GPT series is anticipated to push the envelope even further, offering potentially groundbreaking advancements in LLM capabilities.

Large Language Models Beyond Borders: Models like Gemini and LaMDA from Google demonstrate the global reach of LLM development, fostering innovation across various regions.

Open-Source LLMs: The rise of open-source LLMs like Gemini (being large and generative) encourages collaboration and democratizes access to this powerful technology. This fosters a more inclusive environment for LLM research and development.

Beyond Language: The impact of LLMs extends beyond just text. Whisper exemplifies how LLMs can be applied to speech recognition, while the Robotics Transformer demonstrates their potential in controlling robots.

AI-Powered Creation: Stable Diffusion showcases how LLMs can be used for image generation, opening doors for new forms of creative expression.

Multimodal LLMs: Models like Sora represent the future of LLMs, capable of processing and generating information across different modalities, like text and speech.

If history is anything to go by, there is no doubt that progress is inexorable! This is just a glimpse into the ever-expanding world of LLMs. As research and development continue, we can expect even more exciting applications and advancements in the years to come.

Who knows what the future holds?

References

Elman, Jeffrey. ‘Finding Structure in Time’. March 1990.
Schmidhuber, J. ‘Reducing the Ratio Between Learning Complexity and Number of Time Varying Variables in Fully Recurrent Nets’. 1993.
Hochreiter, Sepp & Schmidhuber, Jürgen. ‘Long Short-term Memory. Neural computation’. December 1997.
Banco, M & Brill, E. ‘Scaling to Very Very Large Corpora for Natural Language Disambiguation’. July 2001.
Ilya Sutskever, Oriol Vinyals, Quoc V. Le. ‘Sequence to Sequence Learning with Neural Networks’. 2014.
Cho et.al. ‘On the Properties of Neural Machine Translation: Encoder–Decoder Approaches’. 2014.
Luong et. al. ‘Effective Approaches to Attention-based Neural Machine Translation’. September 2015.
Google. Vaswani et. al. ‘Attention is All You Need’. Google 2017.
Xiong et. al. ‘On Layer Normalization in the Transformer Architecture’. February 2020.
AWS. ‘What are Transformers in AI’. 2024.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Transformers in AI: The Attention Timeline, From the 1990s to Present

Author(s): Thiongo John W

What are Transformers in AI?

The Prehistoric Era: From 1990 to 2013

2014–2016: Simple Attention Mechanism

2017: The Beginning of Transformers

2018–2020: Explosion of Transformers in NLP and Other Fields

2021 to Present: The Start of the Generative Era, GPT X, DALL.E

References

Feedback ↓ Cancel reply

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

AI in Medical Imaging: A Life-Saving Revolution or Ethical Minefield?

Llm Fine Tuning Guide: Do You Need It and How to Do It

10 Comprehensive Strategies for Ensuring Ethical Artificial Intelligence

10 Comprehensive Strategies for Ensuring Ethical Artificial Intelligence

10 Comprehensive Strategies for Ensuring Ethical Artificial Intelligence

The World’s Leading AI and Technology Publication.

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Transformers in AI: The Attention Timeline, From the 1990s to Present

Author(s): Thiongo John W

What are Transformers in AI?

The Prehistoric Era: From 1990 to 2013

2014–2016: Simple Attention Mechanism

2017: The Beginning of Transformers

2018–2020: Explosion of Transformers in NLP and Other Fields

2021 to Present: The Start of the Generative Era, GPT X, DALL.E

References

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement