Unlock the full potential of AI with Building LLMs for Production—our 470+ page guide to mastering LLMs with practical projects and expert insights!


Why BERT is Not GPT
Latest   Machine Learning

Why BERT is Not GPT

Last Updated on June 13, 2024 by Editorial Team

Author(s): Thiongo John W

Originally published on Towards AI.

Why BERT is Not GPT
Photo by david clarke on Unsplash

The most recent breakthroughs in language models have been the use of neural network architectures to represent text. There is very little contention that large language models have evolved very rapidly since 2018.

It all started with Word2Vec and N-Grams in 2013 as the most recent in language modelling. RNNs and LSTMs came later in 2014. These were followed by the breakthrough of the Attention Mechanism.

It was the Attention Mechanism breakthrough that gave birth to Large Pre-Trained Models and Transformers.

Both BERT and GPT are based on the Transformer architecture. This piece compares and contrasts between the two models.

The story starts with word embedding.

What is Word Embedding?

Word embedding is a technique in natural language processing (NLP) where words are represented as vectors in a continuous vector space. These vectors capture semantic meanings, allowing words with similar meanings to have similar representations.

For example, in a word embedding model, the words “king” and “queen” would have vectors that are close to each other, reflecting their related meanings. In the same way, the words ‘car’ and ‘truck’ are also likely to have vectors very close to each other. Same with ‘cat’ and ‘dog’.

However, you would not expect ‘car’ and ‘dog’ to have very close vectors.

A famous example of word embedding is Word2Vec.

Image by: Mahajan, Patil, and Sankar. 2013

Word2Vec is a neural network model that uses n-grams by training on context windows of words. There are two main approaches:

Continuous Bag of Words (CBOW): Predicts a target word based on its surrounding context (n-grams). For example, given the context “the cat sat on the,” CBOW predicts the word “mat.”

Skip-gram: Predicts the surrounding words given a target word. For example, given the word “cat,” Skip-gram predicts the context words “the,” “sat,” “on,” and “the.”

Both methods help to capture semantic relationships; with similar words having similar vector representations. This facilitates various NLP tasks by providing meaningful word embeddings.

Word2Vec uses context from large corpora to learn word associations. This approach enables various NLP tasks, such as sentiment analysis and machine translation, by providing a rich representation of words based on their usage patterns.

Image by: Mahajan, Patil, and Sankar. 2013

Word2Vec using n-grams was introduced by Mahajan, Patil, and Sankar in their 2013 paper titled, ‘Word2Vec Using Character N–Grams’.

Recurrent Neural Networks (RNNs) are a type of neural network designed for sequential data. They process inputs sequentially, maintaining a hidden state that captures information about previous inputs, making them suitable for tasks like time series prediction and natural language processing. The RNN type of network can be traced as far back as 1925 when the Ising model was used to simulate magnetic interactions, analogous to RNNs’ state transitions for sequence learning.

Long Short-Term Memory (LSTM) networks are a specialized type of RNN designed to overcome the limitations of standard RNNs, particularly the vanishing gradient problem.

Image by: Hochreiter and Schmidhuber. 1997

LSTMs use gates (input, output, and forget gates) to regulate the flow of information, enabling them to maintain long-term dependencies and remember important information over long sequences. LSTMs were invented by Hochreiter and Schmidhuber in 1997, and presented in their paper titled ‘Long Short-Term Memory’.

Here is an implementation of the cell architecture shown above for LSTM:

Image by: Hochreiter and Schmidhuber. 1997

Comparison of Word2Vec, RNNs, and LSTMs

Purpose: Word2Vec is primarily a word embedding technique, generating dense vector representations for words based on their context. RNNs and LSTMs, on the other hand, are used for modeling and predicting sequences.

Architecture: Word2Vec employs shallow, two-layer neural networks, while RNNs and LSTMs have more complex, deep architectures designed to handle sequential data. (The more hidden layers an architecture has, the deeper the network.)

Output: Word2Vec outputs fixed-size vectors for words. RNNs and LSTMs output sequences of vectors, suitable for tasks requiring context understanding over time, like language modeling and translation.

Memory Handling: LSTMs, unlike standard RNNs and Word2Vec, can effectively manage long-term dependencies due to their gating mechanisms, making them more powerful for complex sequence tasks.

Word2Vec is(was) ideal for creating word embeddings, while RNNs and LSTMs excel(ed) in tasks involving sequential data and long-term dependencies.

What is the Attention Mechanism?

The attention mechanism is a key component in neural networks, particularly in transformers and large pre-trained language models that allows the model to focus on specific parts of the input sequence when generating output. It assigns different weights to different words or tokens in the input, enabling the model to prioritize important information and handle long-range dependencies more effectively.

The attention mechanism paper is titled “Attention Is All You Need” by Ashish Vaswani et al.


Tokenization is a very important part of the attention mechanism.

Attention Mechanism Relation to Transformers

Transformers use self-attention mechanisms to process input sequences in parallel rather than sequentially, as done in RNNs. This allows transformers to capture contextual relationships between all tokens in a sequence simultaneously, improving the handling of long-term dependencies and reducing training time.

The self-attention mechanism helps in identifying the relevance of each token to every other token within the input sequence, enhancing the model’s ability to understand the context.

Attention Mechanism Relation to Large Pre-Trained Language Models

Large pre-trained language models, such as BERT and GPT, are built on transformer architectures and leverage attention mechanisms to learn contextual embeddings from vast amounts of text data.

These models utilize multiple layers of self-attention to capture intricate patterns and dependencies within the data, enabling them to perform a wide range of NLP tasks with high accuracy after fine-tuning on specific tasks.

The attention mechanism is fundamental to the success of transformers and large pre-trained language models, allowing them to efficiently handle complex language understanding and generation tasks.

This focus on understanding context is similar to the way YData Fabric, a data quality platform designed for data science teams, also emphasizes on the importance of clean and well-structured data for building high-performing AI models. Just as attention mechanisms help language models understand the nuances of language, good data quality is essential for AI models to learn accurate and generalizable patterns from the data they are trained on.

So, What is BERT and GPT

First things first, both of these models are based on the transformer architecture. Both models are Large Pre-Trained Language Models.


BERT stands for Bidirectional Encoder Representations from Transformers. It is a pre-trained language model developed by Google and was introduced in October 2018. It is based on the transformer architecture. (If you’ve read this far you know what I did there.).

The abstract of the paper by Devlin, Ming-Wei, Lee, and Toutanova, titled “ BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”, reads:

“We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models, BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications.”(Devlin, Ming-Wei, Lee & Toutanova, 2018).

Image by: Devlin, Ming-Wei, Lee & Toutanova, 2018

Two things to note, (1). BERT is bidirectional, that is, it can move from left to right simultaneously. (2). Answering questions and language inference are its major tasks.

Some applications of BERT include ClinicalBERT and BioBERT.


GPT stands for Generative Pre-trained Transformer. It refers to a family of large language models (LLMs) created by OpenAI, known for their ability to generate human-like text. GPT models can create new text content, like poems, code, scripts, musical pieces, and more. They are pre-trained and use transformer models in their core architecture.

Again, you see what I did there?

In their paper titled, “Improving Language Understanding by Generative Pre-Training”, releasing GPT, Radford, Narasimhan, Salimans, and Sutskever put it in the abstract that:

“Natural language understanding comprises a wide range of diverse tasks such as textual entailment, question answering, semantic similarity assessment, and document classification. Although large unlabeled text corpora are abundant, labeled data for learning these specific tasks is scarce, making it challenging for discriminatively trained models to perform adequately. We demonstrate that large gains on these tasks can be realized by generative pre-training of a language model on a diverse corpus of unlabeled text, followed by discriminative fine-tuning on each specific task.”(Radford, Narasimhan, Salimans, & Sutskever, 2016).

Image by: Radford, Narasimhan, Salimans, & Sutskever, 2016

Two things to note: (1).GPT is majorly generative. (2). GPT is unidirectional.

There have been several iterations of GPT, with GPT-4 being the latest and most advanced.

Major Differences Between BERT and GPT

First, let’s note the principal similarities between BERT and GPT

  1. Both are based on the Transformer architecture.
  2. Both are pre-trained models from a large corpus of text.
  3. Both are fine-tuned for various functions.

The differences are:

Image by: H2Oai

And all that is why BERT is not GPT.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓