Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: [email protected]
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab VeloxTrend Ultrarix Capital Partners Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-FranΓ§ois Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Take our 85+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!

Publication

The Comparison between the Encoder and the Decoder
Latest   Machine Learning

The Comparison between the Encoder and the Decoder

Last Updated on May 14, 2025 by Editorial Team

Author(s): tangbasky

Originally published on Towards AI.

This article primarily discusses the advantages and disadvantages of large language models based on encoder and decoder architectures. Both the encoder and decoder architectures are built upon the Transformer model. Initially, this encoder-decoder architecture was designed for translation tasks, where the encoder is responsible for encoding the input and the decoder for decoding the output. The general structure is illustrated in the figure 1 below.

The Comparison between the Encoder and the Decoder
Figure 1: The encoder and decoder in translation task. Image from [3].

Since both the Encoder module and the Decoder module are of the Transformer structure, the overall model structure diagram is as following figure 2.

Figure 2: The framework of Transformer. Image from [1].

In fact, the success of the Transformer architecture at that time dominated the leaderboards of various public datasets, marking the transition of natural language models from the LSTM era to the Transformer era. However, few papers study the architectures of encoder and decoder at that time. Below, we would introduce typical models of encoders and decoders.

  • BERT

In 2018, BERT emerged and quickly dominated various NLP tasks. It made three significant innovations:

  1. Scaling the Transformer: Making it larger and deeper.
  2. Encoder-Only Design: Utilizing the encoder only stack.
  3. Masked Language Modeling (MLM): A cloze-style pretraining task.

Note: BERT also included Next Sentence Prediction (NSP), but subsequent studies found it had minimal impact on model performance.

Notably, BERT excelled primarily in discriminative tasks (e.g., classification) but underperformed in generative tasks. Before GPT-3, no model including decoder-only models like GPT-1 and GPT-2 achieved strong performance in generative tasks.

  • GPT

After BERT, numerous β€œX-BERT” models emerged, yet none made significant breakthroughs in generation until GPT-3. The GPT series popularized decoder-only architectures in NLP, gradually displacing BERT-style models from the spotlight. While GPT-1 and GPT-2 had limited success, model size was a critical factor. GPT-3 achieved critical performance in generation tasks compared to BERT while remaining competitive in discriminative tasks.

  • T5

Are there models that combine both encoder and decoder? The answer is Yes β€” T5 (Text-to-Text Transfer Transformer) uses an encoder-decoder architecture. It matches BERT’s performance in discriminative tasks but may underperform decoder-only models of similar size in generative tasks.

Analysis of Model Architectures
We now analyze three architecture types: encoder-only, encoder-decoder, and decoder-only to understand their relationships and suitable tasks.

Figure 3: Schematics of the Transformer architecture variants we consider. Image from [4].

From the above figure 3, it is evident that for the encoder-decoder structure, the encoder part employs bidirectional self-attention (that is, one token conducts attention with all tokens in the sequence). In the meantime, the language model has a decoder-only structure, where attention only occurs between the current token and the tokens preceding it, which is a unidirectional self-attention. Prefix LM is somewhat of a clever approach. It divides a sequence into two parts: one part is the Prefix, and the other part is the target. The Prefix undergoes bidirectional self-attention, while the target part only undergoes unidirectional self-attention. Regarding the attention mechanisms of these three structures, we have the following figure 4.

Figure 4: Attention patterns in a causal decoder, non-causal decoder, and encoder-decoder architecture. Image from [2].

So why did the encoder-only structure, which was invincible before, lose to the decoder-only structure? Even the encoder-decoder structure is still inferior to the decoder-only structure. I will analyze it from the following aspects.

The Problem of Rank

Here, let’s first explain what the rank of a matrix is. β€œThe rank of a matrix refers to the number of vectors in the maximal linearly independent set of its row vectors or column vectors,” which is the answer from Wikipedia. What is the role of rank in model calculation? Here, we need to mention the calculation mechanism of attention. In the calculation of attention, each token needs to calculate a weight for other tokens, and this weight is a matrix. What does this matrix represent? Please take a look at the following two tables.

Table 1: The high-rank of an attention weight. Image from author.
Table 2: The low-rank of an attention weight. Image from author.

From the tables above, we can observe that in high-rank attention weights, each token has distinct attention weight for other tokens. In contrast, low-rank attention weights result in identical attention weights across tokens. High-rank is desirable as it indicates that each token retains unique information, whereas low-rank homogenizes tokens, eliminating their distinguishing features and preventing the model from learning token-specific characteristics.

The key conclusion is that the bidirectional attention mechanism in encoders tends to produce low-rank matrices, while the unidirectional attention in decoders preserves full rank. For detailed proofs, refer to the paper β€œLow-Rank Bottleneck in Multi-head Attention Models”. The core issue is an inherent limitation of standard multi-head attention: when the head dimension (d) is smaller than the sequence length (n), a β€œlow-rank bottleneck” occurs, reducing the model’s expressive power. Mathematically, multiplying an nΓ—d matrix by a dΓ—n matrix followed by a softmax operation results in a low-rank matrix if n ≫ d.

However, does a low-rank encoder necessarily underperform a high-rank decoder? Not necessarily. High rank indicates greater expressive potential, but effectively harnessing this potential is critical. Without proper optimization, a high-rank decoder might still underperform a low-rank encoder.

Differences in Pretraining Tasks

Encoder-only models like BERT use Masked Language Modeling (MLM), where ~15% of tokens in a sequence are randomly masked, and the model predicts these masked tokens from context. Decoder-only models, in contrast, use autoregressive language modeling, predicting the next token given previous ones. Consequently:

  • MLM encourages global context understanding, making encoder-only models stronger at discriminative tasks (e.g., classification).
  • Autoregressive training fosters sequential reasoning, making decoder-only models better suited for generative tasks.

Why do decoder-only models excel at discriminative tasks?
This is due to prompt engineering and in-context learning. By framing tasks as prompts, decoder-only models can generate outputs that mimic discriminative behaviors, effectively repurposing generative capabilities for classification tasks.

Model Scale

Early decoder-only models underperformed encoders, which seems contradictory given the previous points. The key factor here is model scale:

  • During BERT’s era, models were typically ~300MB.
  • GPT-3, in contrast, scaled to 175B parameters.

Google (developers of BERT) focused on encoder architectures, while OpenAI pursued decoders. While Google likely experimented with large encoders, empirical evidence shows that:

  • Encoders achieve good performance quickly with small models but plateau in scalability.
  • Decoders require substantial scale to outperform encoders but ultimately achieve superior generalization.

Experimental Evidence

In β€œWhat Language Model Architecture and Pretraining Objective Work Best for Zero-Shot Generalization?”, the authors compared encoder-only, encoder-decoder, and decoder-only architectures using 50B parameter models pretrained on 170B tokens. Key findings:

  1. Decoder-only + generative pretraining excels at zero-shot generalization for generative tasks.
  2. Encoder-decoder + MLM + multitask finetuning performs best for zero-shot MLM tasks but struggles with answering open questions.

Conclusion

From what we’ve seen so far, there isn’t a clear superiority between encoder models and decoder models β€” they simply serve different tasks. Encoder models are better suited for discriminative tasks, while decoder models excel at generative tasks. If a decoder model is to perform discriminative tasks, two conditions must be met: (1) the model must be sufficiently large in scale, otherwise its capabilities cannot be fully activated; and (2) a well-designed prompt must be provided to explicitly guide the model in performing the target task.

Additionally, there are encoder-decoder models. According to some online experiments, encoder-decoder models don’t seem to underperform compared to decoder-only models. However, encoder-decoder models lack parallelization capabilities, making them far less efficient than decoder-only models. As a result, they have largely fallen out of favor in industrial applications.

Inference

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.

Published via Towards AI

Feedback ↓