The Comparison between the Encoder and the Decoder

Last Updated on May 14, 2025 by Editorial Team

Author(s): tangbasky

Originally published on Towards AI.

This article primarily discusses the advantages and disadvantages of large language models based on encoder and decoder architectures. Both the encoder and decoder architectures are built upon the Transformer model. Initially, this encoder-decoder architecture was designed for translation tasks, where the encoder is responsible for encoding the input and the decoder for decoding the output. The general structure is illustrated in the figure 1 below.

The Comparison between the Encoder and the Decoder — Figure 1: The encoder and decoder in translation task. Image from [3].

Since both the Encoder module and the Decoder module are of the Transformer structure, the overall model structure diagram is as following figure 2.

Figure 2: The framework of Transformer. Image from [1].

In fact, the success of the Transformer architecture at that time dominated the leaderboards of various public datasets, marking the transition of natural language models from the LSTM era to the Transformer era. However, few papers study the architectures of encoder and decoder at that time. Below, we would introduce typical models of encoders and decoders.

BERT

In 2018, BERT emerged and quickly dominated various NLP tasks. It made three significant innovations:

Scaling the Transformer: Making it larger and deeper.
Encoder-Only Design: Utilizing the encoder only stack.
Masked Language Modeling (MLM): A cloze-style pretraining task.

Note: BERT also included Next Sentence Prediction (NSP), but subsequent studies found it had minimal impact on model performance.

Notably, BERT excelled primarily in discriminative tasks (e.g., classification) but underperformed in generative tasks. Before GPT-3, no model including decoder-only models like GPT-1 and GPT-2 achieved strong performance in generative tasks.

GPT

After BERT, numerous “X-BERT” models emerged, yet none made significant breakthroughs in generation until GPT-3. The GPT series popularized decoder-only architectures in NLP, gradually displacing BERT-style models from the spotlight. While GPT-1 and GPT-2 had limited success, model size was a critical factor. GPT-3 achieved critical performance in generation tasks compared to BERT while remaining competitive in discriminative tasks.

Are there models that combine both encoder and decoder? The answer is Yes — T5 (Text-to-Text Transfer Transformer) uses an encoder-decoder architecture. It matches BERT’s performance in discriminative tasks but may underperform decoder-only models of similar size in generative tasks.

Analysis of Model Architectures
We now analyze three architecture types: encoder-only, encoder-decoder, and decoder-only to understand their relationships and suitable tasks.

Figure 3: Schematics of the Transformer architecture variants we consider. Image from [4].

From the above figure 3, it is evident that for the encoder-decoder structure, the encoder part employs bidirectional self-attention (that is, one token conducts attention with all tokens in the sequence). In the meantime, the language model has a decoder-only structure, where attention only occurs between the current token and the tokens preceding it, which is a unidirectional self-attention. Prefix LM is somewhat of a clever approach. It divides a sequence into two parts: one part is the Prefix, and the other part is the target. The Prefix undergoes bidirectional self-attention, while the target part only undergoes unidirectional self-attention. Regarding the attention mechanisms of these three structures, we have the following figure 4.

Figure 4: Attention patterns in a causal decoder, non-causal decoder, and encoder-decoder architecture. Image from [2].

So why did the encoder-only structure, which was invincible before, lose to the decoder-only structure? Even the encoder-decoder structure is still inferior to the decoder-only structure. I will analyze it from the following aspects.

The Problem of Rank

Here, let’s first explain what the rank of a matrix is. “The rank of a matrix refers to the number of vectors in the maximal linearly independent set of its row vectors or column vectors,” which is the answer from Wikipedia. What is the role of rank in model calculation? Here, we need to mention the calculation mechanism of attention. In the calculation of attention, each token needs to calculate a weight for other tokens, and this weight is a matrix. What does this matrix represent? Please take a look at the following two tables.

Table 1: The high-rank of an attention weight. Image from author.

Table 2: The low-rank of an attention weight. Image from author.

From the tables above, we can observe that in high-rank attention weights, each token has distinct attention weight for other tokens. In contrast, low-rank attention weights result in identical attention weights across tokens. High-rank is desirable as it indicates that each token retains unique information, whereas low-rank homogenizes tokens, eliminating their distinguishing features and preventing the model from learning token-specific characteristics.

The key conclusion is that the bidirectional attention mechanism in encoders tends to produce low-rank matrices, while the unidirectional attention in decoders preserves full rank. For detailed proofs, refer to the paper “Low-Rank Bottleneck in Multi-head Attention Models”. The core issue is an inherent limitation of standard multi-head attention: when the head dimension (d) is smaller than the sequence length (n), a “low-rank bottleneck” occurs, reducing the model’s expressive power. Mathematically, multiplying an n×d matrix by a d×n matrix followed by a softmax operation results in a low-rank matrix if n ≫ d.

However, does a low-rank encoder necessarily underperform a high-rank decoder? Not necessarily. High rank indicates greater expressive potential, but effectively harnessing this potential is critical. Without proper optimization, a high-rank decoder might still underperform a low-rank encoder.

Differences in Pretraining Tasks

Encoder-only models like BERT use Masked Language Modeling (MLM), where ~15% of tokens in a sequence are randomly masked, and the model predicts these masked tokens from context. Decoder-only models, in contrast, use autoregressive language modeling, predicting the next token given previous ones. Consequently:

MLM encourages global context understanding, making encoder-only models stronger at discriminative tasks (e.g., classification).
Autoregressive training fosters sequential reasoning, making decoder-only models better suited for generative tasks.

Why do decoder-only models excel at discriminative tasks?
This is due to prompt engineering and in-context learning. By framing tasks as prompts, decoder-only models can generate outputs that mimic discriminative behaviors, effectively repurposing generative capabilities for classification tasks.

Model Scale

Early decoder-only models underperformed encoders, which seems contradictory given the previous points. The key factor here is model scale:

During BERT’s era, models were typically ~300MB.
GPT-3, in contrast, scaled to 175B parameters.

Google (developers of BERT) focused on encoder architectures, while OpenAI pursued decoders. While Google likely experimented with large encoders, empirical evidence shows that:

Encoders achieve good performance quickly with small models but plateau in scalability.
Decoders require substantial scale to outperform encoders but ultimately achieve superior generalization.

Experimental Evidence

In “What Language Model Architecture and Pretraining Objective Work Best for Zero-Shot Generalization?”, the authors compared encoder-only, encoder-decoder, and decoder-only architectures using 50B parameter models pretrained on 170B tokens. Key findings:

Decoder-only + generative pretraining excels at zero-shot generalization for generative tasks.
Encoder-decoder + MLM + multitask finetuning performs best for zero-shot MLM tasks but struggles with answering open questions.

Conclusion

From what we’ve seen so far, there isn’t a clear superiority between encoder models and decoder models — they simply serve different tasks. Encoder models are better suited for discriminative tasks, while decoder models excel at generative tasks. If a decoder model is to perform discriminative tasks, two conditions must be met: (1) the model must be sufficiently large in scale, otherwise its capabilities cannot be fully activated; and (2) a well-designed prompt must be provided to explicitly guide the model in performing the target task.

Additionally, there are encoder-decoder models. According to some online experiments, encoder-decoder models don’t seem to underperform compared to decoder-only models. However, encoder-decoder models lack parallelization capabilities, making them far less efficient than decoder-only models. As a result, they have largely fallen out of favor in industrial applications.

Inference

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

The Comparison between the Encoder and the Decoder

Author(s): tangbasky

The Problem of Rank

Differences in Pretraining Tasks

Model Scale

Experimental Evidence

Conclusion

Inference

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

Why Knowledge Graphs Are the Missing Piece in AI Agent API Discovery

The Complexity of Self-Driving Cars Explained Simply

Bridging Symbolic AI and Deep Learning: How Knowledge Graphs are Revolutionizing ResNets

LAI #93: Smarter Model Choices, Multi-Agent Systems, and Cutting Through AI Noise

Who Wins Purview vs Rogue AI in Data Control

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

The Comparison between the Encoder and the Decoder

Author(s): tangbasky

The Problem of Rank

Differences in Pretraining Tasks

Model Scale

Experimental Evidence

Conclusion

Inference

Related posts

Popular posts

Updates

Recent Posts

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

GDPR CCPA Statement