
The Comparison between the Encoder and the Decoder
Last Updated on May 14, 2025 by Editorial Team
Author(s): tangbasky
Originally published on Towards AI.
This article primarily discusses the advantages and disadvantages of large language models based on encoder and decoder architectures. Both the encoder and decoder architectures are built upon the Transformer model. Initially, this encoder-decoder architecture was designed for translation tasks, where the encoder is responsible for encoding the input and the decoder for decoding the output. The general structure is illustrated in the figure 1 below.

Since both the Encoder module and the Decoder module are of the Transformer structure, the overall model structure diagram is as following figure 2.

In fact, the success of the Transformer architecture at that time dominated the leaderboards of various public datasets, marking the transition of natural language models from the LSTM era to the Transformer era. However, few papers study the architectures of encoder and decoder at that time. Below, we would introduce typical models of encoders and decoders.
- BERT
In 2018, BERT emerged and quickly dominated various NLP tasks. It made three significant innovations:
- Scaling the Transformer: Making it larger and deeper.
- Encoder-Only Design: Utilizing the encoder only stack.
- Masked Language Modeling (MLM): A cloze-style pretraining task.
Note: BERT also included Next Sentence Prediction (NSP), but subsequent studies found it had minimal impact on model performance.
Notably, BERT excelled primarily in discriminative tasks (e.g., classification) but underperformed in generative tasks. Before GPT-3, no model including decoder-only models like GPT-1 and GPT-2 achieved strong performance in generative tasks.
- GPT
After BERT, numerous βX-BERTβ models emerged, yet none made significant breakthroughs in generation until GPT-3. The GPT series popularized decoder-only architectures in NLP, gradually displacing BERT-style models from the spotlight. While GPT-1 and GPT-2 had limited success, model size was a critical factor. GPT-3 achieved critical performance in generation tasks compared to BERT while remaining competitive in discriminative tasks.
- T5
Are there models that combine both encoder and decoder? The answer is Yes β T5 (Text-to-Text Transfer Transformer) uses an encoder-decoder architecture. It matches BERTβs performance in discriminative tasks but may underperform decoder-only models of similar size in generative tasks.
Analysis of Model Architectures
We now analyze three architecture types: encoder-only, encoder-decoder, and decoder-only to understand their relationships and suitable tasks.

From the above figure 3, it is evident that for the encoder-decoder structure, the encoder part employs bidirectional self-attention (that is, one token conducts attention with all tokens in the sequence). In the meantime, the language model has a decoder-only structure, where attention only occurs between the current token and the tokens preceding it, which is a unidirectional self-attention. Prefix LM is somewhat of a clever approach. It divides a sequence into two parts: one part is the Prefix, and the other part is the target. The Prefix undergoes bidirectional self-attention, while the target part only undergoes unidirectional self-attention. Regarding the attention mechanisms of these three structures, we have the following figure 4.

So why did the encoder-only structure, which was invincible before, lose to the decoder-only structure? Even the encoder-decoder structure is still inferior to the decoder-only structure. I will analyze it from the following aspects.
The Problem of Rank
Here, letβs first explain what the rank of a matrix is. βThe rank of a matrix refers to the number of vectors in the maximal linearly independent set of its row vectors or column vectors,β which is the answer from Wikipedia. What is the role of rank in model calculation? Here, we need to mention the calculation mechanism of attention. In the calculation of attention, each token needs to calculate a weight for other tokens, and this weight is a matrix. What does this matrix represent? Please take a look at the following two tables.


From the tables above, we can observe that in high-rank attention weights, each token has distinct attention weight for other tokens. In contrast, low-rank attention weights result in identical attention weights across tokens. High-rank is desirable as it indicates that each token retains unique information, whereas low-rank homogenizes tokens, eliminating their distinguishing features and preventing the model from learning token-specific characteristics.
The key conclusion is that the bidirectional attention mechanism in encoders tends to produce low-rank matrices, while the unidirectional attention in decoders preserves full rank. For detailed proofs, refer to the paper βLow-Rank Bottleneck in Multi-head Attention Modelsβ. The core issue is an inherent limitation of standard multi-head attention: when the head dimension (d) is smaller than the sequence length (n), a βlow-rank bottleneckβ occurs, reducing the modelβs expressive power. Mathematically, multiplying an nΓd matrix by a dΓn matrix followed by a softmax operation results in a low-rank matrix if n β« d.
However, does a low-rank encoder necessarily underperform a high-rank decoder? Not necessarily. High rank indicates greater expressive potential, but effectively harnessing this potential is critical. Without proper optimization, a high-rank decoder might still underperform a low-rank encoder.
Differences in Pretraining Tasks
Encoder-only models like BERT use Masked Language Modeling (MLM), where ~15% of tokens in a sequence are randomly masked, and the model predicts these masked tokens from context. Decoder-only models, in contrast, use autoregressive language modeling, predicting the next token given previous ones. Consequently:
- MLM encourages global context understanding, making encoder-only models stronger at discriminative tasks (e.g., classification).
- Autoregressive training fosters sequential reasoning, making decoder-only models better suited for generative tasks.
Why do decoder-only models excel at discriminative tasks?
This is due to prompt engineering and in-context learning. By framing tasks as prompts, decoder-only models can generate outputs that mimic discriminative behaviors, effectively repurposing generative capabilities for classification tasks.
Model Scale
Early decoder-only models underperformed encoders, which seems contradictory given the previous points. The key factor here is model scale:
- During BERTβs era, models were typically ~300MB.
- GPT-3, in contrast, scaled to 175B parameters.
Google (developers of BERT) focused on encoder architectures, while OpenAI pursued decoders. While Google likely experimented with large encoders, empirical evidence shows that:
- Encoders achieve good performance quickly with small models but plateau in scalability.
- Decoders require substantial scale to outperform encoders but ultimately achieve superior generalization.
Experimental Evidence
In βWhat Language Model Architecture and Pretraining Objective Work Best for Zero-Shot Generalization?β, the authors compared encoder-only, encoder-decoder, and decoder-only architectures using 50B parameter models pretrained on 170B tokens. Key findings:
- Decoder-only + generative pretraining excels at zero-shot generalization for generative tasks.
- Encoder-decoder + MLM + multitask finetuning performs best for zero-shot MLM tasks but struggles with answering open questions.
Conclusion
From what weβve seen so far, there isnβt a clear superiority between encoder models and decoder models β they simply serve different tasks. Encoder models are better suited for discriminative tasks, while decoder models excel at generative tasks. If a decoder model is to perform discriminative tasks, two conditions must be met: (1) the model must be sufficiently large in scale, otherwise its capabilities cannot be fully activated; and (2) a well-designed prompt must be provided to explicitly guide the model in performing the target task.
Additionally, there are encoder-decoder models. According to some online experiments, encoder-decoder models donβt seem to underperform compared to decoder-only models. However, encoder-decoder models lack parallelization capabilities, making them far less efficient than decoder-only models. As a result, they have largely fallen out of favor in industrial applications.
Inference
- [1] Ultra-short-term PV power prediction based on Informer with multi-head probability sparse self-attentiveness mechanism
- [2] What Language Model Architecture and Pretraining Objective Work Best for Zero-Shot Generalization?
- [3] Imgcook 3.0 Series: Semantic Analysis of Fields
- [4] Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI