Our terms of service are changing. Learn more.

Publication

Natural Language Processing

ELECTRA: Pre-Training Text Encoders as Discriminators rather than Generators

Author(s): Edward Ma

Natural Language Processing

What is the difference between ELECTRA and BERT?

Photo by Edward Ma on Unsplash

BERT (Devlin et al., 2018) is the baseline of NLP tasks recently. There are a lot of new models released based on BERT architecture such as RoBERTA (Liu et al. 2019) and ALBERT (Lan et al., 2019). Clark et al. released ELECTRA (Clark et al., 2020) which target to reduce computation time and resource while maintaining high-quality performance. The trick is introducing the generator for Masked Langauge Model (MLM) prediction and forwarding the generator result to the discriminator

.MLM is one of the training objectives in BERT (Devlin et al., 2018). However, it is being criticized because of misaligned between the training phase and the fine-tuning phase. In short, the MLM mask token by [MASK] and model will predict the real world in order to learn the word representation. On the other hand, ELECTRA (Clark et al., 2020) contains two models which are generator and discriminator. The masked token will be sent to the generator and generating alternative inputs for discriminator (i.e. ELECTRA model). After the training phase, the generator will be thrown away while we only keep the discriminator for fine-tuning and inference.

Clark et al. named this method as replaced token detection. In the following sections, we will cover how does ELECTRA (Clark et al., 2020) works.

Input Data

Overview of ELECTRA training process (Clark et al., 2020)

As mentioned before, there are 2 models in the training phase. Instead of feeding masked token (e.g. [MASK]) to the target model (i.e. discriminator/ ELECTRA), a small MLM is trained to predict mask token. The output of the generator which does not include any masked token becomes the input of the discriminator.

It is possible that the generator predicts the same token (i.e. “the” in the above figure”). It will keep tracking for generating a true label for the discriminator. Taking the above figure as an example, only “ate” will be marked as “replaced” while the rest of them (including “the”) are “original” labels.

You may imagine that the generator is a small-size masked language model (e.g. BERT). The objective of the generator is to generate training data for the discriminator and learning word representation (aka token embeddings). Actually, the idea of a generator is similar to the approach of data augmentation for NLP in nlpaug.

Model Setup

To improve the efficiency of the pre-training, Clark et al. figure out that sharing weight between generator and discriminator may not be a good way. Indeed, they only share token and positional embeddings across two models. The following figure shows that the replaced token detection approach outperforms the masked language model.

Performance comparison between replaced token detection and masked language model (Clark et al., 2020)

Secondly, the smaller size of the generator provides a better result. Small size generator not only leads a better result but also reducing overall training time.

Performance of different generator size and discriminator size (Clark et al., 2020)

Tuning Hyperparameters

Clark et al. did a lot on fine-tuning hyperparameters. It includes the model’s hidden size, learning rate, and batch size. Here are the best hyperparameters for different sizes of ELECTRA models.

Pre-training hyperparameters (Clark et al., 2020)

Take Away

  • Generative Adversarial Network (GAN): The approach is similar to GAN which intends to generate fake data to fool or attack models (to understand more about the adversarial attack, you may check out here and here). However, the generator from training ELECTRA is different. First of all, the correct token which is generated by the generator considers as “real” instead of “fake”. Also, the generator is trained to maximum likelihood rather than fool the discriminator.
  • The major challenge of adopting BERT in production is resource allocation. 1 G memory is almost the minimum requirement for the BERT model in production. Can foresee that there are more and more new NLP models focusing on reducing the size of the model and inference time.

About Me

I am a Data Scientist in the Bay Area. Focusing on state-of-the-art work in Data Science, Artificial Intelligence, especially in NLP and platform related. Feel free to connect with me on LinkedIn or follow me on Medium or Github.

Extension Reading

Reference


ELECTRA: Pre-Training Text Encoders as Discriminators rather than Generators was originally published in Towards AI — Multidisciplinary Science Journal on Medium, where people are continuing the conversation by highlighting and responding to this story.

Published via Towards AI

Feedback ↓