Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: [email protected]
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Take our 85+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!

Publication

ELECTRA: Pre-Training Text Encoders as Discriminators rather than Generators
Natural Language Processing

ELECTRA: Pre-Training Text Encoders as Discriminators rather than Generators

Last Updated on October 1, 2020 by Editorial Team

Author(s): Edward Ma

Natural Language Processing

What is the difference between ELECTRA andΒ BERT?

Photo by Edward Ma onΒ Unsplash

BERT (Devlin et al., 2018) is the baseline of NLP tasks recently. There are a lot of new models released based on BERT architecture such as RoBERTA (Liu et al. 2019) and ALBERT (Lan et al., 2019). Clark et al. released ELECTRA (Clark et al., 2020) which target to reduce computation time and resource while maintaining high-quality performance. The trick is introducing the generator for Masked Langauge Model (MLM) prediction and forwarding the generator result to the discriminator

.MLM is one of the training objectives in BERT (Devlin et al., 2018). However, it is being criticized because of misaligned between the training phase and the fine-tuning phase. In short, the MLM mask token by [MASK] and model will predict the real world in order to learn the word representation. On the other hand, ELECTRA (Clark et al., 2020) contains two models which are generator and discriminator. The masked token will be sent to the generator and generating alternative inputs for discriminator (i.e. ELECTRA model). After the training phase, the generator will be thrown away while we only keep the discriminator for fine-tuning and inference.

Clark et al. named this method as replaced token detection. In the following sections, we will cover how does ELECTRA (Clark et al., 2020)Β works.

Input Data

Overview of ELECTRA training process (Clark et al.,Β 2020)

As mentioned before, there are 2 models in the training phase. Instead of feeding masked token (e.g. [MASK]) to the target model (i.e. discriminator/ ELECTRA), a small MLM is trained to predict mask token. The output of the generator which does not include any masked token becomes the input of the discriminator.

It is possible that the generator predicts the same token (i.e. β€œthe” in the above figure”). It will keep tracking for generating a true label for the discriminator. Taking the above figure as an example, only β€œate” will be marked as β€œreplaced” while the rest of them (including β€œthe”) are β€œoriginal” labels.

You may imagine that the generator is a small-size masked language model (e.g. BERT). The objective of the generator is to generate training data for the discriminator and learning word representation (aka token embeddings). Actually, the idea of a generator is similar to the approach of data augmentation for NLP inΒ nlpaug.

Model Setup

To improve the efficiency of the pre-training, Clark et al. figure out that sharing weight between generator and discriminator may not be a good way. Indeed, they only share token and positional embeddings across two models. The following figure shows that the replaced token detection approach outperforms the masked languageΒ model.

Performance comparison between replaced token detection and masked language model (Clark et al.,Β 2020)

Secondly, the smaller size of the generator provides a better result. Small size generator not only leads a better result but also reducing overall trainingΒ time.

Performance of different generator size and discriminator size (Clark et al.,Β 2020)

Tuning Hyperparameters

Clark et al. did a lot on fine-tuning hyperparameters. It includes the model’s hidden size, learning rate, and batch size. Here are the best hyperparameters for different sizes of ELECTRAΒ models.

Pre-training hyperparameters (Clark et al.,Β 2020)

Take Away

  • Generative Adversarial Network (GAN): The approach is similar to GAN which intends to generate fake data to fool or attack models (to understand more about the adversarial attack, you may check out here and here). However, the generator from training ELECTRA is different. First of all, the correct token which is generated by the generator considers as β€œreal” instead of β€œfake”. Also, the generator is trained to maximum likelihood rather than fool the discriminator.
  • The major challenge of adopting BERT in production is resource allocation. 1 G memory is almost the minimum requirement for the BERT model in production. Can foresee that there are more and more new NLP models focusing on reducing the size of the model and inference time.

About Me

I am a Data Scientist in the Bay Area. Focusing on state-of-the-art work in Data Science, Artificial Intelligence, especially in NLP and platform related. Feel free to connect with me on LinkedIn or follow me on Medium orΒ Github.

Extension Reading

Reference


ELECTRA: Pre-Training Text Encoders as Discriminators rather than Generators was originally published in Towards AIβ€Šβ€”β€ŠMultidisciplinary Science Journal on Medium, where people are continuing the conversation by highlighting and responding to this story.

Published via Towards AI

Feedback ↓