Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: [email protected]
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Take our 85+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!

Publication

The Ever-evolving Pre-training Tasks for Language Models
Latest

The Ever-evolving Pre-training Tasks for Language Models

Last Updated on December 28, 2022 by Editorial Team

Author(s): Harshit Sharma

Originally published on Towards AI the World’s Leading AI and Technology News and Media Company. If you are building an AI-related product or service, we invite you to consider becoming an AI sponsor. At Towards AI, we help scale AI and technology startups. Let us help you unleash your technology to the masses.

Self-Supervised Learning (SSL) is the backbone of transformer-based pre-trained language models, and this paradigm involves solving pre-training tasks (PT) that help in modeling the natural language. This article is about putting all the popular pre-training tasks at aΒ glance.

Loss function in SSL
The loss function here is simply the weighted sum of losses of individual pre-training tasks that the model is trainedΒ on.

Taking BERT as an example, the loss would be the weighted sum of MLM (Masked Language Modelling) and NSP (Next Sentence Prediction)

Over the years, there have been many pre-training tasks that have come up to solve specific problems. We will be reviewing 10 of the interesting and popular ones along with their corresponding loss functions:

  1. Causal Language Modelling (CLM)
  2. Masked Language Modelling (MLM)
  3. Replaced Token Detection (RTD)
  4. Shuffled Token Detection (STD)
  5. Random Token Substitution (RTS)
  6. Swapped Language Modelling (SLM)
  7. Translation Language Modelling (TLM)
  8. Alternate Language Modelling (ALM)
  9. Sentence Boundary Objective (SBO)
  10. Next Sentence Prediction (NSP)

(The loss functions for each task and the content is heavily borrowed from AMMUSΒ : A Survey of Transformer-based Pretrained Models in Natural Language Processing)

  • It's simply a Unidirectional Language Model that predicts the next word given theΒ context.
  • Was used as a pre-training task inΒ GPT-1
  • The loss for CLM is definedΒ as:
  • An improvement over Causal Language Modelling (CLM), since CLM only takes unidirectional context into consideration while predicting text, whereas MLM uses bi-directional context.
  • It was first used as a pre-training task inΒ BERT
  • Instead of masking tokens with [MASK], RTD replaces a token with a different token (using a generator model) and trains the model to classify whether the given tokens are actual or replaced tokens (using a discriminator model)
  • Improves over 2 of the following drawbacks ofΒ MLM:

Drawback 1:
[MASK] token appears while pre-training but not while fine-tuningβ€Šβ€”β€Šthis creates a mismatch between the two scenarios.
RTD overcomes this since it doesn’t use anyΒ masking

Drawback 2:
In MLM, the training signal is only given by 15% of the tokens since the loss is computed just using these masked tokens, but in RTD, the signal is given by all the tokens since each of them is classified to be β€œreplaced” or β€œoriginal”

  • RTD was used in ELECTRA as a pre-training task. The ELECTRA architecture is shownΒ below:
ELECTRA Architecture
  • Similar to RTD, but the tokens here are classified to be shuffled or not, instead of replaced or not (shownΒ below)
Illustration of STD (fromΒ paper)
  • Achieves similar sample efficiency as in RTD compared toΒ MLM
  • Loss is definedΒ as:
  • RTD uses a generator to corrupt the sentence, which is computationally expensive.
    RTS bypasses this complexity by simply substituting 15% of the tokens using tokens from the vocabulary while achieving similar accuracy as MLM, as shownΒ here.
  • SLM corrupts the sequence by replacing 15% of tokens with randomΒ tokens.
  • It's similar to MLM in terms of trying to predict corrupted tokens, but instead of using [MASK], random tokens are used forΒ masking
  • It's similar to RTS in terms of using random tokens for corrupting, but unlike RTS, it's not samply efficient, since only 15% of tokens are used for providing trainingΒ signal.
  • TLM is also known as cross-lingual MLM, wherein the input is a pair of parallel sentences (sentences from two different languages) with the tokens masked as inΒ MLM
  • It was used as a pre-training task in XLM, a cross-lingual model to learn cross-lingual mapping.
Illustration of TLM (fromΒ paper)
  • TLM loss is similar to MLMΒ loss:
  • It's a task to learn a cross-lingual language model just like TLM, where the parallel sentences are code-switched, as shownΒ below:
Illustration of ALM: Step1: Tokens from x are replaced by tokens from y; Step2: Obtained sample is then masked similarly to MLM (image fromΒ paper)

While code-switching, some phrases of x are substituted from y, and the sample thus obtained is used to train theΒ model.

  • The masking strategy is similar toΒ MLM.
  • Involves masking of a contiguous span of tokens in a sentence and then using the model to predict the masked tokens based on the output representations of boundaryΒ tokens
Step1: tokens x5 till x8 are masked; Step2: Output representations of boundary tokens (x4 and x9) are used to predict tokens from x5 till x9 (image fromΒ paper)
  • Was used as a pre-training task inΒ SpanBERT
  • Loss is definedΒ as:
  • It's a sentence-level task that helps the model in learning the relationship between the sentences.
  • It's a binary classification task that involves identifying if the two sentences are consecutive, using the output representation of [CLS]Β token.
  • The training is done using 50% positive and 50% negative samples where the second sentence is not consecutive to the first sentence.

There are many other interesting tasks that are summarized in AMMUSΒ !! Kudos to the authors, and please give it a read if you find this interesting)


The Ever-evolving Pre-training Tasks for Language Models was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Join thousands of data leaders on the AI newsletter. It’s free, we don’t spam, and we never share your email address. Keep up to date with the latest work in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.

Published via Towards AI

Feedback ↓