Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: [email protected]
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Take our 85+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!

Publication

Understanding BERT
Natural Language Processing

Understanding BERT

Last Updated on November 2, 2020 by Editorial Team

Author(s): Shweta Baranwal

Source: Photo by Min An onΒ Pexels

BERT (Bidirectional Encoder Representations from Transformers) is a research paper published by Google AI language. Unlike previous versions of NLP architectures, BERT is conceptually simple and empirically powerful. It obtains a new state of the art results on 11 NLPΒ tasks.

BERT has a benefit over another standard LM because it applies deep bidirectional context training of the sequence meaning it considers both left and right context while training whereas other LM model such as OpenAI GPT is unidirectional, every token can only attend to previous tokens in attention layers. Such restrictions are suboptimal for sentence-level tasks (paraphrasing) or token level tasks (named entity recognition, question-answering), where it is crucial to incorporate context from both directions.

What isΒ BERT?

In earlier versions of LM, such as Glove, we have fixed embeddings of the words. For example, for the word β€œright,” the embedding is the same irrespective of its context in the sentence. Does it mean β€œcorrect” or β€œright direction”? Then came ELMo (bi-directional LSTM), it tried to solve this problem by using the left and right context for generating embedding, but it simply concatenated the left-to-right and right-to-left information, meaning that the representation couldn’t take advantage of both left and right contexts simultaneously. Then BERT, with its attention layers, outperformed all the previous models.

Differences in pre-training model architectures. BERT uses a bidirectional Transformer. OpenAI GPT uses a left-to-right Transformer. ELMo uses the concatenation of independently trained left-to-right and right-to-left LSTMs to generate features for downstream tasks. Among the three, only BERT representations are jointly conditioned on both left and right contexts in allΒ layers.

The basic architecture of the Transformer, a popular attention model, has two major components: Encoder and Decoder. The encoder part reads the input sequence and processes it, and the Decoder part takes the processed input from Encoder and re-process it to perform the prediction task. To understand more about the transformer, refer: here. Since here we are interested in generating the Language Model (LM), only the Encoder part is necessary. BERT uses this transformer encoder architecture to generate bi-directional self-attention for the input sequence. It reads the entire sentence in one go, and attention layers learn the context of a word from all of its left and right surrounding words.

Pre-training BERT:

The pre-training of the BERT is done on an unlabeled dataset and therefore is un-supervised in nature. There are two pre-training steps inΒ BERT:

  1. Masked Language ModelΒ (MLM)

a) Model masks 15% of the tokens at random with [MASK] token and then predict those masked tokens at the output layer. Loss is based only on the prediction of masked tokens, not on all tokens’ prediction.

b) During fine-tuning of the model [MASK] token does not appear, creating a mismatch. In order to mitigate this, if the i-th token is chosen for masking during pre-training, it is replaced with:

80% times [MASK] token: My dog is hairy β†’ My dog isΒ [MASK]

10% times Random word from the corpus: My dog is hairy β†’ My dog isΒ apple

10% times Unchanged: My dog is hairy β†’ My dog isΒ hairy

http://jalammar.github.io/illustrated-bert/

2. Next Sentence Prediction

a) In this pre-training approach, given the two sentences A and B, the model trains on binarized output whether the sentences are related orΒ not.

b) While choosing the sentence A and B for pre-training examples, 50% of the time B is the actual next sentence that follows A (label: IsNext), and 50% of the time it is a random sentence from the corpus (label: NotNext).

http://jalammar.github.io/illustrated-bert/

The training loss is the sum of the mean masked LM likelihood and the mean next sentence prediction likelihood.

In prior works of NLP, only sentence embeddings are transferred to downstream tasks, whereas BERT transfers all parameters of pre-training to initialize models for different downstream tasks. The pre-trained BERT models are made available by Google and can be used directly for the fine-tuning downstream tasks.

https://huggingface.co/transformers/pretrained_models.html

BERT Architecture:

BERT’s model architecture is a multilayer bi-directional Transformer encoder based on Google’s Attention is all you need paper. It comes in two modelΒ forms:

BERT BASE: less transformer blocks and hidden layers size, have the same model size as OpenAI GPT. [12 Transformer blocks, 12 Attention heads, 768 hidden layerΒ size]

BERT LARGE: huge network with twice the attention layers as BERT BASE, achieves a state of the art results on NLP tasks. [24 Transformer blocks, 16 Attention heads, 1024 hidden layerΒ size]

During fine-tuning of the model, parameters of these layers (Transformer blocks, Attention heads, hidden layers) along with additional layers of the downstream task are fine-tuned end-to-end.

BERT Input Representations:

  1. The first token of every sequence is always a special classification token [CLS]. The final hidden state corresponding to this token is used for the classification task.
  2. The two sentences are separated using the [SEP]Β token.
  3. In the case of sentence pair, a segment embedding is added, which indicates whether the token belongs to sentence A or sentence B.
  4. For a given token, its input representation is constructed by adding the corresponding token, segment, and position embedding.
BERT input representation. The input embeddings are the sum of the token embeddings, the segmentation embedding, and the position embeddings.

Fine-tuning BERT:

Fine-tuning BERT is simple and straightforward. The model is modified as per the task in-hand. For each task, we simply plug in the task-specific inputs and outputs into BERT and fine-tune all the parameters end-to-end.

At the input, sentence A and sentence B from pre-training are analogous to

  1. sentence pairs in paraphrasing
  2. hypothesis-premise pairs in entailment
  3. question-passage pairs in question answering
  4. a degenerate text-βˆ… pair in text classification or sequenceΒ tagging.

At the output, the token representations are fed into an output layer for token level tasks, such as sequence tagging or question answering, and the [CLS] representation is fed into an output layer for classification, such as sentiment analysis.

http://jalammar.github.io/illustrated-bert/

HuggingFace has provided a framework for fine-tuning task-specific models asΒ well.

https://huggingface.co/transformers/model_doc/bert.html#bertforpretraining

Model framework for MaskedLM, NextSentence Prediction, Sequence Classification, Multiple Choice, etc. are readily available along with pre-training parameters for BERT. These are simple and fun to implement.

BERT Tokenizer:

https://huggingface.co/transformers/model_doc/bert.html#berttokenizer

In the BERT input representations, we have seen there are three types of embeddings we need (token, segment, position). The Transformers package by HuggingFace constructs the tokens for each of the embedding requirements (encode_plus). Here both pre-trained tokenizer, as well as tokenizer from a given vocab file, can beΒ used.

BERT tokenizer from pre-trained β€˜bert-base-uncased’

BERT tokenizer uses WordPiece Model for tokenization. It breaks the words into sub-words to increase the coverage of vocabulary.

Word: Jet makers feud over seat width with big orders atΒ stake

Wordpieces: _J et _makers _fe ud _over _seat _width _with _big _orders _atΒ _stake

In the above example, the word β€œJet” is broken into two wordpieces β€œ_J” and β€œet”, and the word β€œfeud” is broken into two wordpieces β€œ_fe” and β€œud”. The other words remain as single wordpieces. β€œ_” is a special character added to mark the beginning of aΒ word.

ShwetaBaranwal/BERT

References:


Understanding BERT was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Published via Towards AI

Feedback ↓