Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: [email protected]
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Take our 85+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!

Publication

Which NLP Task Does NOT Benefit From Pre-trained Language Models?
Natural Language Processing

Which NLP Task Does NOT Benefit From Pre-trained Language Models?

Last Updated on August 29, 2022 by Editorial Team

Author(s): Nate Bush

There is such a long history of pre-trained general language representation models with a massive impact that we take for granted that they are a completely 100% necessary foundation for all NLP tasks. There were two separate step function innovations that pushed the accuracy of all NLP tasks forward: (1) statistical language models like Word2Vec and GloVe and, more recently, (2) neural language models likeΒ BERT,Β ELMo, and recentlyΒ BLOOM. Inserting pre-trained neural language models at the beginning of a modeling workflow isΒ almostΒ guaranteed to increase performance, but there is at least one situation where it does not.

Sidebar: why the sesame street theme?!

Named Entity Recognition (NER)

Look no further than the original BERT paper titled β€œBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding” to see a detailed analysis of how pre-trained BERT embeddings improve NER performance in section 5. The BERT diagram below shows a typical machine learning workflow for exploiting any language model for general NLP tasks.

Source:Β https://arxiv.org/pdf/1810.04805.pdfΒ β€” Overall pre-training and fine-tuning procedures for BERT

The papers also show significant improvement on Question Answering (QA) evaluated againstΒ SQUAD, and a hodgepodge of natural language understanding (NLU) tasks calledΒ GLUE.

Entity Disambiguation (ED)

The global ED task also achieved new state-of-the-art results across multiple datasets using BERT. See the related work section of this β€œGlobal Entity Disambiguation with BERT” for a rundown of various workflows for applying BERT as a preprocessing step for ED.

Extractive Summarization (ES)

A simple variant of BERT achieving once again state-of-the-art performance on several ES datasets can be found in β€œFine-tune BERT for Extractive Summarization”.

Sentiment Analysis (SA)

Once again, sentiment analysis is equally graced by the existence of BERT language models in the recent paper β€œBERT for Sentiment Analysis: Pre-trained and Fine-Tuned Alternatives”.

I could keep going… but I won’t. The glory of pre-trained language models is obvious. We only need to stand on the shoulders of giants who spent countless hours preparing massive corpora of data, deploying expensive GPUs to pre-train these models for us. These models aren’t a silver bullet, though.

The main natural language task that has failed to show consistent performance improvements from sesame street and friends isΒ Neural Machine Translation (NMT).

NMT usually doesn’t benefit from Pre-trained language models

It’s difficult to find papers that discuss why it doesn’t work, and it’s easy to imagine why. Writing papers about what doesn’t work is not very popular… and unlikely to gain recognition or be frequently quoted. Ah shoot β€” so why am I writing this article again?

I found one paper that covered this topic: β€œWhen and Why are Pre-trained Word Embeddings Useful for Neural Machine Translation?” and it was an interesting read. They break NMT down into two categories of tasks:

  1. NMT for low-resource languages
  2. NMT for high-resource languages

What they mean byΒ low/highΒ resource language is in reference to the size of the parallel corpus that can be obtained. For the world’s most popular languages, it can be easy to find open-source large parallel corpora online. The largest such repository isΒ OPUS, the Open Parallel Corpus, which is an amazing resource for any machine learning engineer looking to train NMT models.

Source:Β OPUSΒ – high resource parallel corpus between English (en) and Chinese (zh)

The image above shows that the open parallel corpus between English and Chinese has 103 million parallel sentences or 172K parallel documents. But what if you wanted to train an NMT model to translate Farsi to Chinese? In that case, you only have 6 million parallel sentences from 517 documents to work with.

Source:Β OPUSΒ – low resource parallel corpus between Farsi (fa) and Chinese (zh)

As you might expect, low-resource languages benefit from pre-trained language models and are able to achieve better performance when fine-tuning the embeddings while back-propagating errors through the NMT network.Β Surprisingly, however, for high-resource languages, the effect of using pre-trained language models as a pre-processing step before NMT model training does NOT result in performance gains.

It’s critical to point out thatΒ language models only make sense to use for machine translationΒ if they are trained on both source and target language (for example, Chinese and English in the first example). These are commonly referred to as multilingual embedding models or language agnostic embeddings. They are able to achieve the interesting result that words in multiple languages achieve similar vector representations in the embedding space.

Source:Β AI Googleblog

But how are multilingual language models trained? Turns out they are trained over the exact same data as NMT: a massive parallel corpus between the source and target language. So, is there a fundamental shortcoming to language models that prevent them from being effective for this NLP task? No, language models use the same data as NMT models, and they are both built from the same powerhouse building block: the transformer.

To review, language models and NMT are trained over the same data, using very similar fundamental architectures. When you consider the similarities, there isn’t really anything new that the language models are bringing to the table so it shouldn’t be surprising to you that BERT, ELMo, ERNIE, and our other sesame street friends aren’t appearing in NMT papers touting huge breakthroughs in model performance.

A skeptical reader will likely be able to poke holes in this explanation. There are certainly devisable use cases were training an LM on a large parallel corpus but then training BERT + NMT workflow on a much smaller corpus would intuitively result in performance gains. But I think it’s unlikely that a serious deep learning engineer would attempt to build an NMT model without all available data at their disposal… outside of purely academic curiosities.

I tip-toed over some hairy details, so I recommend readingΒ the original paperΒ if you’re interested!

I hope you enjoyed this short exploration into the intuition behind what makes NLP algorithms successful. Please like, share, and follow for more deep learning knowledge.


Which NLP Task Does NOT Benefit From Pre-trained Language Models?Β was originally published in Towards AIΒ on Medium, where people are continuing the conversation by highlighting and responding to this story.

Published viaΒ Towards AI

Feedback ↓