Increasing Robustness and Equity in NLP for Various English Dialects

Author(s): Eera Bhatt

Originally published on Towards AI.

Natural language processing (NLP) is a popular subfield of machine learning that enables computers to interpret and use human language to achieve certain tasks. To do this, we have to train the computer on text and speech data for it to learn patterns from the language and make predictions.

But let’s be honest for a second. There are still times when I don’t understand what people are saying during conversations, even when we’re both speaking English. Since the United States is full of immigrant families, it’s very common to face this great dialect barrier.

So when “English” means something a bit different to each of us, how can we expect machines to understand it?

Solution. Recently, a group of researchers from Stanford University and Georgia Tech released Multi-VALUE, a set of resources that are meant to improve equity in NLP for each of its dialects. Because of their hard work, NLP-based devices like translators and transcription services can be made more accurate across a diverse range of speakers.

Machines face the dialect barrier just like humans do. For instance, the aforementioned researchers noted that there were performance drops of up to 19% in a question-answering system for Singapore English. Most of these performance drops apply to low-resource languages.

Low-resource languages aren’t as widely present online as languages like English and Spanish are. Since technology is barely familiar with them, it’s hard for low-resource language speakers to be supported by it.

Multi-VALUE. As a solution to this issue, these researchers developed Multi-VALUE. They analyzed a catalog of data with decades of linguistic research, containing several features of different English dialects. Specifically, they looked at the grammatical role of each word in a sentence so they could rearrange it to align with different English variants.

For example, in Standard American English (SAE), the sentence “Sofia was praised by her teacher” translates to “Sofia give her teacher praise” in Colloquial Singapore English (CSE). Notice how the entire meaning seems to change completely. See how easy it is to misunderstand either sentence?

Instead of just focusing on vocabulary in the data, these researchers focused on each sentence’s grammatical structure, breaking down each one into smaller chunks. For example, “he doesn’t have a cat” can be broken down into the negation in “doesn’t” that is connected to the verb “does.”

By analyzing grammar so meticulously like this, these researchers built a framework for linguistics that has almost equal performance across several different dialects of English.

Because of their work, people who speak low-resource languages can interact better with NLP-based devices without a huge communication gap. This makes the technology much more fair to people who don’t use Standard American English.

Limitations. At the same time, though, dialects don’t stay the same over time, and the researchers thoughtfully acknowledge this. There are so many specific features of each English variant that can change, even over just 5 to 10 years!

I know what you’re thinking. And you’re right, this shouldn’t be news to us. Today, we know these dialect changes are real because of words like “highkey” and “goat” (greatest of all time, not the animal I used to visit at the farm.)

So if the various dialects of English are always changing, how do we go about training NLP models?

To address this limit, we could train a new model with new data every single time a dialect goes through some change. But doing so is very inefficient and takes up an unnecessary amount of time.

Adapters. Instead of this, Multi-VALUE lets people train one single model and augment it as time goes by. Instead of retraining the entire model for each dialect change, developers can train adapters. These are smaller parts of the model that can swap and adapt to a given dialect.

Adapters are commonly used for parameter-efficient and modular transfer learning. Parameter-efficient fine-tuning (PEFT) means we adjust only a few parameters in a pre-trained model while keeping most of its structure the same. This improves performance for large language models (LLMs).

Modular transfer learning means that we already have a pre-trained model, so we use it as a starting point to do a slightly different task with machine learning.

Since adapters are at the intersection of these two, they’re known for being pretty easy and efficient to use. For Multi-VALUE in particular, we refer to these as Task-Agnostic Dialect Adapters (TADAs).

Robustness with TADAs. For testing, the TADA modules are stacked with task-specific adapters trained on Standard American English, and this improves the dialect performance without needing to train the model more. This helps the NLP model generalize much better to different scenarios of various English dialects, which improves its performance.

It’s great to see the authors establish more equity in NLP for those who speak low-resource languages. Let’s see where this research goes in the future!

Further Reading:

[1] Held, W., Ziems, C. and Yang, D. (2023) ‘TADA: Task-Agnostic Dialect Adapters for English’. arXiv.

[2] Kannan, P. (no date) Addressing Equity in Natural Language Processing of English Dialects, Stanford HAI. Available at: https://hai.stanford.edu/news/addressing-equity-natural-language-processing-english-dialects

[3] Poth, C. et al. (no date) Adapters: A unified library for parameter-efficient and … Available at: https://aclanthology.org/2023.emnlp-demo.13.pdf

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Increasing Robustness and Equity in NLP for Various English Dialects

Author(s): Eera Bhatt

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

Do AI Agents Really Use the Tools You Build for Them? I Tested It.

Understanding Neural Networks — and Building One!

LLMs Don’t Just Need to Be Smart — They Need to Be Specific. Here’s How.

Beyond pre-trained LLMs: Augmenting LLMs through vector databases to create a chatbot on organizational data

Harnessing the power of LLMs and LangChain for structured data extraction from unstructured data

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Increasing Robustness and Equity in NLP for Various English Dialects

Author(s): Eera Bhatt

Related posts

Popular posts

Updates

Recent Posts

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

GDPR CCPA Statement