Unlock the full potential of AI with Building LLMs for Production—our 470+ page guide to mastering LLMs with practical projects and expert insights!


Artificial Intelligence   Data Science   Latest   Machine Learning

Increasing Robustness and Equity in NLP for Various English Dialects

Author(s): Eera Bhatt

Originally published on Towards AI.

Natural language processing (NLP) is a popular subfield of machine learning that enables computers to interpret and use human language to achieve certain tasks. To do this, we have to train the computer on text and speech data for it to learn patterns from the language and make predictions.

But let’s be honest for a second. There are still times when I don’t understand what people are saying during conversations, even when we’re both speaking English. Since the United States is full of immigrant families, it’s very common to face this great dialect barrier.

So when “English” means something a bit different to each of us, how can we expect machines to understand it?

Photo by Joshua Hoehne on Unsplash

Solution. Recently, a group of researchers from Stanford University and Georgia Tech released Multi-VALUE, a set of resources that are meant to improve equity in NLP for each of its dialects. Because of their hard work, NLP-based devices like translators and transcription services can be made more accurate across a diverse range of speakers.

Machines face the dialect barrier just like humans do. For instance, the aforementioned researchers noted that there were performance drops of up to 19% in a question-answering system for Singapore English. Most of these performance drops apply to low-resource languages.

Low-resource languages aren’t as widely present online as languages like English and Spanish are. Since technology is barely familiar with them, it’s hard for low-resource language speakers to be supported by it.

Multi-VALUE. As a solution to this issue, these researchers developed Multi-VALUE. They analyzed a catalog of data with decades of linguistic research, containing several features of different English dialects. Specifically, they looked at the grammatical role of each word in a sentence so they could rearrange it to align with different English variants.

For example, in Standard American English (SAE), the sentence “Sofia was praised by her teacher” translates to “Sofia give her teacher praise” in Colloquial Singapore English (CSE). Notice how the entire meaning seems to change completely. See how easy it is to misunderstand either sentence?

Instead of just focusing on vocabulary in the data, these researchers focused on each sentence’s grammatical structure, breaking down each one into smaller chunks. For example, “he doesn’t have a cat” can be broken down into the negation in “doesn’t” that is connected to the verb “does.”

By analyzing grammar so meticulously like this, these researchers built a framework for linguistics that has almost equal performance across several different dialects of English.

Because of their work, people who speak low-resource languages can interact better with NLP-based devices without a huge communication gap. This makes the technology much more fair to people who don’t use Standard American English.

Limitations. At the same time, though, dialects don’t stay the same over time, and the researchers thoughtfully acknowledge this. There are so many specific features of each English variant that can change, even over just 5 to 10 years!

I know what you’re thinking. And you’re right, this shouldn’t be news to us. Today, we know these dialect changes are real because of words like “highkey” and “goat” (greatest of all time, not the animal I used to visit at the farm.)

So if the various dialects of English are always changing, how do we go about training NLP models?

To address this limit, we could train a new model with new data every single time a dialect goes through some change. But doing so is very inefficient and takes up an unnecessary amount of time.

Adapters. Instead of this, Multi-VALUE lets people train one single model and augment it as time goes by. Instead of retraining the entire model for each dialect change, developers can train adapters. These are smaller parts of the model that can swap and adapt to a given dialect.

Adapters are commonly used for parameter-efficient and modular transfer learning. Parameter-efficient fine-tuning (PEFT) means we adjust only a few parameters in a pre-trained model while keeping most of its structure the same. This improves performance for large language models (LLMs).

Modular transfer learning means that we already have a pre-trained model, so we use it as a starting point to do a slightly different task with machine learning.

Since adapters are at the intersection of these two, they’re known for being pretty easy and efficient to use. For Multi-VALUE in particular, we refer to these as Task-Agnostic Dialect Adapters (TADAs).

Robustness with TADAs. For testing, the TADA modules are stacked with task-specific adapters trained on Standard American English, and this improves the dialect performance without needing to train the model more. This helps the NLP model generalize much better to different scenarios of various English dialects, which improves its performance.

It’s great to see the authors establish more equity in NLP for those who speak low-resource languages. Let’s see where this research goes in the future!

Further Reading:

[1] Held, W., Ziems, C. and Yang, D. (2023) ‘TADA: Task-Agnostic Dialect Adapters for English’. arXiv.

[2] Kannan, P. (no date) Addressing Equity in Natural Language Processing of English Dialects, Stanford HAI. Available at: https://hai.stanford.edu/news/addressing-equity-natural-language-processing-english-dialects

[3] Poth, C. et al. (no date) Adapters: A unified library for parameter-efficient and … Available at: https://aclanthology.org/2023.emnlp-demo.13.pdf

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓