Accented Speech Recognition: The Inclusive Realm of Automatic Speech Recognition Systems

Last Updated on July 15, 2023 by Editorial Team

Author(s): Toluwani Aremu

Originally published on Towards AI.

Accented Speech Recognition: The Inclusive Realm of Automatic Speech Recognition Systems — Photo by Jonathan Borba on Unsplash

“Hey Google!”

“Hey Siri!”

“Hey Cortana!”

“Bixby!!”

“ALEXAAAA!!!”

Yes, I was that frustrated trying to see if Google was the problem or if I was!

In the summer of 2021, I fondly recall my valiant attempt to summon the powers of Google search to find an article for me using voice commands (actually a movie, but I am trying to appear intelligent). Alas, as someone with a knightly command of the English language, also with such a “delightful” accent (not American, not Bri’ish), I found myself in a whimsical predicament!

The speech-to-text feature, in its infinite wisdom, seemed to delight in “misrecognizing” my carefully articulated words! To add more depth to my frustration, my close friend, who also had an accent (not as beautiful as mine), effortlessly commanded the speech-to-text feature, leaving me to start investigating the cosmic forces at play! Pure agony!!!

Is Google the problem? Or am I?

Armed with curiosity, I embarked on a lofty quest to test other well-known ASR systems. With hope in my eyes and “deora ar mo chroi”, I swiftly conducted experiments just to prove I wasn’t the problem.

The mighty Google Home Assistant recognized my commands at least 80% of the time, with the Windows Cortana not too far behind. But then there was Bixby, a true paragon of recognition and obedience, recognizing my commands almost all of the time. Alexa joined Bixby in improving my injured self-esteem, but dear, oh dear, Siri decided to dance to her own tune.

Here’s the confusing part! That Google Home Assistant totally outshone its dear sibling (Search), was perplexing. Larry, riddle me this! Why did the chicken decide to go “Home” instead of “Searching” for food?

My experiments did not end there! In fact, I joined forces with two brave and curious musketeers to unveil the accent-ridden black hole of Automatic Speech Recognition (ASR) systems. So, brace yourself as I embark on a riveting journey to elucidate the very essence of ASR systems before delving further into my tale.

WHAT ARE ASR SYSTEMS?

Speech recognition technology, also known as automatic speech recognition (ASR), is a vital component of speech AI. ASR enables the conversion of spoken language (audio signals) into written text, serving various purposes such as command input. It has advanced capabilities to accurately process different language dialects and accents, finding extensive applications in user-facing contexts like virtual agents, live captioning, and clinical note-taking. Developers in the speech AI field may use alternative terms like speech-to-text (STT) or voice recognition to refer to ASR. ASR plays a crucial role within speech AI, which encompasses technologies aimed at facilitating human-computer interaction through voice communication.

ASR has experienced significant growth and adoption, with popular platforms like TikTok, Instagram, Spotify, and Zoom incorporating ASR technology. There are two primary approaches to ASR: the traditional hybrid approach and the end-to-end Deep Learning approach. The hybrid approach combines statistical and rule-based methods, while the end-to-end Deep Learning approach employs a single neural network to handle the entire ASR process. Ongoing advancements in ASR technology contribute to continuous improvements in system accuracy.

ASR systems are trained on extensive datasets comprising audio recordings and corresponding transcripts. The accuracy of ASR can be influenced by factors such as audio recording quality, speaker accents, and background noise. ASR finds applications in various fields, including transcription, captioning, and dictation. The future of ASR looks promising as the technology continues to evolve, driven by ongoing enhancements and innovations.

ASR systems consist of several components and techniques working together to accomplish accurate speech-to-text conversion. These components include acoustic modeling, language modeling, pronunciation modeling, feature extraction, decoding, training and adaptation, post-processing, and the emerging end-to-end ASR approach. Acoustic modeling captures the relationship between speech audio signals and phonetic units, while language modeling predicts likely word sequences given acoustic observations. Pronunciation modeling maps phonetic units to their corresponding pronunciations. Feature extraction converts raw speech signals into relevant acoustic features, and decoding finds the most likely word sequence given the acoustic and language models. Training and adaptation optimize model parameters using labeled speech data, and post-processing refines the output. The recent development of end-to-end ASR systems directly maps acoustic features to word sequences using deep learning approaches. ASR systems continue to advance, driven by deep learning, training data, and computational resources, enabling various applications relying on speech recognition technology.

HOW IS ACCENT A PROBLEM IN ASR?

English is the most universally adopted language in the world. As a result, different parts of the world have their own styles of communicating with this language. In fact, you can find multiple styles of language within the same part of the world, perhaps due to differences in dentition or vocal perception. Some styles are classified as English-based creoles, which are distorted forms of English that have been influenced by other languages. Other styles are simply pure English with a distinctive and noticeable regional accent.

Due to this variance in style, accents present challenges for ASR systems as they can cause the misrecognition of words. Let’s take a look at the word “SCHEDULE”.

Speaker A from the US pronounces it as “Skeh-dool”

Speaker B from the UK pronounces it as “Sher-dool”

A non-native speaker C would rather pronounce every syllable, i.e., “Skeh-doo-leh”

All speakers are saying the same word, but with their respective accents!

Here is where accent becomes a problem for ASR Systems. Speakers A and B are using standard pronunciations for “SCHEDULE,” while Speaker C isn’t. ASR systems are typically trained on datasets of native/standard speakers, which may not accurately recognize words pronounced differently in other accents.

Accents introduce acoustic variability, with different pronunciation patterns and speech rhythms, making it challenging for ASR systems to transcribe accurately. The lack of diverse training data covering all accents can lead to poorer performance when encountering specific accents in real-world scenarios. Accented speech may feature unfamiliar or non-standard pronunciations not adequately represented in pronunciation models, resulting in misinterpretation or incorrect recognition. Accents also impact language models, affecting word usage, syntax, and vocabulary choice, causing lower accuracy in transcribing accented speech.

Photo by Sebastian Scholz (Nuki) on Unsplash

ADDRESSING THE ACCENTED SPEECH RECOGNITION PROBLEM

Back to my story! In the fall semester, after my spirited encounter with Google search, in pursuit of understanding the enigmatic challenge of accent recognition, I joined forces with two esteemed colleagues who shared my noble ambitions. Together, we embarked on a daring investigation, seeking to unravel and solve the mysteries that plagued the recognition of accented commands.

With fervor and determination, we turned to the concept of disentanglement, a concept/technique which involves breaking down features in an embedding/latent space into narrowly defined variables and encoding them as separate dimensions. Like skilled alchemists, we sought to distill clarity from the chaos, decipher the secret language of accents and empower our ASR systems with the ability to comprehend their unique nuances.

Basically, we thought that if we could find a way to disentangle the content (i.e., commands) of an accented speaker S and match those representations with the style representations (i.e., voice) of a native speaker T, we would be able to provide ASR systems with commands in native English y, translated from an accented input x. The brilliance of our hypothesis gleamed with hopeful promise. We imagined being known as the pioneers in this domain. Yann, Bengio, and Hinton ain’t got nothing on us…

…Boy, were we wrong!

CUDA chuckled at our audacity, revealing the folly of our assumptions. Our grand experiment, so full of promise, crumbled like a house of cards, leaving us humbled by the sheer complexity of the task at hand. While our experiments (based on the VQ-VAE) produced better results than the baseline experiments, it was only marginal!!!

Our native voices were repeating the exact words spoken by our accented speakers, using the speakers’ accents. What we wanted was to clear out the accent. It was during our cross-examinations of other model-centric and data-centric methods with respect to ours that we discovered that speech was made up of many components other than the content and style that we had thought of.

So, what are the best approaches to solve this problem?

Data, Data, More Data: According to Lingohut and EBC TEFL, there are 160 recognized English dialects in the world. However, if there are 1.5 billion English speakers in the world, and only an estimated 500 million people are native speakers, then I would claim that there is at least 2000 ways and styles of speaking the language (considering other specific factors). Current benchmark datasets only cover a small number of accents, of which the majority of the accent classes lack enough data. Data is a major factor when it comes to the success of AI models. For a big improvement in accented speech recognition, an improvement in data diversity is one of the best ways forwards.
Data, again, high-quality ones: While improving the diversity in data collection, another approach is to ensure the high quality of the collected data. One second of clear standard-pitched recording is better than eight seconds of noisy low-pitched recording.
Enhancing Acoustic Modeling Techniques: Acoustic modeling is a crucial component of ASR systems that captures the relationship between speech audio signals and corresponding phonetic units. To address accent-related challenges, advancements in acoustic modeling techniques can be employed. This may involve developing more robust models, such as deep neural networks (DNNs) or convolutional neural networks (CNNs), that are capable of handling the acoustic variability introduced by accents. By improving the modeling of accent-specific acoustic features, the ASR system can better distinguish and recognize speech accurately. Personally, I think one embedding space for binding every accented version of the same words would improve robustness in this domain. Copying Meta’s Rohit et al words, “One acoustic embedding space to catch all accents!”… or “In brightest day, in darkest nights, no accent shall escape my sight!”
Adapting Language Models: Language modeling plays a significant role in ASR systems by estimating the likelihood of word sequences. Adapting language models to accommodate accent-specific variations can help improve the accuracy of transcriptions. This involves refining language models to incorporate accent-specific vocabulary, syntax, and word usage. By training language models on diverse accent data and incorporating accent-specific linguistic patterns, the ASR system can better adapt to different accents and produce more accurate transcriptions.
Transfer Learning and Multi-Accent Training: Advancements in transfer learning techniques can also contribute to addressing accent-related challenges. Transfer learning involves pretraining a model on a large dataset and fine-tuning it on a smaller dataset specific to a particular accent. By leveraging knowledge learned from a broader set of data, the ASR system can improve its performance on accents with limited training data. Additionally, multi-accent training focuses on training ASR systems on a combination of diverse accent datasets, allowing the model to learn and generalize across different accents more effectively.

I am not yet a domain expert in this area, but the approaches above would guarantee inclusivity when it comes to the deployment and use of ASR systems. Currently, the recent breakthroughs in different aspects of AI, i.e., deep learning, large-scale training data, and computational resources, show that it won’t take too long for ASR to be perfect, or at least close to.

It is 2023 now. The speech recognition feature on Google search has stopped frustrating me!

Either Google has improved, or I have improved, or we both improved! No more toxicity between us…

But if it decides to change again! Just saying!

SOURCES

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Accented Speech Recognition: The Inclusive Realm of Automatic Speech Recognition Systems

Author(s): Toluwani Aremu

Towards AI Academy

We Build Enterprise-Grade AI. We'll Teach You to Master It Too.

Recent Posts

Crack ML Interviews with Confidence: K-Nearest Neighbors (KNN 20 Q&A)

The Event-Driven Blueprint: How I Scaled a Spring Boot System to 10 Million Kafka Messages/Day

Building Vector Search? Why FAISS Alone Isn’t Enough

TAI #202: GPT-5.5 Moves Codex Into Real Work

Machine Learning System Design -The Model Serving Triangle, With One Forward Pass Flowing Through Every Trade-off (Part3)

AI Orchestration in Action: How MuleSoft and LLMs Fuel the Future of Enterprise AI

GPT-4 Has 1.8 Trillion Parameters. It Uses 2% of Them Per Token.

Part 20: Data Manipulation in Multi-Dimensional Aggregation

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Accented Speech Recognition: The Inclusive Realm of Automatic Speech Recognition Systems

Author(s): Toluwani Aremu

Towards AI Academy

We Build Enterprise-Grade AI. We'll Teach You to Master It Too.

Related posts

Recent Posts

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

GDPR CCPA Statement