Towards AI Can Help your Team Adopt AI: Corporate Training, Consulting, and Talent Solutions.


Accented Speech Recognition: The Inclusive Realm of Automatic Speech Recognition Systems
Latest   Machine Learning

Accented Speech Recognition: The Inclusive Realm of Automatic Speech Recognition Systems

Last Updated on July 15, 2023 by Editorial Team

Author(s): Toluwani Aremu

Originally published on Towards AI.

Photo by Jonathan Borba on Unsplash

“Hey Google!”

“Hey Siri!”

“Hey Cortana!”



Yes, I was that frustrated trying to see if Google was the problem or if I was!

In the summer of 2021, I fondly recall my valiant attempt to summon the powers of Google search to find an article for me using voice commands (actually a movie, but I am trying to appear intelligent). Alas, as someone with a knightly command of the English language, also with such a “delightful” accent (not American, not Bri’ish), I found myself in a whimsical predicament!

The speech-to-text feature, in its infinite wisdom, seemed to delight in “misrecognizing” my carefully articulated words! To add more depth to my frustration, my close friend, who also had an accent (not as beautiful as mine), effortlessly commanded the speech-to-text feature, leaving me to start investigating the cosmic forces at play! Pure agony!!!

Is Google the problem? Or am I?

Armed with curiosity, I embarked on a lofty quest to test other well-known ASR systems. With hope in my eyes and “deora ar mo chroi”, I swiftly conducted experiments just to prove I wasn’t the problem.

The mighty Google Home Assistant recognized my commands at least 80% of the time, with the Windows Cortana not too far behind. But then there was Bixby, a true paragon of recognition and obedience, recognizing my commands almost all of the time. Alexa joined Bixby in improving my injured self-esteem, but dear, oh dear, Siri decided to dance to her own tune.

Here’s the confusing part! That Google Home Assistant totally outshone its dear sibling (Search), was perplexing. Larry, riddle me this! Why did the chicken decide to go “Home” instead of “Searching” for food?

My experiments did not end there! In fact, I joined forces with two brave and curious musketeers to unveil the accent-ridden black hole of Automatic Speech Recognition (ASR) systems. So, brace yourself as I embark on a riveting journey to elucidate the very essence of ASR systems before delving further into my tale.

Photo by Ivan Bandura on Unsplash


Speech recognition technology, also known as automatic speech recognition (ASR), is a vital component of speech AI. ASR enables the conversion of spoken language (audio signals) into written text, serving various purposes such as command input. It has advanced capabilities to accurately process different language dialects and accents, finding extensive applications in user-facing contexts like virtual agents, live captioning, and clinical note-taking. Developers in the speech AI field may use alternative terms like speech-to-text (STT) or voice recognition to refer to ASR. ASR plays a crucial role within speech AI, which encompasses technologies aimed at facilitating human-computer interaction through voice communication.

ASR has experienced significant growth and adoption, with popular platforms like TikTok, Instagram, Spotify, and Zoom incorporating ASR technology. There are two primary approaches to ASR: the traditional hybrid approach and the end-to-end Deep Learning approach. The hybrid approach combines statistical and rule-based methods, while the end-to-end Deep Learning approach employs a single neural network to handle the entire ASR process. Ongoing advancements in ASR technology contribute to continuous improvements in system accuracy.

ASR systems are trained on extensive datasets comprising audio recordings and corresponding transcripts. The accuracy of ASR can be influenced by factors such as audio recording quality, speaker accents, and background noise. ASR finds applications in various fields, including transcription, captioning, and dictation. The future of ASR looks promising as the technology continues to evolve, driven by ongoing enhancements and innovations.

ASR systems consist of several components and techniques working together to accomplish accurate speech-to-text conversion. These components include acoustic modeling, language modeling, pronunciation modeling, feature extraction, decoding, training and adaptation, post-processing, and the emerging end-to-end ASR approach. Acoustic modeling captures the relationship between speech audio signals and phonetic units, while language modeling predicts likely word sequences given acoustic observations. Pronunciation modeling maps phonetic units to their corresponding pronunciations. Feature extraction converts raw speech signals into relevant acoustic features, and decoding finds the most likely word sequence given the acoustic and language models. Training and adaptation optimize model parameters using labeled speech data, and post-processing refines the output. The recent development of end-to-end ASR systems directly maps acoustic features to word sequences using deep learning approaches. ASR systems continue to advance, driven by deep learning, training data, and computational resources, enabling various applications relying on speech recognition technology.

Photo by Dan Farrell on Unsplash


English is the most universally adopted language in the world. As a result, different parts of the world have their own styles of communicating with this language. In fact, you can find multiple styles of language within the same part of the world, perhaps due to differences in dentition or vocal perception. Some styles are classified as English-based creoles, which are distorted forms of English that have been influenced by other languages. Other styles are simply pure English with a distinctive and noticeable regional accent.

Due to this variance in style, accents present challenges for ASR systems as they can cause the misrecognition of words. Let’s take a look at the word “SCHEDULE”.

Speaker A from the US pronounces it as “Skeh-dool”

Speaker B from the UK pronounces it as “Sher-dool”

A non-native speaker C would rather pronounce every syllable, i.e., “Skeh-doo-leh”

All speakers are saying the same word, but with their respective accents!

Here is where accent becomes a problem for ASR Systems. Speakers A and B are using standard pronunciations for “SCHEDULE,” while Speaker C isn’t. ASR systems are typically trained on datasets of native/standard speakers, which may not accurately recognize words pronounced differently in other accents.

Accents introduce acoustic variability, with different pronunciation patterns and speech rhythms, making it challenging for ASR systems to transcribe accurately. The lack of diverse training data covering all accents can lead to poorer performance when encountering specific accents in real-world scenarios. Accented speech may feature unfamiliar or non-standard pronunciations not adequately represented in pronunciation models, resulting in misinterpretation or incorrect recognition. Accents also impact language models, affecting word usage, syntax, and vocabulary choice, causing lower accuracy in transcribing accented speech.

Photo by Sebastian Scholz (Nuki) on Unsplash


Back to my story! In the fall semester, after my spirited encounter with Google search, in pursuit of understanding the enigmatic challenge of accent recognition, I joined forces with two esteemed colleagues who shared my noble ambitions. Together, we embarked on a daring investigation, seeking to unravel and solve the mysteries that plagued the recognition of accented commands.

With fervor and determination, we turned to the concept of disentanglement, a concept/technique which involves breaking down features in an embedding/latent space into narrowly defined variables and encoding them as separate dimensions. Like skilled alchemists, we sought to distill clarity from the chaos, decipher the secret language of accents and empower our ASR systems with the ability to comprehend their unique nuances.

Basically, we thought that if we could find a way to disentangle the content (i.e., commands) of an accented speaker S and match those representations with the style representations (i.e., voice) of a native speaker T, we would be able to provide ASR systems with commands in native English y, translated from an accented input x. The brilliance of our hypothesis gleamed with hopeful promise. We imagined being known as the pioneers in this domain. Yann, Bengio, and Hinton ain’t got nothing on us…

…Boy, were we wrong!

CUDA chuckled at our audacity, revealing the folly of our assumptions. Our grand experiment, so full of promise, crumbled like a house of cards, leaving us humbled by the sheer complexity of the task at hand. While our experiments (based on the VQ-VAE) produced better results than the baseline experiments, it was only marginal!!!

Our native voices were repeating the exact words spoken by our accented speakers, using the speakers’ accents. What we wanted was to clear out the accent. It was during our cross-examinations of other model-centric and data-centric methods with respect to ours that we discovered that speech was made up of many components other than the content and style that we had thought of.

So, what are the best approaches to solve this problem?

  1. Data, Data, More Data: According to Lingohut and EBC TEFL, there are 160 recognized English dialects in the world. However, if there are 1.5 billion English speakers in the world, and only an estimated 500 million people are native speakers, then I would claim that there is at least 2000 ways and styles of speaking the language (considering other specific factors). Current benchmark datasets only cover a small number of accents, of which the majority of the accent classes lack enough data. Data is a major factor when it comes to the success of AI models. For a big improvement in accented speech recognition, an improvement in data diversity is one of the best ways forwards.
  2. Data, again, high-quality ones: While improving the diversity in data collection, another approach is to ensure the high quality of the collected data. One second of clear standard-pitched recording is better than eight seconds of noisy low-pitched recording.
  3. Enhancing Acoustic Modeling Techniques: Acoustic modeling is a crucial component of ASR systems that captures the relationship between speech audio signals and corresponding phonetic units. To address accent-related challenges, advancements in acoustic modeling techniques can be employed. This may involve developing more robust models, such as deep neural networks (DNNs) or convolutional neural networks (CNNs), that are capable of handling the acoustic variability introduced by accents. By improving the modeling of accent-specific acoustic features, the ASR system can better distinguish and recognize speech accurately. Personally, I think one embedding space for binding every accented version of the same words would improve robustness in this domain. Copying Meta’s Rohit et al words, “One acoustic embedding space to catch all accents!”… or “In brightest day, in darkest nights, no accent shall escape my sight!”
  4. Adapting Language Models: Language modeling plays a significant role in ASR systems by estimating the likelihood of word sequences. Adapting language models to accommodate accent-specific variations can help improve the accuracy of transcriptions. This involves refining language models to incorporate accent-specific vocabulary, syntax, and word usage. By training language models on diverse accent data and incorporating accent-specific linguistic patterns, the ASR system can better adapt to different accents and produce more accurate transcriptions.
  5. Transfer Learning and Multi-Accent Training: Advancements in transfer learning techniques can also contribute to addressing accent-related challenges. Transfer learning involves pretraining a model on a large dataset and fine-tuning it on a smaller dataset specific to a particular accent. By leveraging knowledge learned from a broader set of data, the ASR system can improve its performance on accents with limited training data. Additionally, multi-accent training focuses on training ASR systems on a combination of diverse accent datasets, allowing the model to learn and generalize across different accents more effectively.

I am not yet a domain expert in this area, but the approaches above would guarantee inclusivity when it comes to the deployment and use of ASR systems. Currently, the recent breakthroughs in different aspects of AI, i.e., deep learning, large-scale training data, and computational resources, show that it won’t take too long for ASR to be perfect, or at least close to.

Photo by Nejc Soklič on Unsplash

It is 2023 now. The speech recognition feature on Google search has stopped frustrating me!

Either Google has improved, or I have improved, or we both improved! No more toxicity between us…

But if it decides to change again! Just saying!


  1. What is Automatic Speech Recognition (ASR)? (
  2. What is Automatic Speech Recognition? U+007C NVIDIA Technical Blog
  3. What is ASR? An Overview of Automatic Speech Recognition (
  4. So How Many English Accents Are There In The World? The Number May Surprise You — LingoHut Blog
  5. How many different English accents are there? (
  6. Disentangled Representation Learning Definition U+007C DeepAI
  7. SCHEDULE U+007C Pronunciation in English (
  8. How Many People Speak English, And Where Is It Spoken? (
  9. [2305.05665] ImageBind: One Embedding Space To Bind Them All (

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓