Whisper Variants Comparison: What Are Their Features And How To Implement Them?
Author(s): Yuki Shizuya
Originally published on Towards AI.
Recently, I research automatic speech recognition (ASR) to make transcription from speech data. When it comes to an open-source ASR model, Whisper [1], which is developed by OpenAI, might be the best choice in terms of its highly accurate transcription. However, there are many variants of Whisper, so I want to compare their features. In this blog, I will quickly recap Whisper and introduce the variants and how to implement them in Python. I will explain vanilla Whisper, Faster Whisper, Whisper X, Distil-Whisper, and Whisper-Medusa.
Table of Contents
- What is Whisper?
- Whisper variants : Faster Whisper, Whisper X, Distil-Whisper, and Whisper-Medusa
- Python implementation of Whisper variants : Compare the results for real-world audio data
1. What is Whisper?
Whisper [1] is an automatic speech recognition (ASR) model developed by OpenAI. It is trained on 680,000 hours of multilingual and multi-task supervised data, including transcription, translation, voice activity detection, alignment, and language identification. Before the arrival of Whisper, there were no models trained by such a massive amount of data in a supervised way. Regarding architecture, Whisper adopts an Encoder-Decoder Transformer for scalability. The architecture illustration is shown below.
Firstly, Whisper converts audio data into a log-mel spectrogram. A log-mel spectrogram is a visual representation of the spectrum of signal frequencies in the mel scale, which is commonly used in speech processing and machine learning tasks. For further information, you can check this blog [2]. After Whisper inputs a log-mel spectrogram to some 1-D convolution layers and positional encoding, it processes data in a similar way to the natural language processing Transformer. Whisper can work in the multilingual setting to leverage byte-level BPE tokenizer utilized by GPT-2. Thanks to multi-task learning, Whisper can also perform transcription, timestamp detection, and translation.
Official Whisper has six model sizes, four with English-only versions, offering speed and accuracy tradeoffs. Smaller models provide only English-only capability.
Just recently (2024/10), OpenAI released the new version, βturbo,β which has almost the same capability as the large-size model but offers significant speed-up (8 times!) by fine-tuning the pruned large-size model. All Whisper models are compatible with the HuggingFace transformer library.
Now, we quickly recap Whisper. It is based on the Encoder-Decoder Transformer architecture and performs outstandingly, even including in commercial models. In the next section, we will discuss the Whisper variants.
2. Whisper variants : Faster Whisper, Whisper X, Distil-Whisper, and Whisper-Medusa
In this section, we will go through Whisper variants and their features. I focus on the Python and Pytorch implementations. Although Whisper.cpp and Whisper JAX are popular variants, I will not examine them. Moreover, Whisper-streaming is also a popular variant for real-time inference, but it needs a high-end GPU, so I will not discuss it either. We will check Faster-Whisper, Whisper X, Distil-Whisper, and Whisper-Medusa.
Faster-Whisper
Faster-Whisper is a reimplementation of Whisper using CTranslate2, which is a C++ and Python library for efficient inference with Transformer models. Thus, there is no change in architecture. According to the official repository, Faster-Whisper can speed up ~4 times faster than the original implementation with the same accuracy while using less memory. Briefly, Ctranslate2 utilizes many optimization techniques, such as weights quantization, layers fusion, batch reordering, etc. We can choose type options, such as float16 or int8, according to our machine type; for instance, when we select int8, we can run Whisper even on the CPU.
WhisperX (2023/03)
WhisperX [3] is also an efficient speech transcription system integrated Faster-Whisper. Although vanilla Whisper is trained by multiple tasks, including timestamp prediction, it is prone to be inaccurate for word-level timestamps. Moreover, due to its sequential inference nature, vanilla Whisper generally takes computation time for long-form audio inputs. To overcome these weak points, WhisperX introduces three additional stages: (1) Voice Activity Detection (VAD), (2) cut & merge results of VAD, and (3) forced alignment with an external phoneme model to provide accurate word-level timestamps. The architecture illustration is shown below.
Firstly, WhisperX processes input audio through the VAD layer. As its name suggests, VAD detects voice segments. WhisperX utilizes the segmentation model in the pyannote-audio library for the VAD. Next, WhisperX cuts and merges the voice detected segmentation. This process allows us to run batch inference based on each cut result. Finally, WhisperX applies the forced alignment to measure word-level accurate timestamps. Letβs check a concrete example as shown below.
It leverages Whisper for the transcription and the Phoneme model for phoneme-level transcription. The phoneme model can detect a timestamp for each phoneme; thus, if we assign the timestamp from the next nearest phoneme in the Whisper transcript, we can get a more accurate timestamp for each word.
Even though WhisperX adds three additional processes compared to the vanilla Whisper, it can effectively transcribe for longer audio thanks to batch inference. The following table shows the performance comparison. You can check that WhisperX keeps low WER but increase the inference speed.
Distil-Whisper (2023/11)
Distil-Whisper [4] was developed by HuggingFace in 2023. It is a model that compresses the Whipser Large model using knowledge distillation. It leverages common knowledge distillation techniques to train the smaller model, such as pseudo-labeling from the Whisper Large model and Kullback-Leibler Divergence loss. The architecture illustration is shown below.
The architecture is paired with the vanilla Whisper, but the number of layers is decreased. For the dataset, the authors collect 21,170 hours of publicly available data from the Internet to train the Distil-Whisper. Distil-Whisper records 5.8 times faster than the Whisper Large model, with 51% fewer parameters, while performing within a 1% word error rate (WER) on out-of-distribution data. The following table shows the performance comparison.
As you can see, Distil-Whisper keeps a word error rate as low as vanilla Whisper but can decrease the latency.
Whisper-Medusa (2024/09)
Whisper-Medusa [5] is the variant that utilizes Medusa to increase Whisperβs inference speed. Medusa is an efficient LLM inference method that adds extra decoding heads to predict multiple subsequent tokens in parallel. You can understand well using the following illustration.
In the left part, the Medusa has three additional heads to predict subsequent tokens. If an original model outputs y1 token, the three additional heads predict y2, y3, and y4 tokens. Medusa can increase the number of predictions by adding additional heads and reduce the inference time overall. Note that the necessary VRAM amount will be increased because of additional heads.
Whisper-Medusa applies the Medusa idea to Whisper, as shown in the right part. Since Whisper has a disadvantage in inference speed because of the sequential inference nature, Medusaβs feature helps speed up the inference. The comparison results between Whisper-Medusa and vanilla Whisper are shown below.
For several language datasets, Whisper-Medusa records a lower word error rate (WER) than vanilla Whisper. It can also speed up 1.5 times on average.
In this section, we check the Whisper variants and their features. The following section will explore how to implement them in Python and check their capability for real-world audio.
3. Python implementation of Whisper variants : Compare the results using real-world audio data
In this section, we will learn how to implement Whisper and Whisper variants in Python. For real-world audio data, I will use audio from this YouTube video downloaded manually. The video size is around 14 minutes. I will attach the code on how to convert an mp4 file into an mp3 file later.
Environment setup
Due to library incompatibility, we created two environments: one for Whipser, Faster-Whisper, WhisperX, and Distil-Whisper, and the other for Whisper-Medusa.
For the former environment, I used a conda environment with Python 3.10. I experimented on Ubuntu 20.04 with cuda 12.0, 16 GB VRAM.
conda create -n audioenv python=3.10 -y
conda activate audioenv
Next, we need to install the libraries below via pip and conda. After the installation below, you need to downgrade numpy to 1.26.3.
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia
pip install python-dotenv moviepy openai-whisper accelerate datasets[audio]
pip install numpy==1.26.3
Next, we need to install whisperX repository. However, whisperX is no longer maintained frequently so far. Thus, we use the forked repository called BetterWhisperX.
git clone https://github.com/federicotorrielli/BetterWhisperX.git
cd BetterWhisperX
pip install -e .
First environment preparation is done.
For Whisper-Medusa environment, I used a conda environment with Python 3.11. I also experimented on Ubuntu 20.04 with cuda 12.0, 24 GB VRAM.
conda create -n medusa python=3.11 -y
conda activate medusa
You need to install the following libraries via pip.
pip install torch==2.2.2 torchvision==0.17.2 torchaudio==2.2.2 --index-url https://download.pytorch.org/whl/cu118
pip install wandb
git clone https://github.com/aiola-lab/whisper-medusa.git
cd whisper-medusa
pip install -e .
All preparation is done. Now, letβs check Whisper variants capabilities!
How to implement Whisper variants in Python
- Whisper turbo
We use the latest version of Whisper, turbo. Thanks to the official repository, we can implement vanilla Whisper with only a few lines of code.
import whisper
model = whisper.load_model("turbo")
result = model.transcribe("audio.mp3")
Whisper can only work for audio data within 30 seconds, but transcribe method reads the entire file and processes the audio with a sliding 30-second window, so we donβt care about how to feed the data.
2. Faster-Whisper
We use the Whisper turbo backbone of Faster-Whisper. Faster-Whisper has the original repository, and we can implement it as follows.
from faster_whisper import WhisperModel
model_size = "deepdml/faster-whisper-large-v3-turbo-ct2"
# Run on GPU with FP16
model = WhisperModel(model_size_or_path=model_size, device="cuda", compute_type="float16")
segments, info = model.transcribe('audio.mp3', beam_size=5)
beam_size is used for beam search on decoding. Since the capability of Faster-Whisper is the same as the vanilla Whisper, we can process long-form audio using a sliding window.
3. WhisperX
We use the Whisper turbo backbone of WhisperX. Since WhisperX utilizes Faster-Whisper as a backbone, some parts of the codes are shared.
import whisperx
model_size = "deepdml/faster-whisper-large-v3-turbo-ct2"
# Transcribe with original whisper (batched)
model = whisperx.load_model(model_size, 'cuda', compute_type="float16")
model_a, metadata = whisperx.load_align_model(language_code='en', device='cuda')
# inference
audio = whisperx.load_audio('audio.mp3')
whisper_result = model.transcribe(audio, batch_size=16)
result = whisperx.align(whisper_result["segments"], model_a, metadata, audio, 'cuda', return_char_alignments=False)
WhisperX integrates with Faster-Whisper and adds additional layers that process VAD and forced alignment. We can also process long-form audio more than 30 seconds thanks to the cut & merge.
4. Distil-Whisper
We will use the large-v3 modelβs distilled version because the latest turbo version has yet to be released. Distil-Whisper is compatible with the HuggingFace Transformer library, so we can easily implement it.
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = "distil-whisper/distil-large-v3"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)
pipe = pipeline(
"automatic-speech-recognition",
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
max_new_tokens=128,
torch_dtype=torch_dtype,
device=device,
return_timestamps=True
)
result = pipe('audio.mp3')
pipeline class automatically processes long-form audio using sliding window. Note that this method only outputs the relative timestamps.
5. Whisper-Medusa
We use the large model as the Whisper backbone. Following the official implementation, we can implement it as follows:
import torch
import torchaudio
from whisper_medusa import WhisperMedusaModel
from transformers import WhisperProcessor
SAMPLING_RATE = 16000
language = "en"
regulation_factor=1.01
regulation_start=140
device = 'cuda'
model_name = "aiola/whisper-medusa-linear-libri"
model = WhisperMedusaModel.from_pretrained(model_name)
processor = WhisperProcessor.from_pretrained(model_name)
model = model.to(device)
input_speech, sr = torchaudio.load(audio_path)
if input_speech.shape[0] > 1: # If stereo, average the channels
input_speech = input_speech.mean(dim=0, keepdim=True)
if sr != SAMPLING_RATE:
input_speech = torchaudio.transforms.Resample(sr, SAMPLING_RATE)(input_speech)
exponential_decay_length_penalty = (regulation_start, regulation_factor)
input_features = processor(input_speech.squeeze(), return_tensors="pt", sampling_rate=SAMPLING_RATE).input_features
input_features = input_features.to(device)
model_output = model.generate(
input_features,
language=language,
exponential_decay_length_penalty=exponential_decay_length_penalty,
)
predict_ids = model_output[0]
pred = processor.decode(predict_ids, skip_special_tokens=True)
Unfortunately, Whisper-Medusa currently doesnβt support long-form audio transcription, so we can only use it for up to 30 seconds audio data. When I checked the quality of the 30-second transcription, it was not as good as other variants. Therefore, I skip its result from the comparison among other Whisper variants.
Performance comparison among Whisper Variants
As I mentioned before, I used around 14 minutes audio file as an input. The following table compares the results of each model.
In summary,
- Whisper turbo sometimes tends to put the same sentences and hallucinations.
- Faster-Whisper transcription is almost good, and calculation speed is the best.
- WhisperX transcription is the best, and it records a very accurate timestamp.
- Distil-Whisper transcription is almost good. However, it only records relative timestamps.
If you can allow subtle mistranscription and donβt care about timestamps, you should use Faster-Whisper. Meanwhile, if you want to know the accurate timestamps and transcriptions, you should use WhisperX.
WhisperX and Faster-Whisper can get better results than the vanilla Whisper probably because Faster-Whisper has beam search for better inference results, and Whisper X has forced alignment. Hence, they have chance to fix their mistranscription in postprocessing.
In this blog, we have learned about Whisper variantsβ architecture and their implementation in Python. Many researchers use various optimization techniques to minimize the inference speed for real-world applications. Based on my investigation, Faster-Whisper and WhisperX keep the capability but succeed in decreasing the inference speed. Here is the code that I used in this experiment.
References
[1] Alec Radford, Jong Wook Kim, et.al., Robust Speech Recognition via Large-Scale Weak Supervision, Arxiv
[2] Leland Roberts, Understanding the Mel Spectrogram, Analytics Vidhya
[3] Max Bain, Jaesung Huh, Tengda Han, Andrew Zisserman, WhisperX: Time-Accurate Speech Transcription of Long-Form Audio, Arxiv
[4] Sanchit Gandhi, Patrick von Platen & Alexander M. Rush, DISTIL-WHISPER: ROBUST KNOWLEDGE DISTILLATION VIA LARGE-SCALE PSEUDO LABELLING, Arxiv
[5] Yael Segal-Feldman, Aviv Shamsian, Aviv Navon, Gill Hetz, Joseph Keshet, Whisper in Medusaβs Ear: Multi-head Efficient Decoding for Transformer-based ASR, Arxiv
[6] Tianle Cai, Yuhong Li, et.al., MEDUSA: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads, Arxiv
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI