Speeding up Whisper
Introduction
Recent years have witnessed remarkable advancements in artificial intelligence, propelling rapid growth in automatic speech recognition (ASR) technologies. Soon after its release, OpenAI's Whisper model quickly gained prominence due to its open licensing, competitive performance against proprietary models, and strong generalization capabilities. Despite being in the field for over two years, Whisper models continue to be highly relevant and are the go-to workhorse for many large-scale ASR systems deployed worldwide.
Our main contribution is the batched implementation on Faster-Whisper, achieving a 12.5x speed increase over OpenAI's original Whisper and over 3x speed-up compared to the Faster-Whisper model. We chose Faster-Whisper specifically for its proven ability to maintain the quality of transcripts, and provide additional quality improvements that provides better consistency across runs, reliability of language detection and controllability of multilingual transcription. Additionally, we optimize feature extraction by parallelizing the short-time Fourier transform. These advancements not only accelerate performance but also ensure high-quality output, distinguishing our system as ideal for large-scale deployments worldwide. Such enhancements lead to better GPU and compute utilization, resulting in substantial cost savings.
Why Whisper remains a good commercial ASR workhorse system
The Whisper ASR model has significantly advanced speech recognition by using substantial amount of training data and innovative features such as multitask tokenizers and end-to-end punctuated transcription. Despite the emergence of numerous models inspired by the idea of vast training data from Whisper, such as NVIDIA’s open-source canary ASR (under a CC-by-NC license) and Assembly AI's proprietary universal-1 model, which are leaders on platforms such as Open ASR Leaderboard, Whisper remains a preferred production system. Its robust performance in real-world scenarios and permissive licensing make it exceptionally reliable.
Essential features of whisper models include:
- Generalization in Challenging Environments
- The model excels in noisy or music-heavy environments often missed in controlled evaluation datasets such as the Librispeech corpus, making it suitable for diverse and challenging applications.
- Multitask and Multilingual Capabilities
- As a single-model solution, it manages language identification for 99 languages and offers transcription and translation-to-English, crucial for memory-limited but language-diverse environments.
- Punctuated Output
- It provides punctuated and capitalized text directly, which enhances usability beyond typical ASR systems without needing extra processing.
Prior work on speeding up Whisper
Faster-Whisper was the first project to demonstrate significant speed improvements over OpenAI's implementation without compromising transcript quality. The optimized model is available in ctranslate format for both GPU and CPU usage, supporting various datatypes. CTranslate2 model format has advanced optimizations such as layer fusion, padding removal, batch reordering, in-place operations, caching mechanism etc, resulting in fast inference.
Hugging Face's implementation supports parallel processing of the audio, offering a notable speedup. It uses a uniform chunk length for batching, which may not align with meaningful audio breaks. Hence HuggingFace's implementation uses longest common sequence around boundaries as a workaround to the limit drop in transcription quality.
Faster-Whisper initially achieved a 2-4x speed increase over the original OpenAI model without the need for batching, maintaining performance using the same inference script as the original OpenAI version.
Batch Processing for Speed
We increase the speed of the ASR model in the faster-whisper package by using batching based on voice activity detection (VAD) and improving the speed of the feature extraction stage before feeding the input to the model.
Semantic Batching by Voice Activity Detection
We use Voice Activity Detection (VAD) to aggregate segments of less than 30-second durations, inspired by Whisper-X . Unlike the Whisper-X implementation, our approach is designed as a generator, akin to the Faster Whisper implementation and supports more user-defined arguments. We also select a borader window around voiced region (100ms silence on either side) to prevent the audio chunk from windowing effects. We follow the min-cut and merging operations from Whisper-X, optimizing chunk alignment to adhere as closely as possible to the 30-second boundary (Figure 1), while respecting voicing/phrase boundaries.

This methodology yields up to 64 times real-time speed, and further speed gains can be achieved on larger hardware configurations, depending upon batch size and length of the audio, offering significant advantages for processing very long audio sequences.
Improving Feature Extraction Speed by parallel STFT
Feature extraction speed is improved using Kaldi-based mel feature extraction techniques available in torchaudio library. Unlike the torch Short-Time Fourier Transform (STFT) employed in the Faster Whisper package, Kaldi-based extraction computes STFT in parallel. This enhancement is applied to batched and sequential versions of our improved implementation, resulting in speed enhancements from up to 104 times the real-time.
The culmination of these improvements yields a final speed enhancement of 10 times the speed of open AI Whisper, and is achieved through a generator-based approach for retrieving transcript chunks as they become available. This method significantly expedites subsequent processing of transcripts, such as inputs for Large Language Models (LLMs) or other downstream tasks.
Invocation
It is straightforward to use the faster-whisper version with batching:
from faster_whisper import WhisperModel, BatchedInferencePipeline
#load faster-whisper model in the usual way
model = WhisperModel("medium", device="cuda", compute_type="float16")
#apply batched pipeline
batched_model = BatchedInferencePipeline(model=model)
#predict using the batched_model
result = batched_model.transcribe("audio.mp3", batch_size=16)
for segment, info in result:
print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))
Quality Improvements
We improve the quality of the baseline Faster Whisper model outputs by employing the following ideas:
- Consistency across runs:
- Reducing hallucinations:
- Improving language detection:
- Multilingual support (Code-switching):
The ASR model should produce consistent results across different runs. We observed that the results with Faster-whisper varied across runs even without a sampling temperature. We manually set the seed for the ctranslate model and it improved the consistency of transcription across different runs.
ASR models may occasionally produce erroneous outputs characterized by unstructured or repeated phrases, a phenomenon referred to as hallucination. Whisper occasionally exhibits hallucinations when processing noisy audio, prompting the implementation of additional checks in the inference pipeline, such as evaluating compression ratios in conjunction with silence probabilities. We advocate for a more stringent approach that leverages higher compression ratios alone as a criterion for removing or skipping parts of utterances.
The whisper model analyzes the initial 30 seconds of the audio input to determine the language of the whole audio. This is not very reliable as the initial segments may not represent the predominant language due to the presence of music, noise, or if the audio itself is multilingual. Incorrect identification of the major language in the audio severely affects speech recognition. To mitigate this issue, we propose an approach that evaluates multiple segments and diverse locations within the audio for the voiced segments. When multiple languages are detected with similar frequency, we resolve ties based on detected probabilities. Existing multi-segment language detection on Faster Whisper only looks for continuous segments and does not resolve ties.
Whisper shows some promise in handling code-switching audio (the language of the audio changes during the speech). Some attempts are made in this workto incorporate language code as a prompt for code-switching data and showed that fine-tuning with code-switching data improves multilingual accuracy. However, there is still a lack of control over the code-switching behavior.
To address this challenge, we propose a solution involving language detection for every 30-second audio, which dynamically directs the data flow toward transcription or translation tasks as required. This approach enforces strict tokenization, ensuring the emission of characters solely in the detected language. Note that in scenarios where the language switching occurs within a 30-second chunk, the transcription will be based on the detected language and the language switch starts from the succeeding segment in the transcription. We offer two output options in code-switching framework: (a) Hybrid Output, in which the transcription reflects the hybrid nature of the input, encompassing multiple languages if present and (b) English Output: In this mode, the output is always constrained to English, and beneficial for applications such as subtitling.
Multi-segment language detection
from faster_whisper import WhisperModel
model = WhisperModel("medium", device="cuda", compute_type="float16")
language_info = model.detect_language_multi_segment("audio.mp3")
Benchmarks and Experiments
Speed of Retrieval
We first discuss the experiments related to the speed of retrieval, with a focus on the Whisper medium model as a representative example, although similar observations can be made on other model versions as well.
The Real Time Factor (RTF) metric, commonly used in open-source benchmarks like Open ASR Leaderboard evaluation, measures the speed of offline ASR systems by comparing total processing time to audio duration. We use 1/RTF in our benchmarking experiments as it is more interpretable once the speed becomes faster than the real-time speed, which is a common offline ASR requirement.
The benchmark datasets used in open-source evaluations typically involve short audio recordings, usually under 10 seconds, which do not reflect conditions found in typical production environments with much longer audio data. They fail to accurately capture real-world audio scenarios, which means models excelling in these settings might not perform as well in practical applications. For example, the following table summarizes the average length and standard deviation of the individual samples in four different datasets used in open ASR evaluations.
Dataset | Voxpopuli | TEDLIUM | Earnings22 | AMI | Average |
---|---|---|---|---|---|
Duration (in secs) | 10 | 8 | 7 | 2 | 7 |
Std. Deviation (in secs) | 8 | 4 | 5 | 3 | 5 |
To better assess the speed of ASR systems under real-world conditions, we've introduced a different open-source dataset that reflects more complex use cases with long-form audio. We've selected the recently released YouTube-Commons dataset, extracting an initial subset of 38 hours.
This subset includes 94 YouTube audio samples with a mean duration of 24 minutes and a notable standard deviation of approximately 25 minutes. Audio durations in this subset range from 2 minutes to 2 hours and 8 minutes, providing a wide variation that closely simulates actual usage scenarios. This variability is crucial for evaluating the effectiveness and scalability of ASR systems in realistic environments.
For our experiments, we measure inference times of the Whisper medium model running on a GPU, specifically a GeForce RTX 2080 Ti with 12GB of memory and CPU with 16 threads. We standardize batch sizes across HuggingFace Whisper and the batched faster-Whisper to ensure fair comparisons.
System | Speed GPU | Speed CPU |
---|---|---|
OpenAI Whisper | 8.2x | 4.5x |
faster-whiser | 20.1x | 5.6x |
HF Whisper(batched) | 59.3x | 8.4x |
Batched Faster-Whisper | 104x | 14.6x |
The batched version of Whisper is remarkably efficient, averaging 12.5x faster than OpenAI's implementation. Notably, for longer files, such as those nearing 3 hours in duration, the speed enhancement reaches up to 380x real time. Additionally, this high performance does not require specialized hardware such as flash attention-2 or speculative decoding, broadening its applicability.
Transcriotion Quality
Word Error Rate (WER) is the go-to evaluation metric for the quality of ASR systems and it indicates various types of errors that can occur in an ASR output. Open ASR Eval dataset consisting of 9 test datasets is recently used for comparing different ASR models. These datasets are better than testing the ASR systems on a single dataset such as Librispeech. However, they still do not reflect real-world scenarios and do not represent long-form audio.
Since the transcriptions were available for the Youtube-Commons subset, we compared the WER on this new subset. We used the EnglishTextNormalizer from Open ASR Eval for normalizing the transcript before WER comparison.
System | WER |
---|---|
OpenAI Whisper | 15.1 |
faster-whiser | 14.6 |
HF Whisper(batched) | 16.8 |
Batched Faster-Whisper | 13.1 |
Note that the transcriptions in this dataset are human-annotated/automatically produced and can contain various types of errors. Since these results alone are not sufficient to conclude the quality of the implementation, we also use internal dataset that contains a smaller test set (84 minutes) with verified ground truth. The test set contains 9 audios ranging from 3 minutes to 13 minutes duration and various audio types. The table below summarizes the average speed and WER of the models on the internal test set:
System | WER | Speed |
---|---|---|
OpenAI Whisper | 6.8 | 9.1 |
faster-whiser | 6.1 | 17.4 |
HF Whisper(batched) | 8.2 | 42.8 |
Batched Faster-Whisper | 6.5 | 86.6 |
We see that faster-whisper in general has a better WER compared with HF implementation. Note that the WER for batched version is slightly worse than the sequential version of the faster-whisper for internal test set. This is expected as there is no context passing between the 30-second segments in the batched version. If the audio is very noisy, context passing can also have an adverse effect, causing hallucination loops. For the Youtube-Commons subset, the batching indeed seem to help to reduce the WER.
The evaluation dataset is publicly available at huggingface dataset repo: mobiuslabsgmbh/youtube-commons-asr-eval.
Conclusion
In this blog post, we briefly discussed the benefit of using whisper models and showed speed improvements on the faster whisper based on batching and faster feature extraction. We also provided some insights into quality improvements achieved for the faster-whisper. Since the open-source datasets fail to represent the duration and complexity of real-world datasets, we used a subset of the Youtube-Commons dataset for the evaluations and an internal test set in addtion. Batching via VAD and feature extraction together improve the speed up to 12.5x on average compared to OpenAI implementation. We provide the Google colab notebook to replicate these results and added the evaluation dataset to mobiuslabsgmbh/youtube-commons-asr-eval repo in the huggingface.
In the next blog post, we will focus on the speed improvement of pytorch-based implementations of Whisper models via torch compile and propose HQQ+ based approach for making it even faster. Stay tuned…
Citation
@misc{sebastian2024whisper1,
title = {Speeding up Whisper},
url = {https://mobiusml.github.io/batched_whisper_blog/},
author = {Jilt Sebastian and Appu Shaji},
month = {May},
year = {2024}
}
Please feel free to contact us..