Speeding up Whisper

Introduction

Recent years have witnessed remarkable advancements in artificial intelligence, propelling rapid growth in automatic speech recognition (ASR) technologies. Soon after its release, OpenAI's Whisper model quickly gained prominence due to its open licensing, competitive performance against proprietary models, and strong generalization capabilities. Despite being in the field for over two years, Whisper models continue to be highly relevant and are the go-to workhorse for many large-scale ASR systems deployed worldwide.

Our main contribution is the batched implementation on Faster-Whisper, achieving a 12.5x speed increase over OpenAI's original Whisper and over 3x speed-up compared to the Faster-Whisper model. We chose Faster-Whisper specifically for its proven ability to maintain the quality of transcripts, and provide additional quality improvements that provides better consistency across runs, reliability of language detection and controllability of multilingual transcription. Additionally, we optimize feature extraction by parallelizing the short-time Fourier transform. These advancements not only accelerate performance but also ensure high-quality output, distinguishing our system as ideal for large-scale deployments worldwide. Such enhancements lead to better GPU and compute utilization, resulting in substantial cost savings.

Table of Contents

Why Whisper remains a good commercial ASR workhorse system

The Whisper ASR model has significantly advanced speech recognition by using substantial amount of training data and innovative features such as multitask tokenizers and end-to-end punctuated transcription. Despite the emergence of numerous models inspired by the idea of vast training data from Whisper, such as NVIDIA’s open-source canary ASR (under a CC-by-NC license) and Assembly AI's proprietary universal-1 model, which are leaders on platforms such as Open ASR Leaderboard, Whisper remains a preferred production system. Its robust performance in real-world scenarios and permissive licensing make it exceptionally reliable.

Essential features of whisper models include:

Generalization in Challenging Environments: The model excels in noisy or music-heavy environments often missed in controlled evaluation datasets such as the Librispeech corpus, making it suitable for diverse and challenging applications.
Multitask and Multilingual Capabilities: As a single-model solution, it manages language identification for 99 languages and offers transcription and translation-to-English, crucial for memory-limited but language-diverse environments.
Punctuated Output: It provides punctuated and capitalized text directly, which enhances usability beyond typical ASR systems without needing extra processing.

Prior work on speeding up Whisper

Faster-Whisper was the first project to demonstrate significant speed improvements over OpenAI's implementation without compromising transcript quality. The optimized model is available in ctranslate format for both GPU and CPU usage, supporting various datatypes. CTranslate2 model format has advanced optimizations such as layer fusion, padding removal, batch reordering, in-place operations, caching mechanism etc, resulting in fast inference.

Hugging Face's implementation supports parallel processing of the audio, offering a notable speedup. It uses a uniform chunk length for batching, which may not align with meaningful audio breaks. Hence HuggingFace's implementation uses longest common sequence around boundaries as a workaround to the limit drop in transcription quality.

Faster-Whisper initially achieved a 2-4x speed increase over the original OpenAI model without the need for batching, maintaining performance using the same inference script as the original OpenAI version.

Batch Processing for Speed

We increase the speed of the ASR model in the faster-whisper package by using batching based on voice activity detection (VAD) and improving the speed of the feature extraction stage before feeding the input to the model.

Semantic Batching by Voice Activity Detection

We use Voice Activity Detection (VAD) to aggregate segments of less than 30-second durations, inspired by Whisper-X . Unlike the Whisper-X implementation, our approach is designed as a generator, akin to the Faster Whisper implementation and supports more user-defined arguments. We also select a borader window around voiced region (100ms silence on either side) to prevent the audio chunk from windowing effects. We follow the min-cut and merging operations from Whisper-X, optimizing chunk alignment to adhere as closely as possible to the 30-second boundary (Figure 1), while respecting voicing/phrase boundaries.

Image Description — Figure 1: VAD based batching. Audio is divided into small voiced areas and merged to form final Audio segments (marked in blue).

This methodology yields up to 64 times real-time speed, and further speed gains can be achieved on larger hardware configurations, depending upon batch size and length of the audio, offering significant advantages for processing very long audio sequences.

Improving Feature Extraction Speed by parallel STFT

Feature extraction speed is improved using Kaldi-based mel feature extraction techniques available in torchaudio library. Unlike the torch Short-Time Fourier Transform (STFT) employed in the Faster Whisper package, Kaldi-based extraction computes STFT in parallel. This enhancement is applied to batched and sequential versions of our improved implementation, resulting in speed enhancements from up to 104 times the real-time.

The culmination of these improvements yields a final speed enhancement of 10 times the speed of open AI Whisper, and is achieved through a generator-based approach for retrieving transcript chunks as they become available. This method significantly expedites subsequent processing of transcripts, such as inputs for Large Language Models (LLMs) or other downstream tasks.

Invocation

It is straightforward to use the faster-whisper version with batching:


from faster_whisper import WhisperModel, BatchedInferencePipeline
#load faster-whisper model in the usual way
model = WhisperModel("medium", device="cuda", compute_type="float16") 

#apply batched pipeline
batched_model = BatchedInferencePipeline(model=model)

#predict using the batched_model
result = batched_model.transcribe("audio.mp3", batch_size=16)

for segment, info in result:
	print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))

Quality Improvements

We improve the quality of the baseline Faster Whisper model outputs by employing the following ideas:

Consistency across runs:
Reducing hallucinations:
Improving language detection:
Multilingual support (Code-switching):

Multi-segment language detection


						from faster_whisper import WhisperModel
						model = WhisperModel("medium", device="cuda", compute_type="float16")
						language_info = model.detect_language_multi_segment("audio.mp3")

Benchmarks and Experiments

Speed of Retrieval

We first discuss the experiments related to the speed of retrieval, with a focus on the Whisper medium model as a representative example, although similar observations can be made on other model versions as well.

The Real Time Factor (RTF) metric, commonly used in open-source benchmarks like Open ASR Leaderboard evaluation, measures the speed of offline ASR systems by comparing total processing time to audio duration. We use 1/RTF in our benchmarking experiments as it is more interpretable once the speed becomes faster than the real-time speed, which is a common offline ASR requirement.

The benchmark datasets used in open-source evaluations typically involve short audio recordings, usually under 10 seconds, which do not reflect conditions found in typical production environments with much longer audio data. They fail to accurately capture real-world audio scenarios, which means models excelling in these settings might not perform as well in practical applications. For example, the following table summarizes the average length and standard deviation of the individual samples in four different datasets used in open ASR evaluations.

Average Duration and Standard Deviation of Various Datasets in OpenASR Evaluation
Dataset	Voxpopuli	TEDLIUM	Earnings22	AMI	Average
Duration (in secs)	10	8	7	2	7
Std. Deviation (in secs)	8	4	5	3	5

To better assess the speed of ASR systems under real-world conditions, we've introduced a different open-source dataset that reflects more complex use cases with long-form audio. We've selected the recently released YouTube-Commons dataset, extracting an initial subset of 38 hours.

This subset includes 94 YouTube audio samples with a mean duration of 24 minutes and a notable standard deviation of approximately 25 minutes. Audio durations in this subset range from 2 minutes to 2 hours and 8 minutes, providing a wide variation that closely simulates actual usage scenarios. This variability is crucial for evaluating the effectiveness and scalability of ASR systems in realistic environments.

For our experiments, we measure inference times of the Whisper medium model running on a GPU, specifically a GeForce RTX 2080 Ti with 12GB of memory and CPU with 16 threads. We standardize batch sizes across HuggingFace Whisper and the batched faster-Whisper to ensure fair comparisons.

Speed (x real-time) comparison of whisper models on Youtube-commons subset.
System	Speed GPU	Speed CPU
OpenAI Whisper	8.2x	4.5x
faster-whiser	20.1x	5.6x
HF Whisper(batched)	59.3x	8.4x
Batched Faster-Whisper	104x	14.6x

The batched version of Whisper is remarkably efficient, averaging 12.5x faster than OpenAI's implementation. Notably, for longer files, such as those nearing 3 hours in duration, the speed enhancement reaches up to 380x real time. Additionally, this high performance does not require specialized hardware such as flash attention-2 or speculative decoding, broadening its applicability.

Transcriotion Quality

Word Error Rate (WER) is the go-to evaluation metric for the quality of ASR systems and it indicates various types of errors that can occur in an ASR output. Open ASR Eval dataset consisting of 9 test datasets is recently used for comparing different ASR models. These datasets are better than testing the ASR systems on a single dataset such as Librispeech. However, they still do not reflect real-world scenarios and do not represent long-form audio.

Since the transcriptions were available for the Youtube-Commons subset, we compared the WER on this new subset. We used the EnglishTextNormalizer from Open ASR Eval for normalizing the transcript before WER comparison.

WER comparison of whisper models on Youtube-commons subset.
System	WER
OpenAI Whisper	15.1
faster-whiser	14.6
HF Whisper(batched)	16.8
Batched Faster-Whisper	13.1

Note that the transcriptions in this dataset are human-annotated/automatically produced and can contain various types of errors. Since these results alone are not sufficient to conclude the quality of the implementation, we also use internal dataset that contains a smaller test set (84 minutes) with verified ground truth. The test set contains 9 audios ranging from 3 minutes to 13 minutes duration and various audio types. The table below summarizes the average speed and WER of the models on the internal test set:

WER comparison of whisper models on internal dataset.
System	WER	Speed
OpenAI Whisper	6.8	9.1
faster-whiser	6.1	17.4
HF Whisper(batched)	8.2	42.8
Batched Faster-Whisper	6.5	86.6

We see that faster-whisper in general has a better WER compared with HF implementation. Note that the WER for batched version is slightly worse than the sequential version of the faster-whisper for internal test set. This is expected as there is no context passing between the 30-second segments in the batched version. If the audio is very noisy, context passing can also have an adverse effect, causing hallucination loops. For the Youtube-Commons subset, the batching indeed seem to help to reduce the WER.

The evaluation dataset is publicly available at huggingface dataset repo: mobiuslabsgmbh/youtube-commons-asr-eval.

Conclusion

In this blog post, we briefly discussed the benefit of using whisper models and showed speed improvements on the faster whisper based on batching and faster feature extraction. We also provided some insights into quality improvements achieved for the faster-whisper. Since the open-source datasets fail to represent the duration and complexity of real-world datasets, we used a subset of the Youtube-Commons dataset for the evaluations and an internal test set in addtion. Batching via VAD and feature extraction together improve the speed up to 12.5x on average compared to OpenAI implementation. We provide the Google colab notebook to replicate these results and added the evaluation dataset to mobiuslabsgmbh/youtube-commons-asr-eval repo in the huggingface.

In the next blog post, we will focus on the speed improvement of pytorch-based implementations of Whisper models via torch compile and propose HQQ+ based approach for making it even faster. Stay tuned…

Citation


	@misc{sebastian2024whisper1,
	title = {Speeding up Whisper},
	url = {https://mobiusml.github.io/batched_whisper_blog/},
	author = {Jilt Sebastian and Appu Shaji},
	month = {May},
	year = {2024}
	}

Please feel free to contact us..