Automatic Speech Recognition (ASR) Models¶
WhisperDeployment allows you to transcribe or translate audio with Whisper models. The deployment is based on the faster-whisper library.
Tip
To use Whisper deployment, install required libraries with pip install faster-whisper
or include extra dependencies using pip install aana[asr]
.
WhisperConfig is used to configure the Whisper deployment.
aana.deployments.whisper_deployment.WhisperConfig
¶
Attributes:
-
model_size
(WhisperModelSize | str
) –The whisper model size. Defaults to WhisperModelSize.TURBO.
-
compute_type
(WhisperComputeType
) –The compute type. Defaults to WhisperComputeType.FLOAT16.
Example Configurations¶
As an example, let's see how to configure the Whisper deployment for the Whisper Medium model.
Whisper Medium
from aana.deployments.whisper_deployment import WhisperDeployment, WhisperConfig, WhisperModelSize, WhisperComputeType
WhisperDeployment.options(
num_replicas=1,
max_ongoing_requests=1000,
ray_actor_options={"num_gpus": 0.25},
user_config=WhisperConfig(
model_size=WhisperModelSize.MEDIUM,
compute_type=WhisperComputeType.FLOAT16,
).model_dump(mode="json"),
)
Model size is the one of the Whisper model sizes available in the faster-whisper
library or HuggingFace model hub in Ctranslate2 format.
compute_type
is the data type to be used for the model.
Here are some other possible configurations for the Whisper deployment:
Whisper Turbo on GPU
from aana.deployments.whisper_deployment import WhisperDeployment, WhisperConfig, WhisperModelSize, WhisperComputeType
# for CPU do not specify num_gpus and use FLOAT32 compute type
WhisperDeployment.options(
num_replicas=1,
max_ongoing_requests=1000,
ray_actor_options={"num_gpus": 0.25},
user_config=WhisperConfig(
model_size=WhisperModelSize.TURBO,
compute_type=WhisperComputeType.FLOAT16,
).model_dump(mode="json"),
)
Whisper Tiny on CPU
from aana.deployments.whisper_deployment import WhisperDeployment, WhisperConfig, WhisperModelSize, WhisperComputeType
# for CPU do not specify num_gpus and use FLOAT32 compute type
WhisperDeployment.options(
num_replicas=1,
user_config=WhisperConfig(
model_size=WhisperModelSize.TINY,
compute_type=WhisperComputeType.FLOAT32,
).model_dump(mode="json"),
)
Available Transcription Methods in Aana SDK¶
Below are the different transcription methods available in the Aana SDK:
-
transcribe
Method- Description: This method is used to get the complete transcription output at once after processing the entire audio.
- Usage Example:
-
transcribe_stream
Method- Description: This method allows for segment-by-segment transcription as they become available.
- Usage Example:
-
transcribe_in_chunks
Method- Description: This method performs batched inference, returning one batch of segments at a time. It is up to 4x faster than sequential methods.
- Usage Example:
Differences Between WhisperParams
and BatchedWhisperParams
¶
Both WhisperParams
and BatchedWhisperParams
are used to configure the Whisper speech-to-text model in sequential and batched inferences respectively.
-
Common Parameters: Both classes share common attributes such as
language
,beam_size
,best_of
, andtemperature
. -
Key Differences: WhisperParams includes additional attributes such as
word_timestamps
andvad_filter
, which provide word-level timestamp extraction and voice activity detection filtering.
Refer to the respective class documentation for detailed attributes and usage.
Diarized ASR¶
Diarized transcription can be generated by using WhisperDeployment and PyannoteSpeakerDiarizationDeployment and combining the timelines using post processing with PostProcessingForDiarizedAsr.
Example configuration for the PyannoteSpeakerDiarization model is available at Speaker Diarization model hub.
You can simply define the model deployments and the endpoint to transcribe the video with diarization. Below code snippet shows the how to combine the outputs from ASR and diarization deployments:
from aana.processors.speaker import PostProcessingForDiarizedAsr
from aana.core.models.base import pydantic_to_dict
# diarized transcript requires word_timestamps from ASR
whisper_params.word_timestamps = True
# asr_handle is an AanaDeploymentHandle for WhisperDeployment
transcription = await self.asr_handle.transcribe(
audio=audio, params=whisper_params
)
# diar_handle is an AanaDeploymentHandle for PyannoteSpeakerDiarizationDeployment
diarized_output = await self.diar_handle.diarize(
audio=audio, params=diar_params
)
updated_segments = PostProcessingForDiarizedAsr.process(
diarized_segments=diarized_output["segments"],
transcription_segments=transcription["segments"],
)
# updated_segments will have speaker information as well:
# [AsrSegment(text=' Hello. Hello.',
# time_interval=TimeInterval(start=6.38, end=7.84),
# confidence=0.8329984157521475,
# no_speech_confidence=0.012033582665026188,
# words=[AsrWord(word=' Hello.', speaker='SPEAKER_01',time_interval=TimeInterval(start=6.38, end=7.0), alignment_confidence=0.6853185296058655),
# AsrWord(word=' Hello.', speaker='SPEAKER_01', time_interval=TimeInterval(start=7.5, end=7.84), alignment_confidence=0.7124693989753723)],
# speaker='SPEAKER_01'),
#
# AsrSegment(text=" Oh, hello. I didn't know you were there.",
# time_interval=TimeInterval(start=8.3, end=9.68),
# confidence=0.8329984157521475,
# no_speech_confidence=0.012033582665026188,
# words=[AsrWord(word=' Oh,', speaker='SPEAKER_02', time_interval=TimeInterval(start=8.3, end=8.48), alignment_confidence=0.8500092029571533),
# AsrWord(word=' hello.', speaker='SPEAKER_02', time_interval=TimeInterval(start=8.5, end=8.76), alignment_confidence=0.9408962726593018), ...],
# speaker='SPEAKER_02'),
# ...
# ]