Automatic Speech Recognition (ASR) Models¶

WhisperDeployment allows you to transcribe or translate audio with Whisper models. The deployment is based on the faster-whisper library.

Tip

To use Whisper deployment, install required libraries with pip install faster-whisper or include extra dependencies using pip install aana[asr].

WhisperConfig is used to configure the Whisper deployment.

aana.deployments.whisper_deployment.WhisperConfig ¶

Attributes:

model_size (WhisperModelSize | str) –

The whisper model size. Defaults to WhisperModelSize.TURBO.
compute_type (WhisperComputeType) –

The compute type. Defaults to WhisperComputeType.FLOAT16.

Example Configurations¶

As an example, let's see how to configure the Whisper deployment for the Whisper Medium model.

Whisper Medium

from aana.deployments.whisper_deployment import WhisperDeployment, WhisperConfig, WhisperModelSize, WhisperComputeType

WhisperDeployment.options(
    num_replicas=1,
    max_ongoing_requests=1000,
    ray_actor_options={"num_gpus": 0.25},
    user_config=WhisperConfig(
        model_size=WhisperModelSize.MEDIUM,
        compute_type=WhisperComputeType.FLOAT16,
    ).model_dump(mode="json"),
)

Model size is the one of the Whisper model sizes available in the faster-whisper library or HuggingFace model hub in Ctranslate2 format. compute_type is the data type to be used for the model.

Here are some other possible configurations for the Whisper deployment:

Whisper Turbo on GPU

from aana.deployments.whisper_deployment import WhisperDeployment, WhisperConfig, WhisperModelSize, WhisperComputeType

# for CPU do not specify num_gpus and use FLOAT32 compute type
WhisperDeployment.options(
    num_replicas=1,
    max_ongoing_requests=1000,
    ray_actor_options={"num_gpus": 0.25},
    user_config=WhisperConfig(
        model_size=WhisperModelSize.TURBO,
        compute_type=WhisperComputeType.FLOAT16,
    ).model_dump(mode="json"),
)

Whisper Tiny on CPU

from aana.deployments.whisper_deployment import WhisperDeployment, WhisperConfig, WhisperModelSize, WhisperComputeType

# for CPU do not specify num_gpus and use FLOAT32 compute type
WhisperDeployment.options(
    num_replicas=1,
    user_config=WhisperConfig(
        model_size=WhisperModelSize.TINY,
        compute_type=WhisperComputeType.FLOAT32,
    ).model_dump(mode="json"),
)

Available Transcription Methods in Aana SDK¶

Below are the different transcription methods available in the Aana SDK:

transcribe Method
- Description: This method is used to get the complete transcription output at once after processing the entire audio.
- Usage Example:
```
transcription = await self.asr_handle.transcribe(audio=audio, params=whisper_params)
# Further processing...
```

transcribe_stream Method

Description: This method allows for segment-by-segment transcription as they become available.

Usage Example:

stream = handle.transcribe_stream(audio=audio, params=whisper_params)
async for chunk in stream:
    # Further processing...

transcribe_in_chunks Method
- Description: This method performs batched inference, returning one batch of segments at a time. It is up to 4x faster than sequential methods.
- Usage Example:
```
batched_stream = handle.transcribe_in_chunks(audio=audio, params=batched_whisper_params)
async for chunk in batched_stream:
    # Further processing...
```

Differences Between `WhisperParams` and `BatchedWhisperParams`¶

Both WhisperParams and BatchedWhisperParams are used to configure the Whisper speech-to-text model in sequential and batched inferences respectively.

Common Parameters: Both classes share common attributes such as language, beam_size, best_of, and temperature.
Key Differences: WhisperParams includes additional attributes such as word_timestamps and vad_filter, which provide word-level timestamp extraction and voice activity detection filtering.

Refer to the respective class documentation for detailed attributes and usage.

Diarized ASR¶

Diarized transcription can be generated by using WhisperDeployment and PyannoteSpeakerDiarizationDeployment and combining the timelines using post processing with PostProcessingForDiarizedAsr.

Example configuration for the PyannoteSpeakerDiarization model is available at Speaker Diarization model hub.

You can simply define the model deployments and the endpoint to transcribe the video with diarization. Below code snippet shows the how to combine the outputs from ASR and diarization deployments:

from aana.processors.speaker import PostProcessingForDiarizedAsr
from aana.core.models.base import pydantic_to_dict


# diarized transcript requires word_timestamps from ASR
whisper_params.word_timestamps = True

# asr_handle is an AanaDeploymentHandle for WhisperDeployment
transcription = await self.asr_handle.transcribe(
    audio=audio, params=whisper_params
)

# diar_handle is an AanaDeploymentHandle for PyannoteSpeakerDiarizationDeployment
diarized_output = await self.diar_handle.diarize(
    audio=audio, params=diar_params
)

updated_segments = PostProcessingForDiarizedAsr.process(
    diarized_segments=diarized_output["segments"],
    transcription_segments=transcription["segments"],
)

# updated_segments will have speaker information as well:

# [AsrSegment(text=' Hello. Hello.', 
#            time_interval=TimeInterval(start=6.38, end=7.84), 
#            confidence=0.8329984157521475, 
#            no_speech_confidence=0.012033582665026188, 
#            words=[AsrWord(word=' Hello.', speaker='SPEAKER_01',time_interval=TimeInterval(start=6.38, end=7.0), alignment_confidence=0.6853185296058655), 
#                   AsrWord(word=' Hello.', speaker='SPEAKER_01', time_interval=TimeInterval(start=7.5, end=7.84), alignment_confidence=0.7124693989753723)], 
#           speaker='SPEAKER_01'), 
#
# AsrSegment(text=" Oh, hello. I didn't know you were there.", 
#            time_interval=TimeInterval(start=8.3, end=9.68), 
#            confidence=0.8329984157521475, 
#            no_speech_confidence=0.012033582665026188, 
#            words=[AsrWord(word=' Oh,', speaker='SPEAKER_02', time_interval=TimeInterval(start=8.3, end=8.48), alignment_confidence=0.8500092029571533), 
#                   AsrWord(word=' hello.', speaker='SPEAKER_02', time_interval=TimeInterval(start=8.5, end=8.76), alignment_confidence=0.9408962726593018), ...], 
#            speaker='SPEAKER_02'), 
# ...
# ]

An example notebook on diarized transcription is available at notebooks/diarized_transcription_example.ipynb.