Speaker Recognition¶

Speaker Diarization (SD) Models¶

PyannoteSpeakerDiarizationDeployment allows you to diarize the audio for speakers audio with pyannote models. The deployment is based on the pyannote.audio library.

Tip

To use Pyannotate Speaker Diarization deployment, install required libraries with pip install pyannote-audio or include extra dependencies using pip install aana[asr].

PyannoteSpeakerDiarizationConfig is used to configure the Speaker Diarization deployment.

aana.deployments.pyannote_speaker_diarization_deployment.PyannoteSpeakerDiarizationConfig ¶

Attributes:

model_id (str) –

name of the speaker diarization pipeline.
sample_rate (int) –

The sample rate of the audio. Defaults to 16000.

Accessing Gated Models¶

The PyAnnote speaker diarization models are gated, requiring special access. To use these models:

Request Access:
Visit the PyAnnote Speaker Diarization 3.1 model page and Pyannote Speaker Segmentation 3.0 model page on Hugging Face. Log in, fil out the forms, and request access.
Approval:
- If automatic, access is granted immediately.
- If manual, wait for the model authors to approve your request.
Set Up the SDK:
After approval, add your Hugging Face access token to your .env file by setting the HF_TOKEN variable:
```
HF_TOKEN=your_huggingface_access_token
```
To get your Hugging Face access token, visit the Hugging Face Settings - Tokens.

Example Configurations¶

As an example, let's see how to configure the Pyannote Speaker Diarization deployment for the Speaker Diarization-3.1 model.

Speaker diarization-3.1

from aana.deployments.pyannote_speaker_diarization_deployment import PyannoteSpeakerDiarizationDeployment, PyannoteSpeakerDiarizationConfig

PyannoteSpeakerDiarizationDeployment.options(
    num_replicas=1,
    max_ongoing_requests=1000,
    ray_actor_options={"num_gpus": 0.05},
    user_config=PyannoteSpeakerDiarizationConfig(
        model_name=("pyannote/speaker-diarization-3.1"),
        sample_rate=16000,
    ).model_dump(mode="json"),
)

Diarized ASR¶

Speaker Diarization output can be combined with ASR to generate transcription with speaker information. Further details and code snippet are available in ASR model hub.