Speaker Diarization

Note: The modelscope pipeline supports all the models in model zoo to inference and finetune. Here we take the model of xvector_sv as example to demonstrate the usage.

Inference with pipeline

Quick start

from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks

# initialize pipeline
inference_diar_pipline = pipeline(
    mode="sond_demo",
    num_workers=0,
    task=Tasks.speaker_diarization,
    diar_model_config="sond.yaml",
    model='damo/speech_diarization_sond-zh-cn-alimeeting-16k-n16k4-pytorch',
    reversion="v1.0.5",
    sv_model="damo/speech_xvector_sv-zh-cn-cnceleb-16k-spk3465-pytorch",
    sv_model_revision="v1.2.2",
)

# input: a list of audio in which the first item is a speech recording to detect speakers, 
# and the following wav file are used to extract speaker embeddings.
audio_list = [
    "https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_data/speaker_diarization/record.wav",
    "https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_data/speaker_diarization/spk1.wav",
    "https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_data/speaker_diarization/spk2.wav",
    "https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_data/speaker_diarization/spk3.wav",
    "https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_data/speaker_diarization/spk4.wav",
]

results = inference_diar_pipline(audio_in=audio_list)
print(results)

API-reference

Define pipeline

task: Tasks.speaker_diarization
model: model name in model zoo, or model path in local disk
ngpu: 1 (Default), decoding on GPU. If ngpu=0, decoding on CPU
output_dir: None (Default), the output path of results if set
batch_size: 1 (Default), batch size when decoding
smooth_size: 83 (Default), the window size to perform smoothing
dur_threshold: 10 (Default), segments shorter than 100 ms will be dropped
out_format: vad (Default), the output format, choices ["vad", "rttm"].
- vad format: spk1: [1.0, 3.0], [5.0, 8.0]
- rttm format: “SPEAKER test1 0 1.00 2.00 spk1 ” and “SPEAKER test1 0 5.00 3.00 spk1 ”

Infer pipeline for speaker embedding extraction

audio_in: the input to process, which could be:
- list of url: e.g.: waveform files at a website
- list of local file path: e.g.: path/to/a.wav
- (“wav.scp,speech,sound”, “profile.scp,profile,kaldi_ark”): a script file of waveform files and another script file of speaker profiles (extracted with the model)
```
wav.scp
test1 path/to/enroll1.wav
test2 path/to/enroll2.wav

profile.scp
test1 path/to/profile.ark:11
test2 path/to/profile.ark:234
```
  The profile.ark file contains speaker embeddings in a kaldi-like style. Please refer README.md for more details.

Inference with you data

For single input, we recommend the “list of local file path” mode for inference. For multiple inputs, we recommend the last mode with pre-organized wav.scp and profile.scp.

Inference with multi-threads on CPU

We recommend the last mode with split wav.scp and profile.scp. Then, run inference for each split part. Please refer README.md to find a similar process.

Inference with multi GPU

Similar to CPU, please set ngpu=1 for inference on GPU. Besides, you should use CUDA_VISIBLE_DEVICES=0 to specify a GPU device. Please refer README.md to find a similar process.