Speaker Diarization

Note: The modelscope pipeline supports all the models in model zoo to inference and finetune. Here we take the model of xvector_sv as example to demonstrate the usage.

Inference with pipeline

Quick start

from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks

# initialize pipeline
inference_diar_pipline = pipeline(
    mode="sond_demo",
    num_workers=0,
    task=Tasks.speaker_diarization,
    diar_model_config="sond.yaml",
    model='damo/speech_diarization_sond-zh-cn-alimeeting-16k-n16k4-pytorch',
    reversion="v1.0.5",
    sv_model="damo/speech_xvector_sv-zh-cn-cnceleb-16k-spk3465-pytorch",
    sv_model_revision="v1.2.2",
)

# input: a list of audio in which the first item is a speech recording to detect speakers, 
# and the following wav file are used to extract speaker embeddings.
audio_list = [
    "https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_data/speaker_diarization/record.wav",
    "https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_data/speaker_diarization/spk1.wav",
    "https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_data/speaker_diarization/spk2.wav",
    "https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_data/speaker_diarization/spk3.wav",
    "https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_data/speaker_diarization/spk4.wav",
]

results = inference_diar_pipline(audio_in=audio_list)
print(results)

API-reference

Define pipeline

  • task: Tasks.speaker_diarization

  • model: model name in model zoo, or model path in local disk

  • ngpu: 1 (Default), decoding on GPU. If ngpu=0, decoding on CPU

  • output_dir: None (Default), the output path of results if set

  • batch_size: 1 (Default), batch size when decoding

  • smooth_size: 83 (Default), the window size to perform smoothing

  • dur_threshold: 10 (Default), segments shorter than 100 ms will be dropped

  • out_format: vad (Default), the output format, choices ["vad", "rttm"].

    • vad format: spk1: [1.0, 3.0], [5.0, 8.0]

    • rttm format: “SPEAKER test1 0 1.00 2.00 spk1 ” and “SPEAKER test1 0 5.00 3.00 spk1

Infer pipeline for speaker embedding extraction

  • audio_in: the input to process, which could be:

    • list of url: e.g.: waveform files at a website

    • list of local file path: e.g.: path/to/a.wav

    • (“wav.scp,speech,sound”, “profile.scp,profile,kaldi_ark”): a script file of waveform files and another script file of speaker profiles (extracted with the model)

      wav.scp
      test1 path/to/enroll1.wav
      test2 path/to/enroll2.wav
      
      profile.scp
      test1 path/to/profile.ark:11
      test2 path/to/profile.ark:234
      

      The profile.ark file contains speaker embeddings in a kaldi-like style. Please refer README.md for more details.

Inference with you data

For single input, we recommend the “list of local file path” mode for inference. For multiple inputs, we recommend the last mode with pre-organized wav.scp and profile.scp.

Inference with multi-threads on CPU

We recommend the last mode with split wav.scp and profile.scp. Then, run inference for each split part. Please refer README.md to find a similar process.

Inference with multi GPU

Similar to CPU, please set ngpu=1 for inference on GPU. Besides, you should use CUDA_VISIBLE_DEVICES=0 to specify a GPU device. Please refer README.md to find a similar process.