Speaker Diarization
Note: The modelscope pipeline supports all the models in model zoo to inference and finetune. Here we take the model of xvector_sv as example to demonstrate the usage.
Inference with pipeline
Quick start
from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks
# initialize pipeline
inference_diar_pipline = pipeline(
mode="sond_demo",
num_workers=0,
task=Tasks.speaker_diarization,
diar_model_config="sond.yaml",
model='damo/speech_diarization_sond-zh-cn-alimeeting-16k-n16k4-pytorch',
reversion="v1.0.5",
sv_model="damo/speech_xvector_sv-zh-cn-cnceleb-16k-spk3465-pytorch",
sv_model_revision="v1.2.2",
)
# input: a list of audio in which the first item is a speech recording to detect speakers,
# and the following wav file are used to extract speaker embeddings.
audio_list = [
"https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_data/speaker_diarization/record.wav",
"https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_data/speaker_diarization/spk1.wav",
"https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_data/speaker_diarization/spk2.wav",
"https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_data/speaker_diarization/spk3.wav",
"https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_data/speaker_diarization/spk4.wav",
]
results = inference_diar_pipline(audio_in=audio_list)
print(results)
API-reference
Define pipeline
task
:Tasks.speaker_diarization
model
: model name in model zoo, or model path in local diskngpu
:1
(Default), decoding on GPU. If ngpu=0, decoding on CPUoutput_dir
:None
(Default), the output path of results if setbatch_size
:1
(Default), batch size when decodingsmooth_size
:83
(Default), the window size to perform smoothingdur_threshold
:10
(Default), segments shorter than 100 ms will be droppedout_format
:vad
(Default), the output format, choices["vad", "rttm"]
.vad format: spk1: [1.0, 3.0], [5.0, 8.0]
rttm format: “SPEAKER test1 0 1.00 2.00
spk1 ” and “SPEAKER test1 0 5.00 3.00 spk1 ”
Infer pipeline for speaker embedding extraction
audio_in
: the input to process, which could be:list of url:
e.g.
: waveform files at a websitelist of local file path:
e.g.
: path/to/a.wav(“wav.scp,speech,sound”, “profile.scp,profile,kaldi_ark”): a script file of waveform files and another script file of speaker profiles (extracted with the model)
wav.scp test1 path/to/enroll1.wav test2 path/to/enroll2.wav profile.scp test1 path/to/profile.ark:11 test2 path/to/profile.ark:234
The profile.ark file contains speaker embeddings in a kaldi-like style. Please refer README.md for more details.
Inference with you data
For single input, we recommend the “list of local file path” mode for inference. For multiple inputs, we recommend the last mode with pre-organized wav.scp and profile.scp.