Speaker Verification

Note: The modelscope pipeline supports all the models in model zoo to inference and finetine. Here we take the model of xvector_sv as example to demonstrate the usage.

Inference with pipeline

Quick start

Speaker verification

from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks

inference_sv_pipline = pipeline(
    task=Tasks.speaker_verification,
    model='damo/speech_xvector_sv-zh-cn-cnceleb-16k-spk3465-pytorch'
)

# The same speaker
rec_result = inference_sv_pipline(audio_in=(
    'https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/sv_example_enroll.wav',
    'https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/sv_example_same.wav'))
print("Similarity", rec_result["scores"])

# Different speakers
rec_result = inference_sv_pipline(audio_in=(
    'https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/sv_example_enroll.wav',
    'https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/sv_example_different.wav'))
print("Similarity", rec_result["scores"])

Speaker embedding extraction

from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks

# Define extraction pipeline
inference_sv_pipline = pipeline(
    task=Tasks.speaker_verification,
    model='damo/speech_xvector_sv-zh-cn-cnceleb-16k-spk3465-pytorch'
)
# Extract speaker embedding
rec_result = inference_sv_pipline(
    audio_in='https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/sv_example_enroll.wav')
speaker_embedding = rec_result["spk_embedding"]

Full code of demo, please ref to infer.py.

API-reference

Define pipeline

task: Tasks.speaker_verification
model: model name in model zoo, or model path in local disk
ngpu: 1 (Default), decoding on GPU. If ngpu=0, decoding on CPU
output_dir: None (Default), the output path of results if set
batch_size: 1 (Default), batch size when decoding
sv_threshold: 0.9465 (Default), the similarity threshold to determine whether utterances belong to the same speaker (it should be in (0, 1))

Infer pipeline for speaker embedding extraction

audio_in: the input to process, which could be:
- url (str): e.g.: https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/sv_example_enroll.wav
- local_path: e.g.: path/to/a.wav
- wav.scp: e.g.: path/to/wav1.scp
```
wav.scp
test1 path/to/enroll1.wav
test2 path/to/enroll2.wav
```
- bytes: e.g.: raw bytes data from a microphone
- fbank1.scp,speech,kaldi_ark: e.g.: extracted 80-dimensional fbank features with kaldi toolkits.

Infer pipeline for speaker verification

audio_in: the input to process, which could be:
- Tuple(url1, url2): e.g.: (https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/sv_example_enroll.wav, https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/sv_example_different.wav)
- Tuple(local_path1, local_path2): e.g.: (path/to/a.wav, path/to/b.wav)
- Tuple(wav1.scp, wav2.scp): e.g.: (path/to/wav1.scp, path/to/wav2.scp)
```
wav1.scp
test1 path/to/enroll1.wav
test2 path/to/enroll2.wav

wav2.scp
test1 path/to/same1.wav
test2 path/to/diff2.wav
```
- Tuple(bytes, bytes): e.g.: raw bytes data from a microphone
- Tuple(“fbank1.scp,speech,kaldi_ark”, “fbank2.scp,speech,kaldi_ark”): e.g.: extracted 80-dimensional fbank features with kaldi toolkits.

Inference with you data

Use wav1.scp or fbank.scp to organize your own data to extract speaker embeddings or perform speaker verification. In this case, the output_dir should be set to save all the embeddings or scores.

Inference with multi-threads on CPU

You can inference with multi-threads on CPU as follow steps:

Set ngpu=0 while defining the pipeline in infer.py.
Split wav.scp to several files e.g.: 4

split -l $((`wc -l < wav.scp`/4+1)) --numeric-suffixes wav.scp splits/wav.scp.

Start to extract embeddings

for wav_scp in `ls splits/wav.scp.*`; do
  infer.py ${wav_scp} outputs/$((basename ${wav_scp}))
done

The embeddings will be saved in outputs/*

Inference with multi GPU

Similar to inference on CPU, the difference are as follows:

Step 1. Set ngpu=1 while defining the pipeline in infer.py.

Step 3. specify the gpu device with CUDA_VISIBLE_DEVICES:

  for wav_scp in `ls splits/wav.scp.*`; do
    CUDA_VISIBLE_DEVICES=1 infer.py ${wav_scp} outputs/$((basename ${wav_scp}))
  done