Speaker Verification

Note: The modelscope pipeline supports all the models in model zoo to inference and finetine. Here we take the model of xvector_sv as example to demonstrate the usage.

Inference with pipeline

Quick start

Speaker verification

from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks

inference_sv_pipline = pipeline(
    task=Tasks.speaker_verification,
    model='damo/speech_xvector_sv-zh-cn-cnceleb-16k-spk3465-pytorch'
)

# The same speaker
rec_result = inference_sv_pipline(audio_in=(
    'https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/sv_example_enroll.wav',
    'https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/sv_example_same.wav'))
print("Similarity", rec_result["scores"])

# Different speakers
rec_result = inference_sv_pipline(audio_in=(
    'https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/sv_example_enroll.wav',
    'https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/sv_example_different.wav'))
print("Similarity", rec_result["scores"])

Speaker embedding extraction

from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks

# Define extraction pipeline
inference_sv_pipline = pipeline(
    task=Tasks.speaker_verification,
    model='damo/speech_xvector_sv-zh-cn-cnceleb-16k-spk3465-pytorch'
)
# Extract speaker embedding
rec_result = inference_sv_pipline(
    audio_in='https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/sv_example_enroll.wav')
speaker_embedding = rec_result["spk_embedding"]

Full code of demo, please ref to infer.py.

API-reference

Define pipeline

  • task: Tasks.speaker_verification

  • model: model name in model zoo, or model path in local disk

  • ngpu: 1 (Default), decoding on GPU. If ngpu=0, decoding on CPU

  • output_dir: None (Default), the output path of results if set

  • batch_size: 1 (Default), batch size when decoding

  • sv_threshold: 0.9465 (Default), the similarity threshold to determine whether utterances belong to the same speaker (it should be in (0, 1))

Infer pipeline for speaker embedding extraction

  • audio_in: the input to process, which could be:

    • url (str): e.g.: https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/sv_example_enroll.wav

    • local_path: e.g.: path/to/a.wav

    • wav.scp: e.g.: path/to/wav1.scp

      wav.scp
      test1 path/to/enroll1.wav
      test2 path/to/enroll2.wav
      
    • bytes: e.g.: raw bytes data from a microphone

    • fbank1.scp,speech,kaldi_ark: e.g.: extracted 80-dimensional fbank features with kaldi toolkits.

Infer pipeline for speaker verification

  • audio_in: the input to process, which could be:

    • Tuple(url1, url2): e.g.: (https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/sv_example_enroll.wav, https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/sv_example_different.wav)

    • Tuple(local_path1, local_path2): e.g.: (path/to/a.wav, path/to/b.wav)

    • Tuple(wav1.scp, wav2.scp): e.g.: (path/to/wav1.scp, path/to/wav2.scp)

      wav1.scp
      test1 path/to/enroll1.wav
      test2 path/to/enroll2.wav
      
      wav2.scp
      test1 path/to/same1.wav
      test2 path/to/diff2.wav
      
    • Tuple(bytes, bytes): e.g.: raw bytes data from a microphone

    • Tuple(“fbank1.scp,speech,kaldi_ark”, “fbank2.scp,speech,kaldi_ark”): e.g.: extracted 80-dimensional fbank features with kaldi toolkits.

Inference with you data

Use wav1.scp or fbank.scp to organize your own data to extract speaker embeddings or perform speaker verification. In this case, the output_dir should be set to save all the embeddings or scores.

Inference with multi-threads on CPU

You can inference with multi-threads on CPU as follow steps:

  1. Set ngpu=0 while defining the pipeline in infer.py.

  2. Split wav.scp to several files e.g.: 4

split -l $((`wc -l < wav.scp`/4+1)) --numeric-suffixes wav.scp splits/wav.scp.
  1. Start to extract embeddings

for wav_scp in `ls splits/wav.scp.*`; do
  infer.py ${wav_scp} outputs/$((basename ${wav_scp}))
done
  1. The embeddings will be saved in outputs/*

Inference with multi GPU

Similar to inference on CPU, the difference are as follows:

Step 1. Set ngpu=1 while defining the pipeline in infer.py.

Step 3. specify the gpu device with CUDA_VISIBLE_DEVICES:

  for wav_scp in `ls splits/wav.scp.*`; do
    CUDA_VISIBLE_DEVICES=1 infer.py ${wav_scp} outputs/$((basename ${wav_scp}))
  done