3260 papers • 126 benchmarks • 313 datasets
This task has no description! Would you like to contribute one?
(Image credit: Papersgraph)
These leaderboards are used to track progress in text-independent-speaker-verification-3
No benchmarks available.
Use these libraries to find text-independent-speaker-verification-3 models and implementations
No datasets available.
No subtasks available.
This paper proposes a powerful speaker recognition deep network, using a ‘thin-ResNet’ trunk architecture, and a dictionary-based NetVLAD or GhostVLAD layer to aggregate features across time, that can be trained end-to-end.
An adaptive feature learning by utilizing the 3D-CNN s for direct speaker model creation in which, for both development and enrollment phases, an identical number of spoken utterances per speaker is fed to the network for representing the speakers' utterances and creation of the speaker model.
This study proposes an end-to-end system that comprises two deep neural networks, one front-end for utterance-level speaker embedding extraction and the other for back-end classification that achieves state-of-the-art performance among systems without data augmentation.
Experiments showed that the NIFS can significantly improve the performance of Vector Quantization (VQ), Gaussian Mixture Model-Universal Background Model (GMM-UBM) and i-vector-based speaker verification systems in different unknown noisy environments with different SNRs, in comparison to their baselines.
Deep multi-metric learning is used to address the purpose of text-independent speaker verification and introduces three different losses for this problem, i.e., triplet loss, n-pair loss and angular loss, which work in a cooperative way to train a feature extraction network equipped with Residual connections and squeeze-and-excitation attention.
A Masked Proxy (MP) loss which directly incorporates both proxy- based relationships and pair-based relationships is proposed to leverage the hardness of speaker pairs and state-of-the-art Equal Error Rate (EER) is proposed.
A simple contrastive learning approach (SimCLR) with a momentum contrastive (MoCo) learning framework, where the MoCo speaker embedding system utilizes a queue to maintain a large set of negative examples, is examined.
A novel multi-scale waveform encoder that uses three convolution branches with different time scales to compute speech features from the waveform to outperform existing raw-waveform-based speaker embeddings on speaker verification by a large margin is proposed.
Adding a benchmark result helps the community track progress.