3260 papers • 126 benchmarks • 313 datasets
This task has no description! Would you like to contribute one?
(Image credit: Papersgraph)
These leaderboards are used to track progress in speaker-identification-7
Use these libraries to find speaker-identification-7 models and implementations
No subtasks available.
This paper proposes a novel CNN architecture, called SincNet, that encourages the first convolutional layer to discover more meaningful filters, based on parametrized sinc functions, which implement band-pass filters.
Results that suggest adapting from a model trained with Mandarin can improve accuracy for English speaker recognition are presented, and it is suggested that Deep Speaker outperforms a DNN-based i-vector baseline.
This work addresses the problem of segment-level general audio SSL, and proposes a new transformer-based teacher-student SSL model, named ATST, which achieves the new state-of-the-art results on almost all of the downstream tasks.
A portable model called Additive Margin Mobile net1D (AM-MobileNet1D) to Speaker Identification on mobile devices is proposed, which takes only 11.6 megabytes on disk storage against 91.2 from SincNet and AM-SincNet architectures, making the model seven times faster, with eight times fewer parameters.
Results demonstrate that the derived CNN architectures from the proposed approach significantly outperform current speaker recognition systems based on VGG-M, Res net-18, and ResNet-34 back-bones, while enjoying lower model complexity.
This work proposes Audio ALBERT, a lite version of the self-supervised speech representation model, and applies the lightweight representation extractor to two downstream tasks, speaker classification and phoneme classification, showing that it achieves performance comparable with massive pre-trained networks in the downstream tasks while having 91% fewer parameters.
This paper applies multi-task learning to the current SSL framework for speaker representation learning, and proposes an utterance mixing strategy for data augmentation, where additional overlapped utterances are created unsupervisely and incorporated during training.
Extensive evaluations show the superiority of the proposed SpeechT5 framework on a wide variety of spoken language processing tasks, including automatic speech recognition, speech synthesis, speech translation, voice conversion, speech enhancement, and speaker identification.
The Audio-MAE, a simple extension of image-based Masked Autoencoders to self-supervised representation learning from audio spectrograms, sets new state-of-the-art performance on six audio and speech classification tasks, outperforming other recent models that use external supervised pre-training.
This work learns representations that capture speaker identities by maximizing the mutual information between the encoded representations of chunks of speech randomly sampled from the same sentence.
Adding a benchmark result helps the community track progress.