1
WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing
2
Text-Free Prosody-Aware Generative Spoken Language Modeling
3
Injecting Text in Self-Supervised Speech Pretraining
4
w2v-BERT: Combining Contrastive Learning and Masked Language Modeling for Self-Supervised Speech Pre-Training
5
Improving Speech Translation by Understanding and Learning from the Auxiliary Text Translation Task
6
BEiT: BERT Pre-Training of Image Transformers
7
HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units
8
SPLAT: Speech-Language Joint Pre-Training for Spoken Language Understanding
9
Lightweight Adapter Tuning for Multilingual Speech Translation
10
Unsupervised Speech Recognition
11
SpeechNet: A Universal Modularized Model for Speech Processing Tasks
12
Learning Shared Semantic Space for Speech-to-Text Translation
13
SUPERB: Speech processing Universal PERformance Benchmark
14
End-to-end Speech Translation via Cross-modal Progressive Training
15
Speech-Language Pre-Training for End-to-End Spoken Language Understanding
16
Fused Acoustic and Text Encoding for Multimodal Bilingual Pretraining and Speech Translation
17
On Generative Spoken Language Modeling from Raw Audio
18
UniSpeech: Unified Speech Representation Learning with Labeled and Unlabeled Data
19
Deep Learning Based Assessment of Synthetic Speech Naturalness
20
St-Bert: Cross-Modal Language Model Pre-Training for End-to-End Spoken Language Understanding
21
A General Multi-Task Learning Framework to Leverage Text Data for Speech to Text Tasks
22
HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis
23
Fairseq S2T: Fast Speech-to-Text Modeling with Fairseq
24
ICASSP 2021 Deep Noise Suppression Challenge
25
Data Augmentation and Loss Normalization for Deep Noise Suppression
26
Pretraining Techniques for Sequence-to-Sequence Voice Conversion
27
wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations
28
A Further Study of Unsupervised Pretraining for Transformer Based Speech Recognition
29
Many-to-Many Voice Transformer Network
30
ESPnet-ST: All-in-One Speech Translation Toolkit
32
End-to-end ASR: from Supervised to Semi-Supervised Learning with Modern Architectures
33
Effectiveness of self-supervised pre-training for speech recognition
34
BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension
35
SpeechBERT: An Audio-and-Text Jointly Learned Language Model for End-to-End Spoken Question Answering
36
Parallel Wavegan: A Fast Waveform Generation Model Based on Generative Adversarial Networks with Multi-Resolution Spectrogram
37
Delving into VoxCeleb: environment invariant speaker recognition
38
Speech-XLNet: Unsupervised Acoustic Model Pretraining For Self-Attention Networks
39
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
40
Generative Pre-Training for Speech with Autoregressive Predictive Coding
41
Pre-Trained Text Embeddings for Enhanced Text-to-Speech Synthesis
42
RoBERTa: A Robustly Optimized BERT Pretraining Approach
43
WHAM!: Extending Speech Separation to Noisy Environments
44
XLNet: Generalized Autoregressive Pretraining for Language Understanding
45
MuST-C: a Multilingual Speech Translation Corpus
46
Unified Language Model Pre-training for Natural Language Understanding and Generation
47
LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech
48
fairseq: A Fast, Extensible Toolkit for Sequence Modeling
49
Cross-lingual Language Model Pretraining
50
Neural Speech Synthesis with Transformer Network
51
Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates
52
X-Vectors: Robust DNN Embeddings for Speaker Recognition
53
ESPnet: End-to-End Speech Processing Toolkit
54
Speech2Vec: A Sequence-to-Sequence Framework for Learning Word Embeddings from Speech
55
Self-Attention with Relative Position Representations
56
Deep Contextualized Word Representations
57
Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions
58
Neural Discrete Representation Learning
59
Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention
60
Hybrid CTC/Attention Architecture for End-to-End Speech Recognition
61
VoxCeleb: A Large-Scale Speaker Identification Dataset
62
Attention is All you Need
63
Language Modeling with Gated Convolutional Networks
64
An Algorithm for Predicting the Intelligibility of Speech Masked by Modulated Noise Maskers
66
Deep Residual Learning for Image Recognition
67
Librispeech: An ASR corpus based on public domain audio books
68
On Using Monolingual Corpora in Neural Machine Translation
69
Adam: A Method for Stochastic Optimization
70
Bleu: a Method for Automatic Evaluation of Machine Translation
71
Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs
72
Multilingual Speech Translation from Efficient Finetuning of Pretrained Models
73
SemFace: Pre-training Encoder and Decoder with a Semantic Interface for Neural Machine Translation
74
dev-other test-clean test-other wav2vec 2.0 BASE (Baevski et al., 2020
75
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
76
Language Models are Unsupervised Multitask Learners
77
Joint CTC/attention decoding for end-to-end speech recognition
78
Dataset and Evaluation Metrics We use the official split of the VoxCeleb1 dataset (Nagrani et al., 2017) for the SID task, where the test set contains 8,251 utterances from these 1,251 celebrities
79
The CMU Arctic speech databases