SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing (2021-10-14T00:00:00.000000Z)

TL;DR

Extensive evaluations show the superiority of the proposed SpeechT5 framework on a wide variety of spoken language processing tasks, including automatic speech recognition, speech synthesis, speech translation, voice conversion, speech enhancement, and speaker identification.

Abstract

Motivated by the success of T5 (Text-To-Text Transfer Transformer) in pre-trained natural language processing models, we propose a unified-modal SpeechT5 framework that explores the encoder-decoder pre-training for self-supervised speech/text representation learning. The SpeechT5 framework consists of a shared encoder-decoder network and six modal-specific (speech/text) pre/post-nets. After preprocessing the input speech/text through the pre-nets, the shared encoder-decoder network models the sequence-to-sequence transformation, and then the post-nets generate the output in the speech/text modality based on the output of the decoder. Leveraging large-scale unlabeled speech and text data, we pre-train SpeechT5 to learn a unified-modal representation, hoping to improve the modeling capability for both speech and text. To align the textual and speech information into this unified semantic space, we propose a cross-modal vector quantization approach that randomly mixes up speech/text states with latent units as the interface between encoder and decoder. Extensive evaluations show the superiority of the proposed SpeechT5 framework on a wide variety of spoken language processing tasks, including automatic speech recognition, speech synthesis, speech translation, voice conversion, speech enhancement, and speaker identification.

Authors

References79 items

WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing

Text-Free Prosody-Aware Generative Spoken Language Modeling

Injecting Text in Self-Supervised Speech Pretraining

w2v-BERT: Combining Contrastive Learning and Masked Language Modeling for Self-Supervised Speech Pre-Training

Improving Speech Translation by Understanding and Learning from the Auxiliary Text Translation Task

BEiT: BERT Pre-Training of Image Transformers

HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units

SPLAT: Speech-Language Joint Pre-Training for Spoken Language Understanding

Lightweight Adapter Tuning for Multilingual Speech Translation

Unsupervised Speech Recognition

SpeechNet: A Universal Modularized Model for Speech Processing Tasks

Learning Shared Semantic Space for Speech-to-Text Translation

SUPERB: Speech processing Universal PERformance Benchmark

End-to-end Speech Translation via Cross-modal Progressive Training

Speech-Language Pre-Training for End-to-End Spoken Language Understanding

Fused Acoustic and Text Encoding for Multimodal Bilingual Pretraining and Speech Translation

On Generative Spoken Language Modeling from Raw Audio

UniSpeech: Unified Speech Representation Learning with Labeled and Unlabeled Data

Deep Learning Based Assessment of Synthetic Speech Naturalness

St-Bert: Cross-Modal Language Model Pre-Training for End-to-End Spoken Language Understanding

A General Multi-Task Learning Framework to Leverage Text Data for Speech to Text Tasks

HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

Fairseq S2T: Fast Speech-to-Text Modeling with Fairseq

ICASSP 2021 Deep Noise Suppression Challenge

Data Augmentation and Loss Normalization for Deep Noise Suppression

Pretraining Techniques for Sequence-to-Sequence Voice Conversion

wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations

A Further Study of Unsupervised Pretraining for Transformer Based Speech Recognition

Many-to-Many Voice Transformer Network

ESPnet-ST: All-in-One Speech Translation Toolkit

Machine Speech Chain

End-to-end ASR: from Supervised to Semi-Supervised Learning with Modern Architectures

Effectiveness of self-supervised pre-training for speech recognition

BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension

SpeechBERT: An Audio-and-Text Jointly Learned Language Model for End-to-End Spoken Question Answering

Parallel Wavegan: A Fast Waveform Generation Model Based on Generative Adversarial Networks with Multi-Resolution Spectrogram

Delving into VoxCeleb: environment invariant speaker recognition

Speech-XLNet: Unsupervised Acoustic Model Pretraining For Self-Attention Networks

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Generative Pre-Training for Speech with Autoregressive Predictive Coding

Pre-Trained Text Embeddings for Enhanced Text-to-Speech Synthesis

RoBERTa: A Robustly Optimized BERT Pretraining Approach

WHAM!: Extending Speech Separation to Noisy Environments

XLNet: Generalized Autoregressive Pretraining for Language Understanding

MuST-C: a Multilingual Speech Translation Corpus

Unified Language Model Pre-training for Natural Language Understanding and Generation

LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech

fairseq: A Fast, Extensible Toolkit for Sequence Modeling

Cross-lingual Language Model Pretraining

Neural Speech Synthesis with Transformer Network

Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates

X-Vectors: Robust DNN Embeddings for Speaker Recognition

ESPnet: End-to-End Speech Processing Toolkit

Speech2Vec: A Sequence-to-Sequence Framework for Learning Word Embeddings from Speech

Self-Attention with Relative Position Representations

Deep Contextualized Word Representations

Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions

Neural Discrete Representation Learning

Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention

Hybrid CTC/Attention Architecture for End-to-End Speech Recognition

VoxCeleb: A Large-Scale Speaker Identification Dataset

Attention is All you Need

Language Modeling with Gated Convolutional Networks

An Algorithm for Predicting the Intelligibility of Speech Masked by Modulated Noise Maskers

Layer Normalization

Deep Residual Learning for Image Recognition

Librispeech: An ASR corpus based on public domain audio books

On Using Monolingual Corpora in Neural Machine Translation

Adam: A Method for Stochastic Optimization

Bleu: a Method for Automatic Evaluation of Machine Translation

Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs

Multilingual Speech Translation from Efficient Finetuning of Pretrained Models

SemFace: Pre-training Encoder and Decoder with a Semantic Interface for Neural Machine Translation

dev-other test-clean test-other wav2vec 2.0 BASE (Baevski et al., 2020

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Language Models are Unsupervised Multitask Learners

Joint CTC/attention decoding for end-to-end speech recognition

Dataset and Evaluation Metrics We use the official split of the VoxCeleb1 dataset (Nagrani et al., 2017) for the SID task, where the test set contains 8,251 utterances from these 1,251 celebrities

The CMU Arctic speech databases