VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation (2021-01-02T00:00:00.000000Z)

TL;DR

VoxPopuli is introduced, a large-scale multilingual corpus providing 400K hours of unlabeled speech data in 23 languages and it is the largest open data to date for unsupervised representation learning as well as semi-supervised learning.

Abstract

We introduce VoxPopuli, a large-scale multilingual corpus providing 400K hours of unlabeled speech data in 23 languages. It is the largest open data to date for unsupervised representation learning as well as semi-supervised learning. VoxPopuli also contains 1.8K hours of transcribed speeches in 15 languages and their aligned oral interpretations into 15 target languages totaling 17.3K hours. We provide speech recognition (ASR) baselines and validate the versatility of VoxPopuli unlabeled data in semi-supervised ASR and speech-to-text translation under challenging out-of-domain settings. The corpus is available at https://github.com/facebookresearch/voxpopuli.

Authors

Ann Lee

4 papers

Daniel Haziza

2 papers

Mary Williamson

3 papers

TL;DR

Abstract

Authors

References50 items

MuST-C: A multilingual corpus for end-to-end speech translation

Towards Unsupervised Learning of Speech Features in the Wild

Joint Masked CPC And CTC Training For ASR

MLS: A Large-Scale Multilingual Dataset for Speech Research

Investigating Self-Supervised Pre-Training for End-to-End Speech Translation

Self-Training and Pre-Training are Complementary for Speech Recognition

Pushing the Limits of Semi-Supervised Learning for Automatic Speech Recognition

Fairseq S2T: Fast Speech-to-Text Modeling with Fairseq

CoVoST 2 and Massively Multilingual Speech-to-Text Translation

Data Augmenting Contrastive Learning of Speech Representations in the Time Domain

Unsupervised Cross-lingual Representation Learning for Speech Recognition

Self-Supervised Representations Improve End-to-End Speech Translation

wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations

UWSpeech: Speech to Speech Translation for Unwritten Languages

Self-Training for End-to-End Speech Translation

Unsupervised Pretraining Transfers Well Across Languages

Libri-Light: A Benchmark for ASR with Limited or No Supervision

Common Voice: A Massively-Multilingual Speech Corpus

Europarl-ST: A Multilingual Corpus for Speech Translation of Parliamentary Debates

Pyannote.Audio: Neural Building Blocks for Speaker Diarization

Speech-to-Speech Translation Between Untranscribed Unknown Languages

Self-Training for End-to-End Speech Recognition

MaSS: A Large and Clean Multilingual Corpus of Sentence-aligned Spoken Utterances Extracted from the Bible

MuST-C: a Multilingual Speech Translation Corpus

CMU Wilderness Multilingual Speech Dataset

Direct speech-to-speech translation with a sequence-to-sequence model

Sequence-to-Sequence Speech Recognition with Time-Depth Separable Convolutions

fairseq: A Fast, Extensible Toolkit for Sequence Modeling

Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond

Wav2Letter++: A Fast Open-source Speech Recognition System

SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing

Representation Learning with Contrastive Predictive Coding

The zero resource speech challenge 2017

Attention is All you Need

Wav2Letter: an End-to-End ConvNet-based Speech Recognition System

Librispeech: An ASR corpus based on public domain audio books

Collection of a Simultaneous Translation Corpus for Comparative Analysis

Evaluating speech features with the minimal-pair ABX task: analysis of the classical MFC/PLP pipeline

KenLM: Faster and Smaller Language Model Queries

CIAIR Simultaneous Interpretation Corpus

Speech recognition

Probability of error of some adaptive pattern-recognition machines

German End-to-end Speech Recognition based on DeepSpeech

Interpretese vs. Translationese: The Uniqueness of Human Strategies in Simultaneous Interpretation

Corpus analysis of simultaneous interpretation data for improving real time speech translation

WHAT YOU WILL NEED

An Approach to Corpus-Based Interpreting Studies: Developing EPIC (European Parliament Interpreting Corpus)

Europarl: A Parallel Corpus for Statistical Machine Translation

Connectionist Temporal Classiﬁcation: Labelling Unsegmented Sequence Data with Recurrent Neural Networks

Connectionist Temporal Classiﬁcation: Labelling Unsegmented Sequences with Recurrent Neural Networks

Field of Study

Venue Information

Name

Type

URL

Alternate Names