SpeechBrain: A General-Purpose Speech Toolkit

Published in

arXiv.org(2021)

External Links:

Generate Graph

TL;DR

The core architecture of SpeechBrain is described, designed to support several tasks of common interest, allowing users to naturally conceive, compare and share novel speech processing pipelines.

Abstract

SpeechBrain is an open-source and all-in-one speech toolkit. It is designed to facilitate the research and development of neural speech processing technologies by being simple, flexible, user-friendly, and well-documented. This paper describes the core architecture designed to support several tasks of common interest, allowing users to naturally conceive, compare and share novel speech processing pipelines. SpeechBrain achieves competitive or state-of-the-art performance in a wide range of speech benchmarks. It also provides training recipes, pretrained models, and inference scripts for popular speech datasets, as well as tutorials which allow anyone with basic Python proficiency to familiarize themselves with speech technologies.

Authors

Yoshua Bengio

69 papers

M. Ravanelli

17 papers

Samuele Cornell

6 papers

References113 items

SUPERB: Speech processing Universal PERformance Benchmark

Integration of Pre-Trained Networks with Continuous Token Interface for End-to-End Spoken Language Understanding

MetricGAN+: An Improved Version of MetricGAN for Speech Enhancement

Citrinet: Closing the Gap between Non-Autoregressive and Autoregressive End-to-End Models for Automatic Speech Recognition

Timers and Such: A Practical Benchmark for Spoken Language Understanding with Numbers

SpeechBrain: A General-Purpose Speech Toolkit

Published in

arXiv.org(2021)

External Links:

Generate Graph

TL;DR

The core architecture of SpeechBrain is described, designed to support several tasks of common interest, allowing users to naturally conceive, compare and share novel speech processing pipelines.

Abstract

Authors

Yoshua Bengio

69 papers

M. Ravanelli

17 papers

Samuele Cornell

6 papers

References113 items

SUPERB: Speech processing Universal PERformance Benchmark

Integration of Pre-Trained Networks with Continuous Token Interface for End-to-End Spoken Language Understanding

MetricGAN+: An Improved Version of MetricGAN for Speech Enhancement

Citrinet: Closing the Gap between Non-Autoregressive and Autoregressive End-to-End Models for Automatic Speech Recognition

Timers and Such: A Practical Benchmark for Spoken Language Understanding with Numbers

Jianyuan Zhong

4 papers

Cem Subakan

3 papers

Loren Lugosch

4 papers

Titouan Parcollet

5 papers

Ju-Chieh Chou

2 papers

Peter William VanHarn Plantinga

1 papers

Nauman Dawalatabad

1 papers

Sung-Lin Yeh

1 papers

Chien-Feng Liao

2 papers

Elena Rastorgueva

1 papers

Franccois Grondin

1 papers

William Aris

1 papers

ECAPA-TDNN Embeddings for Speaker Diarization

VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation

Bayesian HMM clustering of x-vector sequences (VBx) in speaker diarization: theory, implementation and analysis on standard tasks

ESPnet-SE: End-To-End Speech Enhancement and Separation Toolkit Designed for ASR Integration

SLURP: A Spoken Language Understanding Resource Package

Attention Is All You Need In Speech Separation

Rethinking Evaluation in ASR: Are Our Models Robust Enough?

WER we are and WER we think we are

Transformers: State-of-the-Art Natural Language Processing

Meta-Learning With Latent Space Clustering in Generative Adversarial Network for Speaker Diarization

Real Time Speech Enhancement in the Waveform Domain

wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations

LibriMix: An Open-Source Dataset for Generalizable Speech Separation

GEV Beamforming Supported by DOA-Based Masks Generated on Pairs of Microphones

The INTERSPEECH 2020 Deep Noise Suppression Challenge: Datasets, Subjective Testing Framework, and Challenge Results

ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification

Asteroid: the PyTorch-based audio source separation toolkit for researchers

ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context

fastai: A Layered API for Deep Learning

Libri-Light: A Benchmark for ASR with Limited or No Supervision

Common Voice: A Massively-Multilingual Speech Corpus

PyTorch: An Imperative Style, High-Performance Deep Learning Library

Pyannote.Audio: Neural Building Blocks for Speaker Diarization

WHAMR!: Noisy and Reverberant Single-Channel Speech Separation

Quartznet: Deep Automatic Speech Recognition with 1D Time-Channel Separable Convolutions

BUT System Description to VoxCeleb Speaker Recognition Challenge 2019

Dual-Path RNN: Efficient Long Sequence Modeling for Time-Domain Single-Channel Speech Separation

Espresso: A Fast End-to-End Neural Speech Recognition Toolkit

NeMo: a toolkit for building AI applications using Neural Modules

A Comparative Study on Transformer vs RNN in Speech Applications

SciPy 1.0: fundamental algorithms for scientific computing in Python

WHAM!: Extending Speech Separation to Noisy Environments

SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition

Speech Model Pre-training for End-to-End Spoken Language Understanding

Jasper: An End-to-End Convolutional Neural Acoustic Model

fairseq: A Fast, Extensible Toolkit for Sequence Modeling

Lingvo: a Modular and Scalable Framework for Sequence-to-Sequence Modeling

Lightweight and Optimized Sound Source Localization and Tracking Methods for Open and Closed Microphone Array Configurations

The Pytorch-kaldi Speech Recognition Toolkit

From Audio to Semantics: Approaches to End-to-End Spoken Language Understanding

Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation

SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing

A Comparison of Techniques for Language Model Integration in Encoder-Decoder Speech Recognition

VoxCeleb2: Deep Speaker Recognition

Quaternion Recurrent Neural Networks

TED-LIUM 3: twice as much data and corpus repartition for experiments on speaker adaptation

RETURNN as a Generic Flexible Neural Toolkit with Application to Translation and Speech Recognition

X-Vectors: Robust DNN Embeddings for Speaker Recognition

ESPnet: End-to-End Speech Processing Toolkit

Spectral Feature Mapping with MIMIC Loss for Robust Speech Recognition

Light Gated Recurrent Units for Speech Recognition

Towards End-to-end Spoken Language Understanding

Hybrid CTC/Attention Architecture for End-to-End Speech Recognition

AISHELL-1: An open-source Mandarin speech corpus and a speech recognition baseline

Noisy speech database for training speech enhancement algorithms and TTS models

End-to-End Training of Acoustic Models for Large Vocabulary Continuous Speech Recognition with TensorFlow

VoxCeleb: A Large-Scale Speaker Identification Dataset

An Empirical Study of Mini-Batch Creation Strategies for Neural Machine Translation

Attention is All you Need

Deep Complex Networks

Convolutional Sequence to Sequence Learning

Batch-normalized joint training for DNN-based distant speech recognition

Improved MVDR Beamforming Using Single-Channel Mask Prediction Networks

A Joint Training Framework for Robust Automatic Speech Recognition

An extensible speaker identification sidekit in Python

Neural network based spectral mask estimation for acoustic beamforming

Speech enhancement and recognition using multi-task learning of long short-term memory recurrent neural networks

Deep clustering: Discriminative embeddings for segmentation and separation

Joint training of front-end and back-end deep neural networks for robust speech recognition

Librispeech: An ASR corpus based on public domain audio books

Deep Speech: Scaling up end-to-end speech recognition

Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling

Neural Machine Translation by Jointly Learning to Align and Translate

The voice bank corpus: Design, collection and data analysis of a large regional accent speech database

PLDA for speaker verification with utterances of arbitrary duration

The Diverse Environments Multi-channel Acoustic Noise Database (DEMAND): A database of multichannel environmental noise recordings

Sequence Transduction with Recurrent Neural Networks

Scikit-learn: Machine Learning in Python

A tutorial on spectral clustering

Design fragments make using frameworks easier

Performance measurement in blind audio source separation

Pocketsphinx: A Free, Real-Time Continuous Speech Recognition System for Hand-Held Devices

The AMI Meeting Corpus: A Pre-announcement

Julius - an open source real-time large vocabulary recognition engine

Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs

Long Short-Term Memory

Acoustic event localization using a crosspower-spectrum phase based technique

Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM {TIMIT} | NIST

Phoneme recognition using time-delay neural networks

Multiple emitter location and signal Parameter estimation

The generalized correlation method for estimation of time delay

Pytorch Lightning

GENERATIVE ADVERSARIAL NETS

100

Deep learning with Keras

101

Robust speech recognition with speech enhanced deep neural networks

102

The Kaldi Speech Recognition Toolkit

103

A Modified SRP-PHAT Functional for Robust Real-Time Sound Source Localization With Scalable Spatial Sampling

104

Analysis of i-vector Length Normalization in Speaker Recognition Systems

105

RASR - The RWTH Aachen University Open Source Speech Recognition Toolkit

106

New Insights Into the MVDR Beamformer in Room Acoustics

107

Evaluation of Objective Quality Measures for Speech Enhancement

108

Weighted finite-state transducers in speech recognition

109

Speech processing

110

The development of the time-delay neural network architecture for speech recognition

111

Connectionist Temporal Classiﬁcation: Labelling Unsegmented Sequence Data with Recurrent Neural Networks

112

This Paper Is Included in the Proceedings of the 12th Usenix Symposium on Operating Systems Design and Implementation (osdi '16). Tensorflow: a System for Large-scale Machine Learning Tensorflow: a System for Large-scale Machine Learning

113

Connectionist Temporal Classiﬁcation: Labelling Unsegmented Sequences with Recurrent Neural Networks

Field of Study

EngineeringComputer Science

Journal Information

Name

ArXiv

Volume

abs/2005.00687

Venue Information

Name

arXiv.org

Type

URL

https://arxiv.org

Alternate Names

ArXiv