Deep Speech: Scaling up end-to-end speech recognition

Published in

arXiv.org(2014)

External Links:

Generate Graph

TL;DR

Deep Speech, a state-of-the-art speech recognition system developed using end-to-end deep learning, outperforms previously published results on the widely studied Switchboard Hub5'00, achieving 16.0% error on the full test set.

Abstract

We present a state-of-the-art speech recognition system developed using end-to-end deep learning. Our architecture is significantly simpler than traditional speech systems, which rely on laboriously engineered processing pipelines; these traditional systems also tend to perform poorly when used in noisy environments. In contrast, our system does not need hand-designed components to model background noise, reverberation, or speaker variation, but instead directly learns a function that is robust to such effects. We do not need a phoneme dictionary, nor even the concept of a "phoneme." Key to our approach is a well-optimized RNN training system that uses multiple GPUs, as well as a set of novel data synthesis techniques that allow us to efficiently obtain a large amount of varied data for training. Our system, called Deep Speech, outperforms previously published results on the widely studied Switchboard Hub5'00, achieving 16.0% error on the full test set. Deep Speech also handles challenging noisy environments better than widely used, state-of-the-art commercial speech systems.

Authors

Awni Y. Hannun

7 papers

Shubho Sengupta

5 papers

A. Ng

22 papers

References49 items

Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin

cuDNN: Efficient Primitives for Deep Learning

Going deeper with convolutions

Sequence to Sequence Learning with Neural Networks

First-Pass Large Vocabulary Continuous Speech Recognition using Bi-Directional Recurrent DNNs

Deep Speech: Scaling up end-to-end speech recognition

Published in

arXiv.org(2014)

External Links:

Generate Graph

TL;DR

Abstract

Authors

Awni Y. Hannun

7 papers

Shubho Sengupta

5 papers

A. Ng

22 papers

References49 items

Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin

cuDNN: Efficient Primitives for Deep Learning

Going deeper with convolutions

Sequence to Sequence Learning with Neural Networks

First-Pass Large Vocabulary Continuous Speech Recognition using Bi-Directional Recurrent DNNs

Bryan Catanzaro

12 papers

Adam Coates

7 papers

Erich Elsen

10 papers

S. Satheesh

3 papers

Increasing Deep Neural Network Acoustic Model Size for Large Vocabulary Continuous Speech Recognition

Towards End-To-End Speech Recognition with Recurrent Neural Networks

Joint training of convolutional and non-convolutional neural networks

Improvements to Deep Convolutional Neural Networks for LVCSR

Sequence-discriminative training of deep neural networks

Scalable Modified Kneser-Ney Language Model Estimation

Deep learning with COTS HPC systems

On the importance of initialization and momentum in deep learning

Deep convolutional neural networks for LVCSR

ImageNet classification with deep convolutional neural networks

Large Scale Distributed Deep Networks

Deep Neural Networks for Acoustic Modeling in Speech Recognition

Improving neural networks by preventing co-adaptation of feature detectors

Multi-column deep neural networks for image classification

Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition

Building high-level features using large scale unsupervised learning

Feature engineering in Context-Dependent Deep Neural Networks for conversational speech transcription

An Analysis of Single-Layer Networks in Unsupervised Feature Learning

Text Detection and Character Recognition in Scene Images with Unsupervised Feature Learning

Deep Sparse Rectifier Neural Networks

Rectified Linear Units Improve Restricted Boltzmann Machines

Unsupervised feature learning for audio classification using convolutional deep belief networks

Large-scale deep unsupervised learning using graphics processors

A Fast Data Collection and Augmentation Procedure for Object Recognition

Shift-invariant sparse coding for audio classification

Learning methods for generic object recognition with invariance to pose and lighting

The Fisher Corpus: a Resource for the Next Generations of Speech-to-Text

Size matters: an empirical study of neural network training for large vocabulary continuous speech recognition

Bidirectional recurrent neural networks

Connectionist Speech Recognition: A Hybrid Approach

Backpropagation Applied to Handwritten Zip Code Recognition

Rectifier Nonlinearities Improve Neural Network Acoustic Models

Scalable Minimum Bayes Risk Training of Deep Neural Network Acoustic Models Using Distributed Hessian-free Optimization

Acoustic Modeling Using Deep Belief Networks

Advances in Neural Information Processing Systems 25

The Kaldi Speech Recognition Toolkit

Connectionist probability estimators in HMM speech recognition

The Lombard reflex and its role on human listeners and automatic speech recognizers.

Connectionist Temporal Classiﬁcation: Labelling Unsegmented Sequence Data with Recurrent Neural Networks

Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence Flexible, High Performance Convolutional Neural Networks for Image Classification

Connectionist Temporal Classiﬁcation: Labelling Unsegmented Sequences with Recurrent Neural Networks

Speaker adaptation is critical to the success of current ASR systems

Field of Study

Computer Science

Journal Information

Name

ArXiv

Volume

abs/2005.00687

Venue Information

Name

arXiv.org

Type

URL

https://arxiv.org

Alternate Names

ArXiv

TL;DR

Abstract

Authors

References49 items

Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin

cuDNN: Efficient Primitives for Deep Learning

Going deeper with convolutions

Sequence to Sequence Learning with Neural Networks

First-Pass Large Vocabulary Continuous Speech Recognition using Bi-Directional Recurrent DNNs

TL;DR

Abstract

Authors

References49 items

Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin

cuDNN: Efficient Primitives for Deep Learning

Going deeper with convolutions

Sequence to Sequence Learning with Neural Networks

First-Pass Large Vocabulary Continuous Speech Recognition using Bi-Directional Recurrent DNNs

Increasing Deep Neural Network Acoustic Model Size for Large Vocabulary Continuous Speech Recognition

Towards End-To-End Speech Recognition with Recurrent Neural Networks

Joint training of convolutional and non-convolutional neural networks

Improvements to Deep Convolutional Neural Networks for LVCSR

Sequence-discriminative training of deep neural networks

Scalable Modified Kneser-Ney Language Model Estimation

Deep learning with COTS HPC systems

On the importance of initialization and momentum in deep learning

Deep convolutional neural networks for LVCSR

ImageNet classification with deep convolutional neural networks

Large Scale Distributed Deep Networks

Deep Neural Networks for Acoustic Modeling in Speech Recognition

Improving neural networks by preventing co-adaptation of feature detectors

Multi-column deep neural networks for image classification

Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition

Building high-level features using large scale unsupervised learning

Feature engineering in Context-Dependent Deep Neural Networks for conversational speech transcription

An Analysis of Single-Layer Networks in Unsupervised Feature Learning

Text Detection and Character Recognition in Scene Images with Unsupervised Feature Learning

Deep Sparse Rectifier Neural Networks

Rectified Linear Units Improve Restricted Boltzmann Machines

Unsupervised feature learning for audio classification using convolutional deep belief networks

Large-scale deep unsupervised learning using graphics processors

A Fast Data Collection and Augmentation Procedure for Object Recognition

Shift-invariant sparse coding for audio classification

Learning methods for generic object recognition with invariance to pose and lighting

The Fisher Corpus: a Resource for the Next Generations of Speech-to-Text

Size matters: an empirical study of neural network training for large vocabulary continuous speech recognition

Bidirectional recurrent neural networks

Connectionist Speech Recognition: A Hybrid Approach

Backpropagation Applied to Handwritten Zip Code Recognition

Rectifier Nonlinearities Improve Neural Network Acoustic Models

Scalable Minimum Bayes Risk Training of Deep Neural Network Acoustic Models Using Distributed Hessian-free Optimization

Acoustic Modeling Using Deep Belief Networks

Advances in Neural Information Processing Systems 25

The Kaldi Speech Recognition Toolkit

Decoding

Connectionist probability estimators in HMM speech recognition

The Lombard reflex and its role on human listeners and automatic speech recognizers.

2d users guide

Connectionist Temporal Classiﬁcation: Labelling Unsegmented Sequence Data with Recurrent Neural Networks

Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence Flexible, High Performance Convolutional Neural Networks for Image Classification

Connectionist Temporal Classiﬁcation: Labelling Unsegmented Sequences with Recurrent Neural Networks

Speaker adaptation is critical to the success of current ASR systems

Field of Study

Journal Information

Name

Volume

Venue Information

Name

Type

URL

Alternate Names