3260 papers • 126 benchmarks • 313 datasets
Automatic Speech Recognition (ASR) involves converting spoken language into written text. It is designed to transcribe spoken words into text in real-time, allowing people to communicate with computers, mobile devices, and other technology using their voice. The goal of Automatic Speech Recognition is to accurately transcribe speech, taking into account variations in accent, pronunciation, and speaking style, as well as background noise and other factors that can affect speech quality.
(Image credit: Papersgraph)
These leaderboards are used to track progress in automatic-speech-recognition-1
Use these libraries to find automatic-speech-recognition-1 models and implementations
No subtasks available.
This work presents SpecAugment, a simple data augmentation method for speech recognition that is applied directly to the feature inputs of a neural network (i.e., filter bank coefficients) and achieves state-of-the-art performance on the LibriSpeech 960h and Swichboard 300h tasks, outperforming all prior work.
The machine learning architecture of the Snips Voice Platform is presented, a software solution to perform Spoken Language Understanding on microprocessors typical of IoT devices that is fast and accurate while enforcing privacy by design, as no personal user data is ever collected.
A simple baseline that utilizes probabilities from softmax distributions is presented, showing the effectiveness of this baseline across all computer vision, natural language processing, and automatic speech recognition, and it is shown the baseline can sometimes be surpassed.
This work proposes the convolution-augmented transformer for speech recognition, named Conformer, which significantly outperforms the previous Transformer and CNN based models achieving state-of-the-art accuracies.
This paper proposes a simple scaling method that scales the widths of ContextNet that achieves good trade-off between computation and accuracy and demonstrates that on the widely used LibriSpeech benchmark, ContextNet achieves a word error rate of 2.1%/4.6%.
This work learns to listen and write characters with a joint Connectionist Temporal Classification (CTC) and attention-based encoder-decoder network and beats out traditional hybrid ASR systems on spontaneous Japanese and Chinese speech.
This paper presents the Eesen framework which drastically simplifies the existing pipeline to build state-of-the-art ASR systems and achieves comparable word error rates (WERs), while at the same time speeding up decoding significantly.
Three deep neural network architectures are adapted to energy disaggregation and it is found that all three neural nets achieve better F1 scores than either combinatorial optimisation or factorial hidden Markov models and that the neural net algorithms generalise well to an unseen house.
A variety of structural and optimization improvements to the Listen, Attend, and Spell model are explored, which significantly improve performance and a multi-head attention architecture is introduced, which offers improvements over the commonly-used single- head attention.
Adding a benchmark result helps the community track progress.