3260 papers • 126 benchmarks • 313 datasets
In speech processing, keyword spotting deals with the identification of keywords in utterances. ( Image credit: Simon Grest )
(Image credit: Papersgraph)
These leaderboards are used to track progress in keyword-spotting-11
Use these libraries to find keyword-spotting-11 models and implementations
An audio dataset of spoken words designed to help train and evaluate keyword spotting systems and suggests a methodology for reproducible and comparable accuracy metrics for this task.
It is shown that it is possible to optimize these neural network architectures to fit within the memory and compute constraints of microcontrollers without sacrificing accuracy, and the depthwise separable convolutional neural network (DS-CNN) is explored and compared against other neural network architecture.
The Keyword Transformer (KWT), a fully self-attentional architecture that exceeds state-of-the-art performance across multiple tasks without any pre-training or additional data, is introduced.
A self-supervised speech pre-training method called TERA, which stands for Transformer Encoder Representations from Alteration, is introduced and it is shown the proposed method is transferable to downstream datasets not used in pre- training.
A TM-based keyword spotting pipeline is explored to demonstrate low complexity with faster rate of convergence compared to NNs and investigate the scalability with increasing keywords and explore the potential for enabling low-power on-chip KWS.
Honk, an open-source PyTorch reimplementation of convolutional neural networks for keyword spotting that are included as examples in TensorFlow, is described and provides a starting point for future work on the keyword spotting task.
This paper collects and annotates 2036 archival document images from different locations and time periods and proposes a new evaluation scheme that is based on baselines, which has no need for binarization and it can handle skewed as well as rotated text lines.
This work explores the application of deep residual learning and dilated convolutions to the keyword spotting task, using the recently-released Google Speech Commands Dataset as a benchmark and establishes an open-source state-of-the-art reference to support the development of future speech-based interfaces.
A model inspired by the recent success of dilated convolutions in sequence modeling applications, allowing to train deeper architectures in resource-constrained configurations, and applies a custom target labeling that back-propagates loss from specific frames of interest, therefore yielding higher accuracy and only requiring to detect the end of the keyword.
The Audio Spectrogram Transformer is introduced, the first convolution-free, purely attention-based model for audio classification, which achieves new state-of-the-art results on various audio classification benchmarks.
Adding a benchmark result helps the community track progress.