3260 papers • 126 benchmarks • 313 datasets
This task has no description! Would you like to contribute one?
(Image credit: Papersgraph)
These leaderboards are used to track progress in distant-speech-recognition-8
Use these libraries to find distant-speech-recognition-8 models and implementations
No subtasks available.
Experiments, that are conducted on several datasets and tasks, show that PyTorch-Kaldi can effectively be used to develop modern state-of-the-art speech recognizers.
A novel architecture for a deep recurrent neural network, residual LSTM is introduced, which separates a spatial shortcut path with temporal one by using output layers, which can help to avoid a conflict between spatial and temporal-domain gradient flows.
A new freely available corpus for German distant speech recognition is presented and speaker-independent word error rate WER results for two open source speech recognizers trained on this corpus are reported.
A first set of baseline results obtained using different techniques, including Deep Neural Networks (DNN), aligned with the state-of-the-art at international level are reported.
This paper revise this classical approach in the context of modern DNN-HMM systems, and proposes the adoption of three methods, namely, asymmetric context windowing, close- talk based supervision, and close-talk based pre-training, which show a significant advantage in using these three methods.
It is shown that a quaternion long-short term memory neural network (QLSTM), trained on the concatenated multi-channel speech signals, outperforms equivalent real-valued LSTM on two different tasks of multi-Channel distant speech recognition.
This paper proposes SincNet, a novel Convolutional Neural Network that encourages the first layer to discover more meaningful filters by exploiting parametrized sinc functions, and shows that the proposed architecture converges faster, performs better, and is more interpretable than standard CNNs.
This work proposes MicRank, a learning to rank framework where a neural network is trained to rank the available channels using directly the recognition performance on the training set, which is agnostic with respect to the array geometry and type of recognition back-end.
Experiments show that the proposed improved self-supervised method can learn transferable, robust, and problem-agnostic features that carry on relevant information from the speech signal, such as speaker identity, phonemes, and even higher-level features such as emotional cues.
A new impulse response (IR) dataset called MeshRIR is introduced, which consists of IRs measured at positions obtained by finely discretizing a spatial region and is suitable for evaluating sound field analysis and synthesis methods.
Adding a benchmark result helps the community track progress.