3260 papers • 126 benchmarks • 313 datasets
Audio Classification is a machine learning task that involves identifying and tagging audio signals into different classes or categories. The goal of audio classification is to enable machines to automatically recognize and distinguish between different types of audio, such as music, speech, and environmental sounds.
(Image credit: Papersgraph)
These leaderboards are used to track progress in audio-classification-12
Use these libraries to find audio-classification-12 models and implementations
This paper introduces the Perceiver - a model that builds upon Transformers and hence makes few architectural assumptions about the relationship between its inputs, but that also scales to hundreds of thousands of inputs, like ConvNets.
This work uses various CNN architectures to classify the soundtracks of a dataset of 70M training videos with 30,871 video-level labels, and investigates varying the size of both training set and label vocabulary, finding that analogs of the CNNs used in image classification do well on the authors' audio classification task, and larger training and label sets help up to a point.
This paper proposes pretrained audio neural networks (PANNs) trained on the large-scale AudioSet dataset, and investigates the performance and computational complexity of PANNs modeled by a variety of convolutional neural networks.
A multi-attention attention model which consists of multiple attention modules applied on the intermediate neural network layers that achieves a state-of-the-art mean average precision (mAP) of 0.360, outperforming the single attention model and the Google baseline system.
This paper proposes a simple and effective remedy, SGDP and AdamP: get rid of the radial component, or the norm-increasing direction, at each optimizer step, which alters the effective step sizes without changing the effective update directions, thus enjoying the original convergence properties of GD optimizers.
This work introduces a new principled, lightweight, fully learnable architecture that can be used as a drop-in replacement of mel-filterbanks, and outperforms the current state-of-the-art learnable frontend on Audioset, with orders of magnitude fewer parameters.
This work addresses the problem of segment-level general audio SSL, and proposes a new transformer-based teacher-student SSL model, named ATST, which achieves the new state-of-the-art results on almost all of the downstream tasks.
The Audio-MAE, a simple extension of image-based Masked Autoencoders to self-supervised representation learning from audio spectrograms, sets new state-of-the-art performance on six audio and speech classification tasks, outperforming other recent models that use external supervised pre-training.
This work proposes a model that enhances this feature extraction process for the case of sequential data, by feeding patches of the data into a recurrent neural network and using the outputs or hidden states of the recurrent units to compute the extracted features.
This work proposes LanguageBind, taking the language as the bind across different modalities because the language modality is well-explored and contains rich semantics, and freezes the language encoder acquired by VL pretraining, then train encoders for other modalities with contrastive learning.
Adding a benchmark result helps the community track progress.