Home Research Papers Datasets State of the Art Pricing

Discover, visualize, and connect AI research papers. Explore the latest trends and insights in artificial intelligence research.

Product

Home
Research Papers
About

Support

Contact
Terms of Service
Privacy Policy

© 2026 Papersgraph. All rights reserved.

medical-6

Audio Classification

3260 papers • 126 benchmarks • 313 datasets

Audio Classification is a machine learning task that involves identifying and tagging audio signals into different classes or categories. The goal of audio classification is to enable machines to automatically recognize and distinguish between different types of audio, such as music, speech, and environmental sounds.

(Image credit: Papersgraph)

Benchmarks

These leaderboards are used to track progress in audio-classification-12

Trend

Dataset

Best Model

Actions

AudioSet

AudioSet

ESC-50

ESC-50

VGGSound

VGGSound

Libraries

i

Use these libraries to find audio-classification-12 models and implementations

3 papers 22

Datasets

AudioSet

Speech Commands

Speech Commands

ESC-50

VGG-Sound

EPIC-KITCHENS-100

EPIC-KITCHENS-100

UrbanSound8K

Subtasks

Environmental Sound Classification Audio Multiple Target Classification Semi-supervised Audio Classification

Most implemented papers

Perceiver: General Perception with Iterative Attention

O. Vinyals, Andrew Zisserman, Felix Gimeno, João Carreira, Andrew Jaegle, Andrew Brock•Wed Mar 03 2021

This paper introduces the Perceiver - a model that builds upon Transformers and hence makes few architectural assumptions about the relationship between its inputs, but that also scales to hundreds of thousands of inputs, like ConvNets.

1285

Content

Introduction Benchmarks Datasets Subtasks Libraries Papers

ICBHI Respiratory Sound Database

ICBHI Respiratory Sound Database

SHD

SHD

FSD50K

FSD50K

Speech Commands

Speech Commands

DCASE

DCASE

Balanced Audio Set

Balanced Audio Set

EPIC-KITCHENS-100

EPIC-KITCHENS-100

SSC

SSC

BirdCLEF 2021

BirdCLEF 2021

DiCOVA

DiCOVA

CREMA-D

CREMA-D

RAVDESS

RAVDESS

VocalSound

VocalSound

Multimodal PISA

Multimodal PISA

UCR Time Series Classification Archive

UCR Time Series Classification Archive

DEEP-VOICE: DeepFake Voice Recognition

DEEP-VOICE: DeepFake Voice Recognition

EPIC-SOUNDS

EPIC-SOUNDS

towhee-io/towhee

2 papers 2,983

google-research/leaf-audio

2 papers 473

fschmid56/efficientat

2 papers 179

IBM/MAX-Audio-Classifier

2 papers 150

faceonlive/ai-research

2 papers 140

deephdc/audio-classification-tf

2 papers 20

UrbanSound8K

FSD50K

UCR Time Series Classification Archive

UCR Time Series Classification Archive

CREMA-D

DiCOVA

0

CNN architectures for large-scale audio classification

Shawn Hershey, Sourish Chaudhuri, D. Ellis, J. Gemmeke, A. Jansen, R. C. Moore, Manoj Plakal, D. Platt, R. Saurous, Bryan Seybold, M. Slaney, Ron J. Weiss, K. Wilson•Wed Sep 28 2016

This work uses various CNN architectures to classify the soundtracks of a dataset of 70M training videos with 30,871 video-level labels, and investigates varying the size of both training set and label vocabulary, finding that analogs of the CNNs used in image classification do well on the authors' audio classification task, and larger training and label sets help up to a point.

2831 0

PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition

Mark D. Plumbley, Qiuqiang Kong, Yuxuan Wang, Turab Iqbal, Yin Cao, Wenwu Wang•Fri Dec 20 2019

This paper proposes pretrained audio neural networks (PANNs) trained on the large-scale AudioSet dataset, and investigates the performance and computational complexity of PANNs modeled by a variety of convolutional neural networks.

1361 0

Multi-level attention model for weakly supervised audio classification

Binh Yang, Qiuqiang Kong, Changsong Yu, K. Barsim•Mon Mar 05 2018

A multi-attention attention model which consists of multiple attention modules applied on the intermediate neural network layers that achieves a state-of-the-art mean average precision (mAP) of 0.360, outperforming the single attention model and the Google baseline system.

84 0

AdamP: Slowing Down the Slowdown for Momentum Optimizers on Scale-invariant Weights

Sangdoo Yun, Byeongho Heo, Dongyoon Han, Sanghyuk Chun, Seong Joon Oh, Youngjung Uh, Jung-Woo Ha, Gyuwan Kim•Sun Jun 14 2020

This paper proposes a simple and effective remedy, SGDP and AdamP: get rid of the radial component, or the norm-increasing direction, at each optimizer step, which alters the effective step sizes without changing the effective update directions, thus enjoying the original convergence properties of GD optimizers.

159 0

LEAF: A Learnable Frontend for Audio Classification

M. Tagliasacchi, Neil Zeghidour, O. Teboul, F. D. C. Quitry•Wed Jan 20 2021

This work introduces a new principled, lightweight, fully learnable architecture that can be used as a drop-in replacement of mel-filterbanks, and outperforms the current state-of-the-art learnable frontend on Audioset, with orders of magnitude fewer parameters.

169 0

ATST: Audio Representation Learning with Teacher-Student Transformer

Xian Li, Xiaofei Li•Mon Apr 25 2022

This work addresses the problem of segment-level general audio SSL, and proposes a new transformer-based teacher-student SSL model, named ATST, which achieves the new state-of-the-art results on almost all of the downstream tasks.

27 0

Masked Autoencoders that Listen

Po-Yao (Bernie) Huang, Michael Auli, Wojciech Galuba, Christoph Feichtenhofer, Florian Metze, Alexei Baevski, Hu Xu, Juncheng Billy Li•Tue Jul 12 2022

The Audio-MAE, a simple extension of image-based Masked Autoencoders to self-supervised representation learning from audio spectrograms, sets new state-of-the-art performance on six audio and speech classification tasks, outperforming other recent models that use external supervised pre-training.

395 0

Convolutional RNN: An enhanced model for extracting features from sequential data

Björn Schuller, Gil Keren•Wed Feb 17 2016

This work proposes a model that enhances this feature extraction process for the case of sequential data, by feeding patches of the data into a recurrent neural network and using the outputs or hidden states of the recurrent units to compute the extracted features.

153 0

LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment

Bin Lin, Jiaxi Cui, Munan Ning, Bin Zhu, Yang Yan, Hongfa Wang, Yatian Pang, Wenhao Jiang, Junwu Zhang, Zongwei Li, Wancai Zhang, Zhifeng Li, Wei Liu, Liejie Yuan•Mon Oct 02 2023

This work proposes LanguageBind, taking the language as the bind across different modalities because the language modality is well-explored and contains rich semantics, and freezes the language encoder acquired by VL pretraining, then train encoders for other modalities with contrastive learning.

353 0

Adding a benchmark result helps the community track progress.

Audio Classification | State-of-the-Art