Audio captioning

Audio Captioning is the task of describing audio using text. The general approach is to use an audio encoder to encode the audio (example: PANN, CAV-MAE), and to use a decoder (example: transformer) to generate the text. To judge the quality of audio captions, though machine translation metrics (BLEU, METEOR, ROUGE) and image captioning metrics (SPICE, CIDER) are used, they are not very well-suited. Attempts have been made to use pretrained language model based metrics such as Sentence-BERT.

Benchmarks

Libraries

Datasets

Subtasks

Most implemented papers

Clotho: an Audio Captioning Dataset

Content

WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research

CL4AC: A Contrastive Loss for Audio Captioning

Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models

Audio Caption in a Car Setting with a Sentence-Level Loss

Temporal Sub-sampling of Audio Feature Sequences for Automated Audio Captioning

Multi-task Regularization Based on Infrequent Classes for Audio Captioning

MusCaps: Generating Captions for Music Audio