3260 papers • 126 benchmarks • 313 datasets
Audio Captioning is the task of describing audio using text. The general approach is to use an audio encoder to encode the audio (example: PANN, CAV-MAE), and to use a decoder (example: transformer) to generate the text. To judge the quality of audio captions, though machine translation metrics (BLEU, METEOR, ROUGE) and image captioning metrics (SPICE, CIDER) are used, they are not very well-suited. Attempts have been made to use pretrained language model based metrics such as Sentence-BERT.
(Image credit: Papersgraph)
These leaderboards are used to track progress in audio-captioning-4
Use these libraries to find audio-captioning-4 models and implementations
Clotho, a dataset for audio captioning consisting of 4981 audio samples of 15 to 30 seconds duration and 24 905 captions of eight to 20 words length, and a baseline method to provide initial results, is presented.
This work introduces WavCaps, the first large-scale weakly-labelled audio captioning dataset, and proposes a three-stage processing pipeline for filtering noisy data and generating high-quality captions, where ChatGPT, a large language model, is leveraged to filter and transform raw descriptions automatically.
In CL4AC, the self-supervision signals derived from the original audio-text paired data are used to exploit the correspondences between audio and texts by contrasting samples, which can improve the quality of latent representation and the alignment betweenaudio and texts, while trained with limited data.
The Qwen-Audio model, a multi-task training framework that achieves impressive performance across diverse benchmark tasks without requiring any task-specific fine-tuning, surpassing its counterparts.
Captioning has attracted much attention in image and video understanding while a small amount of work examines audio captioning. This paper contributes a Mandarin-annotated dataset for audio captioning within a car scene. A sentence-level loss is proposed to be used in tandem with a GRU encoder-decoder model to generate captions with higher semantic similarity to human annotations. We evaluate the model on the newly-proposed Car dataset, a previously published Mandarin Hospital dataset and the Joint dataset, indicating its generalization capability across different scenes. An improvement in all metrics can be observed, including classical natural language generation (NLG) metrics, sentence richness and human evaluation ratings. However, though detailed audio captions can now be automatically generated, human annotations still outperform model captions on many aspects.
This work presents an approach that focuses on explicitly taking advantage of the difference of lengths between sequences, by applying a temporal sub-sampling to the audio input sequence by employing a sequence-to-sequence method.
This paper proposes two methods to mitigate the class imbalance problem in an autoencoder setting for audio captioning, and defines a multi-label side task based on clip-level content word detection by training a separate decoder.
This work presents the first music audio captioning model, MusCaps, consisting of an encoder-decoder with temporal attention, which represents a shift away from classificationbased music description and combines tasks requiring both auditory and linguistic understanding to bridge the semantic gap in music information retrieval.
Adding a benchmark result helps the community track progress.