3260 papers • 126 benchmarks • 313 datasets
Audio Captioning is the task of describing audio using text. The general approach is to use an audio encoder to encode the audio (example: PANN, CAV-MAE), and to use a decoder (example: transformer) to generate the text. To judge the quality of audio captions, though machine translation metrics (BLEU, METEOR, ROUGE) and image captioning metrics (SPICE, CIDER) are used, they are not very well-suited. Attempts have been made to use pretrained language model based metrics such as Sentence-BERT.
(Image credit: Papersgraph)
These leaderboards are used to track progress in audio-captioning-16
Use these libraries to find audio-captioning-16 models and implementations
No datasets available.
Adding a benchmark result helps the community track progress.