3260 papers • 126 benchmarks • 313 datasets
Zero-shot audio captioning aims at automatically generating descriptive textual captions for audio content without any prior training for this task. Audio captioning is commonly concerned with ambient sounds, or sounds produced by a human performing an action.
(Image credit: Papersgraph)
These leaderboards are used to track progress in zero-shot-audio-captioning-5
Use these libraries to find zero-shot-audio-captioning-5 models and implementations
No subtasks available.
This work proposes ZerAuCap, a novel framework for summarising general audio signals in a text caption without requiring task-specific training that achieves state-of-the-art results in zero-shot audio captioning on the AudioCaps and Clotho datasets.
Adding a benchmark result helps the community track progress.