3260 papers • 126 benchmarks • 313 datasets
Audio-visual zero-shot learning aims to recognize unseen categories based on paired audio-visual sequences.
(Image credit: Papersgraph)
These leaderboards are used to track progress in gzsl-video-classification
Use these libraries to find gzsl-video-classification models and implementations
No subtasks available.
This paper introduces a (generalised) zero-shot learning benchmark on three audio-visual datasets of varying sizes and difficulty, VGGSound, UCF, and ActivityNet, ensuring that the unseen test classes do not appear in the dataset used for supervised training of the backbone deep models.
This paper first proposes to utilize the knowledge contained in large language models to generate numerous descriptive sentences that include important distinguishing audio-visual features of event classes, which helps to better understand unseen categories, and proposes a knowledge-aware adaptive margin loss to help distinguish similar events.
Adding a benchmark result helps the community track progress.