3260 papers • 126 benchmarks • 313 datasets
Audio-visual zero-shot learning aims to recognize unseen categories based on paired audio-visual sequences.
(Image credit: Papersgraph)
These leaderboards are used to track progress in gzsl-video-classification-4
Use these libraries to find gzsl-video-classification-4 models and implementations
No subtasks available.
It is shown that the proposed framework that ingests temporal features yields state-of-the-art performance on the \ucf, \vgg, and \activity benchmarks for (generalised) zero-shot learning.
This paper first proposes to utilize the knowledge contained in large language models to generate numerous descriptive sentences that include important distinguishing audio-visual features of event classes, which helps to better understand unseen categories, and proposes a knowledge-aware adaptive margin loss to help distinguish similar events.
Adding a benchmark result helps the community track progress.