3260 papers • 126 benchmarks • 313 datasets
This task has no description! Would you like to contribute one?
(Image credit: Papersgraph)
These leaderboards are used to track progress in zero-shot-action-recognition-6
Use these libraries to find zero-shot-action-recognition-6 models and implementations
No subtasks available.
This paper revise the role of the linear classifier and replace the classifier with the different knowledge from pre-trained model, and utilizes the well-pretrained language model to generate good semantic target for efficient transferring learning.
This paper proposes to use the visual space as the embedding space instead of embedding into a semantic space or an intermediate space, and argues that in this space, the subsequent nearest neighbour search would suffer much less from the hubness problem and thus become more effective.
A novel framework called BIKE, which utilizes the cross-modal bridge to explore bidirectional knowledge and presents a Temporal Concept Spotting mechanism that uses the Text-to-Video expertise to capture temporal saliency in a parameter-free manner, leading to enhanced video representation.
This work proposes LanguageBind, taking the language as the bind across different modalities because the language modality is well-explored and contains rich semantics, and freezes the language encoder acquired by VL pretraining, then train encoders for other modalities with contrastive learning.
This paper proposes a series of models that significantly improve the efficiency and effectiveness of CLIP training, and incorporates new techniques for representation learning, optimization, and augmentation, enabling EVA-CLIP to achieve superior performance compared to previous CLIP models with the same number of parameters but significantly smaller training costs.
This project shows that compelling classification performance can be achieved on fine-grained categories even without labeled training data, and establishes a substantially improved state-of-the-art on the Animals with Attributes and Caltech-UCSD Birds datasets.
This work proposes to view attribute-based image classification as a label-embedding problem: each class is embedded in the space of attribute vectors, and introduces a function that measures the compatibility between an image and a label embedding.
An instantiation of the new paradigm, ActionCLIP, which not only has superior and flexible zero-shot/few-shot transfer ability but also reaches a top performance on general action recognition task, achieving 83.8% top-1 accuracy on Kinetics-400 with a ViT-B/16 as the backbone.
This work enables fine-grained video-text interactions while maintaining high efficiency for retrieval via a novel pretext task, dubbed as Multiple Choice Questions (MCQ), where a parametric module BridgeFormer is trained to answer the “questions” constructed by the text features via resorting to the video features.
This paper presents a fine-tuning strategy to refine these large-scale pretrained image-text models for zero-shot video understanding tasks and shows that by carefully adapting these models they obtain considerable improvements on two zero- shot Action Recognition tasks and three Text-to-video Retrieval tasks.
Adding a benchmark result helps the community track progress.