computer-vision-6

Zero-Shot Action Recognition

3260 papers • 126 benchmarks • 313 datasets

This task has no description! Would you like to contribute one?

(Image credit: Papersgraph)

Benchmarks

These leaderboards are used to track progress in zero-shot-action-recognition-6

Trend

Dataset

Best Model

Actions

UCF101

HMDB51

Kinetics

Libraries

i

Use these libraries to find zero-shot-action-recognition-6 models and implementations

towhee-io/towhee

2 papers 2,952

Datasets

Subtasks

No subtasks available.

Most implemented papers

Revisiting Classifier: Transferring Vision-Language Models for Video Recognition

Wanli Ouyang, Wenhao Wu, Zhun Sun•Sun Jul 03 2022

This paper revise the role of the linear classifier and replace the classifier with the different knowledge from pre-trained model, and utilizes the well-pretrained language model to generate good semantic target for efficient transferring learning.

127

Content

Olympics

ActivityNet

Charades

THUMOS' 14

whwu95/Cap4Video

2 papers 198

0

Paper Graph

Learning a Deep Embedding Model for Zero-Shot Learning

T. Xiang, S. Gong, Li Zhang•Mon Nov 14 2016

This paper proposes to use the visual space as the embedding space instead of embedding into a semantic space or an intermediate space, and argues that in this space, the subsequent nearest neighbour search would suffer much less from the hubness problem and thus become more effective.

739 0

Paper Graph

Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models

Wanli Ouyang, Wenhao Wu, Yi Yang, Xiaohan Wang, Jingdong Wang, Haipeng Luo•Fri Dec 30 2022

A novel framework called BIKE, which utilizes the cross-modal bridge to explore bidirectional knowledge and presents a Temporal Concept Spotting mechanism that uses the Text-to-Video expertise to capture temporal saliency in a parameter-free manner, leading to enhanced video representation.

80 0

Paper Graph

LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment

Bin Lin, Jiaxi Cui, Munan Ning, Bin Zhu, Yang Yan, Hongfa Wang, Yatian Pang, Wenhao Jiang, Junwu Zhang, Zongwei Li, Wancai Zhang, Zhifeng Li, Wei Liu, Liejie Yuan•Mon Oct 02 2023

This work proposes LanguageBind, taking the language as the bind across different modalities because the language modality is well-explored and contains rich semantics, and freezes the language encoder acquired by VL pretraining, then train encoders for other modalities with contrastive learning.

353 0

Paper Graph

EVA-CLIP: Improved Training Techniques for CLIP at Scale

Xinlong Wang, Ledell Yu Wu, Yuxin Fang, Quan Sun, Yue Cao•Sun Mar 26 2023

This paper proposes a series of models that significantly improve the efficiency and effectiveness of CLIP training, and incorporates new techniques for representation learning, optimization, and augmentation, enabling EVA-CLIP to achieve superior performance compared to previous CLIP models with the same number of parameters but significantly smaller training costs.

747 0

Paper Graph

Evaluation of output embeddings for fine-grained image classification

Honglak Lee, B. Schiele, Scott E. Reed, Zeynep Akata, D. Walter•Sun Sep 28 2014

This project shows that compelling classification performance can be achieved on fine-grained categories even without labeled training data, and establishes a substantially improved state-of-the-art on the Animals with Attributes and Caltech-UCSD Birds datasets.

1068 0

Paper Graph

Label-Embedding for Image Classification

Zaïd Harchaoui, C. Schmid, Zeynep Akata, Florent Perronnin•Sun Mar 29 2015

This work proposes to view attribute-based image classification as a label-embedding problem: each class is embedded in the space of attribute vectors, and introduces a function that measures the compatibility between an image and a label embedding.

836 0

Paper Graph

ActionCLIP: A New Paradigm for Video Action Recognition

Mengmeng Wang, Yong Liu, Jiazheng Xing•Thu Sep 16 2021

An instantiation of the new paradigm, ActionCLIP, which not only has superior and flexible zero-shot/few-shot transfer ability but also reaches a top performance on general action recognition task, achieving 83.8% top-1 accuracy on Kinetics-400 with a ViT-B/16 as the backbone.

467 0

Paper Graph

Bridging Video-text Retrieval with Multiple Choice Questions

Ping Luo, Yuying Ge, Yixiao Ge, Ying Shan, Xiaohu Qie, Xihui Liu, Dian Li•Wed Jan 12 2022

This work enables fine-grained video-text interactions while maintaining high efficiency for retrieval via a novel pretext task, dubbed as Multiple Choice Questions (MCQ), where a parametric module BridgeFormer is trained to answer the “questions” constructed by the text features via resorting to the video features.

121 0

Paper Graph

FitCLIP: Refining Large-Scale Pretrained Image-Text Models for Zero-Shot Video Understanding Tasks

Fabian Caba Heilbron, Santiago Castro•Wed Mar 23 2022

This paper presents a fine-tuning strategy to refine these large-scale pretrained image-text models for zero-shot video understanding tasks and shows that by carefully adapting these models they obtain considerable improvements on two zero- shot Action Recognition tasks and three Text-to-video Retrieval tasks.

22 0

Paper Graph

Adding a benchmark result helps the community track progress.