FitCLIP: Refining Large-Scale Pretrained Image-Text Models for Zero-Shot Video Understanding Tasks (2022-03-24T00:00:00.000000Z)

TL;DR

This paper presents a fine-tuning strategy to refine these large-scale pretrained image-text models for zero-shot video understanding tasks and shows that by carefully adapting these models they obtain considerable improvements on two zero- shot Action Recognition tasks and three Text-to-video Retrieval tasks.

Abstract

Large-scale pretrained image-text models have shown incredible zero-shot performance in a handful of tasks, including video ones such as action recognition and text-to-video retrieval. However, these models have not been adapted to video, mainly because they do not account for the time dimension but also because video frames are different from the typical images (e.g., containing motion blur, and less sharpness). In this paper, we present a fine-tuning strategy to refine these large-scale pretrained image-text models for zero-shot video understanding tasks. We show that by carefully adapting these models we obtain considerable improvements on two zero-shot Action Recognition tasks and three zero-shot Text-to-video Retrieval tasks. The code is available at https://github.com/bryant1410/fitclip

Authors

Fabian Caba Heilbron

6 papers

Santiago Castro

7 papers

TL;DR

Abstract

Authors

References75 items

Bridging Video-text Retrieval with Multiple Choice Questions

VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling

VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding

ActionCLIP: A New Paradigm for Video Action Recognition

Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss

Robust fine-tuning of zero-shot models

TACo: Token-aware Cascade Contrastive Learning for Video-Text Alignment

Elaborative Rehearsal for Zero-shot Action Recognition

A New Split for Evaluating True Zero-Shot Action Recognition

CLIP2Video: Mastering Video-Text Retrieval via Image CLIP

VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text

CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval

Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval

ViViT: A Video Vision Transformer

MoViNets: Mobile Video Networks for Efficient Video Recognition

WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning

Learning Transferable Visual Models From Natural Language Supervision

A Straightforward Framework For Video Retrieval Using CLIP

Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision

Less is More: CLIPBERT for Video-and-Language Learning via Sparse Sampling

SMART Frame Selection for Action Recognition

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Transformers: State-of-the-Art Natural Language Processing

VirTex: Learning Visual Representations from Textual Annotations

ZSTAD: Zero-Shot Temporal Activity Detection

Rethinking Zero-Shot Video Classification: End-to-End Training for Realistic Applications

End-to-End Learning of Visual Representations From Uncurated Instructional Videos

PyTorch: An Imperative Style, High-Performance Deep Learning Library

Zero-Shot Action Recognition in Videos: A Survey

Use What You Have: Video retrieval using representations from collaborative experts

I Know the Relationships: Zero-Shot Action Recognition via Two-Stream Graph Convolutional Networks and Knowledge Graphs

HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

Out-Of-Distribution Detection for Generalized Zero-Shot Action Recognition

DistInit: Learning Video Representations Without a Single Labeled Video

A Joint Sequence Fusion Model for Video Question Answering and Retrieval

S3D: Single Shot multi-Span Detector via Fully 3D Convolutional Networks

Representation Learning with Contrastive Predictive Coding

Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning

Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNs

Moments in Time Dataset: One Million Videos for Event Understanding

Localizing Moments in Video with Natural Language

Attention is All you Need

Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset

Towards Automatic Learning of Procedures From Web Instructional Videos

Exploring synonyms as context in zero-shot action recognition

Temporal Segment Networks: Towards Good Practices for Deep Action Recognition

MSR-VTT: A Large Video Description Dataset for Bridging Video and Language

Recognizing an Action Using Its Name: A Knowledge-Based Approach

Learning Visual Features from Large Weakly Supervised Data

Objects2action: Classifying and Localizing Actions without Any Video Example

Classifier adaptation at prediction time

Microsoft COCO Captions: Data Collection and Evaluation Server

Distilling the Knowledge in a Neural Network

Microsoft COCO: Common Objects in Context

Zero-Shot Learning Through Cross-Modal Transfer

Multimodal learning with deep Boltzmann machines

UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

Recognizing human actions by attributes

Steven Bird, Ewan Klein and Edward Loper: Natural Language Processing with Python, Analyzing Text with the Natural Language Toolkit

Zero-data Learning of New Tasks

Matplotlib: A 2D Graphics Environment

Catastrophic forgetting in connectionist networks

Seaborn: Statistical Data Visualization

Frozen in Time: A joint 8 CASTRO AND CABA HEILBRON: SM OF FITCLIP Action Recognition Text-to-video Retrieval UCF101 MiT MSR-VTT YouCook2 DiDeMo CLIP

The pandas development team. pandas-dev/pandas: Pandas

Rethinking Zero-shot Video Classiﬁcation: End-to-end Training for Realistic Applications

spaCy: Industrial-strength Natural Language Processing in Python

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

William Falcon and The PyTorch Lightning team

Hydra - a framework for elegantly configuring complex applications

GNU Parallel: The Command-Line Power Tool

Ensemble Methods in Machine Learning

Image-to-word transformation based on dividing and vector quantizing images with words

Author manuscript, published in "International Conference on Computer Vision (2013)" Action Recognition with Improved Trajectories

SM OF FITCLIP

Field of Study