CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval (2021-04-18T00:00:00.000000Z)

TL;DR

A CLIP4Clip model to transfer the knowledge of the CLIP model to video-language retrieval in an end-to-end manner can achieve SOTA results on various video-text retrieval datasets, including MSR-VTT, MSVC, LSMDC, ActivityNet, and DiDeMo.

Authors

Nan Duan

17 papers

Lei Ji

3 papers

Huaishao Luo

5 papers

Ming Zhong

2 papers

Yang Chen

1 papers

Wen Lei

1 papers

Tianrui Li

3 papers

TL;DR

Authors

References52 items

TeachText: CrossModal Generalized Distillation for Text-Video Retrieval

Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval

ViViT: A Video Vision Transformer

HiT: Hierarchical Transformer with Momentum Contrast for Video-Text Retrieval

MDMMT: Multidomain Multimodal Transformer for Video Retrieval

Learning Transferable Visual Models From Natural Language Supervision

A Straightforward Framework For Video Retrieval Using CLIP

Less is More: CLIPBERT for Video-and-Language Learning via Sparse Sampling

Is Space-Time Attention All You Need for Video Understanding?

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Support-set bottlenecks for video-text representation learning

Multi-modal Transformer for Video Retrieval

AVLnet: Learning Audio-Visual Language Representations from Instructional Videos

ActBERT: Learning Global-Local Video-Text Representations

Hero: Hierarchical Encoder for Video+Language Omni-representation Pre-training

Noise Estimation Using Density Estimation for Self-Supervised Multimodal Learning

UniViLM: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation

End-to-End Learning of Visual Representations From Uncurated Instructional Videos

Use What You Have: Video retrieval using representations from collaborative experts

HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

SlowFast Networks for Video Recognition

Cross-Modal and Hierarchical Modeling of Video and Text

A Joint Sequence Fusion Model for Video Question Answering and Retrieval

Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning

Learning Joint Embedding with Multimodal Cues for Cross-Modal Video-Text Retrieval

Deep Learning using Rectified Linear Units (ReLU)

Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification

Localizing Moments in Video with Natural Language

Attention is All you Need

Dense-Captioning Events in Videos

Temporal Tessellation: A Unified Approach for Video Analysis

End-to-End Concept Word Detection for Video Captioning, Retrieval, and Question Answering

Video Captioning and Retrieval Models with Semantic Attention

Learning Language-Visual Embedding for Movie Understanding with Natural-Language

SGDR: Stochastic Gradient Descent with Warm Restarts

MSR-VTT: A Large Video Description Dataset for Bridging Video and Language

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Video Paragraph Captioning Using Hierarchical Recurrent Neural Networks

The Long-Short Story of Movie Description

Microsoft COCO Captions: Data Collection and Evaluation Server

Adam: A Method for Stochastic Optimization

Translating Videos to Natural Language Using Deep Recurrent Neural Networks

Learning Spatiotemporal Features with 3D Convolutional Networks

Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models

Collecting Highly Parallel Data for Paraphrase Evaluation

Automatic Evaluation of Machine Translation Quality Using Longest Common Subsequence and Skip-Bigram Statistics

Learning Precise Timing with LSTM Recurrent Networks

Bleu: a Method for Automatic Evaluation of Machine Translation

Long Short-Term Memory

TRANSFER OF TRAINING: A REVIEW AND DIRECTIONS FOR FUTURE RESEARCH

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Language Models are Unsupervised Multitask Learners

Field of Study

Journal Information

Name

Page

Volume

Venue Information

Name

Type

URL