A CLIP4Clip model to transfer the knowledge of the CLIP model to video-language retrieval in an end-to-end manner can achieve SOTA results on various video-text retrieval datasets, including MSR-VTT, MSVC, LSMDC, ActivityNet, and DiDeMo.
Authors
Nan Duan
17 papers
Lei Ji
3 papers
Huaishao Luo
5 papers
Ming Zhong
2 papers
Yang Chen
1 papers
Wen Lei
1 papers
Tianrui Li
3 papers
References52 items
1
TeachText: CrossModal Generalized Distillation for Text-Video Retrieval
2
Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval
3
ViViT: A Video Vision Transformer
4
HiT: Hierarchical Transformer with Momentum Contrast for Video-Text Retrieval
5
MDMMT: Multidomain Multimodal Transformer for Video Retrieval
6
Learning Transferable Visual Models From Natural Language Supervision
7
A Straightforward Framework For Video Retrieval Using CLIP
8
Less is More: CLIPBERT for Video-and-Language Learning via Sparse Sampling
9
Is Space-Time Attention All You Need for Video Understanding?
10
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
11
Support-set bottlenecks for video-text representation learning
12
Multi-modal Transformer for Video Retrieval
13
AVLnet: Learning Audio-Visual Language Representations from Instructional Videos