3260 papers • 126 benchmarks • 313 datasets
Video-Text retrieval requires understanding of both video and language together. Therefore it's different to video retrieval task.
(Image credit: Papersgraph)
These leaderboards are used to track progress in video-retrieval
Use these libraries to find video-retrieval models and implementations
No subtasks available.
An end-to-end trainable model that is designed to take advantage of both large-scale image and video captioning datasets and yields state-of-the-art results on standard downstream video-retrieval benchmarks including MSR-VTT, MSVD, DiDeMo and LSMDC.
A CLIP4Clip model to transfer the knowledge of the CLIP model to video-language retrieval in an end-to-end manner can achieve SOTA results on various video-text retrieval datasets, including MSR-VTT, MSVC, LSMDC, ActivityNet, and DiDeMo.
This work proposes LanguageBind, taking the language as the bind across different modalities because the language modality is well-explored and contains rich semantics, and freezes the language encoder acquired by VL pretraining, then train encoders for other modalities with contrastive learning.
The results show that the proposed CAMoE and DSL are of strong efficiency, and each is capable of achieving State-of-The-Art (SOTA) individually on various benchmarks such as MSR-VTT, MSVD, and LSMDC.
The recently released Ego4D dataset is exploited to pioneer Egocentric VLP along three directions, and a novel pretraining objective is proposed, dubbed EgoNCE, which adapts video-text contrastive learning to the egocentric domain by mining egOCentric-aware positive and negative samples.
UniAdapter is proposed, which unifies unimodal and multimodal adapters for parameter-efficient cross-modal adaptation on pre-trained vision-language models and shows that in most cases, UniAdapter not only outperforms the state-of-the-arts, but even beats the full fine-tuning strategy.
This work enables fine-grained video-text interactions while maintaining high efficiency for retrieval via a novel pretext task, dubbed as Multiple Choice Questions (MCQ), where a parametric module BridgeFormer is trained to answer the “questions” constructed by the text features via resorting to the video features.
A Hierarchical Graph Reasoning (HGR) model is proposed, which decomposes video-text matching into global-to-local levels and generates hierarchical textual embeddings via attention-based graph reasoning.
mPLUG is a new vision-language foundation model for both cross-modal understanding and generation that achieves state-of-the-art results on a wide range of vision- language downstream tasks, including image captioning, image-text retrieval, visual grounding and visual question answering.
Adding a benchmark result helps the community track progress.