3260 papers • 126 benchmarks • 313 datasets
Given a natural language query, find the most relevant video from a large set of candidate videos.
(Image credit: Papersgraph)
These leaderboards are used to track progress in text-to-video-retrieval-8
Use these libraries to find text-to-video-retrieval-8 models and implementations
An end-to-end trainable model that is designed to take advantage of both large-scale image and video captioning datasets and yields state-of-the-art results on standard downstream video-retrieval benchmarks including MSR-VTT, MSVD, DiDeMo and LSMDC.
This work proposes a new learning approach, MIL-NCE, capable of addressing mis- alignments inherent in narrated videos and outperforms all published self-supervised approaches for these tasks as well as several fully supervised baselines.
A CLIP4Clip model to transfer the knowledge of the CLIP model to video-language retrieval in an end-to-end manner can achieve SOTA results on various video-text retrieval datasets, including MSR-VTT, MSVC, LSMDC, ActivityNet, and DiDeMo.
It is demonstrated that a text-video embedding trained on this data leads to state-of-the-art results for text-to-video retrieval and action localization on instructional video datasets such as YouCook2 or CrossTask.
A new state-of-the-art on the text-to-video retrieval task on MSRVTT and LSMDC benchmarks where the model outperforms all previous solutions by a large margin and is achieved using a single model and without finetuning.
The convolution-free VATT outperforms state-of-the-art ConvNet-based architectures in the downstream tasks and sets a new record on waveform-based audio event recognition by achieving the mAP of 39.4% on AudioSet without any supervised pre-training.
This work enables fine-grained video-text interactions while maintaining high efficiency for retrieval via a novel pretext task, dubbed as Multiple Choice Questions (MCQ), where a parametric module BridgeFormer is trained to answer the “questions” constructed by the text features via resorting to the video features.
This paper presents a fine-tuning strategy to refine these large-scale pretrained image-text models for zero-shot video understanding tasks and shows that by carefully adapting these models they obtain considerable improvements on two zero- shot Action Recognition tasks and three Text-to-video Retrieval tasks.
This work shows the surprising result that, with large-scale pre-training and a proper frame ensemble strategy at inference time, a single-frame trained model that does not consider temporal information can achieve better performance than existing methods that use multiple frames for training.
Adding a benchmark result helps the community track progress.