computer-vision-8

Text to Video Retrieval

3260 papers • 126 benchmarks • 313 datasets

Given a natural language query, find the most relevant video from a large set of candidate videos.

(Image credit: Papersgraph)

Benchmarks

These leaderboards are used to track progress in text-to-video-retrieval-8

Trend

Dataset

Best Model

Actions

Kinetics-GEB+

MSR-VTT

MSVD-Indonesian

Libraries

i

Use these libraries to find text-to-video-retrieval-8 models and implementations

towhee-io/towhee

4 papers 2,951

Datasets

MVK

Subtasks

Partially Relevant Video Retrieval

Most implemented papers

Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval

Andrew Zisserman, Gül Varol, Arsha Nagrani, Max Bain•Wed Mar 31 2021

An end-to-end trainable model that is designed to take advantage of both large-scale image and video captioning datasets and yields state-of-the-art results on standard downstream video-retrieval benchmarks including MSR-VTT, MSVD, DiDeMo and LSMDC.

1463

Content

MSVD-Indonesian

0

Paper Graph

End-to-End Learning of Visual Representations From Uncurated Instructional Videos

Jean-Baptiste Alayrac, Antoine Miech, Andrew Zisserman, I. Laptev, Josef Sivic, Lucas Smaira•Thu Dec 12 2019

This work proposes a new learning approach, MIL-NCE, capable of addressing mis- alignments inherent in narrated videos and outperforms all published self-supervised approaches for these tasks as well as several fully supervised baselines.

757 0

Paper Graph

CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval

Nan Duan, Lei Ji, Huaishao Luo, Ming Zhong, Yang Chen, Wen Lei, Tianrui Li•Sat Apr 17 2021

A CLIP4Clip model to transfer the knowledge of the CLIP model to video-language retrieval in an end-to-end manner can achieve SOTA results on various video-text retrieval datasets, including MSR-VTT, MSVC, LSMDC, ActivityNet, and DiDeMo.

1014 0

Paper Graph

HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

Jean-Baptiste Alayrac, Antoine Miech, I. Laptev, Josef Sivic, Dimitri Zhukov, Makarand Tapaswi•Thu Jun 06 2019

It is demonstrated that a text-video embedding trained on this data leads to state-of-the-art results for text-to-video retrieval and action localization on instructional video datasets such as YouCook2 or CrossTask.

1376 0

Paper Graph

MDMMT: Multidomain Multimodal Transformer for Video Retrieval

Stepan Alekseevich Komkov, Aleksandr Petiushko, Maksim Dzabraev, M. Kalashnikov•Thu Mar 18 2021

A new state-of-the-art on the text-to-video retrieval task on MSRVTT and LSMDC benchmarks where the model outperforms all previous solutions by a large margin and is achieved using a single model and without finetuning.

148 0

Paper Graph

VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text

Yin Cui, Shih-Fu Chang, Boqing Gong, Hassan Akbari, Wei-Hong Chuang•Wed Apr 21 2021

The convolution-free VATT outperforms state-of-the-art ConvNet-based architectures in the downstream tasks and sets a new record on waveform-based audio event recognition by achieving the mAP of 39.4% on AudioSet without any supervised pre-training.

688 0

Paper Graph

Bridging Video-text Retrieval with Multiple Choice Questions

Ping Luo, Yuying Ge, Yixiao Ge, Ying Shan, Xiaohu Qie, Xihui Liu, Dian Li•Wed Jan 12 2022

This work enables fine-grained video-text interactions while maintaining high efficiency for retrieval via a novel pretext task, dubbed as Multiple Choice Questions (MCQ), where a parametric module BridgeFormer is trained to answer the “questions” constructed by the text features via resorting to the video features.

121 0

Paper Graph

FitCLIP: Refining Large-Scale Pretrained Image-Text Models for Zero-Shot Video Understanding Tasks

Fabian Caba Heilbron, Santiago Castro•Wed Mar 23 2022

This paper presents a fine-tuning strategy to refine these large-scale pretrained image-text models for zero-shot video understanding tasks and shows that by carefully adapting these models they obtain considerable improvements on two zero- shot Action Recognition tasks and three Text-to-video Retrieval tasks.

22 0

Paper Graph

Revealing Single Frame Bias for Video-and-Language Learning

Tamara L. Berg, Mohit Bansal, Jie Lei•Mon Jun 06 2022

This work shows the surprising result that, with large-scale pre-training and a proper frame ensemble strategy at inference time, a single-frame trained model that does not consider temporal information can achieve better performance than existing methods that use multiple frames for training.

143 0

Paper Graph

Adding a benchmark result helps the community track progress.