Home Research Papers Datasets State of the Art Pricing

Discover, visualize, and connect AI research papers. Explore the latest trends and insights in artificial intelligence research.

Product

Home
Research Papers
About

Support

Contact
Terms of Service
Privacy Policy

video-retrieval

Video-Text Retrieval

3260 papers • 126 benchmarks • 313 datasets

Video-Text retrieval requires understanding of both video and language together. Therefore it's different to video retrieval task.

(Image credit: Papersgraph)

Benchmarks

These leaderboards are used to track progress in video-retrieval

Trend

Dataset

Best Model

Actions

Test-of-Time

Libraries

Use these libraries to find video-retrieval models and implementations

towhee-io/towhee

3 papers 2,950

Datasets

VTC

Subtasks

No subtasks available.

Most implemented papers

Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval

Andrew Zisserman, Gül Varol, Arsha Nagrani, Max Bain•Wed Mar 31 2021

An end-to-end trainable model that is designed to take advantage of both large-scale image and video captioning datasets and yields state-of-the-art results on standard downstream video-retrieval benchmarks including MSR-VTT, MSVD, DiDeMo and LSMDC.

1463

Content

Introduction Benchmarks Datasets Subtasks Libraries Papers

Paper Graph

CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval

Nan Duan, Lei Ji, Huaishao Luo, Ming Zhong, Yang Chen, Wen Lei, Tianrui Li•Sat Apr 17 2021

A CLIP4Clip model to transfer the knowledge of the CLIP model to video-language retrieval in an end-to-end manner can achieve SOTA results on various video-text retrieval datasets, including MSR-VTT, MSVC, LSMDC, ActivityNet, and DiDeMo.

1014 0

Paper Graph

LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment

Bin Lin, Jiaxi Cui, Munan Ning, Bin Zhu, Yang Yan, Hongfa Wang, Yatian Pang, Wenhao Jiang, Junwu Zhang, Zongwei Li, Wancai Zhang, Zhifeng Li, Wei Liu, Liejie Yuan•Mon Oct 02 2023

This work proposes LanguageBind, taking the language as the bind across different modalities because the language modality is well-explored and contains rich semantics, and freezes the language encoder acquired by VL pretraining, then train encoders for other modalities with contrastive learning.

353 0

Paper Graph

Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss

F. Yang, Xingyi Cheng, Hezheng Lin, Xiangyu Wu, Dong Shen•Wed Sep 08 2021

The results show that the proposed CAMoE and DSL are of strong efficiency, and each is capable of achieving State-of-The-Art (SOTA) individually on various benchmarks such as MSR-VTT, MSVD, and LSMDC.

171 0

Paper Graph

Egocentric Video-Language Pretraining

Bernard Ghanem, Alex Wang, D. Damen, Michael Wray, Mattia Soldan, Difei Gao, Kevin Lin, Rui Yan, Mike Zheng Shou, Eric Z. Xu, Rong-Cheng Tu, Wenzhe Zhao, Weijie Kong, Chengfei Cai, Hongfa Wang•Thu Jun 02 2022

The recently released Ego4D dataset is exploited to pioneer Egocentric VLP along three directions, and a novel pretraining objective is proposed, dubbed EgoNCE, which adapts video-text contrastive learning to the egocentric domain by mining egOCentric-aware positive and negative samples.

252 0

Paper Graph

UniAdapter: Unified Parameter-Efficient Transfer Learning for Cross-modal Modeling

M. Tomizuka, Wei Zhan, Haoyu Lu, Mingyu Ding, Yuqi Huo, Guoxing Yang, Zhiwu Lu•Sun Feb 12 2023

UniAdapter is proposed, which unifies unimodal and multimodal adapters for parameter-efficient cross-modal adaptation on pre-trained vision-language models and shows that in most cases, UniAdapter not only outperforms the state-of-the-arts, but even beats the full fine-tuning strategy.

55 0

Paper Graph

Bridging Video-text Retrieval with Multiple Choice Questions

Ping Luo, Yuying Ge, Yixiao Ge, Ying Shan, Xiaohu Qie, Xihui Liu, Dian Li•Wed Jan 12 2022

This work enables fine-grained video-text interactions while maintaining high efficiency for retrieval via a novel pretext task, dubbed as Multiple Choice Questions (MCQ), where a parametric module BridgeFormer is trained to answer the “questions” constructed by the text features via resorting to the video features.

121 0

Paper Graph

Fine-Grained Video-Text Retrieval With Hierarchical Graph Reasoning

Qi Wu, Shizhe Chen, Qin Jin, Yida Zhao•Sat Feb 29 2020

A Hierarchical Graph Reasoning (HGR) model is proposed, which decomposes video-text matching into global-to-local levels and generates hierarchical textual embeddings via attention-based graph reasoning.

360 0

Paper Graph

mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections

Jingren Zhou, Luo Si, Guohai Xu, Ji Zhang, Ming Yan, Haiyang Xu, Jiabo Ye, Chenliang Li, Bin Bi, Songfang Huang, Feiran Huang, Junfeng Tian, Wei Wang, Hehong Chen, Zheng-da Cao•Mon May 23 2022

mPLUG is a new vision-language foundation model for both cross-modal understanding and generation that achieves state-of-the-art results on a wide range of vision- language downstream tasks, including image captioning, image-text retrieval, visual grounding and visual question answering.

272 0

Paper Graph

Adding a benchmark result helps the community track progress.

Video-Text Retrieval | State-of-the-Art