computer-vision-5

Long Video Retrieval (Background Removed)

3260 papers • 126 benchmarks • 313 datasets

Retrieve the long videos given all subtitles.

(Image credit: Papersgraph)

Benchmarks

These leaderboards are used to track progress in long-video-retrieval-background-removed-5

Trend

Dataset

Best Model

Actions

YouCook2

Libraries

i

Use these libraries to find long-video-retrieval-background-removed-5 models and implementations

Datasets

YouCook2

Subtasks

No subtasks available.

Most implemented papers

HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

Jean-Baptiste Alayrac, Antoine Miech, I. Laptev, Josef Sivic, Dimitri Zhukov, Makarand Tapaswi•Thu Jun 06 2019

It is demonstrated that a text-video embedding trained on this data leads to state-of-the-art results for text-to-video retrieval and action localization on instructional video datasets such as YouCook2 or CrossTask.

1376

Content

0

Paper Graph

End-to-End Learning of Visual Representations From Uncurated Instructional Videos

Jean-Baptiste Alayrac, Antoine Miech, Andrew Zisserman, I. Laptev, Josef Sivic, Lucas Smaira•Thu Dec 12 2019

This work proposes a new learning approach, MIL-NCE, capable of addressing mis- alignments inherent in narrated videos and outperforms all published self-supervised approaches for these tasks as well as several fully supervised baselines.

757 0

Paper Graph

VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding

Po-Yao (Bernie) Huang, Dmytro Okhonko, Gargi Ghosh, Hu Xu, Armen Aghajanyan, Florian Metze Luke Zettlemoyer Christoph Feichtenhofer•Mon Sep 27 2021

VideoCLIP is presented, a contrastive approach to pre-train a unified model for zero-shot video and text understanding, without using any labels on downstream tasks, revealing state-of-the-art performance, surpassing prior work, and in some cases even outperforming supervised approaches.

705 0

Paper Graph

Multimodal Clustering Networks for Self-supervised Learning from Unlabeled Videos

Brian Kingsbury, Shih-Fu Chang, Hilde Kuehne, James R. Glass, R. Feris, Rameswar Panda, Brian Chen, Andrew Rouditchenko, Kevin Duarte, Samuel Thomas, Angie Boggust, David F. Harwath, M. Picheny•Sun Apr 25 2021

A self-supervised training framework that learns a common multimodal embedding space that enforces a grouping of semantically similar instances that enables retrieval of samples across all modalities, even from unseen datasets and different domains is proposed.

97 0

Paper Graph

TempCLR: Temporal Alignment Representation with Contrastive Learning

G. Han, Long Chen, Xudong Lin, Shih-Fu Chang, Jiawei Ma, Yuncong Yang, Shiyuan Huang•Tue Dec 27 2022

This paper proposes a contrastive learning framework TempCLR to compare the full video and the paragraph explicitly, and uses dynamic time warping to compute the minimum cumulative cost over sentence-clip pairs as the sequence-level distance.

15 0

Paper Graph

Multi-granularity Correspondence Learning from Long-term Noisy Videos

Zhenyu Huang, Yijie Lin, Jie Zhang, Jia Liu, Zujie Wen, Xi Peng•Mon Jan 29 2024

Noise Robust Temporal Optimal traNsport (Norton) is proposed that addresses MNC in a unified optimal transport (OT) framework and employs video-paragraph and clip-caption contrastive losses to capture long-term dependencies based on OT.

36 0

Paper Graph

Adding a benchmark result helps the community track progress.