3260 papers • 126 benchmarks • 313 datasets
Retrieve the long videos given all subtitles.
(Image credit: Papersgraph)
These leaderboards are used to track progress in long-video-retrieval-background-removed-5
Use these libraries to find long-video-retrieval-background-removed-5 models and implementations
No subtasks available.
It is demonstrated that a text-video embedding trained on this data leads to state-of-the-art results for text-to-video retrieval and action localization on instructional video datasets such as YouCook2 or CrossTask.
This work proposes a new learning approach, MIL-NCE, capable of addressing mis- alignments inherent in narrated videos and outperforms all published self-supervised approaches for these tasks as well as several fully supervised baselines.
VideoCLIP is presented, a contrastive approach to pre-train a unified model for zero-shot video and text understanding, without using any labels on downstream tasks, revealing state-of-the-art performance, surpassing prior work, and in some cases even outperforming supervised approaches.
A self-supervised training framework that learns a common multimodal embedding space that enforces a grouping of semantically similar instances that enables retrieval of samples across all modalities, even from unseen datasets and different domains is proposed.
This paper proposes a contrastive learning framework TempCLR to compare the full video and the paragraph explicitly, and uses dynamic time warping to compute the minimum cumulative cost over sentence-clip pairs as the sequence-level distance.
Noise Robust Temporal Optimal traNsport (Norton) is proposed that addresses MNC in a unified optimal transport (OT) framework and employs video-paragraph and clip-caption contrastive losses to capture long-term dependencies based on OT.
Adding a benchmark result helps the community track progress.