3260 papers • 126 benchmarks • 313 datasets
The Ad-hoc search task ended a 3 year cycle from 2016-2018 with a goal to model the end user search use-case, who is searching (using textual sentence queries) for segments of video containing persons, objects, activities, locations, etc. and combinations of the former. While the Internet Archive (IACC.3) dataset was adopted between 2016 to 2018, starting in 2019 a new data collection based on Vimeo Creative Commons (V3C) will be adopted to support the task for at least 3 more years. Given the test collection (V3C1 or IACC.3), master shot boundary reference, and set of Ad-hoc queries (approx. 30 queries) released by NIST, return for each query a list of at most 1000 shot IDs from the test collection ranked according to their likelihood of containing the target query.
(Image credit: Papersgraph)
These leaderboards are used to track progress in ad-hoc-video-search-1
Use these libraries to find ad-hoc-video-search-1 models and implementations
No subtasks available.
This paper takes a concept-free approach, proposing a dual deep encoding network that encodes videos and queries into powerful dense representations of their own and establishes a new state-of-the-art for zero-example video retrieval.
With W2VV++, a super version of Word2VisualVec previously developed for visual-to-text matching, a new baseline for ad-hoc video search is established, which outperforms the state-of-the-art.
This paper attacks the challenging problem of video retrieval by text. In such a retrieval paradigm, an end user searches for unlabeled videos by ad-hoc queries described exclusively in the form of a natural-language sentence, with no visual example provided. Given videos as sequences of frames and queries as sequences of words, an effective sequence-to-sequence cross-modal matching is crucial. To that end, the two modalities need to be first encoded into real-valued vectors and then projected into a common space. In this paper we achieve this by proposing a dual deep encoding network that encodes videos and queries into powerful dense representations of their own. Our novelty is two-fold. First, different from prior art that resorts to a specific single-level encoder, the proposed network performs multi-level encoding that represents the rich content of both modalities in a coarse-to-fine fashion. Second, different from a conventional common space learning algorithm which is either concept based or latent space based, we introduce hybrid space learning which combines the high performance of the latent space and the good interpretability of the concept space. Dual encoding is conceptually simple, practically effective and end-to-end trained with hybrid space learning. Extensive experiments on four challenging video datasets show the viability of the new method. Code and data are available at https://github.com/danieljf24/hybrid_space.
As extensive experiments on four benchmarks show, SEA surpasses the state-of-the-art and is extremely ease to implement, making SEA an appealing solution for AVS and promising for continuously advancing the task by harvesting new sentence encoders.
LAFF is proposed, a new baseline for text-to-video retrieval that performs feature fusion at both early and late stages and at both video and text ends, making it a powerful method for exploiting diverse (off-the-shelf) features.
A new encoder-decoder network, which learns interpretable cross-modal representation, which outperforms several state-of-the-art retrieval models with a statistically significant performance margin is proposed for ad-hoc video search.
Adding a benchmark result helps the community track progress.