3260 papers • 126 benchmarks • 313 datasets
This task has no description! Would you like to contribute one?
(Image credit: Papersgraph)
These leaderboards are used to track progress in language-based-temporal-localization-10
Use these libraries to find language-based-temporal-localization-10 models and implementations
The novel ACL encodes the semantic concepts from verb-obj pairs in language queries and leverages activity classifiers' prediction scores to encode visual concepts, and shows that ACL significantly outperforms state-of-the-arts under the widely used metric.
This work proposes "Multi-faceted VideoMoment Localizer" (MML), an extension of MAC model by the introduction of visual object evidence via object segmentation masks and video understanding features via video captioning that outperforms MAC baseline and improves language modelling in sentence embedding.
A Hierarchical Deep Residual Reasoning (HDRR) model is proposed, which decomposes the video and sentence into multi-level representations with different semantics to achieve a finer-grained localization in temporal Moment Localization in untrimmed videos.
TubeDETR is proposed, a transformer-based architecture inspired by the recent success of such models for text-conditioned object detection that includes an efficient video and text encoder that models spatial multi-modal interactions over sparsely sampled frames and a space-time decoder that jointly performs spatio-temporal localization.
This paper proposes a novel training framework for grounding models to use shuffled videos to address temporal bias problem without losing grounding accuracy, and introduces two auxiliary tasks, cross-modal matching and temporal order discrimination, to promote the grounding model training.
Adding a benchmark result helps the community track progress.