computer-vision-10

Language-Based Temporal Localization

3260 papers • 126 benchmarks • 313 datasets

This task has no description! Would you like to contribute one?

(Image credit: Papersgraph)

Benchmarks

These leaderboards are used to track progress in language-based-temporal-localization-10

Trend

Dataset

Best Model

Actions

VidChapters-7M

Libraries

i

Use these libraries to find language-based-temporal-localization-10 models and implementations

Datasets

VidChapters-7M

Subtasks

Corpus Video Moment Retrieval

Most implemented papers

MAC: Mining Activity Concepts for Language-Based Temporal Localization

J. Gao, R. Nevatia, Kan Chen, Runzhou Ge•Tue Nov 20 2018

The novel ACL encodes the semantic concepts from verb-obj pairs in language queries and leverages activity classifiers' prediction scores to encode visual concepts, and shows that ACL significantly outperforms state-of-the-arts under the widely used metric.

194

Content

0

Paper Graph

Video Moment Localization using Object Evidence and Reverse Captioning

Madhawa Vidanapathirana, Supriya Pandhre, Sonia Raychaudhuri, Anjali Khurana•Wed Jun 17 2020

This work proposes "Multi-faceted VideoMoment Localizer" (MML), an extension of MAC model by the introduction of visual object evidence via object segmentation masks and video understanding features via video captioning that outperforms MAC baseline and improves language modelling in sentence embedding.

1 0

Paper Graph

Hierarchical Deep Residual Reasoning for Temporal Moment Localization

Ziyang Ma, Liqiang Nie, Xianjing Han, Xuemeng Song, Yiran Cui•Sat Oct 30 2021

A Hierarchical Deep Residual Reasoning (HDRR) model is proposed, which decomposes the video and sentence into multi-level representations with different semantics to achieve a finer-grained localization in temporal Moment Localization in untrimmed videos.

10 0

Paper Graph

TubeDETR: Spatio-Temporal Video Grounding with Transformers

Antoine Miech, I. Laptev, C. Schmid, Josef Sivic, Antoine Yang•Tue Mar 29 2022

TubeDETR is proposed, a transformer-based architecture inspired by the recent success of such models for text-conditioned object detection that includes an efficient video and text encoder that models spatial multi-modal interactions over sparsely sampled frames and a space-time decoder that jointly performs spatio-temporal localization.

122 0

Paper Graph

Can Shuffling Video Benefit Temporal Bias Problem: A Novel Training Framework for Temporal Grounding

Haifeng Sun, Jiachang Hao, Pengfei Ren, Jingyu Wang, Q. Qi, J. Liao•Thu Jul 28 2022

This paper proposes a novel training framework for grounding models to use shuffled videos to address temporal bias problem without losing grounding accuracy, and introduces two auxiliary tasks, cross-modal matching and temporal order discrimination, to promote the grounding model training.

34 0

Paper Graph

Adding a benchmark result helps the community track progress.