3260 papers • 126 benchmarks • 313 datasets
Spatio-temporal video grounding is a computer vision and natural language processing (NLP) task that involves linking textual descriptions to specific spatio-temporal regions or moments in a video. In other words, it aims to determine which parts of a video correspond to a given textual query or description. This task is essential for various applications, including video summarization, content-based video retrieval, video captioning, and more.
(Image credit: Papersgraph)
These leaderboards are used to track progress in spatio-temporal-video-grounding-2
Use these libraries to find spatio-temporal-video-grounding-2 models and implementations
No subtasks available.
A novel Spatio-Temporal Graph Reasoning Network (STGRN) is proposed for this task, which builds a spatio-temporal region graph to capture the region relationships with temporal object dynamics, which involves the implicit and explicit spatial subgraphs in each frame and the temporal dynamic subgraph across frames.
This work introduces a novel task – Human-centric Spatio-Temporal Video Grounding (HC-STVG), which aims to localize a spatio-temporal tube of the target person from an untrimmed video based on a given textural description.
TubeDETR is proposed, a transformer-based architecture inspired by the recent success of such models for text-conditioned object detection that includes an efficient video and text encoder that models spatial multi-modal interactions over sparsely sampled frames and a space-time decoder that jointly performs spatio-temporal localization.
A novel multi-modal template is introduced as the global objective to address this task, which explicitly constricts the grounding region and associates the predictions among all video frames, and an encoder-decoder architecture is proposed for effective global context modeling.
PG-Video-LLaVA is proposed, the first LMM with pixel-level grounding capability, integrating audio cues by transcribing them into text to enrich video-context understanding and delivering promising gains on video-based conversation and grounding tasks.
A novel framework, context-guided STVG (CG-STVG), which mines discriminative instance context for object in videos and applies it as a supplementary guidance for target localization for more accurate target localization.
Adding a benchmark result helps the community track progress.