Home Research Papers Datasets State of the Art Pricing

Discover, visualize, and connect AI research papers. Explore the latest trends and insights in artificial intelligence research.

Product

Home
Research Papers
About

Support

Contact
Terms of Service
Privacy Policy

computer-vision-2

Spatio-Temporal Video Grounding

3260 papers • 126 benchmarks • 313 datasets

Spatio-temporal video grounding is a computer vision and natural language processing (NLP) task that involves linking textual descriptions to specific spatio-temporal regions or moments in a video. In other words, it aims to determine which parts of a video correspond to a given textual query or description. This task is essential for various applications, including video summarization, content-based video retrieval, video captioning, and more.

(Image credit: Papersgraph)

Benchmarks

These leaderboards are used to track progress in spatio-temporal-video-grounding-2

Trend

Dataset

Best Model

Actions

HC-STVG2

VidSTG

HC-STVG1

Libraries

Use these libraries to find spatio-temporal-video-grounding-2 models and implementations

Datasets

VidSTG

HC-STVG2

HC-STVG1

Subtasks

No subtasks available.

Most implemented papers

Where Does It Exist: Spatio-Temporal Video Grounding for Multi-Form Sentences

Zhou Zhao, Zhu Zhang, Yang Zhao, Qi Wang, Huasheng Liu, Lianli Gao•Sat Jan 18 2020

A novel Spatio-Temporal Graph Reasoning Network (STGRN) is proposed for this task, which builds a spatio-temporal region graph to capture the region relationships with temporal object dynamics, which involves the implicit and explicit spatial subgraphs in each frame and the temporal dynamic subgraph across frames.

148

Content

Introduction Benchmarks Datasets Subtasks Libraries Papers

Paper Graph

Human-Centric Spatio-Temporal Video Grounding With Visual Transformers

Guanbin Li, Xiaojie Jin, Zongheng Tang, Yue Liao, Si Liu, Hongxu Jiang, Qian Yu, Dong Xu•Mon Nov 09 2020

This work introduces a novel task – Human-centric Spatio-Temporal Video Grounding (HC-STVG), which aims to localize a spatio-temporal tube of the target person from an untrimmed video based on a given textural description.

128 0

Paper Graph

TubeDETR: Spatio-Temporal Video Grounding with Transformers

Antoine Miech, I. Laptev, C. Schmid, Josef Sivic, Antoine Yang•Tue Mar 29 2022

TubeDETR is proposed, a transformer-based architecture inspired by the recent success of such models for text-conditioned object detection that includes an efficient video and text encoder that models spatial multi-modal interactions over sparsely sampled frames and a space-time decoder that jointly performs spatio-temporal localization.

122 0

Paper Graph

Embracing Consistency: A One-Stage Approach for Spatio-Temporal Video Grounding

Yadong Mu, Zehuan Yuan, Yang Jin, Yongzhi Li•Mon Sep 26 2022

A novel multi-modal template is introduced as the global objective to address this task, which explicitly constricts the grounding region and associates the predictions among all video frames, and an encoder-decoder architecture is proposed for effective global context modeling.

49 0

Paper Graph

PG-Video-LLaVA: Pixel Grounding Large Video-Language Models

Mubarak Shah, F. Khan, Muhammad Maaz, H. Rasheed, Salman H. Khan, Shehan Munasinghe, Rusiru Thushara•Tue Nov 21 2023

PG-Video-LLaVA is proposed, the first LMM with pixel-level grounding capability, integrating audio cues by transcribing them into text to enrich video-context understanding and delivering promising gains on video-based conversation and grounding tasks.

46 0

Paper Graph

Context-Guided Spatio-Temporal Video Grounding

Xin Gu, Hengrui Fan, Yan Huang, Tiejian Luo, Libo Zhang•Tue Jan 02 2024

A novel framework, context-guided STVG (CG-STVG), which mines discriminative instance context for object in videos and applies it as a supplementary guidance for target localization for more accurate target localization.

40 0

Paper Graph

Adding a benchmark result helps the community track progress.

Spatio-Temporal Video Grounding | State-of-the-Art