Video Grounding

Video grounding is the task of linking spoken language descriptions to specific video segments. In video grounding, the model is given a video and a natural language description, such as a sentence or a caption, and its goal is to identify the specific segment of the video that corresponds to the description. This can involve tasks such as localizing the objects or actions mentioned in the description within the video, or associating a specific time interval with the description.

Benchmarks

Libraries

Datasets

Subtasks

Most implemented papers

Negative Sample Matters: A Renaissance of Metric Learning for Temporal Grounding

Content

InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding

Where Does It Exist: Spatio-Temporal Video Grounding for Multi-Form Sentences

Read, Watch, and Move: Reinforcement Learning for Temporally Grounding Natural Language Descriptions in Videos

Human-Centric Spatio-Temporal Video Grounding With Visual Transformers

Cross-Modal learning for Audio-Visual Video Parsing

Interventional Video Grounding with Dual Contrastive Learning

VLG-Net: Video-Language Graph Matching Network for Video Grounding

Dense Regression Network for Video Grounding