3260 papers • 126 benchmarks • 313 datasets
Video grounding is the task of linking spoken language descriptions to specific video segments. In video grounding, the model is given a video and a natural language description, such as a sentence or a caption, and its goal is to identify the specific segment of the video that corresponds to the description. This can involve tasks such as localizing the objects or actions mentioned in the description within the video, or associating a specific time interval with the description.
(Image credit: Papersgraph)
These leaderboards are used to track progress in video-retrieval
Use these libraries to find video-retrieval models and implementations
A Mutual Matching Network (MMN) is presented, to directly model the similarity between language queries and video moments in a joint embedding space and suggests that metric learning is still a promising method for temporal grounding via capturing the essential cross-modal correlation in aointing space.
This work scales both data and model size for the InternVideo2, a model that outperforms others on various video-related captioning, dialogue, and long video understanding benchmarks, highlighting its ability to reason and comprehend long temporal contexts.
A novel Spatio-Temporal Graph Reasoning Network (STGRN) is proposed for this task, which builds a spatio-temporal region graph to capture the region relationships with temporal object dynamics, which involves the implicit and explicit spatial subgraphs in each frame and the temporal dynamic subgraph across frames.
A reinforcement learning based framework improved by multi-task learning is proposed which achieves state-of-the-art performance on ActivityNet'18 DenseCaption dataset and Charades-STA dataset while observing only 10 or less clips per video.
This work introduces a novel task – Human-centric Spatio-Temporal Video Grounding (HC-STVG), which aims to localize a spatio-temporal tube of the target person from an untrimmed video based on a given textural description.
This paper presents a novel approach to the audio-visual video parsing (AVVP) task that demarcates events from a video separately for audio and visual modalities and outperforms the state-of-the-art Hybrid Attention Network (HAN) on all five metrics proposed for AVVP.
A novel paradigm from the perspective of the causal inference, i.e., interventional video grounding (IVG) that leverages backdoor adjustment to deconfound the selection bias based on structured causal model (SCM) and do-calculus P(Y |do(X)).
A novel Video-Language Graph Matching Network (VLG-Net) is designed to enable the mutual exchange of information across the modalities, and superior performance over state-of-the-art grounding methods on three widely used datasets for temporal localization of moments in videos with language queries.
A novel dense regression network (DRN) is designed to regress the distances between the frame within the ground truth and the starting (ending) frame of the video segment described by the query to improve the video grounding accuracy.
Adding a benchmark result helps the community track progress.