3260 papers • 126 benchmarks • 313 datasets
Spatio-temporal video grounding is a computer vision and natural language processing (NLP) task that involves linking textual descriptions to specific spatio-temporal regions or moments in a video. In other words, it aims to determine which parts of a video correspond to a given textual query or description. This task is essential for various applications, including video summarization, content-based video retrieval, video captioning, and more.
(Image credit: Papersgraph)
These leaderboards are used to track progress in spatio-temporal-video-grounding-20
Use these libraries to find spatio-temporal-video-grounding-20 models and implementations
No datasets available.
No subtasks available.
Adding a benchmark result helps the community track progress.