3260 papers • 126 benchmarks • 313 datasets
Person-centric visual grounding is the problem of linking between people named in a caption and people pictured in an image. Introduced in "Who's Waldo? Linking People Across Text and Images" (Cui et al, ICCV 2021).
(Image credit: Papersgraph)
These leaderboards are used to track progress in person-centric-visual-grounding-5
Use these libraries to find person-centric-visual-grounding-5 models and implementations
No subtasks available.
TubeDETR is proposed, a transformer-based architecture inspired by the recent success of such models for text-conditioned object detection that includes an efficient video and text encoder that models spatial multi-modal interactions over sparsely sampled frames and a space-time decoder that jointly performs spatio-temporal localization.
Adding a benchmark result helps the community track progress.