3260 papers • 126 benchmarks • 313 datasets
Visual Grounding (VG) aims to locate the most relevant object or region in an image, based on a natural language query. The query can be a phrase, a sentence, or even a multi-round dialogue. There are three main challenges in VG: What is the main focus in a query? How to understand an image? How to locate an object?
(Image credit: Papersgraph)
These leaderboards are used to track progress in visual-grounding-15
Use these libraries to find visual-grounding-15 models and implementations
Adding a benchmark result helps the community track progress.