3260 papers • 126 benchmarks • 313 datasets
Visual Grounding (VG) aims to locate the most relevant object or region in an image, based on a natural language query. The query can be a phrase, a sentence, or even a multi-round dialogue. There are three main challenges in VG: What is the main focus in a query? How to understand an image? How to locate an object?
(Image credit: Open Source)
These leaderboards are used to track progress in visual-grounding-19
Use these libraries to find visual-grounding-19 models and implementations
No datasets available.
Adding a benchmark result helps the community track progress.