3260 papers • 126 benchmarks • 313 datasets
Given a textual goal and multiple images representing candidate events, a model must choose one image which constitutes a reason- able step towards the given goal. A model should correctly recognize not only the specific action illustrated in an image (e.g., “turning on the oven”), but also the intent of the action (“baking fish”).
(Image credit: Papersgraph)
These leaderboards are used to track progress in vgsi-20
Use these libraries to find vgsi-20 models and implementations
No datasets available.
No subtasks available.
Adding a benchmark result helps the community track progress.