OVIC Datasets Dataset

Due to the free-form nature of the open vocabulary image classification task, special annotations are required for image sets used for evaluation purposes. Three such image datasets are presented here:

World: 272 images of which the grand majority are originally sourced (have never been on the internet) from 10 countries by 12 people, with an active focus on covering as wide and varied concepts as possible, including unusual, deceptive and/or indirect representations of objects,
Wiki: 1000 Wikipedia lead images sampled from a scraped pool of 18K,
Val3K: 3000 images from the ImageNet-1K validation set, sampled uniformly across the classes.

It is not in general possible to exhaustively annotate ground truth classification labels for open vocabulary image sets, as this would require annotations for every possible correct object noun in the English language for every visible entity in every part of every image. It is possible however, to annotate the thousands of predictions that have been made across the image sets by open vocabulary models trained thus far. All three image datasets presented here have been individually annotated by both human and multimodal LLM annotators for the object nouns that were predicted by trained models. The annotations specify whether each classification is correct, close, or incorrect, and for the human annotations, whether it relates to a primary or secondary element of the image. It is customary to use the suffixes -H and -L to clearly specify which annotations are being referred to at any time, e.g. Wiki-H is the Wiki dataset with corresponding human annotations. All three datasets together contain a total of 17.4K human and 112K LLM class annotations.

OVIC Datasets

Unconstrained Open Vocabulary Image Classification: Zero-Shot Transfer from Text to Image via CLIP Inversion

MNIST

CelebA

JFT-300M