Open Vocabulary Image Classification Datasets
Introduced in Unconstrained Open Vocabulary Image Classification: Zero-Shot Transfer from Text to Image via CLIP Inversion2024
Due to the free-form nature of the open vocabulary image classification task, special annotations are required for image sets used for evaluation purposes. Three such image datasets are presented here:
It is not in general possible to exhaustively annotate ground truth classification labels for open vocabulary image sets, as this would require annotations for every possible correct object noun in the English language for every visible entity in every part of every image. It is possible however, to annotate the thousands of predictions that have been made across the image sets by open vocabulary models trained thus far. All three image datasets presented here have been individually annotated by both human and multimodal LLM annotators for the object nouns that were predicted by trained models. The annotations specify whether each classification is correct, close, or incorrect, and for the human annotations, whether it relates to a primary or secondary element of the image. It is customary to use the suffixes -H and -L to clearly specify which annotations are being referred to at any time, e.g. Wiki-H is the Wiki dataset with corresponding human annotations. All three datasets together contain a total of 17.4K human and 112K LLM class annotations.
The data is directly available at the following links:
Refer to the NOVIC code for an example of how the datasets can be used, as well as tools for updating the class annotations for newer model predictions.