This work analyzes CLIP’s ability to perform zero-shot learning on various texture and material classification datasets, and its ability to represent compositional properties of texture such as red dots or yellow stripes on the Describable Texture in Detail dataset.
. We investigate how well does CLIP understand texture in natural images described by natural language. To this end we analyze CLIP’s ability to: (1) perform zero-shot learning on various texture and material classification datasets; (2) represent compositional properties of texture such as red dots or yellow stripes on the Describable Texture in Detail ( DTD 2 ) dataset; and (3) aid fine-grained categorization of birds in photographs described by color and texture of their body parts.