CLIP-Art: Contrastive Pre-training for Fine-Grained Art Classification (2021-06-01T00:00:00.000000Z)

TL;DR

This work uses CLIP (Contrastive Language-Image Pre-Training) for training a neural network on a variety of art images and text pairs, being able to learn directly from raw descriptions about images, or if available, curated labels, with zero-shot capability.

Abstract

Existing computer vision research in artwork struggles with artwork’s fine-grained attributes recognition and lack of curated annotated datasets due to their costly creation. In this work, we use CLIP (Contrastive Language-Image Pre-Training) [12] for training a neural network on a variety of art images and text pairs, being able to learn directly from raw descriptions about images, or if available, curated labels. Model’s zero-shot capability allows predicting the most relevant natural language description for a given image, without directly optimizing for the task. Our approach aims to solve 2 challenges: instance retrieval and fine-grained artwork attribute recognition. We use the iMet Dataset [20], which we consider the largest annotated artwork dataset. Our code and models will be available at https://github.com/KeremTurgutlu/clip_art

Authors

Marcos V. Conde

9 papers

Kerem Turgutlu

1 papers

CLIP-Art: Contrastive Pre-training for Fine-Grained Art Classification

TL;DR

Abstract

Authors

References31 items

Exploring Vision Transformers for Fine-grained Classification

Barlow Twins: Self-Supervised Learning via Redundancy Reduction

Learning Transferable Visual Models From Natural Language Supervision

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Big Self-Supervised Models are Strong Semi-Supervised Learners

Multi-branch and Multi-scale Attention Learning for Fine-Grained Visual Categorization

A Simple Framework for Contrastive Learning of Visual Representations

Self-Training With Noisy Student Improves ImageNet Classification

On the Variance of the Adaptive Learning Rate and Beyond

Lookahead Optimizer: k steps forward, 1 step back

The iMet Collection 2019 Challenge Dataset

Destruction and Construction Learning for Fine-Grained Image Recognition

EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks

Looking for the Devil in the Details: Learning Trilinear Attention Sampling Network for Fine-Grained Image Recognition

See Better Before Looking Closer: Weakly Supervised Data Augmentation Network for Fine-Grained Visual Classification

Learning to Navigate for Fine-grained Classification

Representation Learning with Contrastive Predictive Coding

Squeeze-and-Excitation Networks

Fine-Grained Visual-Textual Representation Learning

Fine-Grained Recognition as HSnet Search for Informative Image Parts

Deep Residual Learning for Image Recognition

Going deeper with convolutions

Very Deep Convolutional Networks for Large-Scale Image Recognition

Fine-Grained Visual Classification of Aircraft

ImageNet classification with deep convolutional neural networks

The Caltech-UCSD Birds-200-2011 Dataset

ImageNet: A large-scale hierarchical image database

: k steps forward,

Novel Dataset for Fine-Grained Image Categorization : Stanford Dogs

iMet and Kaggle

For each query image we rank 20.000 validation candidates based on cosine similarity

Field of Study

Journal Information

Name

Page