3260 papers • 126 benchmarks • 313 datasets
The goal of multi-label classification task is to predict a set of labels in an image. As an extension of zero-shot learning (ZSL), multi-label zero-shot learning (ML-ZSL) is developed to identify multiple seen and unseen labels in an image.
(Image credit: Papersgraph)
These leaderboards are used to track progress in multi-label-zero-shot-learning-7
Use these libraries to find multi-label-zero-shot-learning-7 models and implementations
No subtasks available.
This work proposes to view attribute-based image classification as a label-embedding problem: each class is embedded in the space of attribute vectors, and introduces a function that measures the compatibility between an image and a label embedding.
A novel deep learning architecture for multi-label zero-shot learning (ML-ZSL), which is able to predict multiple unseen class labels for each input instance, and a framework that incorporates knowledge graphs for describing the relationships between multiple labels is proposed.
This work investigates the zero-shot learning in the music domain and organizes two different setups of side information using human-labeled attribute information based on Free Music Archive and OpenMIC-2018 datasets and general word semantic information from Million Song Dataset and this http URL tag annotations.
In this work, we develop a shared multi-attention model for multi-label zero-shot learning. We argue that designing attention mechanism for recognizing multiple seen and unseen labels in an image is a non-trivial task as there is no training signal to localize unseen labels and an image only contains a few present labels that need attentions out of thousands of possible labels. Therefore, instead of generating attentions for unseen labels which have unknown behaviors and could focus on irrelevant regions due to the lack of any training sample, we let the unseen labels select among a set of shared attentions which are trained to be label-agnostic and to focus on only relevant/foreground regions through our novel loss. Finally, we learn a compatibility function to distinguish labels based on the selected attention. We further propose a novel loss function that consists of three components guiding the attention to focus on diverse and relevant image regions while utilizing all attention features. By extensive experiments, we show that our method improves the state of the art by 2.9% and 1.4% F1 score on the NUS-WIDE and the large scale Open Images datasets, respectively.
We study the problem of multi-label zero-shot recognition in which labels are in the form of human-object interactions (combinations of actions on objects), each image may contain multiple interactions and some interactions do not have training images. We propose a novel compositional learning framework that decouples interaction labels into separate action and object scores that incorporate the spatial compatibility between the two components. We combine these scores to efficiently recognize seen and unseen interactions. However, learning action-object spatial relations, in principle, requires bounding-box annotations, which are costly to gather. Moreover, it is not clear how to generalize spatial relations to unseen interactions. We address these challenges by developing a cross-attention mechanism that localizes objects from action locations and vice versa by predicting displacements between them, referred to as relational directions. During training, we estimate the relational directions as ones maximizing the scores of ground-truth interactions that guide predictions toward compatible action-object regions. By extensive experiments, we show the effectiveness of our framework, where we improve the state of the art by 2.6% mAP score and 5.8% recall score on HICO and Visual Genome datasets, respectively.1
This work is the first to tackle the problem of multi-label feature synthesis in the (generalized) zero-shot setting with a cross-level fusion-based generative approach, which outperforms the state-of-the-art on three zero- shot benchmarks: NUS-WIDE, Open Images and MS COCO.
This study introduces an end-to-end model training for multi-label zero-shot learning that supports the semantic diversity of the images and labels and proposes to use an embedding matrix having principal embedding vectors trained using a tailored loss function.
This work proposes an alternate approach towards region-based discriminability-preserving multi-label zero-shot classification that maintains the spatial resolution to preserve region-level characteristics and utilizes a bi-level attention module (BiAM) to enrich the features by incorporating both region and scene context information.
Adding a benchmark result helps the community track progress.