3260 papers • 126 benchmarks • 313 datasets
Open-Vocabulary Attribute Detection (OVAD) is a task that aims to detect and recognize an open set of objects and their associated attributes in an image. The objects and attributes are defined by text queries during inference, without prior knowledge of the tested classes during training.
(Image credit: Papersgraph)
These leaderboards are used to track progress in open-vocabulary-object-detection
Use these libraries to find open-vocabulary-object-detection models and implementations
No subtasks available.
It is demonstrated that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet.
BLIP-2 achieves state-of-the-art performance on various vision-language tasks, despite having significantly fewer trainable parameters than existing methods, and is demonstrated's emerging capabilities of zero-shot image-to-text generation that can follow natural language instructions.
BLIP effectively utilizes the noisy web data by bootstrapping the captions, where a captioner generates synthetic captions and a filter removes the noisy ones, and demonstrates strong generalization ability when directly transferred to video-language tasks in a zero-shot manner.
A contrastive loss to ALign the image and text representations BEfore Fusing (ALBEF) them through cross-modal attention, which enables more grounded vision and language representation learning and proposes momentum distillation, a self-training method which learns from pseudo-targets produced by a momentum model.
This work investigates scaling laws for contrastive language-image pre-training (CLIP) with the public LAION dataset and the open-source OpenCLIP repository and finds that the training distribution plays a key role in scaling laws as the OpenAI and OpenClIP models exhibit different scaling behavior.
This paper proposes a new method to train object detectors using bounding box annotations for a limited set of object categories, as well as image-caption pairs that cover a larger variety of objects at a significantly lower cost.
This work proposes an open-vocabulary object detection method that, based on image-caption pairs, learns to detect novel object classes along with a given set of known classes and shows that a simple language model fits better than a large contextualized language model for detecting novel objects.
The Open- Vocabulary Attribute Detection (OVAD) task and the corresponding OVAD benchmark are introduced to probe object-level attribute information learned by vision-language models and a first baseline method for open-vocabulary attribute detection is provided.
This work proposes to address the gap between object and image-centric representations in the OVD setting by performing object-centric alignment of the language embeddings from the CLIP model using a pseudo-labeling process that provides high-quality object proposals and helps expand the vocabulary during training.
Experimental results show that X-VLM effectively leverages the learned multi-grained alignments to many downstream vision language tasks and consistently outperforms state-of-the-art methods.
Adding a benchmark result helps the community track progress.