Home Research Papers Datasets State of the Art Pricing

Discover, visualize, and connect AI research papers. Explore the latest trends and insights in artificial intelligence research.

Product

Home
Research Papers
About

Support

Contact
Terms of Service
Privacy Policy

© 2026 Papersgraph. All rights reserved.

open-vocabulary-object-detection

Open Vocabulary Attribute Detection

3260 papers • 126 benchmarks • 313 datasets

Open-Vocabulary Attribute Detection (OVAD) is a task that aims to detect and recognize an open set of objects and their associated attributes in an image. The objects and attributes are defined by text queries during inference, without prior knowledge of the tested classes during training.

(Image credit: Papersgraph)

Benchmarks

These leaderboards are used to track progress in open-vocabulary-object-detection

Trend

Dataset

Best Model

Actions

OVAD-Box benchmark

OVAD-Box benchmark

OVAD benchmark

OVAD benchmark

Libraries

i

Use these libraries to find open-vocabulary-object-detection models and implementations

salesforce/lavis

3 papers 8,608

Datasets

OVAD benchmark

Subtasks

No subtasks available.

Most implemented papers

Learning Transferable Visual Models From Natural Language Supervision

I. Sutskever, Alec Radford, Jong Wook Kim, Chris Hallacy, A. Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger•Thu Feb 25 2021

It is demonstrated that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet.

39743

Content

Introduction Benchmarks Datasets Subtasks Libraries Papers

Open Vocabulary Attribute Detection | State-of-the-Art

huggingface/transformers

2 papers 124,061

mlfoundations/open_clip

2 papers 8,263

towhee-io/towhee

2 papers 2,950

facebookresearch/multimodal

2 papers 1,279

0

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

S. Savarese, Junnan Li, Dongxu Li, Steven C. H. Hoi•Sun Jan 29 2023

BLIP-2 achieves state-of-the-art performance on various vision-language tasks, despite having significantly fewer trainable parameters than existing methods, and is demonstrated's emerging capabilities of zero-shot image-to-text generation that can follow natural language instructions.

6884 0

BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

S. Hoi, Junnan Li, Caiming Xiong, Dongxu Li•Thu Jan 27 2022

BLIP effectively utilizes the noisy web data by bootstrapping the captions, where a captioner generates synthetic captions and a filter removes the noisy ones, and demonstrates strong generalization ability when directly transferred to video-language tasks in a zero-shot manner.

5936 0

Align before Fuse: Vision and Language Representation Learning with Momentum Distillation

S. Hoi, Akhilesh Deepak Gotmare, Junnan Li, Ramprasaath R. Selvaraju, Shafiq R. Joty, Caiming Xiong•Thu Jul 15 2021

A contrastive loss to ALign the image and text representations BEfore Fusing (ALBEF) them through cross-modal attention, which enables more grounded vision and language representation learning and proposes momentum distillation, a self-training method which learns from pseudo-targets produced by a momentum model.

2513 0

Reproducible Scaling Laws for Contrastive Language-Image Learning

Ludwig Schmidt, Gabriel Ilharco, J. Jitsev, Christoph Schuhmann, R. Beaumont, Cade Gordon, Ross Wightman, Mehdi Cherti, Mitchell Wortsman•Tue Dec 13 2022

This work investigates scaling laws for contrastive language-image pre-training (CLIP) with the public LAION dataset and the open-source OpenCLIP repository and finds that the training distribution plays a key role in scaling laws as the OpenAI and OpenClIP models exhibit different scaling behavior.

1192 0

Open-Vocabulary Object Detection Using Captions

Shih-Fu Chang, Alireza Zareian, Kevin Dela Rosa, Derek Hao Hu•Thu Nov 19 2020

This paper proposes a new method to train object detectors using bounding box annotations for a limited set of object categories, as well as image-caption pairs that cover a larger variety of objects at a significantly lower cost.

546 0

Localized Vision-Language Matching for Open-vocabulary Object Detection

T. Brox, Sudhanshu Mittal, M. A. Bravo•Wed May 11 2022

This work proposes an open-vocabulary object detection method that, based on image-caption pairs, learns to detect novel object classes along with a given set of known classes and shows that a simple language model fits better than a large contextualized language model for detecting novel objects.

32 0

Open-vocabulary Attribute Detection

T. Brox, Sudhanshu Mittal, M. A. Bravo, Simon Ging•Tue Nov 22 2022

The Open- Vocabulary Attribute Detection (OVAD) task and the corresponding OVAD benchmark are introduced to probe object-level attribute information learned by vision-language models and a first baseline method for open-vocabulary attribute detection is provided.

38 0

Bridging the Gap between Object and Image-level Representations for Open-Vocabulary Detection

F. Khan, Salman H. Khan, Muhammad Maaz, H. Rasheed, Muhammad Uzair Khattak•Wed Jul 06 2022

This work proposes to address the gap between object and image-centric representations in the OVD setting by performing object-centric alignment of the language embeddings from the CLIP model using a pseudo-labeling process that provides high-quality object proposals and helps expand the vocabulary during training.

184 0

Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts

Hang Li, Yan Zeng, Xinsong Zhang•Mon Nov 15 2021

Experimental results show that X-VLM effectively leverages the learned multi-grained alignments to many downstream vision language tasks and consistently outperforms state-of-the-art methods.

357 0

Adding a benchmark result helps the community track progress.