Home Research Papers Datasets State of the Art Pricing

Discover, visualize, and connect AI research papers. Explore the latest trends and insights in artificial intelligence research.

Product

Home
Research Papers
About

Support

Contact
Terms of Service
Privacy Policy

© 2026 Papersgraph. All rights reserved.

object-detection

Open Vocabulary Object Detection

3260 papers • 126 benchmarks • 313 datasets

Open-vocabulary detection (OVD) aims to generalize beyond the limited number of base classes labeled during the training phase. The goal is to detect novel classes defined by an unbounded (open) vocabulary at inference.

(Image credit: Papersgraph)

Benchmarks

These leaderboards are used to track progress in object-detection

Trend

Dataset

Best Model

Actions

MSCOCO

MSCOCO

LVIS v1.0

LVIS v1.0

OpenImages-v4

OpenImages-v4

Libraries

i

Use these libraries to find object-detection models and implementations

om-ai-lab/OmDet

2 papers 1,552

Datasets

LVIS

Objects365

MSCOCO

OVAD benchmark

Description Detection Dataset

Description Detection Dataset

OVIC Datasets

Subtasks

Open Vocabulary Attribute Detection

Most implemented papers

Simple Open-Vocabulary Object Detection with Vision Transformers

N. Houlsby, Xiaohua Zhai, Alexey Dosovitskiy, Mostafa Dehghani, A. Gritsenko, Anurag Arnab, M. Minderer, Maxim Neumann, Dirk Weissenborn, Aravindh Mahendran, Thomas Kipf, Austin Stone, Xiao Wang, Zhuoran Shen•Wed May 11 2022

This paper proposes a strong recipe for transferring image-text models to open-vocabulary object detection using a standard Vision Transformer architecture with minimal modifications, contrastive image- text pre-training, and end-to-end detection fine-tuning.

Content

Introduction Benchmarks Datasets Subtasks Libraries Papers

Open Vocabulary Object Detection | State-of-the-Art

Objects365

Objects365

372

0

Scaling Open-Vocabulary Object Detection

N. Houlsby, A. Gritsenko, M. Minderer•Thu Jun 15 2023

The OWLv2 model and OWL-ST self-training recipe, which surpasses the performance of previous state-of-the-art open-vocabulary detectors already at comparable training scales and unlocks Web-scale training for open-world localization, similar to what has been seen for image classification and language modelling.

322 0

YOLO-World: Real-Time Open-Vocabulary Object Detection

Tianheng Cheng, Xinggang Wang, Yixiao Ge, Ying Shan, Lin Song, Wenyu Liu•Mon Jan 29 2024

YOLO-World is introduced, an innovative approach that enhances YOLO with open-vocabulary detection capabilities through vision-language modeling and pre-training on large-scale datasets and proposes a new Re-parameterizable Vision-Language Path Aggregation Network (RepVL-PAN) and region-text contrastive loss to facilitate the interaction between visual and linguistic information.

677 0

PointCLIP: Point Cloud Understanding by CLIP

Hongsheng Li, Wei Zhang, Renrui Zhang, Ziyu Guo, Peng Gao, Kunchang Li, Y. Qiao, Xupeng Miao, Bin Cui•Fri Dec 03 2021

Recently, zero-shot and few-shot learning via Contrastive Vision-Language Pre-training (CLIP) have shown inspirational performance on 2D visual recognition, which learns to match images with their corresponding texts in open-vocabulary settings. However, it remains under explored that whether CLIP, pre-trained by large-scale image-text pairs in 2D, can be generalized to 3D recognition. In this paper, we identify such a setting is feasible by proposing PointCLIP, which conducts alignment between CLIP-encoded point clouds and 3D category texts. Specifically, we encode a point cloud by projecting it onto multi-view depth maps and aggregate the view-wise zero-shot prediction in an end-to-end manner, which achieves efficient knowledge transfer from 2D to 3D. We further design an inter-view adapter to better extract the global feature and adaptively fuse the 3D few-shot knowledge into CLIP pre-trained in 2D. By just fine-tuning the adapter under few-shot settings, the performance of PointCLIP could be largely improved. In addition, we observe the knowledge complementary property between PointCLIP and classical 3D-supervised networks. Via simple ensemble during inference, PointCLIP contributes to favorable performance enhancement over state-of-the-art 3D networks. Therefore, PointCLIP is a promising alternative for effective 3D point cloud understanding under low data regime with marginal resource cost. We conduct thorough experiments on Model-NetlO, ModelNet40 and ScanObjectNN to demonstrate the effectiveness of PointCLIP. Code is available at https://github.com/ZrrSkywalker/PointCLIP.

583 0

PointCLIP V2: Prompting CLIP and GPT for Powerful 3D Open-world Learning

Shanghang Zhang, Renrui Zhang, Ziyu Guo, Peng Gao, Zipeng Qin, Xiangyang Zhu, Bowei He, Ziyao Zeng•Sun Nov 20 2022

This paper first collaborate CLIP and GPT to be a unified 3D open-world learner, named as Point-CLIP V2, which fully unleashes their potential for zero-shot 3D classification, segmentation, and detection, demonstrating the generalization ability for unified 3D open-world learning.

224 0

Open Vocabulary Object Detection with Proposal Mining and Prediction Equalization

Chunhua Shen, Mengdan Zhang, Kekai Sheng, Peixian Chen, Yunhang Shen, Ke Li•Tue Jun 21 2022

This work designs an online proposal mining to refine the inherited vision-semantic knowledge from coarse to fine, allowing for proposal-level detection-oriented feature alignment and introduces a class-wise backdoor adjustment to reinforce the predictions on novel categories to improve the overall OVD performance.

33 0

Open-vocabulary Object Detection via Vision and Language Knowledge Distillation

Yin Cui, Tsung-Yi Lin, Weicheng Kuo, Xiuye Gu•Tue Apr 27 2021

This work distills the knowledge from a pretrained open-vocabulary image classification model (teacher) into a two-stage detector (student), which uses the teacher model to encode category texts and image regions of object proposals and trains a student detector, whose region embeddings of detected boxes are aligned with the text and image embedDings inferred by the teacher.

1170 0

X-Paste: Revisiting Scalable Copy-Paste for Instance Segmentation using CLIP and StableDiffusion

Lu Yuan, Dongdong Chen, Ce Liu, Jianmin Bao, Dong Chen, Fang Wen, Qi Chu, Hanqing Zhao, Dianmo Sheng, Wenbo Zhou, Weiming Zhang, Neng H. Yu•Tue Dec 06 2022

This paper revisits Copy-Paste at scale with the power of newly emerged zero-shot recognition models and text2 image models and demonstrates for the first time that using a text2image model to generate images or zero- shot recognition model to filter noisily crawled images for different object categories is a feasible way to make Copy- Paste truly scalable.

60 0

Open-Vocabulary DETR with Conditional Matching

Wei Li, Kaiyang Zhou, Chen Change Loy, Chen Huang, Yuhang Zang•Mon Mar 21 2022

This work proposes a novel open-vocabulary detector based on DETR -- hence the name OV-DETR, which, once trained, can detect any object given its class name or an exemplar image and achieves non-trivial improvements over current state of the arts.

267 0

Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision Transformers

Weicheng Kuo, A. Angelova, Dahun Kim•Wed May 10 2023

We present Region-aware Open-vocabulary Vision Transformers (RO-ViT) - a contrastive image-text pretraining recipe to bridge the gap between image-level pretraining and open-vocabulary object detection. At the pretraining phase, we propose to randomly crop and resize regions of positional embeddings instead of using the whole image positional embeddings. This better matches the use of positional embeddings at region-level in the detection finetuning phase. In addition, we replace the common softmax cross entropy loss in contrastive learning with focal loss to better learn the informative yet difficult examples. Finally, we leverage recent advances in novel object proposals to improve open-vocabulary detection finetuning. We evaluate our full model on the LVIS and COCO open-vocabulary detection benchmarks and zero-shot transfer. RO-ViT achieves a state-of-the-art 32.1 APr on LVIS, surpassing the best existing approach by +5.8 points in addition to competitive zero-shot transfer detection. Surprisingly, RO-ViT improves the image-level representation as well and achieves the state of the art on 9 out of 12 metrics on COCO and Flickr image-text retrieval benchmarks, outperforming competitive approaches with larger models.

113 0

Adding a benchmark result helps the community track progress.