3260 papers • 126 benchmarks • 313 datasets
Open-vocabulary detection (OVD) aims to generalize beyond the limited number of base classes labeled during the training phase. The goal is to detect novel classes defined by an unbounded (open) vocabulary at inference.
(Image credit: Papersgraph)
These leaderboards are used to track progress in object-detection
Use these libraries to find object-detection models and implementations
This paper proposes a strong recipe for transferring image-text models to open-vocabulary object detection using a standard Vision Transformer architecture with minimal modifications, contrastive image- text pre-training, and end-to-end detection fine-tuning.
The OWLv2 model and OWL-ST self-training recipe, which surpasses the performance of previous state-of-the-art open-vocabulary detectors already at comparable training scales and unlocks Web-scale training for open-world localization, similar to what has been seen for image classification and language modelling.
YOLO-World is introduced, an innovative approach that enhances YOLO with open-vocabulary detection capabilities through vision-language modeling and pre-training on large-scale datasets and proposes a new Re-parameterizable Vision-Language Path Aggregation Network (RepVL-PAN) and region-text contrastive loss to facilitate the interaction between visual and linguistic information.
Recently, zero-shot and few-shot learning via Contrastive Vision-Language Pre-training (CLIP) have shown inspirational performance on 2D visual recognition, which learns to match images with their corresponding texts in open-vocabulary settings. However, it remains under explored that whether CLIP, pre-trained by large-scale image-text pairs in 2D, can be generalized to 3D recognition. In this paper, we identify such a setting is feasible by proposing PointCLIP, which conducts alignment between CLIP-encoded point clouds and 3D category texts. Specifically, we encode a point cloud by projecting it onto multi-view depth maps and aggregate the view-wise zero-shot prediction in an end-to-end manner, which achieves efficient knowledge transfer from 2D to 3D. We further design an inter-view adapter to better extract the global feature and adaptively fuse the 3D few-shot knowledge into CLIP pre-trained in 2D. By just fine-tuning the adapter under few-shot settings, the performance of PointCLIP could be largely improved. In addition, we observe the knowledge complementary property between PointCLIP and classical 3D-supervised networks. Via simple ensemble during inference, PointCLIP contributes to favorable performance enhancement over state-of-the-art 3D networks. Therefore, PointCLIP is a promising alternative for effective 3D point cloud understanding under low data regime with marginal resource cost. We conduct thorough experiments on Model-NetlO, ModelNet40 and ScanObjectNN to demonstrate the effectiveness of PointCLIP. Code is available at https://github.com/ZrrSkywalker/PointCLIP.
This paper first collaborate CLIP and GPT to be a unified 3D open-world learner, named as Point-CLIP V2, which fully unleashes their potential for zero-shot 3D classification, segmentation, and detection, demonstrating the generalization ability for unified 3D open-world learning.
This work designs an online proposal mining to refine the inherited vision-semantic knowledge from coarse to fine, allowing for proposal-level detection-oriented feature alignment and introduces a class-wise backdoor adjustment to reinforce the predictions on novel categories to improve the overall OVD performance.
This work distills the knowledge from a pretrained open-vocabulary image classification model (teacher) into a two-stage detector (student), which uses the teacher model to encode category texts and image regions of object proposals and trains a student detector, whose region embeddings of detected boxes are aligned with the text and image embedDings inferred by the teacher.
This paper revisits Copy-Paste at scale with the power of newly emerged zero-shot recognition models and text2 image models and demonstrates for the first time that using a text2image model to generate images or zero- shot recognition model to filter noisily crawled images for different object categories is a feasible way to make Copy- Paste truly scalable.
This work proposes a novel open-vocabulary detector based on DETR -- hence the name OV-DETR, which, once trained, can detect any object given its class name or an exemplar image and achieves non-trivial improvements over current state of the arts.
We present Region-aware Open-vocabulary Vision Transformers (RO-ViT) - a contrastive image-text pretraining recipe to bridge the gap between image-level pretraining and open-vocabulary object detection. At the pretraining phase, we propose to randomly crop and resize regions of positional embeddings instead of using the whole image positional embeddings. This better matches the use of positional embeddings at region-level in the detection finetuning phase. In addition, we replace the common softmax cross entropy loss in contrastive learning with focal loss to better learn the informative yet difficult examples. Finally, we leverage recent advances in novel object proposals to improve open-vocabulary detection finetuning. We evaluate our full model on the LVIS and COCO open-vocabulary detection benchmarks and zero-shot transfer. RO-ViT achieves a state-of-the-art 32.1 APr on LVIS, surpassing the best existing approach by +5.8 points in addition to competitive zero-shot transfer detection. Surprisingly, RO-ViT improves the image-level representation as well and achieves the state of the art on 9 out of 12 metrics on COCO and Flickr image-text retrieval benchmarks, outperforming competitive approaches with larger models.
Adding a benchmark result helps the community track progress.