3260 papers • 126 benchmarks • 313 datasets
Zero-shot object detection (ZSD) is the task of object detection where no visual training data is available for some of the target object classes. ( Image credit: Zero-Shot Object Detection: Learning to Simultaneously Recognize and Localize Novel Concepts )
(Image credit: Papersgraph)
These leaderboards are used to track progress in object-detection
Use these libraries to find object-detection models and implementations
No subtasks available.
An open-set object detector, called Grounding DINO, is presented by marrying Transformer-based detector DINO with grounded pre-training, which can detect arbitrary objects with human inputs such as category names or referring expressions, and performs remarkably well on all three settings.
This work builds ELEVATER (Evaluation of Language-augmented Visual Task-level Transfer), the first benchmark and toolkit for evaluating (pre-trained) language-AUgmented visual models.
A classification-free Object Localization Network (OLN) is proposed which estimates the objectness of each region purely by how well the location and shape of a region overlap with any ground-truth object.
This work distills the knowledge from a pretrained open-vocabulary image classification model (teacher) into a two-stage detector (student), which uses the teacher model to encode category texts and image regions of object proposals and trains a student detector, whose region embeddings of detected boxes are aligned with the text and image embedDings inferred by the teacher.
A novel loss function called 'Polarity loss' is proposed, that promotes correct visual-semantic alignment for an improved zero-shot object detection and refines the noisy semantic embeddings via metric learning on a 'Semantic vocabulary' of related concepts to establish a better synergy between visual and semantic domains.
A grounded language-image pretraining model for learning object-level, language-aware, and semantic-rich visual representations that unifies object detection and phrase grounding for pre-training and can leverage massive image-text pairs by generating grounding boxes in a self-training fashion.
The OWLv2 model and OWL-ST self-training recipe, which surpasses the performance of previous state-of-the-art open-vocabulary detectors already at comparable training scales and unlocks Web-scale training for open-world localization, similar to what has been seen for image classification and language modelling.
YOLO-World is introduced, an innovative approach that enhances YOLO with open-vocabulary detection capabilities through vision-language modeling and pre-training on large-scale datasets and proposes a new Re-parameterizable Vision-Language Path Aggregation Network (RepVL-PAN) and region-text contrastive loss to facilitate the interaction between visual and linguistic information.
The integration of domain-specific indices (NCGI) and prompt optimization techniques provides an effective solution for plant phenotyping, highlighting the potential of weakly supervised models in agricultural computer vision where extensive manual annotation is impractical.
A novel approach to tackle zero-shot object detection (ZSD) where no visual training data is available for some of the target object classes is presented, where a convex combination of embeddings are used in conjunction with a detection framework.
Adding a benchmark result helps the community track progress.