3260 papers • 126 benchmarks • 313 datasets
Described Object Detection (DOD) detects all instances on each image in the dataset, based on a flexible reference. It is a superset of Open-Vocabulary Object Detection (OVD) and Referring Expression Comprehension (REC). It expands category names to flexible language expressions for OVD and overcomes the limitation of REC only grounding the pre-existing object. Works related to DOD are tracked in awesome-DOD list on github.
(Image credit: Papersgraph)
These leaderboards are used to track progress in described-object-detection
Use these libraries to find described-object-detection models and implementations
No subtasks available.
A grounded language-image pretraining model for learning object-level, language-aware, and semantic-rich visual representations that unifies object detection and phrase grounding for pre-training and can leverage massive image-text pairs by generating grounding boxes in a self-training fashion.
This paper proposes a strong recipe for transferring image-text models to open-vocabulary object detection using a standard Vision Transformer architecture with minimal modifications, contrastive image- text pre-training, and end-to-end detection fine-tuning.
This work presents a universal instance perception model of the next generation, termed UNINEXT, which reformulates diverse instance perception tasks into a unified object discovery and retrieval paradigm and can flexibly perceive different types of objects by simply changing the input prompts.
MM-Grounding-DINO is presented, an open-source, comprehensive, and user-friendly baseline, which is built with the MMDetection toolbox, and outperforms the Grounding-DINO-Tiny baseline.
This work proposes CORA, a DETR-style framework that adapts CLIP for Open-vocabulary detection by Region prompting and Anchor pre-matching, which mitigates the whole-to-region distribution gap by prompting the region features of the CLIP-based region classifier.
SPHINX, a versatile multi-modal large language model (MLLM) with a joint mixing of model weights, tuning tasks, and visual embeddings is presented, and an efficient strategy aiming to better capture fine-grained appearances of high-resolution images is proposed.
This work presents FIBER (Fusion-In-the-Backbone-based transformER), a new VL model architecture that can seamlessly handle both these types of tasks and provides consistent performance improvements over strong baselines across all tasks, often outperforming methods using magnitudes more data.
A baseline is proposed that largely improves REC methods by reconstructing the training data and introducing a binary classification sub-task, outperforming existing methods.
Adding a benchmark result helps the community track progress.