Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection (2023-03-09T00:00:00.000000Z)

TL;DR

An open-set object detector, called Grounding DINO, is presented by marrying Transformer-based detector DINO with grounded pre-training, which can detect arbitrary objects with human inputs such as category names or referring expressions, and performs remarkably well on all three settings.

Abstract

In this paper, we present an open-set object detector, called Grounding DINO, by marrying Transformer-based detector DINO with grounded pre-training, which can detect arbitrary objects with human inputs such as category names or referring expressions. The key solution of open-set object detection is introducing language to a closed-set detector for open-set concept generalization. To effectively fuse language and vision modalities, we conceptually divide a closed-set detector into three phases and propose a tight fusion solution, which includes a feature enhancer, a language-guided query selection, and a cross-modality decoder for cross-modality fusion. While previous works mainly evaluate open-set object detection on novel categories, we propose to also perform evaluations on referring expression comprehension for objects specified with attributes. Grounding DINO performs remarkably well on all three settings, including benchmarks on COCO, LVIS, ODinW, and RefCOCO/+/g. Grounding DINO achieves a $52.5$ AP on the COCO detection zero-shot transfer benchmark, i.e., without any training data from COCO. It sets a new record on the ODinW zero-shot benchmark with a mean $26.1$ AP. Code will be available at \url{https://github.com/IDEA-Research/GroundingDINO}.

Authors

Hang Su

6 papers

Jianwei Yang

14 papers

Chun-yue Li

5 papers

TL;DR

Abstract

Authors

References68 items

Boosting Long-tailed Object Detection via Step-wise Learning on Smooth-tail Data

DetCLIPv2: Scalable Open-Vocabulary Object Detection Pre-training via Word-Region Alignment

GLIGEN: Open-Set Grounded Text-to-Image Generation

DQ-DETR: Dual Query Detection Transformer for Phrase Extraction and Grounding

DetCLIP: Dictionary-Enriched Visual-Concept Paralleled Pre-training for Open-world Detection

Group DETR: Fast DETR Training with Group-Wise One-to-Many Assignment

DETRs with Hybrid Matching

Anchor DETR: Query Design for Transformer-Based Detector

GLIPv2: Unifying Localization and Vision-Language Understanding

Mask DINO: Towards A Unified Transformer-based Framework for Object Detection and Segmentation

Simple Open-Vocabulary Object Detection with Vision Transformers

ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented Visual Models

FindIt: Generalized Localization with Natural Language Queries

Open-Vocabulary DETR with Conditional Matching

DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection

DN-DETR: Accelerate DETR Training by Introducing Query DeNoising

DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR

Detecting Twenty-thousand Classes using Image-level Supervision

High-Resolution Image Synthesis with Latent Diffusion Models

RegionCLIP: Region-based Language-Image Pretraining

Grounded Language-Image Pre-training

Florence: A New Foundation Model for Computer Vision

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

Dynamic DETR: End-to-End Object Detection with Dynamic Attention

Conditional DETR for Fast Training Convergence

End-to-End Semi-Supervised Object Detection with Soft Teacher

Referring Transformer: A One-step Approach to Multi-task Visual Grounding

Dynamic Head: Unifying Object Detection Heads with Attentions

Visual Grounding with Transformers

Open-vocabulary Object Detection via Vision and Language Knowledge Distillation

MDETR - Modulated Detection for End-to-End Multi-Modal Understanding

TransVG: End-to-End Visual Grounding with Transformers

Fast Convergence of DETR with Spatially Modulated Co-Attention

Sparse R-CNN: End-to-End Object Detection with Learnable Proposals

Open-Vocabulary Object Detection Using Captions

Deformable DETR: Deformable Transformers for End-to-End Object Detection

Large-Scale Adversarial Training for Vision-and-Language Representation Learning

End-to-End Object Detection with Transformers

Bridging the Gap Between Anchor-Based and Anchor-Free Detection via Adaptive Training Sample Selection

HuggingFace's Transformers: State-of-the-art Natural Language Processing

Objects365: A Large-Scale, High-Quality Dataset for Object Detection

LVIS: A Dataset for Large Vocabulary Instance Segmentation

EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks

Objects as Points

Generalized Intersection Over Union: A Metric and a Loss for Bounding Box Regression

Hybrid Task Cascade for Instance Segmentation

Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning

MAttNet: Modular Attention Network for Referring Expression Comprehension

Referring Expression Generation and Comprehension via Attributes

Focal Loss for Dense Object Detection

Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

Mask R-CNN

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Deep Residual Learning for Image Recognition

Neural Machine Translation of Rare Words with Subword Units

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models

Microsoft COCO: Common Objects in Context

Im2Text: Describing Images Using 1 Million Captioned Photographs

Referring Expression Comprehension via Cross-Level Multi-Modal Fusion

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

The open images dataset v4: Uniﬁed image classiﬁcation, object detection, and visual relationship detection at scale

Openimages: A public dataset for large-scale multi-label and multi-class image classiﬁcation

Yfcc100m: the new data in multimedia research

Under review as a conference paper at ICLR 2016

of the Association for Computational Linguistics

Omdet: Language-aware object detection with large-scale vision-language

Field of Study

Journal Information

Name

Volume

Venue Information

Name

Type

URL