Visual in-Context Prompting (2023-11-22T00:00:00.000000Z)

TL;DR

This paper builds on top of an encoder-decoder architecture, and develops a versatile prompt encoder to support a variety of prompts like strokes, boxes, and points, and enhances it to take an arbitrary number of reference image segments as the context.

Abstract

In-context prompting in large language models (LLMs) has become a prevalent approach to improve zero-shot capabilities, but this idea is less explored in the vision domain. Existing visual prompting methods focus on referring segmentation to segment the most relevant object, falling short of addressing many generic vision tasks like open-set segmentation and detection. In this paper, we introduce a universal visual in-context prompting framework for both tasks, as shown in Fig. 1. In particular, we build on top of an encoder-decoder architecture, and develop a versatile prompt encoder to support a variety of prompts like strokes, boxes, and points. We further enhance it to take an arbitrary number of reference image segments as the context. Our extensive explorations show that the proposed visual in-context prompting elicits extraordinary referring and generic segmentation capabilities to refer and detect, yielding competitive performance to close-set in-domain datasets and showing promising results on many open-set segmentation datasets. By joint training on COCO and SA-1B, DINOv achieves 57.7 PQ on COCO and 23.2 PQ on ADE20K. Code will be available at https://github.com/UX-Decoder/DINOv

Authors

Jianwei Yang

14 papers

Chun-yue Li

5 papers

Xueyan Zou

5 papers

TL;DR

Abstract

Authors

References45 items

Multimodal Foundation Models: From Specialists to General-Purpose Assistants

Convolutions Die Hard: Open-Vocabulary Segmentation with Single Frozen Convolutional CLIP

Semantic-SAM: Segment and Recognize Anything at Any Granularity

Multi-modal Queried Object Detection in the Wild

Personalize Segment Anything Model with One Shot

Visual Instruction Tuning

SegGPT: Segmenting Everything In Context

A Simple Framework for Open-Vocabulary Segmentation and Detection

Universal Instance Perception as Object Discovery and Retrieval

Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

Generalized Decoding for Pixel, Image, and Language

Images Speak in Images: A Generalist Painter for In-Context Visual Learning

OneFormer: One Transformer to Rule Universal Image Segmentation

XMem: Long-Term Video Object Segmentation with an Atkinson-Shiffrin Memory Model

Mask DINO: Towards A Unified Transformer-based Framework for Object Detection and Segmentation

SWEM: Towards Real-Time Video Object Segmentation with Sequential Weighted Expectation-Maximization

Open-Vocabulary DETR with Conditional Matching

Language as Queries for Referring Video Object Segmentation

Scaling Open-Vocabulary Image Segmentation with Image-Level Labels

RegionCLIP: Region-based Language-Image Pretraining

Grounded Language-Image Pre-training

DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting

Masked-attention Mask Transformer for Universal Image Segmentation

Open-Vocabulary Instance Segmentation via Robust Cross-Modal Pseudo-Labeling

SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers

DiscoBox: Weakly Supervised Instance Segmentation and Semantic Correspondence from Box Supervision

Open-vocabulary Object Detection via Vision and Language Knowledge Distillation

MDETR - Modulated Detection for End-to-End Multi-Modal Understanding

Modular Interactive Video Object Segmentation: Interaction-to-Mask, Propagation and Difference-Aware Fusion

Learning Transferable Visual Models From Natural Language Supervision

Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision

AGSS-VOS: Attention Guided Single-Shot Video Object Segmentation

Fast Online Object Tracking and Segmentation: A Unifying Approach

A Generative Appearance Model for End-To-End Video Object Segmentation

YouTube-VOS: A Large-Scale Video Object Segmentation Benchmark

Scene Parsing through ADE20K Dataset

The 2017 DAVIS Challenge on Video Object Segmentation

Microsoft COCO: Common Objects in Context

k-means Mask Transformer

Open-Vocabulary Panoptic Segmentation with MaskCLIP

Simple Open-Vocabulary Object Detection

GLIPv2: Unifying Localization and Vision-Language Understanding

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

Enhancing vision-language understanding with advanced large language models

Segment everything every-where all at once

Field of Study

Journal Information

Name

Page

Venue Information

Name

Type

URL

Alternate Names