X-Paste: Revisiting Scalable Copy-Paste for Instance Segmentation using CLIP and StableDiffusion (2022-12-07T00:00:00.000000Z)

TL;DR

This paper revisits Copy-Paste at scale with the power of newly emerged zero-shot recognition models and text2 image models and demonstrates for the first time that using a text2image model to generate images or zero- shot recognition model to filter noisily crawled images for different object categories is a feasible way to make Copy- Paste truly scalable.

Abstract

Copy-Paste is a simple and effective data augmentation strategy for instance segmentation. By randomly pasting object instances onto new background images, it creates new training data for free and significantly boosts the segmentation performance, especially for rare object categories. Although diverse, high-quality object instances used in Copy-Paste result in more performance gain, previous works utilize object instances either from human-annotated instance segmentation datasets or rendered from 3D object models, and both approaches are too expensive to scale up to obtain good diversity. In this paper, we revisit Copy-Paste at scale with the power of newly emerged zero-shot recognition models (e.g., CLIP) and text2image models (e.g., StableDiffusion). We demonstrate for the first time that using a text2image model to generate images or zero-shot recognition model to filter noisily crawled images for different object categories is a feasible way to make Copy-Paste truly scalable. To make such success happen, we design a data acquisition and processing framework, dubbed ``X-Paste", upon which a systematic study is conducted. On the LVIS dataset, X-Paste provides impressive improvements over the strong baseline CenterNet2 with Swin-L as the backbone. Specifically, it archives +2.6 box AP and +2.1 mask AP gains on all classes and even more significant gains with +6.8 box AP, +6.5 mask AP on long-tail classes. Our code and models are available at https://github.com/yoctta/XPaste.

Authors

References64 items

LAION-5B: An open large-scale dataset for training next generation image-text models

OmniVL: One Foundation Model for Image-Language and Video-Language Tasks

Frido: Feature Pyramid Diffusion for Complex Scene Image Synthesis

MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image Pretraining

Bridging the Gap between Object and Image-level Representations for Open-Vocabulary Detection

Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

Open Vocabulary Object Detection with Proposal Mining and Prediction Equalization

Detection Hub: Unifying Object Detection Datasets via Query Adaptation on Language Embedding

Mask DINO: Towards A Unified Transformer-based Framework for Object Detection and Segmentation

SelfReformer: Self-Refined Network with Transformer for Salient Object Detection

Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

Hierarchical Text-Conditional Image Generation with CLIP Latents

MatteFormer: Transformer-Based Image Matting via Prior-Tokens

Learning to Prompt for Open-Vocabulary Object Detection with Vision-Language Model

A Unified Transformer Framework for Group-Based Segmentation: Co-Segmentation, Co-Saliency Detection and Video Salient Object Detection

Detecting Twenty-thousand Classes using Image-level Supervision

High-Resolution Image Synthesis with Latent Diffusion Models

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

Image Segmentation Using Text and Image Prompts

RegionCLIP: Region-based Language-Image Pretraining

Vector Quantized Diffusion Model for Text-to-Image Synthesis

Florence: A New Foundation Model for Computer Vision

Swin Transformer V2: Scaling Up Capacity and Resolution

On Model Calibration for Long-Tailed Object Detection and Instance Segmentation

CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows

Dynamic Head: Unifying Object Detection Heads with Attentions

CogView: Mastering Text-to-Image Generation via Transformers

Open-vocabulary Object Detection via Vision and Language Knowledge Distillation

Probabilistic two-stage detection

Learning Transferable Visual Models From Natural Language Supervision

FASA: Feature Augmentation and Sampling Adaptation for Long-Tailed Instance Segmentation

Zero-Shot Text-to-Image Generation

Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision

Equalization Loss v2: A New Gradient Balance Approach for Long-tailed Object Detection

Simple Copy-Paste is a Strong Data Augmentation Method for Instance Segmentation

Open-Vocabulary Object Detection Using Captions

1st Place Solution of LVIS Challenge 2020: A Good Box is not a Guarantee of a Good Mask

Seesaw Loss for Long-Tailed Instance Segmentation

The Devil is in Classification: A Simple Framework for Long-tail Instance Segmentation

A survey on instance segmentation: state of the art

U2-Net: Going Deeper with Nested U-Structure for Salient Object Detection

InstaBoost: Boosting Instance Segmentation via Probability Map Guided Copy-Pasting

LVIS: A Dataset for Large Vocabulary Instance Segmentation

Photorealistic Image Synthesis for Object Instance Detection

Modeling Visual Context is Key to Augmenting Object Detection Datasets

Bootstrapping the Performance of Webly Supervised Semantic Segmentation

Exploring the Limits of Weakly Supervised Pretraining

On Pre-Trained Image Features and Synthetic Images for Deep Learning

Playing for Benchmarks

Cut, Paste and Learn: Surprisingly Easy Synthesis for Instance Detection

Webly Supervised Semantic Segmentation

Mask R-CNN

Weakly Supervised Semantic Segmentation Using Web-Crawled Videos

Playing for Data: Ground Truth from Computer Games

Instance-Aware Semantic Segmentation via Multi-task Network Cascades

Deep Residual Learning for Image Recognition

STC: A Simple to Complex Framework for Weakly-Supervised Semantic Segmentation

Render for CNN: Viewpoint Estimation in Images Using CNNs Trained with Rendered 3D Model Views

Microsoft COCO: Common Objects in Context

ImageNet: A large-scale hierarchical image database

PromptDet: Expand Your Detector Vocabulary with Uncurated Images

Comparison of open-vocabulary detection performance on LVIS, ∗ means they use external supervised data for training

Open-Vocabulary Image Segmentation