Convolutions Die Hard: Open-Vocabulary Segmentation with Single Frozen Convolutional CLIP (2023-08-04T00:00:00.000000Z)

TL;DR

This work proposes to build everything into a single-stage framework using a shared Frozen Convolutional CLIP backbone, which not only significantly simplifies the current two-stage pipeline, but also remarkably yields a better accuracy-cost trade-off.

Abstract

Open-vocabulary segmentation is a challenging task requiring segmenting and recognizing objects from an open set of categories. One way to address this challenge is to leverage multi-modal models, such as CLIP, to provide image and text features in a shared embedding space, which bridges the gap between closed-vocabulary and open-vocabulary recognition. Hence, existing methods often adopt a two-stage framework to tackle the problem, where the inputs first go through a mask generator and then through the CLIP model along with the predicted masks. This process involves extracting features from images multiple times, which can be ineffective and inefficient. By contrast, we propose to build everything into a single-stage framework using a shared Frozen Convolutional CLIP backbone, which not only significantly simplifies the current two-stage pipeline, but also remarkably yields a better accuracy-cost trade-off. The proposed FC-CLIP, benefits from the following observations: the frozen CLIP backbone maintains the ability of open-vocabulary classification and can also serve as a strong mask generator, and the convolutional CLIP generalizes well to a larger input resolution than the one used during contrastive image-text pretraining. When training on COCO panoptic data only and testing in a zero-shot manner, FC-CLIP achieve 26.8 PQ, 16.8 AP, and 34.1 mIoU on ADE20K, 18.2 PQ, 27.9 mIoU on Mapillary Vistas, 44.0 PQ, 26.8 AP, 56.2 mIoU on Cityscapes, outperforming the prior art by +4.2 PQ, +2.4 AP, +4.2 mIoU on ADE20K, +4.0 PQ on Mapillary Vistas and +20.1 PQ on Cityscapes, respectively. Additionally, the training and testing time of FC-CLIP is 7.5x and 6.6x significantly faster than the same prior art, while using 5.9x fewer parameters. FC-CLIP also sets a new state-of-the-art performance across various open-vocabulary semantic segmentation datasets. Code at https://github.com/bytedance/fc-clip

References104 items

ReMaX: Relaxing for Better Training on Efficient Panoptic Segmentation

DaTaSeg: Taming a Universal Multi-Dataset Multi-Task Segmentation Model

CLUSTSEG: Clustering for Universal Segmentation

FreeSeg: Unified, Universal and Open-Vocabulary Image Segmentation

CAT-Seg: Cost Aggregation for Open-Vocabulary Semantic Segmentation

Open-Vocabulary Panoptic Segmentation with Text-to-Image Diffusion Models

Side Adapter Network for Open-Vocabulary Semantic Segmentation

Generalized Decoding for Pixel, Image, and Language

ZegCLIP: Towards Adapting CLIP for Zero-shot Semantic Segmentation

OneFormer: One Transformer to Rule Universal Image Segmentation

Open-vocabulary Semantic Segmentation with Frozen Vision-Language Models

LAION-5B: An open large-scale dataset for training next generation image-text models

Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP

F-VLM: Open-Vocabulary Object Detection upon Frozen Vision and Language Models

Open-Vocabulary Universal Image Segmentation with MaskCLIP

Mask DINO: Towards A Unified Transformer-based Framework for Object Detection and Segmentation

CMT-DeepLab: Clustering Mask Transformers for Panoptic Segmentation

Vision Transformer Adapter for Dense Predictions

CoCa: Contrastive Captioners are Image-Text Foundation Models

Flamingo: a Visual Language Model for Few-Shot Learning

GroupViT: Semantic Segmentation Emerges from Text Supervision

A ConvNet for the 2020s

Language-driven Semantic Segmentation

Scaling Open-Vocabulary Image Segmentation with Image-Level Labels

High-Resolution Image Synthesis with Latent Diffusion Models

Decoupling Zero-Shot Semantic Segmentation

Masked-attention Mask Transformer for Universal Image Segmentation

Extract Free Dense Labels from CLIP

Florence: A New Foundation Model for Computer Vision

Multi-Scale High-Resolution Vision Transformer for Semantic Segmentation

Per-Pixel Classification is Not All You Need for Semantic Segmentation

VinVL: Revisiting Visual Representations in Vision-Language Models

SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers

Segmenter: Transformer for Semantic Segmentation

Open-vocabulary Object Detection via Vision and Language Knowledge Distillation

Learning Transferable Visual Models From Natural Language Supervision

Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision

Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers

ViP-DeepLab: Learning Visual Perception with Depth-aware Video Panoptic Segmentation

MaX-DeepLab: End-to-End Panoptic Segmentation with Mask Transformers

Scaling Wide Residual Networks for Panoptic Segmentation

Open-Vocabulary Object Detection Using Captions

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Deformable DETR: Deformable Transformers for End-to-End Object Detection

Contrastive Learning for Weakly Supervised Phrase Grounding

DetectoRS: Detecting Objects with Recursive Feature Pyramid and Switchable Atrous Convolution

Language Models are Few-Shot Learners

End-to-End Object Detection with Transformers

SOLOv2: Dynamic and Fast Instance Segmentation

Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation

Conditional Convolutions for Instance Segmentation

Unifying Training and Inference for Panoptic Segmentation

Connecting Vision and Language with Localized Narratives

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Panoptic-DeepLab: A Simple, Strong, and Fast Baseline for Bottom-Up Panoptic Segmentation

UNITER: UNiversal Image-TExt Representation Learning

Object-Contextual Representations for Semantic Segmentation

LXMERT: Learning Cross-Modality Encoder Representations from Transformers

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Zero-Shot Semantic Segmentation

Semantic Projection Network for Zero- and Few-Label Semantic Segmentation

YOLACT: Real-Time Instance Segmentation

An End-To-End Network for Panoptic Segmentation

Hybrid Task Cascade for Instance Segmentation

UPSNet: A Unified Panoptic Segmentation Network

Panoptic Feature Pyramid Networks

Dual Attention Network for Scene Segmentation

Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning

Path Aggregation Network for Instance Segmentation

Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation

Panoptic Segmentation

MaskLab: Instance Segmentation by Refining Object Detection with Semantic and Direction Features

Cascade R-CNN: Delving Into High Quality Object Detection

Decoupled Weight Decay Regularization

The Mapillary Vistas Dataset for Semantic Understanding of Street Scenes

Scene Parsing through ADE20K Dataset

Rethinking Atrous Convolution for Semantic Image Segmentation

Attention is All you Need

Mask R-CNN

COCO-Stuff: Thing and Stuff Classes in Context

InstanceCut: From Edges to Instances with MultiCut

DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs

The Cityscapes Dataset for Semantic Urban Scene Understanding

Deep Residual Learning for Image Recognition

U-Net: Convolutional Networks for Biomedical Image Segmentation

Microsoft COCO Captions: Data Collection and Evaluation Server

Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs

Adam: A Method for Stochastic Optimization

Fully convolutional networks for semantic segmentation

ImageNet Large Scale Visual Recognition Challenge

Simultaneous Detection and Segmentation

The Role of Context for Object Detection and Semantic Segmentation in the Wild

Microsoft COCO: Common Objects in Context

Multiscale conditional random fields for image labeling

Least squares quantization in PCM

The Hungarian method for the assignment problem

k-means Mask Transformer

A Simple Baseline for Zero-shot Semantic Segmentation with Pre-trained Vision-language Model

Panoptic SegFormer

100

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

101

Yfcc100m: The new data in multimedia research

102

Gradient-based learning applied to document recognition

103

International Journal of Computer Vision manuscript No. (will be inserted by the editor) The PASCAL Visual Object Classes (VOC) Challenge

104

Edinburgh Research Explorer The PASCAL Visual Object Classes (VOC) Challenge