Masked-attention Mask Transformer for Universal Image Segmentation (2021-12-02T00:00:00.000000Z)

TL;DR

Mask2Former is presented, a new archi-tecture capable of addressing any image segmentation task (panoptic, instance or semantic), and sets a new state-of-the-art for panoptic segmentation, instance segmentation and semantic segmentation.

Abstract

Image segmentation groups pixels with different semantics, e.g., category or instance membership. Each choice of semantics defines a task. While only the semantics of each task differ, current research focuses on designing spe-cialized architectures for each task. We present Masked- attention Mask Transformer (Mask2Former), a new archi-tecture capable of addressing any image segmentation task (panoptic, instance or semantic). Its key components in-clude masked attention, which extracts localized features by constraining cross-attention within predicted mask regions. In addition to reducing the research effort by at least three times, it outperforms the best specialized architectures by a significant margin on four popular datasets. Most no-tably, Mask2Former sets a new state-of-the-art for panoptic segmentation (57.8 PQ on COCO), instance segmentation (50.1 AP on COCO) and semantic segmentation (57.7 mIoU onADE20K).

Authors

A. Schwing

14 papers

Ishan Misra

13 papers

Alexander Kirillov

6 papers

TL;DR

Abstract

Authors

References65 items

Fully Convolutional Networks for Panoptic Segmentation With Point-Based Supervision

FaPN: Feature-aligned Pyramid Network for Dense Image Prediction

Conditional DETR for Fast Training Convergence

Per-Pixel Classification is Not All You Need for Semantic Segmentation

Simple Training Strategies and Model Scaling for Object Detection

K-Net: Towards Unified Image Segmentation

PVT v2: Improved baselines with Pyramid Vision Transformer

BEiT: BERT Pre-Training of Image Transformers

SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers

OCNet: Object Context for Semantic Segmentation

Segmenter: Transformer for Semantic Segmentation

Instances as Queries

Pointly-Supervised Instance Segmentation

Boundary IoU: Improving Object-Centric Image Segmentation Evaluation

Fast Convergence of DETR with Spatially Modulated Co-Attention

Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers

Simple Copy-Paste is a Strong Data Augmentation Method for Instance Segmentation

MaX-DeepLab: End-to-End Panoptic Segmentation with Mask Transformers

Scaling Wide Residual Networks for Panoptic Segmentation

Rethinking Transformer-based Set Prediction for Object Detection

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Deformable DETR: Deformable Transformers for End-to-End Object Detection

End-to-End Object Detection with Transformers

Hierarchical Multi-Scale Attention for Semantic Segmentation

SOLOv2: Dynamic and Fast Instance Segmentation

Conditional Convolutions for Instance Segmentation

PointRend: Image Segmentation As Rendering

YOLACT++ Better Real-Time Instance Segmentation

EfficientDet: Scalable and Efficient Object Detection

Panoptic-DeepLab: A Simple, Strong, and Fast Baseline for Bottom-Up Panoptic Segmentation

Deep High-Resolution Representation Learning for Visual Recognition

Hybrid Task Cascade for Instance Segmentation

UPSNet: A Unified Panoptic Segmentation Network

Panoptic Feature Pyramid Networks

CCNet: Criss-Cross Attention for Semantic Segmentation

Dual Attention Network for Scene Segmentation

Unified Perceptual Parsing for Scene Understanding

Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation

Panoptic Segmentation

Cascade R-CNN: Delving Into High Quality Object Detection

Non-local Neural Networks

Decoupled Weight Decay Regularization

The Mapillary Vistas Dataset for Semantic Understanding of Street Scenes

Focal Loss for Dense Object Detection

Scene Parsing through ADE20K Dataset

Rethinking Atrous Convolution for Semantic Image Segmentation

Attention is All you Need

Mask R-CNN

Feature Pyramid Networks for Object Detection

Pyramid Scene Parsing Network

InstanceCut: From Edges to Instances with MultiCut

Xception: Deep Learning with Depthwise Separable Convolutions

V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation

DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs

The Cityscapes Dataset for Semantic Urban Scene Understanding

Deep Residual Learning for Image Recognition

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Fully convolutional networks for semantic segmentation

ImageNet Large Scale Visual Recognition Challenge

The Pascal Visual Object Classes Challenge: A Retrospective

Multiscale Combinatorial Grouping

Microsoft COCO: Common Objects in Context

Selective Search for Object Recognition

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

Panoptic SegFormer

Field of Study

Journal Information

Name

Page

Venue Information

Name

Type

URL

Alternate Names