PVT v2: Improved baselines with Pyramid Vision Transformer (2021-06-25T00:00:00.000000Z)

TL;DR

This work improves the original Pyramid Vision Transformer (PVT v1) by adding three designs: a linear complexity attention layer, an overlapping patch embedding, and a convolutional feed-forward network to reduce the computational complexity of PVT v1 to linearity and provide significant improvements on fundamental vision tasks.

Abstract

Transformers have recently lead to encouraging progress in computer vision. In this work, we present new baselines by improving the original Pyramid Vision Transformer (PVT v1) by adding three designs: (i) a linear complexity attention layer, (ii) an overlapping patch embedding, and (iii) a convolutional feed-forward network. With these modifications, PVT v2 reduces the computational complexity of PVT v1 to linearity and provides significant improvements on fundamental vision tasks such as classification, detection, and segmentation. In particular, PVT v2 achieves comparable or better performance than recent work such as the Swin transformer. We hope this work will facilitate state-of-the-art transformer research in computer vision. Code is available at https://github.com/whai362/PVT.

Authors

Ding Liang

8 papers

Wenhai Wang

14 papers

Enze Xie

12 papers

TL;DR

Abstract

Authors

References41 items

Polyp-PVT: Polyp Segmentation with Pyramid Vision Transformers

Twins: Revisiting the Design of Spatial Attention in Vision Transformers

Co-Scale Conv-Attentional Image Transformers

LeViT: a Vision Transformer in ConvNet’s Clothing for Faster Inference

CvT: Introducing Convolutions to Vision Transformers

CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification

Transformer in Transformer

Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions

Conditional Positional Encodings for Vision Transformers

Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet

Training data-efficient image transformers & distillation through attention

Sparse R-CNN: End-to-End Object Detection with Learnable Proposals

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Generalized Focal Loss: Learning Qualified and Distributed Bounding Boxes for Dense Object Detection

Designing Network Design Spaces

How Much Position Information Do Convolutional Neural Networks Encode?

Bridging the Gap Between Anchor-Based and Anchor-Free Detection via Adaptive Training Sample Selection

MMDetection: Open MMLab Detection Toolbox and Benchmark

Panoptic Feature Pyramid Networks

Cascade R-CNN: Delving Into High Quality Object Detection

Decoupled Weight Decay Regularization

mixup: Beyond Empirical Risk Minimization

Random Erasing Data Augmentation

Focal Loss for Dense Object Detection

Scene Parsing through ADE20K Dataset

Attention is All you Need

MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

Mask R-CNN

Aggregated Residual Transformations for Deep Neural Networks

SGDR: Stochastic Gradient Descent with Warm Restarts

Gaussian Error Linear Units (GELUs)

DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs

Deep Residual Learning for Image Recognition

Rethinking the Inception Architecture for Computer Vision

Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification

Going deeper with convolutions

ImageNet Large Scale Visual Recognition Challenge

Microsoft COCO: Common Objects in Context

Understanding the difficulty of training deep feedforward neural networks

ImageNet: A large-scale hierarchical image database

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

Field of Study

Journal Information

Name

Page

Volume

Venue Information

Name

Type

URL

Alternate Names