High-Resolution Representations for Labeling Pixels and Regions (2019-04-09T00:00:00.000000Z)

TL;DR

A simple modification is introduced to augment the high-resolution representation by aggregating the (upsampled) representations from all the parallel convolutions rather than only the representation from thehigh-resolution convolution, which leads to stronger representations, evidenced by superior results.

Abstract

High-resolution representation learning plays an essential role in many vision problems, e.g., pose estimation and semantic segmentation. The high-resolution network (HRNet)~\cite{SunXLW19}, recently developed for human pose estimation, maintains high-resolution representations through the whole process by connecting high-to-low resolution convolutions in \emph{parallel} and produces strong high-resolution representations by repeatedly conducting fusions across parallel convolutions. In this paper, we conduct a further study on high-resolution representations by introducing a simple yet effective modification and apply it to a wide range of vision tasks. We augment the high-resolution representation by aggregating the (upsampled) representations from all the parallel convolutions rather than only the representation from the high-resolution convolution as done in~\cite{SunXLW19}. This simple modification leads to stronger representations, evidenced by superior results. We show top results in semantic segmentation on Cityscapes, LIP, and PASCAL Context, and facial landmark detection on AFLW, COFW, $300$W, and WFLW. In addition, we build a multi-level representation from the high-resolution representation and apply it to the Faster R-CNN object detection framework and the extended frameworks. The proposed approach achieves superior results to existing single-model networks on COCO object detection. The code and models have been publicly available at \url{this https URL}.

Authors

Ke Sun

6 papers

Bin Xiao

6 papers

Dong Liu

2 papers

TL;DR

Abstract

Authors

References138 items

Deep High-Resolution Representation Learning for Human Pose Estimation

Stacked Dense U-Nets with Dual Transformers for Robust Face Alignment

Rethinking ImageNet Pre-Training

M2Det: A Single-Shot Object Detector based on Multi-Level Feature Pyramid Network

Sequential Context Encoding for Duplicate Removal

Devil in the Details: Towards Accurate Single and Multiple Human Parsing

A Deeply-Initialized Coarse-to-fine Ensemble of Regression Trees for Face Alignment

Context Refinement for Object Detection

Mutual Learning to Adapt for Joint Human Parsing and Pose Estimation

PSANet: Point-wise Spatial Attention Network for Scene Parsing

DetNet: Design Backbone for Object Detection

Parallel Feature Pyramid Network for Object Detection

Deep Feature Pyramid Reconfiguration for Object Detection

Quantized Densely Connected U-Nets for Efficient Landmark Localization

CornerNet: Detecting Objects as Paired Keypoints

BiSeNet: Bilateral Segmentation Network for Real-time Semantic Segmentation

Unified Perceptual Parsing for Scene Understanding

Macro-Micro Adversarial Network for Human Parsing

UNet++: A Nested U-Net Architecture for Medical Image Segmentation

Direct Shape Regression Networks for End-to-End Face Alignment

Supervision-by-Registration: An Unsupervised Approach to Improve the Precision of Facial Landmark Detectors

Context Contrasted Feature and Gated Multi-scale Aggregation for Scene Segmentation

Scale-Transferrable Object Detection

Look at Boundary: A Boundary-Aware Face Alignment Algorithm

Pyramid Attention Network for Semantic Segmentation

PAD-Net: Multi-tasks Guided Prediction-and-Distillation Network for Simultaneous Depth Estimation and Scene Parsing

SNIPER: Efficient Multi-Scale Training

R-FCN++: Towards Accurate Region-Based Fully Convolutional Networks for Object Detection

Learning a Discriminative Feature Network for Semantic Segmentation

ExFuse: Enhancing Feature Fusion for Semantic Segmentation

Look into Person: Joint Body Parsing & Pose Estimation Network and a New Benchmark

Multi-scale Location-Aware Kernel Representation for Object Detection

Adaptive Affinity Fields for Semantic Segmentation

Context Encoding for Semantic Segmentation

Dynamic-Structured Semantic Propagation Network

Style Aggregated Network for Facial Landmark Detection

Path Aggregation Network for Instance Segmentation

Disentangling 3D Pose in a Dendritic CNN for Unconstrained 2D Face Alignment

Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation

Deep Regionlets for Object Detection

Cascade R-CNN: Delving Into High Quality Object Detection

Single-Shot Object Detection with Enriched Semantics

Relation Networks for Object Detection

An Analysis of Scale Invariance in Object Detection - SNIP

Receptive Field Block Net for Accurate and Fast Object Detection

MegDet: A Large Mini-Batch Object Detector

Single-Shot Refinement Neural Network for Object Detection

Wing Loss for Robust Facial Landmark Localisation with Convolutional Neural Networks

Improving Object Localization with Fitness NMS and Bounded IoU Loss

Interleaved Group Convolutions

Scale-Adaptive Convolutions for Scene Parsing

Joint Multi-View Face Alignment in the Wild

Stacked Deconvolutional Network for Semantic Segmentation

CoupleNet: Coupling Global Structure with Local Parts for Object Detection

Focal Loss for Dense Object Detection

Residual Conv-Deconv Grid Network for Semantic Segmentation

A Deep Regression Architecture with Two-Stage Re-initialization for High Performance Facial Landmark Detection

The Devil is in the Decoder: Classification, Regression and GANs

Leveraging Intra and Inter-Dataset Variations for Robust Face Alignment

Self-Supervised Neural Aggregation Networks for Human Parsing

Stacked Hourglass Network for Robust Facial Landmark Localisation

Gated Feedback Refinement Network for Dense Image Labeling

Rethinking Atrous Convolution for Semantic Image Segmentation

Deep Alignment Network: A Convolutional Neural Network for Robust Face Alignment

Dilated Residual Networks

Recurrent Scene Parsing with Perspective Understanding in the Loop

Adversarial PoseNet: A Structure-Aware Convolutional Network for Human Pose Estimation

Not All Pixels Are Equal: Difficulty-Aware Semantic Segmentation via Deep Layer Cascade

DeNet: Scalable Real-Time Object Detection with Directed Sparse Sampling

How Far are We from Solving the 2D & 3D Face Alignment Problem? (and a Dataset of 230,000 3D Facial Landmarks)

Mask R-CNN

Look into Person: Self-Supervised Structure-Sensitive Learning and a New Benchmark for Human Parsing

Large Kernel Matters — Improve Semantic Segmentation by Global Convolutional Network

Binarized Convolutional Landmark Localizers for Human Pose Estimation and Face Alignment with Limited Resources

Understanding Convolution for Semantic Segmentation

Feature Pyramid Networks for Object Detection