Finding Differences Between Transformers and ConvNets Using Counterfactual Simulation Testing (2022-11-29T00:00:00.000000Z)

TL;DR

This work presents Counterfactual Simulation Testing, a counterfactual framework that allows for a fair comparison of the robustness of recently released, state-of-the-art Convolutional Neural Networks and Vision Transformers, with respect to naturalistic variations.

Abstract

Modern deep neural networks tend to be evaluated on static test sets. One shortcoming of this is the fact that these deep neural networks cannot be easily evaluated for robustness issues with respect to specific scene variations. For example, it is hard to study the robustness of these networks to variations of object scale, object pose, scene lighting and 3D occlusions. The main reason is that collecting real datasets with fine-grained naturalistic variations of sufficient scale can be extremely time-consuming and expensive. In this work, we present Counterfactual Simulation Testing, a counterfactual framework that allows us to study the robustness of neural networks with respect to some of these naturalistic variations by building realistic synthetic scenes that allow us to ask counterfactual questions to the models, ultimately providing answers to questions such as"Would your classification still be correct if the object were viewed from the top?"or"Would your classification still be correct if the object were partially occluded by another object?". Our method allows for a fair comparison of the robustness of recently released, state-of-the-art Convolutional Neural Networks and Vision Transformers, with respect to these naturalistic variations. We find evidence that ConvNext is more robust to pose and scale variations than Swin, that ConvNext generalizes better to our simulated domain and that Swin handles partial occlusion better than ConvNext. We also find that robustness for all networks improves with network scale and with data scale and variety. We release the Naturalistic Variation Object Dataset (NVD), a large simulated dataset of 272k images of everyday objects with naturalistic variations such as object pose, scale, viewpoint, lighting and occlusions. Project page: https://counterfactualsimulation.github.io

References80 items

DaViT: Dual Attention Vision Transformers

A ConvNet for the 2020s

On Causally Disentangled Representations

Swin Transformer V2: Scaling Up Capacity and Resolution

Are Transformers More Robust Than CNNs?

Adversarial Robustness Comparison of Vision Transformer and MLP-Mixer to CNNs

Do Vision Transformers See Like Convolutional Neural Networks?

ConvNets vs. Transformers: Whose Visual Representations are More Transferable?

BEiT: BERT Pre-Training of Image Transformers

Reveal of Vision Transformers Robustness against Adversarial Attacks

Intriguing Properties of Vision Transformers

Vision Transformers are Robust Learners

Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions

Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet

Training data-efficient image transformers & distillation through attention

Learning Contextual Causality from Time-consecutive Images

MorphGAN: One-Shot Face Synthesis GAN for Detecting Recognition Bias

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

ThreeDWorld: A Platform for Interactive Multi-Modal Physical Simulation

Gender Slopes: Counterfactual Fairness for Computer Vision Models by Attribute Manipulation

Compositional Convolutional Neural Networks: A Deep Architecture With Innate Robustness to Partial Occlusion

Natural Adversarial Examples

Image Counterfactual Sensitivity Analysis for Detecting Unintended Bias

On the Transfer of Inductive Bias from Simulation to the Real World: a New Disentanglement Dataset

Combining Compositional Models and Deep Networks For Robust Object Classification under Occlusion

EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks

Robustness of Object Recognition under Extreme Occlusion in Humans and Computational Models

Meta-Sim: Learning to Generate Synthetic Datasets

Counterfactual Visual Explanations

Synthetic Examples Improve Generalization for Rare Classes

Counterfactual Sensitivity and Robustness

Benchmarking Neural Network Robustness to Common Corruptions and Perturbations

Moment Matching for Multi-Source Domain Adaptation

Strike (With) a Pose: Neural Networks Are Easily Fooled by Strange Poses of Familiar Objects

Learning To Simulate

ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness

Learning dexterous in-hand manipulation

VisDA: A Synthetic-to-Real Benchmark for Visual Domain Adaptation

Synthesizing Programs for Images using Reinforced Adversarial Learning

Learning to Adapt Structured Output Space for Semantic Segmentation

Path-Specific Counterfactual Fairness

Disentangling by Factorising

Empirically Analyzing the Effect of Dataset Biases on Deep Face Recognition Systems

CyCADA: Cycle-Consistent Adversarial Domain Adaptation

CARLA: An Open Urban Driving Simulator

Adversarial Variational Optimization of Non-Differentiable Simulators

No More Discrimination: Cross City Adaptation of Road Scene Segmenters

Counterfactual Fairness

Adversarial Discriminative Domain Adaptation

CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning

Densely Connected Convolutional Networks

Playing for Data: Ground Truth from Computer Games

The SYNTHIA Dataset: A Large Collection of Synthetic Images for Semantic Segmentation of Urban Scenes

VirtualWorlds as Proxy for Multi-object Tracking Analysis

Deep Residual Learning for Image Recognition

A Large Dataset to Train Convolutional Networks for Disparity, Optical Flow, and Scene Flow Estimation

Domain-Adversarial Training of Neural Networks

Very Deep Convolutional Networks for Large-Scale Image Recognition

ImageNet classification with deep convolutional neural networks

3D Object Detection and Viewpoint Estimation with a Deformable 3D Cuboid Model

ImageNet: A large-scale hierarchical image database

Et al

Learning methods for generic object recognition with invariance to pose and lighting

and C

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

ObjectNet: A large-scale bias-controlled dataset for pushing the limits of object recognition models

NVLabs. Falor3d - isaac3d

dsprites: Disentanglement testing sprites dataset

Establishing Good Benchmarks and Baselines for Face Recognition

Gradient-based learning applied to document recognition

Did you discuss any potential negative societal impacts of your work? [Yes] Extended discussion in supp. material

(a) Did you state the full set of assumptions of all theoretical results

Do the main claims made in the abstract and introduction accurately reflect the paper's contributions and scope?

Did you report error bars (e.g., with respect to the random seed after running experiments multiple times

Have you read the ethics review guidelines and ensured that your paper conforms to them

Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content?

Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable?

If you used crowdsourcing or conducted research with human subjects... (a) Did you include the full text of instructions given to participants and screenshots

Did you describe the limitations of your work

c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation?