Underspecificat… (2020-11-06T00:00:00.000000Z)

Abstract

ML models often exhibit unexpectedly poor behavior when they are deployed in real-world domains. We identify underspecification as a key reason for these failures. An ML pipeline is underspecified when it can return many predictors with equivalently strong held-out performance in the training domain. Underspecification is common in modern ML pipelines, such as those based on deep learning. Predictors returned by underspecified pipelines are often treated as equivalent based on their training domain performance, but we show here that such predictors can behave very differently in deployment domains. This ambiguity can lead to instability and poor model behavior in practice, and is a distinct failure mode from previously identified issues arising from structural mismatch between training and deployment domains. We show that this problem appears in a wide variety of practical ML pipelines, using examples from computer vision, medical imaging, natural language processing, clinical risk prediction based on electronic health records, and medical genomics. Our results show the need to explicitly account for underspecification in modeling pipelines that are intended for real-world deployment in any domain.

Abstract

Authors

N. Houlsby

12 papers

Vivek Natarajan

5 papers

A. D'Amour

4 papers

K. Heller

3 papers

D. Moldovan

4 papers

Ben Adlam

1 papers

Babak Alipanahi

1 papers

Alex Beutel

2 papers

Christina Chen

1 papers

Jonathan Deaton

1 papers

Jacob Eisenstein

7 papers

M. Hoffman

4 papers

F. Hormozdiari

1 papers

Shaobo Hou

2 papers

Ghassen Jerfel

4 papers

A. Karthikesalingam

3 papers

Mario Lucic

10 papers

Yi-An Ma

1 papers

Cory Y. McLean

1 papers

Diana Mincu

1 papers

A. Mitani

1 papers

A. Montanari

2 papers

Zachary Nado

5 papers

Christopher Nielson

1 papers

T. Osborne

1 papers

R. Raman

3 papers

K. Ramasamy

1 papers

R. Sayres

1 papers

Jessica Schrouff

1 papers

Martin G. Seneviratne

1 papers

Shannon Sequeira

1 papers

Harini Suresh

1 papers

Victor Veitch

5 papers

Max Vladymyrov

1 papers

Xuezhi Wang

1 papers

Kellie Webster

1 papers

Steve Yadlowsky

3 papers

T. Yun

1 papers

Xiaohua Zhai

15 papers

D. Sculley

2 papers

References149 items

Causally motivated shortcut removal using auxiliary labels

Formalizing Trust in Artificial Intelligence: Prerequisites, Causes and Goals of Human Trust in AI

Measuring and Reducing Gendered Correlations in Pre-trained Models

Reporting guidelines for clinical trial reports for interventions involving artificial intelligence: the CONSORT-AI extension

Guidelines for clinical trial protocols for interventions involving artificial intelligence: the SPIRIT-AI extension

The myth of generalisability in clinical research and machine learning in health care

On Robustness and Transferability of Convolutional Neural Networks

Measuring Robustness to Natural Distribution Shifts in Image Classification

Climbing towards NLU: On Meaning, Form, and Understanding in the Age of Data

The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Generalization

Hyperparameter Ensembles for Robustness and Uncertainty Quantification

Estimating the effects of non-pharmaceutical interventions on COVID-19 in Europe

The hardware lottery

Beyond Accuracy: Behavioral Testing of NLP Models with CheckList

How Can We Accelerate Progress Towards Human-like Linguistic Generalization?

Skin Color in Dermatology Textbooks: An Updated Evaluation and Analysis.

A Human-Centered Evaluation of a Deep Learning System Deployed in Clinics for the Detection of Diabetic Retinopathy

StereoSet: Measuring stereotypical bias in pretrained language models

Quantifying Gender Bias in Different Corpora

Shortcut learning in deep neural networks

The Curse of Performance Instability in Analysis Datasets: Consequences, Source, and Suggestions

Understanding and Mitigating the Tradeoff Between Robustness and Accuracy

Bayesian Deep Learning and a Probabilistic Perspective of Generalization

5分で分かる!? 有名論文ナナメ読み：Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding

Fine-Tuning Pretrained Language Models: Weight Initializations, Data Orders, and Early Stopping

Doctor XAI: an ontology-based approach to black-box sequential data classification explanations

Big Transfer (BiT): General Visual Representation Learning

Large Scale Learning of General Visual Representations for Transfer

Linear Mode Connectivity and the Lottery Ticket Hypothesis

Deep double descent: where bigger models and more data hurt

Causality for Machine Learning

BERTs of a feather do not generalize together: Large variability in generalization across models with similar test set performance

Key challenges for delivering clinical impact with artificial intelligence

Dissecting racial bias in an algorithm used to manage the health of populations

Hidden stratification causes clinically meaningful failures in machine learning for medical imaging

Learning the Difference that Makes a Difference with Counterfactually-Augmented Data

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

Deep Ensembles: A Loss Landscape Perspective

Predictive Multiplicity in Classification

A deep learning system for differential diagnosis of skin diseases

Artificial intelligence to predict AKI: is it a breakthrough?

The Generalization Error of Random Features Regression: Precise Asymptotics and the Double Descent Curve

Association Between Surgical Skin Markings in Dermoscopic Images and Diagnostic Performance of a Deep Learning Convolutional Neural Network for Melanoma Recognition.

A study in Rashomon curves and volumes: A new perspective on generalization and model simplicity in machine learning

Feature Robustness in Non-stationary Health Records: Caveats to Deployable Model Performance in Common Clinical Machine Learning Tasks

Developing Deep Learning Continuous Risk Models for Early Adverse Event Prediction in Electronic Health Records: an AKI Case Study

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Analysis of polygenic risk score usage and performance in diverse human populations

Natural Adversarial Examples

Invariant Risk Minimization

A Clinically Applicable Approach to Continuous Prediction of Future Acute Kidney Injury

A Fourier Perspective on Model Robustness in Computer Vision

XLNet: Generalized Autoregressive Pretraining for Language Understanding

Analyzing the role of model uncertainty for electronic health records

Can You Trust Your Model's Uncertainty? Evaluating Predictive Uncertainty Under Dataset Shift

High-Frequency Component Helps Explain the Generalization of Convolutional Neural Networks

Adversarial Examples Are Not Bugs, They Are Features

HellaSwag: Can a Machine Really Finish Your Sentence?

Learning Robust Global Representations by Penalizing Local Predictive Power

Clinical use of current polygenic risk scores may exacerbate health disparities

Benchmarking Neural Network Robustness to Common Corruptions and Perturbations

Guidelines and recommendations for ensuring Good Epidemiological Practice (GEP): a guideline developed by the German Society for Epidemiology

Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference

Bias in Bios: A Case Study of Semantic Representation Bias in a High-Stakes Setting

Using Electronic Health Records to Identify Adverse Drug Events in Ambulatory Care: A Systematic Review

Reconciling modern machine learning and the bias-variance trade-off

On Lazy Training in Differentiable Programming

Predicting diabetes-related hospitalizations based on electronic health records

Machine Learning and Health Care Disparities in Dermatology.

ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness

Counterfactual Fairness in Text Classification through Robustness

SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference

Reduced signal for polygenic adaptation of height in UK Biobank

Genome-wide polygenic scores for common diseases identify individuals with risk equivalent to monogenic mutations

Stress Test Evaluation for Natural Language Inference

Gender Bias in Coreference Resolution

Gender Bias in Coreference Resolution: Evaluation and Debiasing Methods

Averaging Weights Leads to Wider Optima and Better Generalization

Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNs

A standardized framework for representation of ancestry data in genomics studies, with application to the NHGRI-EBI GWAS Catalog

Deep Contextualized Word Representations

Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification

Universal Language Model Fine-tuning for Text Classification

All Models are Wrong, but Many are Useful: Learning a Variable's Importance by Studying an Entire Class of Prediction Models Simultaneously

Development and Validation of a Deep Learning System for Diabetic Retinopathy and Related Eye Diseases Using Retinal Images From Multiethnic Populations With Diabetes

Grader variability and the importance of reference standards for evaluating machine learning models for diabetic retinopathy

The Consciousness Prior

Simple Recurrent Units for Highly Parallelizable Recurrence

SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation

Domain Adaptation by Using Causal Inference to Predict Invariant Conditional Distributions

Revisiting Unreasonable Effectiveness of Data in Deep Learning Era

Invariant Causal Prediction for Nonlinear Models

Machine Learning: An Applied Econometric Approach

A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference

Counterfactual Fairness

Right for the Right Reasons: Training Differentiable Models by Constraining their Explanations

Beyond prediction: Using big data for policy problems

Dermatologist-level classification of skin cancer with deep neural networks

Development and Validation of a Deep Learning Algorithm for Detection of Diabetic Retinopathy in Retinal Fundus Photographs.

100

Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles

101

GRAM: Graph-based Attention Model for Healthcare Representation Learning

102

Entropy-SGD: biasing gradient descent into wide valleys

103

Capacity and Trainability in Recurrent Neural Networks

104

Genomics is failing on diversity

105

Semantics derived automatically from language corpora contain human-like biases

106

Human demographic history impacts genetic risk prediction across diverse populations

107

Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings

108

Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning

109

Deep Residual Learning for Image Recognition

110

A large annotated corpus for learning natural language inference

111

Prediction Policy Problems.

112

UK Biobank: An Open Access Resource for Identifying the Causes of a Wide Range of Complex Diseases of Middle and Old Age

113

Causal inference by using invariant prediction: identification and confidence intervals

114

Efficient Estimation of Word Representations in Vector Space

115

Large-scale association analysis identifies new risk loci for coronary artery disease

116

KDIGO Clinical Practice Guidelines for Acute Kidney Injury

117

Unachievable Region in Precision-Recall Space and Its Effect on Empirical Evaluation

118

Improving disease prediction using ICD-9 ontological features

119

Next generation disparities in human genomics: concerns and remedies.

120

Common polygenic variation contributes to risk of schizophrenia and bipolar disorder

121

ImageNet: A large-scale hierarchical image database

122

Eigenvectors of some large sample covariance matrix ensembles

123

Linkage disequilibrium — understanding the evolutionary past and mapping the medical future

124

Random Features for Large-Scale Kernel Machines

125

Prediction of individual genetic risk to disease from genome-wide association studies.

126

PLINK: a tool set for whole-genome association and population-based linkage analyses.

127

Principal components analysis corrects for stratification in genome-wide association studies

128

OntoNotes: The 90% Solution

129

Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author)

130

Long Short-Term Memory

131

Sun and skin.

132

2020). Interestingly, however, it takes marginalizing across a larger subset of models

133

ObjectNet: A large-scale bias-controlled dataset for pushing the limits of object recognition models

134

2019) holds for the risk, conditional on the realization of X,y. The statement given here is obtained simply my taking expectation over X,y

135

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

136

Acute kidney injury: prevention, detection and management

137

Neural architecture search: A survey

138

x)〉. It is useful to consider a couple of examples. Example 1. Imagine training a highly overparametrized neural network using SGD

139

In order to evaluate the asymptotics of this expression, we recall some formulas that follow from Mei and Montanari

140

PRS show potential for identifying high-risk individuals for certain common diseases

141

PROTOCOL available at Protocol Exchange, version 1, jul 2019b. doi: 10.21203/RS

142

while representing less than a quarter of global population), this has raised scientific and ethical concerns about the clinical use of PRS and GWAS in the community Martin et al

143

Correction: Efficacy of Commercial Weight-Loss Programs

144

Glue: A

145

Intelligible Models for HealthCare

146

Priors for Infinite Networks

147

The use of misclassification costs to learn rule-based decision support models for cost-effective hospital admission strategies.

148

[Sun and skin].

149

TL;DR

Abstract

TL;DR

Abstract

Authors

References149 items

Causally motivated shortcut removal using auxiliary labels

Formalizing Trust in Artificial Intelligence: Prerequisites, Causes and Goals of Human Trust in AI

Measuring and Reducing Gendered Correlations in Pre-trained Models

Reporting guidelines for clinical trial reports for interventions involving artificial intelligence: the CONSORT-AI extension

Guidelines for clinical trial protocols for interventions involving artificial intelligence: the SPIRIT-AI extension

The myth of generalisability in clinical research and machine learning in health care

On Robustness and Transferability of Convolutional Neural Networks

Measuring Robustness to Natural Distribution Shifts in Image Classification

Climbing towards NLU: On Meaning, Form, and Understanding in the Age of Data

The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Generalization

Hyperparameter Ensembles for Robustness and Uncertainty Quantification

Estimating the effects of non-pharmaceutical interventions on COVID-19 in Europe

The hardware lottery

Beyond Accuracy: Behavioral Testing of NLP Models with CheckList

How Can We Accelerate Progress Towards Human-like Linguistic Generalization?

Skin Color in Dermatology Textbooks: An Updated Evaluation and Analysis.

A Human-Centered Evaluation of a Deep Learning System Deployed in Clinics for the Detection of Diabetic Retinopathy

StereoSet: Measuring stereotypical bias in pretrained language models

Quantifying Gender Bias in Different Corpora

Shortcut learning in deep neural networks

The Curse of Performance Instability in Analysis Datasets: Consequences, Source, and Suggestions

Understanding and Mitigating the Tradeoff Between Robustness and Accuracy

Bayesian Deep Learning and a Probabilistic Perspective of Generalization

5分で分かる!? 有名論文ナナメ読み：Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding

Fine-Tuning Pretrained Language Models: Weight Initializations, Data Orders, and Early Stopping

Doctor XAI: an ontology-based approach to black-box sequential data classification explanations

Big Transfer (BiT): General Visual Representation Learning

Large Scale Learning of General Visual Representations for Transfer

Linear Mode Connectivity and the Lottery Ticket Hypothesis

Deep double descent: where bigger models and more data hurt

Causality for Machine Learning

BERTs of a feather do not generalize together: Large variability in generalization across models with similar test set performance

Key challenges for delivering clinical impact with artificial intelligence

Dissecting racial bias in an algorithm used to manage the health of populations

Hidden stratification causes clinically meaningful failures in machine learning for medical imaging

Learning the Difference that Makes a Difference with Counterfactually-Augmented Data

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

Deep Ensembles: A Loss Landscape Perspective

Predictive Multiplicity in Classification

A deep learning system for differential diagnosis of skin diseases

Artificial intelligence to predict AKI: is it a breakthrough?

The Generalization Error of Random Features Regression: Precise Asymptotics and the Double Descent Curve

Association Between Surgical Skin Markings in Dermoscopic Images and Diagnostic Performance of a Deep Learning Convolutional Neural Network for Melanoma Recognition.

A study in Rashomon curves and volumes: A new perspective on generalization and model simplicity in machine learning

Feature Robustness in Non-stationary Health Records: Caveats to Deployable Model Performance in Common Clinical Machine Learning Tasks

Developing Deep Learning Continuous Risk Models for Early Adverse Event Prediction in Electronic Health Records: an AKI Case Study

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Analysis of polygenic risk score usage and performance in diverse human populations

Natural Adversarial Examples

Invariant Risk Minimization

A Clinically Applicable Approach to Continuous Prediction of Future Acute Kidney Injury

A Fourier Perspective on Model Robustness in Computer Vision

XLNet: Generalized Autoregressive Pretraining for Language Understanding

Analyzing the role of model uncertainty for electronic health records

Can You Trust Your Model's Uncertainty? Evaluating Predictive Uncertainty Under Dataset Shift

High-Frequency Component Helps Explain the Generalization of Convolutional Neural Networks

Adversarial Examples Are Not Bugs, They Are Features

HellaSwag: Can a Machine Really Finish Your Sentence?

Learning Robust Global Representations by Penalizing Local Predictive Power

Clinical use of current polygenic risk scores may exacerbate health disparities

Benchmarking Neural Network Robustness to Common Corruptions and Perturbations

Guidelines and recommendations for ensuring Good Epidemiological Practice (GEP): a guideline developed by the German Society for Epidemiology

Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference

Bias in Bios: A Case Study of Semantic Representation Bias in a High-Stakes Setting

Using Electronic Health Records to Identify Adverse Drug Events in Ambulatory Care: A Systematic Review

Reconciling modern machine learning and the bias-variance trade-off

On Lazy Training in Differentiable Programming

Predicting diabetes-related hospitalizations based on electronic health records

Machine Learning and Health Care Disparities in Dermatology.

ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness

Counterfactual Fairness in Text Classification through Robustness

SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference

Reduced signal for polygenic adaptation of height in UK Biobank

Genome-wide polygenic scores for common diseases identify individuals with risk equivalent to monogenic mutations