Accounting for Variance in Machine Learning Benchmarks (2021-03-01T00:00:00.000000Z)

TL;DR

It is shown a counter-intuitive result that adding more sources of variation to an imperfect estimator approaches better the ideal estimator at a 51 times reduction in compute cost.

Abstract

Strong empirical evidence that one machine-learning algorithm A outperforms another one B ideally calls for multiple trials optimizing the learning pipeline over sources of variation such as data sampling, data augmentation, parameter initialization, and hyperparameters choices. This is prohibitively expensive, and corners are cut to reach conclusions. We model the whole benchmarking process, revealing that variance due to data sampling, parameter initialization and hyperparameter choice impact markedly the results. We analyze the predominant comparison methods used today in the light of this variance. We show a counter-intuitive result that adding more sources of variation to an imperfect estimator approaches better the ideal estimator at a 51 times reduction in compute cost. Building on these results, we study the error rate of detecting improvements, on five different deep-learning tasks/architectures. This study leads us to propose recommendations for performance comparisons.

Authors

G. Varoquaux

4 papers

Pascal Vincent

5 papers

Xavier Bouthillier

1 papers

TL;DR

Abstract

Authors

References76 items

Research Reproducibility as a Survival Analysis

A Metric Learning Reality Check

What is the State of Neural Network Pruning?

5分で分かる!? 有名論文ナナメ読み：Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding

Survey of machine-learning experimental methods at NeurIPS2019 and ICLR2020

Self-Training With Noisy Student Improves ImageNet Classification

MLPerf Inference Benchmark

HuggingFace's Transformers: State-of-the-art Natural Language Processing

A Step Toward Quantifying Independently Reproducible Machine Learning Research

Are we really making much progress? A worrying analysis of recent neural recommendation approaches

We Need to Talk about Standard Splits

Unreproducible Research is Reproducible

Confidence intervals for the Mann–Whitney test

The Immune Epitope Database (IEDB): 2018 update

Exploring the Limits of Weakly Supervised Pretraining

GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

Progressive Neural Architecture Search

Are GANs Created Equal? A Large-Scale Study

Replicability Analysis for Natural Language Processing: Testing Significance with Multiple Datasets

Deep Reinforcement Learning that Matters

MHCflurry: open-source class I MHC binding affinity prediction

Reporting Score Distributions Makes a Difference: Performance Study of LSTM-networks for Sequence Tagging

On the State of the Art of Evaluation in Neural Language Models

NetMHCpan 4.0: Improved peptide-MHC class I interaction predictions integrating eluted ligand and peptide binding affinity data

Attention is All you Need

Knowledge Base Completion: Baselines Strike Back

MHC class I-associated peptides derive from selective regions of the human genome.

The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables

Identity Mappings in Deep Residual Networks

Deep Residual Learning for Image Recognition

Fisher, Neyman-Pearson or NHST? A tutorial for teaching data testing

Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification

Fully convolutional networks for semantic segmentation

Very Deep Convolutional Networks for Large-Scale Image Recognition

An Efficient Approach for Assessing Hyperparameter Importance

What’s in a p-value in NLP?

Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank

An Empirical Investigation of Statistical Significance in NLP

Understanding the difficulty of training deep feedforward neural networks

ImageNet: A large-scale hierarchical image database

80 Million Tiny Images: A Large Data Set for Nonparametric Object and Scene Recognition

NetMHCpan, a Method for Quantitative Predictions of Peptide Binding to Any HLA-A and -B Locus Protein of Known Sequence

Statistical Comparisons of Classifiers over Multiple Data Sets

Bootstrap diagnostics and remedies

The Design and Analysis of Benchmark Experiments

On Some Pitfalls in Automatic Evaluation and Significance Testing for MT

Evaluating the Replicability of Significance Tests for Comparing Learning Algorithms

Multiple Hypothesis Testing in Microarray Experiments

Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms

Bagging Predictors

Amino acid substitution matrices from protein blocks.

Sample Size Determination for Some Common Nonparametric Tests

ON THE USE AND INTERPRETATION OF CERTAIN TEST CRITERIA FOR PURPOSES OF STATISTICAL INFERENCE PART I

AUTO-ENCODING VARIATIONAL BAYES

The following table (Table 9) offers some comparison points between our model and the NetMHCpan4 (Jurtz et al., 2017) and MHCflurry (O’Donnell

RoBO : A Flexible and Robust Bayesian Optimization Framework in Python

Metric Learning

Dropout: a simple way to prevent neural networks from overfitting

Model stochasticity Learning algorithms sometimes include stochastic computations such as dropout in neural networks

Learning Multiple Layers of Features from Tiny Images

The Sixth PASCAL Recognizing Textual Entailment Challenge

Bootstrap Methods: Another Look at the Jackknife

The Fourth PASCAL Recognizing Textual Entailment Challenge

An Introduction to the Bootstrap

The PASCAL visual object classes challenge 2006 (VOC2006) results

Analyzing Bagging

OUT-OF-BAG ESTIMATION

The jackknife, the bootstrap, and other resampling plans

Model initialization Model initialization, e.g. weights initialization in neural networks, should be randomized across all trainings

Data augmentation Stochastic data augmentation should not be seeded, so that it follows a different sequence at each run

CI max ≤ γ : Not statistically meaningful. Perhaps CI min > 0 . 5 but it is irrelevant since P ( A > B ) is too small to be meaningful

Data splits The data being used should ideally always be different samples from the true distribution of interest

Data order The ordering of the data can have a surprisingly important impact as can be observed in Figure 1

CI min ≤ 0 . 5 : Not statistically signiﬁcant

of the Association for Computational Linguistics

Série Scientifique Scientific Series Inference for the Generalization Error