Training Compute-Optimal Large Language Models (2022-03-29T00:00:00.000000Z)

TL;DR

This work trains a predicted compute-optimal model, Chinchilla, that uses the same compute budget as Gopher but with 70B parameters and 4$\times$ more more data, and reaches a state-of-the-art average accuracy, greater than a 7% improvement over Gopher.

Abstract

We investigate the optimal model size and number of tokens for training a transformer language model under a given compute budget. We find that current large language models are significantly undertrained, a consequence of the recent focus on scaling language models whilst keeping the amount of training data constant. By training over 400 language models ranging from 70 million to over 16 billion parameters on 5 to 500 billion tokens, we find that for compute-optimal training, the model size and the number of training tokens should be scaled equally: for every doubling of model size the number of training tokens should also be doubled. We test this hypothesis by training a predicted compute-optimal model, Chinchilla, that uses the same compute budget as Gopher but with 70B parameters and 4$\times$ more more data. Chinchilla uniformly and significantly outperforms Gopher (280B), GPT-3 (175B), Jurassic-1 (178B), and Megatron-Turing NLG (530B) on a large range of downstream evaluation tasks. This also means that Chinchilla uses substantially less compute for fine-tuning and inference, greatly facilitating downstream usage. As a highlight, Chinchilla reaches a state-of-the-art average accuracy of 67.5% on the MMLU benchmark, greater than a 7% improvement over Gopher.

TL;DR

Abstract

Authors

References78 items

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

Designing Effective Sparse Expert Models

PaLM: Scaling Language Modeling with Pathways

Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer

Unified Scaling Laws for Routed Language Models

Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model

LaMDA: Language Models for Dialog Applications

Efficient Large Scale Language Modeling with Mixtures of Experts

GLaM: Efficient Scaling of Language Models with Mixture-of-Experts

Improving language models by retrieving from trillions of tokens

Scaling Language Models: Methods, Analysis & Insights from Training Gopher

Ethical and social risks of harm from Language Models

Scale Efficiently: Insights from Pre-training and Fine-tuning Transformers

Challenges in Detoxifying Language Models

TruthfulQA: Measuring How Models Mimic Human Falsehoods

Detoxifying Language Models Risks Marginalizing Minority Voices

On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜

Scaling Laws for Transfer

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

The Depth-to-Width Interplay in Self-Attention.

Distilling Knowledge from Reader to Retriever for Question Answering

RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models

Measuring Massive Multitask Language Understanding

Language Models are Few-Shot Learners

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

REALM: Retrieval-Augmented Language Model Pre-Training

Scaling Laws for Neural Language Models

PIQA: Reasoning about Physical Commonsense in Natural Language

Compressive Transformers for Long-Range Sequence Modelling

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

ZeRO: Memory Optimization Towards Training A Trillion Parameter Models

Natural Questions: A Benchmark for Question Answering Research

WinoGrande

Which Algorithmic Choices Matter at Which Batch Sizes? Insights From a Noisy Quadratic Model

BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

HellaSwag: Can a Machine Really Finish Your Sentence?

Approximation rates for neural networks with general activation functions

An Empirical Model of Large-Batch Training

Measuring the Effects of Data Parallelism on Neural Network Training

Model Cards for Model Reporting

SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing

Gender Bias in Coreference Resolution

Decoupled Weight Decay Regularization

Attention is All you Need

TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

In-datacenter performance analysis of a tensor processing unit

RACE: Large-scale ReAding Comprehension Dataset From Examinations

Pointer Sentinel Mixture Models

The LAMBADA dataset: Word prediction requiring a broad discourse context

Adam: A Method for Stochastic Optimization

Convex Optimization: Algorithms and Complexity

Updating Quasi-Newton Matrices With Limited Storage

A Stochastic Approximation Method

Updates and lessons from AI forecasting

Jurassic-1: Technical details and evaluation

The data is collected from the internet, and thus undoubtedly there is toxic/biased content

Haiku: Sonnet for JAX

SocialIQA: Commonsense reasoning about social interactions

JAX: composable transformations of Python+NumPy programs

Common sense understanding on HellaSwag

On robust estimation of the location parameter

Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] We provided all training details and hyperparameters

with respect to the random seed after running experiments multiple times)?

Did you mention the license of the assets? [Yes] We use the same data as in Rae et al. [38] which uses a proprietary dataset. We also show results with an open source dataset-C4 ?

Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [Yes] We include a model card which includes this information

Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation

Did you discuss whether and how consent was obtained from people whose data you're using/curating?

Do the main claims made in the abstract and introduction accurately reflect the paper's contributions and scope? [Yes] The claims in the abstract describe the work clearly

Did you include any new assets either in the supplemental material or as a URL?

Did you describe the limitations of your work? [Yes] We address limitations of our work

Have you read the ethics review guidelines and ensured that your paper conforms to them

Did you discuss any potential negative societal impacts of your work? [Yes] We have a discussion both in a model card and in Appendix I

code, data, models) or curating/releasing new assets... (a) If your work uses existing assets

Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable?

If you used crowdsourcing or conducted research with human subjects... (a) Did you include the full text of instructions given to participants and screenshots