Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling (2023-04-03T00:00:00.000000Z)

TL;DR

A suite of 16 LLMs all trained on public data seen in the exact same order and ranging in size from 70M to 12B parameters is introduced, demonstrating that this highly controlled setup can be used to yield novel insights toward LLMs and their training dynamics.

Abstract

How do large language models (LLMs) develop and evolve over the course of training? How do these patterns change as models scale? To answer these questions, we introduce \textit{Pythia}, a suite of 16 LLMs all trained on public data seen in the exact same order and ranging in size from 70M to 12B parameters. We provide public access to 154 checkpoints for each one of the 16 models, alongside tools to download and reconstruct their exact training dataloaders for further study. We intend \textit{Pythia} to facilitate research in many areas, and we present several case studies including novel results in memorization, term frequency effects on few-shot performance, and reducing gender bias. We demonstrate that this highly controlled setup can be used to yield novel insights toward LLMs and their training dynamics. Trained models, analysis code, training code, and training data can be found at \url{https://github.com/EleutherAI/pythia}.

Authors

Edward Raff

7 papers

Aviya Skowron

2 papers

Stella Biderman

7 papers

TL;DR

Abstract

Authors

References101 items

OpenFold: Retraining AlphaFold2 yields new insights into its learning mechanisms and capacity for generalization

Emergent and Predictable Memorization in Large Language Models

Eliciting Latent Predictions from Transformers with the Tuned Lens

The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset

When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories

Training Trajectories of Language Models Across Scales

Undesirable biases in NLP: Averting a crisis of measurement

Large Language Models Struggle to Learn Long-Tail Knowledge

What Language Model to Train if You Have One Million GPU Hours?

EleutherAI: Going Beyond "Open Science" to "Science in the Open"

GLM-130B: An Open Bilingual Pre-trained Model

Efficient Gender Debiasing of Pre-trained Indic Language Models

Measuring Causal Effects of Data Statistics on Language Model's 'Factual' Predictions

The Birth of Bias: A case study on the evolution of gender bias in an English language model

Measuring Forgetting of Memorized Training Examples

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Are Large Pre-Trained Language Models Leaking Your Personal Information?

Memorization Without Overfitting: Analyzing the Training Dynamics of Large Language Models

Scaling Laws and Interpretability of Learning from Repeated Data

Lifting the Curse of Multilinguality by Pre-training Modular Transformers

Data Governance in the Age of Large-Scale Data-Driven Language Technology

OPT: Open Pre-trained Transformer Language Models

On the Effect of Pretraining Corpora on In-context Learning by a Large-scale Language Model

VQGAN-CLIP: Open Domain Image Generation and Editing with Natural Language Guidance

GPT-NeoX-20B: An Open-Source Autoregressive Language Model

Hierarchical Text-Conditional Image Generation with CLIP Latents

InCoder: A Generative Model for Code Infilling and Synthesis

PaLM: Scaling Language Modeling with Pathways

Training Compute-Optimal Large Language Models

Quantifying Societal Bias Amplification in Image Captioning

A systematic evaluation of large language models of code

Quantifying Memorization Across Neural Language Models

Impact of Pretraining Term Frequencies on Few-Shot Reasoning

Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model

A Systematic Study of Bias Amplification

Documenting Geographically and Contextually Diverse Data Sources: The BigScience Catalogue of Language Data and Resources

Datasheet for the Pile

Submix: Practical Private Prediction for Large-Scale Language Models

High-Resolution Image Synthesis with Latent Diffusion Models

Acquisition of chess knowledge in AlphaZero

Multitask Prompted Training Enables Zero-Shot Task Generalization

Stepmothers are mean and academics are pretentious: What do pretrained language models learn about you?

Scaling Laws for Neural Machine Translation

Collecting a Large-Scale Gender Bias Dataset for Coreference Resolution and Machine Translation

Scaling Effect of Self-Supervised Speech Models

A Scaling Law for Syn2real Transfer: How Much Is Your Pre-training Effective?

Highly accurate protein structure prediction with AlphaFold

Deduplicating Training Data Makes Language Models Better

Evaluating Large Language Models Trained on Code

The MultiBERTs: BERT Reproductions for Robustness Analysis

The Values Encoded in Machine Learning Research

RoFormer: Enhanced Transformer with Rotary Position Embedding

Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus

Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets

GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow

Membership Inference Attacks on Machine Learning: A Survey

Scaling Laws for Transfer

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

Extracting Training Data from Large Language Models

Cross-Loss Influence Functions to Explain Deep Network Representations

Underspecification Presents Challenges for Credibility in Modern Machine Learning

Scaling Laws for Autoregressive Generative Modeling

Language ID in the Wild: Unexpected Challenges on the Path to a Thousand-Language Web Text Corpus

CrowS-Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models

Language Models are Few-Shot Learners

A Neural Scaling Law from the Dimension of the Data Manifold

Scaling Laws for Neural Language Models

Unsupervised Cross-lingual Representation Learning at Scale

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

HuggingFace's Transformers: State-of-the-art Natural Language Processing

ZeRO: Memory Optimization Towards Training A Trillion Parameter Models

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Asynchronous Pipeline for Processing Huge Corpora on Medium to Low Resource Infrastructures

Which Algorithmic Choices Matter at Which Batch Sizes? Insights From a Noisy Quadratic Model

Identifying and Reducing Gender Bias in Word-Level Language Models

An Empirical Model of Large-Batch Training