PaLM: Scaling Language Modeling with Pathways (2022-04-05T00:00:00.000000Z)

TL;DR

A 540-billion parameter, densely activated, Transformer language model, which is called PaLM achieves breakthrough performance, outperforming the finetuned state-of-the-art on a suite of multi-step reasoning tasks, and outperforming average human performance on the recently released BIG-bench benchmark.

Abstract

Large language models have been shown to achieve remarkable performance across a variety of natural language tasks using few-shot learning, which drastically reduces the number of task-specific training examples needed to adapt the model to a particular application. To further our understanding of the impact of scale on few-shot learning, we trained a 540-billion parameter, densely activated, Transformer language model, which we call Pathways Language Model PaLM. We trained PaLM on 6144 TPU v4 chips using Pathways, a new ML system which enables highly efficient training across multiple TPU Pods. We demonstrate continued benefits of scaling by achieving state-of-the-art few-shot learning results on hundreds of language understanding and generation benchmarks. On a number of these tasks, PaLM 540B achieves breakthrough performance, outperforming the finetuned state-of-the-art on a suite of multi-step reasoning tasks, and outperforming average human performance on the recently released BIG-bench benchmark. A significant number of BIG-bench tasks showed discontinuous improvements from model scale, meaning that performance steeply increased as we scaled to our largest model. PaLM also has strong capabilities in multilingual tasks and source code generation, which we demonstrate on a wide array of benchmarks. We additionally provide a comprehensive analysis on bias and toxicity, and study the extent of training data memorization with respect to model scale. Finally, we discuss the ethical considerations related to large language models and discuss potential mitigation strategies.

References173 items

Solving Quantitative Reasoning Problems with Language Models

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

Designing Effective Sparse Expert Models

Scaling Up Models and Data with t5x and seqio

Training Compute-Optimal Large Language Models

Pathways: Asynchronous Distributed Dataflow for ML

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Training language models to follow instructions with human feedback

Quantifying Memorization Across Neural Language Models

Repairing the Cracked Foundation: A Survey of Obstacles in Evaluation Practices for Generated Text

Deduplicating Training Data Mitigates Privacy Risks in Language Models

Competition-level code generation with AlphaCode

Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model

Chain of Thought Prompting Elicits Reasoning in Large Language Models

Reasoning Like Program Executors

LaMDA: Language Models for Dialog Applications

GLaM: Efficient Scaling of Language Models with Mixture-of-Experts

Improving language models by retrieving from trillions of tokens

Scaling Language Models: Methods, Analysis & Insights from Training Gopher

Ethical and social risks of harm from Language Models

Show Your Work: Scratchpads for Intermediate Computation with Language Models

AI and the Everything in the Whole Wide World Benchmark

Training Verifiers to Solve Math Word Problems

Multitask Prompted Training Enables Zero-Shot Task Generalization

Learning Compact Metrics for MT

Challenges in Detoxifying Language Models

Finetuned Language Models Are Zero-Shot Learners

MWPToolkit: An Open-Source Framework for Deep Learning-Based Math Word Problem Solvers

Harms of Gender Exclusivity and Challenges in Non-Binary Representation in Language Technologies

Program Synthesis with Large Language Models

On the Opportunities and Risks of Foundation Models

Deduplicating Training Data Makes Language Models Better

Evaluating Large Language Models Trained on Code

Break-It-Fix-It: Unsupervised Learning for Program Repair

Measuring and Improving BERT’s Mathematical Abilities by Predicting the Order of Reasoning.

ByT5: Towards a Token-Free Future with Pre-trained Byte-to-Byte Models

GSPMD: General and Scalable Parallelization for ML Computation Graphs

Societal Biases in Language Generation: Progress and Challenges

PanGu-α: Large-scale Autoregressive Pretrained Chinese Language Models with Auto-parallel Computation

Carbon Emissions and Large Neural Network Training

RoFormer: Enhanced Transformer with Rotary Position Embedding

Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus

ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep learning

Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM

Are NLP Models really able to Solve Simple Math Word Problems?

Designing Disaggregated Evaluations of AI Systems: Choices, Considerations, and Tradeoffs

On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜

Prompt Programming for Large Language Models: Beyond the Few-Shot Paradigm

The GEM Benchmark: Natural Language Generation, its Evaluation and Metrics

Re-imagining Algorithmic Fairness in India and Beyond

ZeRO-Offload: Democratizing Billion-Scale Model Training

Persistent Anti-Muslim Bias in Large Language Models

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies

HateCheck: Functional Tests for Hate Speech Detection Models

mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer

Beyond English-Centric Multilingual Machine Translation

Complete Multilingual Neural Machine Translation

WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization

Pre-training Multilingual Neural Machine Translation by Leveraging Alignment Information

Rethinking Attention with Performers

RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models

Measuring Massive Multitask Language Understanding

Big Bird: Transformers for Longer Sequences

DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters

You Autocomplete Me: Poisoning Vulnerabilities in Neural Code Completion

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

A domain-specific supercomputer for training deep neural networks

Memory-Efficient Pipeline-Parallel DNN Training

Scalable Cross Lingual Pivots to Model Pronoun Gender for Translation

Language Models are Few-Shot Learners

Language (Technology) is Power: A Critical Survey of “Bias” in NLP

Graph-based, Self-Supervised Program Repair from Diagnostic Feedback

Social Biases in NLP Models as Barriers for Persons with Disabilities

MLSUM: The Multilingual Summarization Corpus

Shortcut learning in deep neural networks

Efficient Content-Based Sparse Attention with Routing Transformers

TyDi QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages

5分で分かる!? 有名論文ナナメ読み：Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding

GLU Variants Improve Transformer

REALM: Retrieval-Augmented Language Model Pre-Training

Towards a Human-like Open-Domain Chatbot

Scaling Laws for Neural Language Models

Reformer: The Efficient Transformer

PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization

Measurement and Fairness

PIQA: Reasoning about Physical Commonsense in Natural Language

Microsoft Research Asia’s Systems for WMT19

Fast Transformer Decoding: One Write-Head is All You Need

Semantic Noise Matters for Neural Natural Language Generation

Adversarial NLI: A New Benchmark for Natural Language Understanding

Toward Gender-Inclusive Coreference Resolution

BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Neural Generation for Czech: Data and Baselines

ZeRO: Memory Optimization Towards Training A Trillion Parameter Models

Evaluating the Cross-Lingual Effectiveness of Massively Multilingual Neural Machine Translation

On Measuring and Mitigating Biased Inferences of Word Embeddings

Natural Questions: A Benchmark for Question Answering Research

100

Quantifying Social Biases in Contextual Word Representations

101

Neural Machine Translation for English–Kazakh with Morphological Segmentation and Synthetic Data

102

WinoGrande

103

The Risk of Racial Bias in Hate Speech Detection

104

Tagged Back-Translation

105

SPoC: Search-based Pseudocode to Code

106

Evaluating Gender Bias in Machine Translation

107

MathQA: Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms

108

MASS: Masked Sequence to Sequence Pre-training for Language Generation

109

SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems

110

HellaSwag: Can a Machine Really Finish Your Sentence?

111

Generating Long Sequences with Sparse Transformers

112

DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs

113

The State of Sparsity in Deep Neural Networks

114

Measuring and Mitigating Unintended Bias in Text Classification

115

The adverse effects of code duplication in machine learning models of code

116

An Empirical Model of Large-Batch Training

117

GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism

118

Mesh-TensorFlow: Deep Learning for Supercomputers

119

Model Cards for Model Reporting

120

Don’t Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization

121

CoQA: A Conversational Question Answering Challenge

122

SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing

123

Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering

124

Know What You Don’t Know: Unanswerable Questions for SQuAD

125

Gender Bias in Coreference Resolution

126

Adafactor: Adaptive Learning Rates with Sublinear Memory Cost

127

Datasheets for datasets

128

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

129

Don't Decay the Learning Rate, Increase the Batch Size

130

DéjàVu: a map of code duplicates on GitHub

131

A Survey of Machine Learning for Big Code and Naturalness

132

Creating Training Corpora for NLG Micro-Planners

133

The E2E Dataset: New Challenges For End-to-End Generation

134

Attention is All you Need

135

Program Induction by Rationale Generation: Learning to Solve and Explain Algebraic Word Problems

136

TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

137

RACE: Large-scale ReAding Comprehension Dataset From Examinations

138

DeepFix: Fixing Common C Language Errors by Deep Learning

139

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

140

The LAMBADA dataset: Word prediction requiring a broad discourse context

141

MAWPS: A Math Word Problem Repository

142

A Corpus and Cloze Evaluation for Deeper Understanding of Commonsense Stories

143

Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Google's Datacenter Network

144

An analysis of patch plausibility and correctness for generate-and-validate patch generation systems

145

Adam: A Method for Stochastic Optimization

146

Semantic Parsing on Freebase from Question-Answer Pairs

147

Understanding the exploding gradient problem

148

The Winograd Schema Challenge

149

ROUGE: A Package for Automatic Evaluation of Summaries

150

Sustainability at Google.Carbon neutral since 2007.Carbon free by 2030., 2022

151

Structure-to-Text Generation with Self-Training, Acceptability Classifiers and Context-Conditioning for the GEM Shared Task

152

What do Bias Measures Measure?

153

Stereotyping Norwegian Salmon: An Inventory of Pitfalls in Fairness Benchmark Datasets

154

An Empirical Cybersecurity Evaluation of GitHub Copilot's Code Contributions

155

GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. https: //github.com/kingoflolz/mesh-transformer-jax

156

Jurassic-1: Technical details and evaluation

157

The 2020 Bilingual, Bi-Directional WebNLG+ Shared Task: Overview and Evaluation Results (WebNLG+ 2020)

158

Pytorch distributed: Experiences on accelerating data parallel training

159

anti-semitism Table 24: TF-IDF tokens co-occurring most frequently with religions (identities chosen to match the analysis

160

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

161

CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge

162

Multi-Agent Dual Learning

163

The NiuTrans Machine Translation Systems for WMT19

164

JAX: Composable transformations of Python+NumPy programs, 2018

165

Improving Language Understanding by Generative Pre-Training

166

Understanding back-translation at scale

167

NLTK: The Natural Language Toolkit

168

The Invisible Whiteness of Being: Whiteness, White Supremacy, White Privilege, and Racism.

169

Unsupervised translation of programming languages. CoRR, abs

170

Zongwei Zhou Model serving (API, use cases and efficiency

171

Aakanksha Chowdhery Translation tasks (few-shot evaluation): Xavier Garcia, Orhan Firat Multilingual Natural Language Generation (few-shot evaluation and finetuning

172

of the Association for Computational Linguistics

173