GLM-130B: An Open Bilingual Pre-trained Model (2022-10-05T00:00:00.000000Z)

TL;DR

An attempt to open-source a 100B-scale model at least as good as GPT-3 (davinci) and unveil how models of such a scale can be successfully pre-trained is unveiled and a unique scaling property of GLM-130B is leveraged to reach INT4 quantization without post training, with almost no performance loss.

Abstract

We introduce GLM-130B, a bilingual (English and Chinese) pre-trained language model with 130 billion parameters. It is an attempt to open-source a 100B-scale model at least as good as GPT-3 (davinci) and unveil how models of such a scale can be successfully pre-trained. Over the course of this effort, we face numerous unexpected technical and engineering challenges, particularly on loss spikes and divergence. In this paper, we introduce the training process of GLM-130B including its design choices, training strategies for both efficiency and stability, and engineering efforts. The resultant GLM-130B model offers significant outperformance over GPT-3 175B (davinci) on a wide range of popular English benchmarks while the performance advantage is not observed in OPT-175B and BLOOM-176B. It also consistently and significantly outperforms ERNIE TITAN 3.0 260B -- the largest Chinese language model -- across related benchmarks. Finally, we leverage a unique scaling property of GLM-130B to reach INT4 quantization without post training, with almost no performance loss, making it the first among 100B-scale models and more importantly, allowing its effective inference on 4$\times$RTX 3090 (24G) or 8$\times$RTX 2080 Ti (11G) GPUs, the most affordable GPUs required for using 100B-scale models. The GLM-130B model weights are publicly accessible and its code, training logs, related toolkit, and lessons learned are open-sourced at \url{https://github.com/THUDM/GLM-130B/}.

Authors

References154 items

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

What Language Model to Train if You Have One Million GPU Hours?

LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

Rationale-Augmented Ensembles in Language Models

Emergent Abilities of Large Language Models

Memory-Based Model Editing at Scale

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

DeepStruct: Pretraining of Language Models for Structure Prediction

OPT: Open Pre-trained Transformer Language Models

GPT-NeoX-20B: An Open-Source Autoregressive Language Model

PaLM: Scaling Language Modeling with Pathways

Training Compute-Optimal Large Language Models

Compression of Generative Pre-trained Language Models via Quantization

DeepNet: Scaling Transformers to 1,000 Layers

PromptSource: An Integrated Development Environment and Repository for Natural Language Prompts

Chain of Thought Prompting Elicits Reasoning in Large Language Models

Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model

LaMDA: Language Models for Dialog Applications

Counterfactual Memorization in Neural Language Models

ERNIE 3.0 Titan: Exploring Larger-scale Knowledge Enhanced Pre-training for Language Understanding and Generation

Efficient Large Scale Language Modeling with Mixtures of Experts

Ethical and social risks of harm from Language Models

Scaling Language Models: Methods, Analysis & Insights from Training Gopher

Do Language Models Have Beliefs? Methods for Detecting, Updating, and Visualizing Model Beliefs

ExT5: Towards Extreme Multi-Task Scaling for Transfer Learning

NormFormer: Improved Transformer Pretraining with Extra Normalization

Multitask Prompted Training Enables Zero-Shot Task Generalization

Learning Compact Metrics for MT

Yuan 1.0: Large-Scale Pre-trained Language Model in Zero-Shot and Few-Shot Learning

8-bit Optimizers via Block-wise Quantization

TruthfulQA: Measuring How Models Mimic Human Falsehoods

Finetuned Language Models Are Zero-Shot Learners

Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation

Harms of Gender Exclusivity and Challenges in Non-Binary Representation in Language Technologies

On the Opportunities and Risks of Foundation Models

Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing

FewCLUE: A Chinese Few-shot Learning Evaluation Benchmark

Towards Understanding and Mitigating Social Biases in Language Models

CogView: Mastering Text-to-Image Generation via Transformers

Societal Biases in Language Generation: Progress and Challenges

PanGu-α: Large-scale Autoregressive Pretrained Chinese Language Models with Auto-parallel Computation

Carbon Emissions and Large Neural Network Training

RoFormer: Enhanced Transformer with Rotary Position Embedding

The Power of Scale for Parameter-Efficient Prompt Tuning

Editing Factual Knowledge in Language Models

An Empirical Study of Training Self-Supervised Vision Transformers

Detecting Hate Speech with GPT-3

GPT Understands, Too

GLM: General Language Model Pretraining with Autoregressive Blank Infilling

On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜

Self-Diagnosis and Self-Debiasing: A Proposal for Reducing Corpus-Based Bias in NLP

Zero-Shot Text-to-Image Generation

The GEM Benchmark: Natural Language Generation, its Evaluation and Metrics

Exploring Text-transformers in AAAI 2021 Shared Task: COVID-19 Fake News Detection in English

Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

RealFormer: Transformer Likes Residual Attention

Modifying Memories in Transformer Models

Large Scale Knowledge Graph Based Synthetic Corpus Generation for Knowledge-Enhanced Language Model Pre-training

CrowS-Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models

RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models

Measuring Massive Multitask Language Understanding

DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters

Memory-Efficient Pipeline-Parallel DNN Training

Self-Supervised Learning: Generative or Contrastive

Language Models are Few-Shot Learners

MLSUM: The Multilingual Summarization Corpus

StereoSet: Measuring stereotypical bias in pretrained language models

CLUE: A Chinese Language Understanding Evaluation Benchmark

Pre-trained models for natural language processing: A survey

MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers

5分で分かる!? 有名論文ナナメ読み：Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding

On Layer Normalization in the Transformer Architecture

How Much Knowledge Can You Pack into the Parameters of a Language Model?

Measurement and Fairness

PIQA: Reasoning about Physical Commonsense in Natural Language

Semantic Noise Matters for Neural Natural Language Generation

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Quantifying the Carbon Emissions of Machine Learning

Q8BERT: Quantized 8Bit BERT

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

Reducing Transformer Depth on Demand with Structured Dropout

TinyBERT: Distilling BERT for Natural Language Understanding

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT

Entity, Relation, and Event Extraction with Contextualized Span Representations

“Going on a vacation” takes longer than “Going for a walk”: A Study of Temporal Commonsense Understanding

Natural Questions: A Benchmark for Question Answering Research

WinoGrande

Energy and Policy Considerations for Deep Learning in NLP

GLTR: Statistical Detection and Visualization of Generated Text

Unified Language Model Pre-training for Natural Language Understanding and Generation

SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems

Are Sixteen Heads Really Better than One?

Parameter-Efficient Transfer Learning for NLP

Transformer-XL: Attentive Language Models beyond a Fixed-Length Context

GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism

Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering

100

T-REx: A Large Scale Alignment of Natural Language with Knowledge Base Triples

101

Gender Bias in Coreference Resolution

102

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

103

Generating Wikipedia by Summarizing Long Sequences

104

Decoupled Weight Decay Regularization

105

Mixed Precision Training

106

Position-aware Attention and Supervised Data Improve Slot Filling

107

Zero-Shot Learning—A Comprehensive Evaluation of the Good, the Bad and the Ugly

108

Attention is All you Need

109

TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

110

Layer Normalization

111

Gaussian Error Linear Units (GELUs)

112

The LAMBADA dataset: Word prediction requiring a broad discourse context

113

Semantic Parsing on Freebase from Question-Answer Pairs

114

Towards Robust Linguistic Analysis using OntoNotes

115

The Winograd Schema Challenge

116

Modeling Relations and Their Mentions without Labeled Text

117

Introduction to the CoNLL-2005 Shared Task: Semantic Role Labeling

118

ROUGE: A Package for Automatic Evaluation of Summaries

119

A Linear Programming Formulation for Global Inference in Natural Language Tasks

120

Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition

121

The GENIA corpus: an annotated research abstract corpus in molecular biology domain

122

A bridging model for parallel computation

123

P-Tuning: Prompt Tuning Can Be Comparable to Fine-tuning Across Scales and Tasks

124

Accelerated inference for large transformer models using nvidia triton inference server

125

leading to in-depth studies of LLMs’ theory, capacity, and flaws. Researchers can also modify the model architecture and weights, to validate the proposed algorithms to improve LLMs Zhu et al

126

Jurassic-1: Technical details and evaluation

127

Prefix-Tuning: Optimizing Continuous Prompts for Generation

128

WuDaoCorpora: A super large-scale Chinese corpora for pre-training language models

129

Stereotyping Norwegian Salmon: An Inventory of Pitfalls in Fairness Benchmark Datasets

130

GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. https://github.com/kingoflolz/mesh-transformer-jax

131

Also, it is known that LLMs can suffer from problems in fairness, bias, privacy, and truthful

132

2021) later observe that Pre-LN is still unable to handle the vulnerable training

133

Ethos: an online hate speech detection dataset

134

The 2020 Bilingual, Bi-Directional WebNLG+ Shared Task: Overview and Evaluation Results (WebNLG+ 2020)

135

2020) is a comprehensive language modeling benchmark that initially includes 22 different text datasets from diverse domains. We report our results over a part of 18 datasets

136

2020), or namely Crowdsourced Stereotype Pairs benchmark, is widely used for measuring biases for masked language models. It collects 1508 examples with nine different conventional biases and adopts

137

Language Models are Unsupervised Multitask Learners

138

MultiWOZ 2.

139

CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge

140

transparency to all the researchers and promote the research to reduce the potential harm of LLMs, like algorithms to identify the synthetic text

141

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

142

Improving language understanding with unsupervised learning

143

From TreeBank to PropBank

144

• Predicate Recognition: given a segment of a sentence and its corresponding semantic role, identify which verb it is related to

145

It would not be possible to reach its current status if without the collaboration of multiple teams-the Knowledge Engineering Group (KEG)

146

According to d'Holbach, people always act according to

147

Semantic Role Labeling: the traditional task form, where a verb (i.e., predicate) is annotated in text, and the model is asked to generate related semantic roles

148

Semantic Role Filling: given a verb and a potential semantic role, the model is asked to judge whether the role exists in the sentence and generate it

149

A) not suitable for the young. (B) not suitable for the old. (C) important, but unpleasant. (D) none of the above

150

• Optimize A100 kernel's computing efficiency => A100 kernels prefer square-shaped inputs, and seq_len=2,048 is optimal for our hidden-state dimension

151

The timeline of major issues that training GLM-130B encountered and addressed, as of

152

Below is a prompted example with 1-shot priming. We predict the probability on

153

Technical Report 2022-10-06

154