1
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
2
Designing Effective Sparse Expert Models
3
PaLM: Scaling Language Modeling with Pathways
4
Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer
5
Unified Scaling Laws for Routed Language Models
6
Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model
7
LaMDA: Language Models for Dialog Applications
8
Efficient Large Scale Language Modeling with Mixtures of Experts
9
GLaM: Efficient Scaling of Language Models with Mixture-of-Experts
10
Improving language models by retrieving from trillions of tokens
11
Scaling Language Models: Methods, Analysis & Insights from Training Gopher
12
Ethical and social risks of harm from Language Models
13
Scale Efficiently: Insights from Pre-training and Fine-tuning Transformers
14
Challenges in Detoxifying Language Models
15
TruthfulQA: Measuring How Models Mimic Human Falsehoods
16
Detoxifying Language Models Risks Marginalizing Minority Voices
17
On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜
18
Scaling Laws for Transfer
19
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
20
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
21
The Depth-to-Width Interplay in Self-Attention.
22
Distilling Knowledge from Reader to Retriever for Question Answering
23
RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models
24
Measuring Massive Multitask Language Understanding
25
Language Models are Few-Shot Learners
26
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
27
REALM: Retrieval-Augmented Language Model Pre-Training
28
Scaling Laws for Neural Language Models
29
PIQA: Reasoning about Physical Commonsense in Natural Language
30
Compressive Transformers for Long-Range Sequence Modelling
31
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
32
ZeRO: Memory Optimization Towards Training A Trillion Parameter Models
33
Natural Questions: A Benchmark for Question Answering Research
35
Which Algorithmic Choices Matter at Which Batch Sizes? Insights From a Noisy Quadratic Model
36
BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions
37
HellaSwag: Can a Machine Really Finish Your Sentence?
38
Approximation rates for neural networks with general activation functions
39
An Empirical Model of Large-Batch Training
40
Measuring the Effects of Data Parallelism on Neural Network Training
41
Model Cards for Model Reporting
42
SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing
43
Gender Bias in Coreference Resolution
44
Decoupled Weight Decay Regularization
45
Attention is All you Need
46
TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension
47
In-datacenter performance analysis of a tensor processing unit
48
RACE: Large-scale ReAding Comprehension Dataset From Examinations
49
Pointer Sentinel Mixture Models
50
The LAMBADA dataset: Word prediction requiring a broad discourse context
51
Adam: A Method for Stochastic Optimization
52
Convex Optimization: Algorithms and Complexity
53
Updating Quasi-Newton Matrices With Limited Storage
54
A Stochastic Approximation Method
55
Updates and lessons from AI forecasting
56
Jurassic-1: Technical details and evaluation
57
The data is collected from the internet, and thus undoubtedly there is toxic/biased content
59
SocialIQA: Commonsense reasoning about social interactions
60
JAX: composable transformations of Python+NumPy programs
61
Common sense understanding on HellaSwag
62
On robust estimation of the location parameter
63
Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] We provided all training details and hyperparameters
64
with respect to the random seed after running experiments multiple times)?
65
Did you mention the license of the assets? [Yes] We use the same data as in Rae et al. [38] which uses a proprietary dataset. We also show results with an open source dataset-C4 ?
66
Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [Yes] We include a model card which includes this information
67
Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation
68
Did you discuss whether and how consent was obtained from people whose data you're using/curating?
69
Do the main claims made in the abstract and introduction accurately reflect the paper's contributions and scope? [Yes] The claims in the abstract describe the work clearly
70
Did you include any new assets either in the supplemental material or as a URL?
71
Did you describe the limitations of your work? [Yes] We address limitations of our work
72
Have you read the ethics review guidelines and ensured that your paper conforms to them
73
Did you discuss any potential negative societal impacts of your work? [Yes] We have a discussion both in a model card and in Appendix I
74
code, data, models) or curating/releasing new assets... (a) If your work uses existing assets
75
Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable?
76
If you used crowdsourcing or conducted research with human subjects... (a) Did you include the full text of instructions given to participants and screenshots
77
Did you include complete proofs of all theoretical results
78
Motivation We chose evaluations from Rae et al. (2021) to allow us to most directly compare to Gopher