1
On The Impact of Machine Learning Randomness on Group Fairness
2
Evaluating the Social Impact of Generative AI Systems in Systems and Society
3
The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only
4
The Curse of Recursion: Training on Generated Data Makes Models Forget
5
The False Promise of Imitating Proprietary LLMs
6
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
7
PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel
8
OpenAssistant Conversations - Democratizing Large Language Model Alignment
9
AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models
10
Self-Refine: Iterative Refinement with Self-Feedback
11
ChatGPT outperforms crowd workers for text-annotation tasks
12
LLaMA: Open and Efficient Foundation Language Models
13
Towards Safer Generative Language Models: A Survey on Safety Risks, Evaluations, and Improvements
14
Pretraining Language Models with Human Preferences
15
The Capacity for Moral Self-Correction in Large Language Models
16
Augmented Language Models: a Survey
17
Is ChatGPT better than Human Annotators? Potential and Limitations of ChatGPT in Explaining Implicit Hate Speech
18
Toolformer: Language Models Can Teach Themselves to Use Tools
19
The Flan Collection: Designing Data and Methods for Effective Instruction Tuning
20
An Empirical Study of Metrics to Measure Representational Harms in Pre-Trained Language Models
21
Self-Instruct: Aligning Language Models with Self-Generated Instructions
22
Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor
23
Constitutional AI: Harmlessness from AI Feedback
24
Galactica: A Large Language Model for Science
25
Efficiently Scaling Transformer Inference
26
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
27
Large Language Models Are Human-Level Prompt Engineers
28
Scaling Instruction-Finetuned Language Models
29
Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them
30
Language Generation Models Can Cause Harm: So What Can We Do About It? An Actionable Survey
31
Re-contextualizing Fairness in NLP: The Case of India
32
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned
33
Fewer Errors, but More Stereotypes? The Effect of Model Size on Gender Bias
34
ACT: designing sustainable computer systems with an architectural carbon modeling tool
35
Measuring the Carbon Intensity of AI in Cloud Instances
36
“I’m sorry to hear that”: Finding New Biases in Language Models with a Holistic Descriptor Dataset
37
OPT: Open Pre-trained Transformer Language Models
38
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
39
Based on billions of words on the internet, people = men
40
Training Compute-Optimal Large Language Models
41
ToxiGen: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection
42
Training language models to follow instructions with human feedback
43
Repairing the Cracked Foundation: A Survey of Obstacles in Evaluation Practices for Generated Text
44
Chain of Thought Prompting Elicits Reasoning in Large Language Models
45
SCROLLS: Standardized CompaRison Over Long Language Sequences
46
WebGPT: Browser-assisted question-answering with human feedback
47
GLaM: Efficient Scaling of Language Models with Mixture-of-Experts
48
Ethical and social risks of harm from Language Models
49
Scaling Language Models: Methods, Analysis & Insights from Training Gopher
50
A General Language Assistant as a Laboratory for Alignment
51
Sustainable AI: Environmental Implications, Challenges and Opportunities
52
Training Verifiers to Solve Math Word Problems
53
Understanding Dataset Difficulty with V-Usable Information
54
TruthfulQA: Measuring How Models Mimic Human Falsehoods
55
Hi, my name is Martha: Using names to measure and mitigate bias in generative dialogue models
56
Finetuned Language Models Are Zero-Shot Learners
57
Program Synthesis with Large Language Models
58
Deduplicating Training Data Makes Language Models Better
59
Evaluating Large Language Models Trained on Code
60
Anticipating Safety Issues in E2E Conversational AI: Framework and Tooling
61
All That’s ‘Human’ Is Not Gold: Evaluating Human Evaluation of Generated Text
62
Carbon Emissions and Large Neural Network Training
63
RoFormer: Enhanced Transformer with Rotary Position Embedding
64
Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus
65
On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜
66
BOLD: Dataset and Metrics for Measuring Biases in Open-Ended Language Generation
67
Chasing Carbon: The Elusive Environmental Footprint of Computing
68
Recipes for Safety in Open-domain Chatbots
69
Measuring Massive Multitask Language Understanding
70
Learning to summarize from human feedback
71
Open-Domain Conversational Agents: Current Progress, Open Problems, and Future Directions
72
ColdGANs: Taming Language GANs with Cautious Sampling Strategies
73
DeBERTa: Decoding-enhanced BERT with Disentangled Attention
74
Language Models are Few-Shot Learners
75
Residual Energy-Based Models for Text Generation
76
Discriminative Adversarial Search for Abstractive Summarization
77
GLU Variants Improve Transformer
78
Scaling Laws for Neural Language Models
79
PIQA: Reasoning about Physical Commonsense in Natural Language
80
Fast Transformer Decoding: One Write-Head is All You Need
81
The Impact of Artificial Intelligence on the Labor Market
82
Grandmaster level in StarCraft II using multi-agent reinforcement learning
83
Growing Up Together: Structured Exploration for Large Action Spaces
84
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
85
Toward Understanding Catastrophic Forgetting in Continual Learning
86
Natural Questions: A Benchmark for Question Answering Research
87
RoBERTa: A Robustly Optimized BERT Pretraining Approach
89
Defending Against Neural Fake News
90
BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions
91
HellaSwag: Can a Machine Really Finish Your Sentence?
92
The Curious Case of Neural Text Degeneration
93
Model Cards for Model Reporting
94
QuAC: Question Answering in Context
95
SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing
96
Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering
97
Is Automation Labor-Displacing? Productivity Growth, Employment, and the Labor Share
98
Know What You Don’t Know: Unanswerable Questions for SQuAD
99
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
100
Artificial Intelligence, Automation and Work
101
Decoupled Weight Decay Regularization
102
Proximal Policy Optimization Algorithms
103
Deep Reinforcement Learning from Human Preferences
104
TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension
105
Overcoming catastrophic forgetting in neural networks
106
Enriching Word Vectors with Subword Information
107
Neural Machine Translation of Rare Words with Subword Units
108
Handbook of Inter-Rater Reliability: The Definitive Guide to Measuring the Extent of Agreement Among Raters
109
VADER: A Parsimonious Rule-Based Model for Sentiment Analysis of Social Media Text
110
Computing inter-rater reliability and its variance in the presence of high agreement.
111
Falcon-40B: an open large language model with state-of-the-art performance
112
Exploring AI Ethics of ChatGPT: A Diagnostic Analysis
113
Stanford alpaca: An instruction-following llama model
114
Huggingface h4 stack exchange preference dataset. 2023
115
Palm 2 technical report, 2023
116
Introducing mpt-7b: A new standard for open-source
117
Introducing the ai research supercluster -meta's cutting-edge ai supercomputer for ai research
118
Effect of scale on catastrophic forgetting in neural networks
119
Guiding the Release of Safer E2E Conversational AI through Value Sensitive Design
120
Fiedel. Palm: Scaling language modeling with pathways, 2022
121
Stereotyping Norwegian Salmon: An Inventory of Pitfalls in Fairness Benchmark Datasets
122
A framework for few-shot language model evaluation
123
Measuringmathematical problem solvingwith themath dataset
124
Socialiqa: Commonsense reasoning about social interactions
125
CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge
126
Root mean square layer normalization, 2019
127
Attention is all you need, 2017
128
In Advances in Neural Information Processing Systems
129
GPT-4 technical report. CoRR, abs/2303.08774
130
Our human annotators, whose work we have shown is key to improving tuned model performance, as well as internal leads who organized annotations and quality control: Eric Alamillo
131
Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality