Llama 2: Open F… (2023-07-18T00:00:00.000000Z)

Abstract

In this work, we develop and release Llama 2, a collection of pretrained and fine-tuned large language models (LLMs) ranging in scale from 7 billion to 70 billion parameters. Our fine-tuned LLMs, called Llama 2-Chat, are optimized for dialogue use cases. Our models outperform open-source chat models on most benchmarks we tested, and based on our human evaluations for helpfulness and safety, may be a suitable substitute for closed-source models. We provide a detailed description of our approach to fine-tuning and safety improvements of Llama 2-Chat in order to enable the community to build on our work and contribute to the responsible development of LLMs.

Abstract

References131 items

On The Impact of Machine Learning Randomness on Group Fairness

Evaluating the Social Impact of Generative AI Systems in Systems and Society

The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only

The Curse of Recursion: Training on Generated Data Makes Models Forget

The False Promise of Imitating Proprietary LLMs

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

OpenAssistant Conversations - Democratizing Large Language Model Alignment

AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models

Self-Refine: Iterative Refinement with Self-Feedback

ChatGPT outperforms crowd workers for text-annotation tasks

LLaMA: Open and Efficient Foundation Language Models

Towards Safer Generative Language Models: A Survey on Safety Risks, Evaluations, and Improvements

Pretraining Language Models with Human Preferences

The Capacity for Moral Self-Correction in Large Language Models

Augmented Language Models: a Survey

Is ChatGPT better than Human Annotators? Potential and Limitations of ChatGPT in Explaining Implicit Hate Speech

Toolformer: Language Models Can Teach Themselves to Use Tools

The Flan Collection: Designing Data and Methods for Effective Instruction Tuning

An Empirical Study of Metrics to Measure Representational Harms in Pre-Trained Language Models

Self-Instruct: Aligning Language Models with Self-Generated Instructions

Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor

Constitutional AI: Harmlessness from AI Feedback

Galactica: A Large Language Model for Science

Efficiently Scaling Transformer Inference

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

Large Language Models Are Human-Level Prompt Engineers

Scaling Instruction-Finetuned Language Models

Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

Language Generation Models Can Cause Harm: So What Can We Do About It? An Actionable Survey

Re-contextualizing Fairness in NLP: The Case of India

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

Fewer Errors, but More Stereotypes? The Effect of Model Size on Gender Bias

ACT: designing sustainable computer systems with an architectural carbon modeling tool

Measuring the Carbon Intensity of AI in Cloud Instances

“I’m sorry to hear that”: Finding New Biases in Language Models with a Holistic Descriptor Dataset

OPT: Open Pre-trained Transformer Language Models

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Based on billions of words on the internet, people = men

Training Compute-Optimal Large Language Models

ToxiGen: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection

Training language models to follow instructions with human feedback

Repairing the Cracked Foundation: A Survey of Obstacles in Evaluation Practices for Generated Text

Chain of Thought Prompting Elicits Reasoning in Large Language Models

SCROLLS: Standardized CompaRison Over Long Language Sequences

WebGPT: Browser-assisted question-answering with human feedback

GLaM: Efficient Scaling of Language Models with Mixture-of-Experts

Ethical and social risks of harm from Language Models

Scaling Language Models: Methods, Analysis & Insights from Training Gopher

A General Language Assistant as a Laboratory for Alignment

Sustainable AI: Environmental Implications, Challenges and Opportunities

Training Verifiers to Solve Math Word Problems

Understanding Dataset Difficulty with V-Usable Information

TruthfulQA: Measuring How Models Mimic Human Falsehoods

Hi, my name is Martha: Using names to measure and mitigate bias in generative dialogue models

Finetuned Language Models Are Zero-Shot Learners

Program Synthesis with Large Language Models

Deduplicating Training Data Makes Language Models Better

Evaluating Large Language Models Trained on Code

Anticipating Safety Issues in E2E Conversational AI: Framework and Tooling

All That’s ‘Human’ Is Not Gold: Evaluating Human Evaluation of Generated Text

Carbon Emissions and Large Neural Network Training

RoFormer: Enhanced Transformer with Rotary Position Embedding

Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus

On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜

BOLD: Dataset and Metrics for Measuring Biases in Open-Ended Language Generation

Chasing Carbon: The Elusive Environmental Footprint of Computing

Recipes for Safety in Open-domain Chatbots

Measuring Massive Multitask Language Understanding

Learning to summarize from human feedback

Open-Domain Conversational Agents: Current Progress, Open Problems, and Future Directions

ColdGANs: Taming Language GANs with Cautious Sampling Strategies

DeBERTa: Decoding-enhanced BERT with Disentangled Attention

Language Models are Few-Shot Learners

Residual Energy-Based Models for Text Generation

Discriminative Adversarial Search for Abstractive Summarization

GLU Variants Improve Transformer

Scaling Laws for Neural Language Models

PIQA: Reasoning about Physical Commonsense in Natural Language

Fast Transformer Decoding: One Write-Head is All You Need

The Impact of Artificial Intelligence on the Labor Market

Grandmaster level in StarCraft II using multi-agent reinforcement learning

Growing Up Together: Structured Exploration for Large Action Spaces

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Toward Understanding Catastrophic Forgetting in Continual Learning

Natural Questions: A Benchmark for Question Answering Research

RoBERTa: A Robustly Optimized BERT Pretraining Approach

WinoGrande

Defending Against Neural Fake News

BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

HellaSwag: Can a Machine Really Finish Your Sentence?

The Curious Case of Neural Text Degeneration

Model Cards for Model Reporting

QuAC: Question Answering in Context

SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing

Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering

Is Automation Labor-Displacing? Productivity Growth, Employment, and the Labor Share

Know What You Don’t Know: Unanswerable Questions for SQuAD

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

100

Artificial Intelligence, Automation and Work

101

Decoupled Weight Decay Regularization

102

Proximal Policy Optimization Algorithms

103

Deep Reinforcement Learning from Human Preferences

104

TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

105

Overcoming catastrophic forgetting in neural networks

106

Enriching Word Vectors with Subword Information

107

Neural Machine Translation of Rare Words with Subword Units

108

Handbook of Inter-Rater Reliability: The Definitive Guide to Measuring the Extent of Agreement Among Raters

109

VADER: A Parsimonious Rule-Based Model for Sentiment Analysis of Social Media Text

110

Computing inter-rater reliability and its variance in the presence of high agreement.

111

Falcon-40B: an open large language model with state-of-the-art performance

112

Exploring AI Ethics of ChatGPT: A Diagnostic Analysis

113

Stanford alpaca: An instruction-following llama model

114

Huggingface h4 stack exchange preference dataset. 2023

115

Palm 2 technical report, 2023

116

Introducing mpt-7b: A new standard for open-source

117

Introducing the ai research supercluster -meta's cutting-edge ai supercomputer for ai research

118

Effect of scale on catastrophic forgetting in neural networks

119

Guiding the Release of Safer E2E Conversational AI through Value Sensitive Design

120

Fiedel. Palm: Scaling language modeling with pathways, 2022

121

Stereotyping Norwegian Salmon: An Inventory of Pitfalls in Fairness Benchmark Datasets

122

A framework for few-shot language model evaluation

123

Measuringmathematical problem solvingwith themath dataset

124

Socialiqa: Commonsense reasoning about social interactions

125

CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge

126

Root mean square layer normalization, 2019

127

Attention is all you need, 2017

128

In Advances in Neural Information Processing Systems

129

GPT-4 technical report. CoRR, abs/2303.08774

130

Our human annotators, whose work we have shown is key to improving tuned model performance, as well as internal leads who organized annotations and quality control: Eric Alamillo

131

Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality

TL;DR

Abstract

Authors

TL;DR

Abstract

Authors

References131 items

On The Impact of Machine Learning Randomness on Group Fairness

Evaluating the Social Impact of Generative AI Systems in Systems and Society

The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only

The Curse of Recursion: Training on Generated Data Makes Models Forget

The False Promise of Imitating Proprietary LLMs

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

OpenAssistant Conversations - Democratizing Large Language Model Alignment

AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models

Self-Refine: Iterative Refinement with Self-Feedback

ChatGPT outperforms crowd workers for text-annotation tasks

LLaMA: Open and Efficient Foundation Language Models

Towards Safer Generative Language Models: A Survey on Safety Risks, Evaluations, and Improvements

Pretraining Language Models with Human Preferences

The Capacity for Moral Self-Correction in Large Language Models

Augmented Language Models: a Survey

Is ChatGPT better than Human Annotators? Potential and Limitations of ChatGPT in Explaining Implicit Hate Speech

Toolformer: Language Models Can Teach Themselves to Use Tools

The Flan Collection: Designing Data and Methods for Effective Instruction Tuning

An Empirical Study of Metrics to Measure Representational Harms in Pre-Trained Language Models

Self-Instruct: Aligning Language Models with Self-Generated Instructions

Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor

Constitutional AI: Harmlessness from AI Feedback

Galactica: A Large Language Model for Science

Efficiently Scaling Transformer Inference

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

Large Language Models Are Human-Level Prompt Engineers

Scaling Instruction-Finetuned Language Models

Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

Language Generation Models Can Cause Harm: So What Can We Do About It? An Actionable Survey

Re-contextualizing Fairness in NLP: The Case of India

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

Fewer Errors, but More Stereotypes? The Effect of Model Size on Gender Bias

ACT: designing sustainable computer systems with an architectural carbon modeling tool

Measuring the Carbon Intensity of AI in Cloud Instances

“I’m sorry to hear that”: Finding New Biases in Language Models with a Holistic Descriptor Dataset

OPT: Open Pre-trained Transformer Language Models

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Based on billions of words on the internet, people = men

Training Compute-Optimal Large Language Models

ToxiGen: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection

Training language models to follow instructions with human feedback

Repairing the Cracked Foundation: A Survey of Obstacles in Evaluation Practices for Generated Text

Chain of Thought Prompting Elicits Reasoning in Large Language Models

SCROLLS: Standardized CompaRison Over Long Language Sequences

WebGPT: Browser-assisted question-answering with human feedback

GLaM: Efficient Scaling of Language Models with Mixture-of-Experts

Ethical and social risks of harm from Language Models

Scaling Language Models: Methods, Analysis & Insights from Training Gopher

A General Language Assistant as a Laboratory for Alignment

Sustainable AI: Environmental Implications, Challenges and Opportunities

Training Verifiers to Solve Math Word Problems

Understanding Dataset Difficulty with V-Usable Information

TruthfulQA: Measuring How Models Mimic Human Falsehoods

Hi, my name is Martha: Using names to measure and mitigate bias in generative dialogue models

Finetuned Language Models Are Zero-Shot Learners

Program Synthesis with Large Language Models

Deduplicating Training Data Makes Language Models Better

Evaluating Large Language Models Trained on Code

Anticipating Safety Issues in E2E Conversational AI: Framework and Tooling

All That’s ‘Human’ Is Not Gold: Evaluating Human Evaluation of Generated Text

Carbon Emissions and Large Neural Network Training

RoFormer: Enhanced Transformer with Rotary Position Embedding

Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus

On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜

BOLD: Dataset and Metrics for Measuring Biases in Open-Ended Language Generation

Chasing Carbon: The Elusive Environmental Footprint of Computing

Recipes for Safety in Open-domain Chatbots

Measuring Massive Multitask Language Understanding

Learning to summarize from human feedback

Open-Domain Conversational Agents: Current Progress, Open Problems, and Future Directions

ColdGANs: Taming Language GANs with Cautious Sampling Strategies

DeBERTa: Decoding-enhanced BERT with Disentangled Attention