Llemma: An Open Language Model For Mathematics (2023-10-16T00:00:00.000000Z)

TL;DR

Llemma is a large language model for mathematics that outperforms all known open base models, as well as the unreleased Minerva model suite on an equi-parameter basis, and is capable of tool use and formal theorem proving without any further finetuning.

Abstract

We present Llemma, a large language model for mathematics. We continue pretraining Code Llama on the Proof-Pile-2, a mixture of scientific papers, web data containing mathematics, and mathematical code, yielding Llemma. On the MATH benchmark Llemma outperforms all known open base models, as well as the unreleased Minerva model suite on an equi-parameter basis. Moreover, Llemma is capable of tool use and formal theorem proving without any further finetuning. We openly release all artifacts, including 7 billion and 34 billion parameter models, the Proof-Pile-2, and code to replicate our experiments.

Authors

Stella Biderman

7 papers

S. Welleck

5 papers

Hailey Schoelkopf

5 papers

TL;DR

Abstract

Authors

References92 items

LLMSTEP: LLM proofstep suggestions in Lean

OpenWebMath: An Open Dataset of High-Quality Mathematical Web Text

MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models

MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning

YaRN: Efficient Context Window Extension of Large Language Models

Code Llama: Open Foundation Models for Code

WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct

SILO Language Models: Isolating Legal Risk In a Nonparametric Datastore

Llama 2: Open Foundation and Fine-Tuned Chat Models

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

LeanDojo: Theorem Proving with Retrieval-Augmented Language Models

Evaluating Language Models for Mathematics through Interactions

Let's Verify Step by Step

Can Transformers Learn to Solve Problems Recursively?

DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining

Towards Expert-Level Medical Question Answering with Large Language Models

Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling

BloombergGPT: A Large Language Model for Finance

Baldur: Whole-Proof Generation and Repair with Large Language Models

ProofNet: Autoformalizing and Formally Proving Undergraduate-Level Mathematics

SantaCoder: don't reach for the stars!

A Survey of Deep Learning for Mathematical Reasoning

BLOOM+1: Adding Language Support to BLOOM for Zero-Shot Prompting

Solving math word problems with process- and outcome-based feedback

The Stack: 3 TB of permissively licensed source code

PAL: Program-aided Language Models

Galactica: A Large Language Model for Science

Teaching Algorithmic Reasoning via In-context Learning

Measuring Progress on Scalable Oversight for Large Language Models

Draft, Sketch, and Prove: Guiding Formal Theorem Provers with Informal Proofs

Solving Quantitative Reasoning Problems with Language Models

NaturalProver: Grounded Mathematical Proof Generation with Language Models

Autoformalization with Large Language Models

HyperTree Proof Search for Neural Theorem Proving

Thor: Wielding Hammers to Integrate Language Models and Automated Theorem Provers

GPT-NeoX-20B: An Open-Source Autoregressive Language Model

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

PaLM: Scaling Language Modeling with Pathways

Training Compute-Optimal Large Language Models

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Training language models to follow instructions with human feedback

Formal Mathematics Statement Curriculum Learning

Chain of Thought Prompting Elicits Reasoning in Large Language Models

LaMDA: Language Models for Dialog Applications

Training Verifiers to Solve Math Word Problems

Multitask Prompted Training Enables Zero-Shot Task Generalization

Finetuned Language Models Are Zero-Shot Learners

MiniF2F: a cross-system benchmark for formal Olympiad-level mathematics

RoFormer: Enhanced Transformer with Rotary Position Embedding

Measuring Mathematical Problem Solving With the MATH Dataset

Proof Artifact Co-training for Theorem Proving with Language Models

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

Measuring Massive Multitask Language Understanding

Generative Language Modeling for Automated Theorem Proving

Language Models are Few-Shot Learners

Scaling Laws for Neural Language Models

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

The lean mathematical library

ZeRO: Memory optimizations Toward Training Trillion Parameter Models

Fine-Tuning Language Models from Human Preferences

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

The Curious Case of Neural Text Degeneration

SciBERT: A Pretrained Language Model for Scientific Text

Problems

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

The Lean Theorem Prover (System Description)

The Isabelle Framework

Realization of a geometry theorem proving machine

Extracting ω's programs from proofs in the calculus of constructions

Mechanization of Mathematics

Retrieval-based Language Models and Applications

lean-training-data. https://github.com/semorrison/ lean-training-data, 2023

Finetuning of various 7B base models on supervised mathematics datasets. All results with a Llama 2 initialization are copied from the literature (Luo et al., 2023

Together Computer

The adventure of the errant hardware

Large language models encode clinical knowledge, 2022