Large Language Models are Zero-Shot Reasoners (2022-05-24T00:00:00.000000Z)

TL;DR

Experimental results demonstrate that the Zero-shot-CoT, using the same single prompt template, significantly outperforms zero-shot LLM performances on diverse benchmark reasoning tasks including arithmetics, symbolic reasoning, and other logical reasoning tasks, without any hand-crafted few-shot examples.

Abstract

Pretrained large language models (LLMs) are widely used in many sub-fields of natural language processing (NLP) and generally known as excellent few-shot learners with task-specific exemplars. Notably, chain of thought (CoT) prompting, a recent technique for eliciting complex multi-step reasoning through step-by-step answer examples, achieved the state-of-the-art performances in arithmetics and symbolic reasoning, difficult system-2 tasks that do not follow the standard scaling laws for LLMs. While these successes are often attributed to LLMs' ability for few-shot learning, we show that LLMs are decent zero-shot reasoners by simply adding"Let's think step by step"before each answer. Experimental results demonstrate that our Zero-shot-CoT, using the same single prompt template, significantly outperforms zero-shot LLM performances on diverse benchmark reasoning tasks including arithmetics (MultiArith, GSM8K, AQUA-RAT, SVAMP), symbolic reasoning (Last Letter, Coin Flip), and other logical reasoning tasks (Date Understanding, Tracking Shuffled Objects), without any hand-crafted few-shot examples, e.g. increasing the accuracy on MultiArith from 17.7% to 78.7% and GSM8K from 10.4% to 40.7% with large InstructGPT model (text-davinci-002), as well as similar magnitudes of improvements with another off-the-shelf large model, 540B parameter PaLM. The versatility of this single prompt across very diverse reasoning tasks hints at untapped and understudied fundamental zero-shot capabilities of LLMs, suggesting high-level, multi-task broad cognitive capabilities may be extracted by simple prompting. We hope our work not only serves as the minimal strongest zero-shot baseline for the challenging reasoning benchmarks, but also highlights the importance of carefully exploring and analyzing the enormous zero-shot knowledge hidden inside LLMs before crafting finetuning datasets or few-shot exemplars.

References60 items

Board

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

OPT: Open Pre-trained Transformer Language Models

PaLM: Scaling Language Modeling with Pathways

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

STaR: Bootstrapping Reasoning With Reasoning

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Training language models to follow instructions with human feedback

Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?

Chain of Thought Prompting Elicits Reasoning in Large Language Models

Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model

LaMDA: Language Models for Dialog Applications

Scaling Language Models: Methods, Analysis & Insights from Training Gopher

Show Your Work: Scratchpads for Intermediate Computation with Language Models

Training Verifiers to Solve Math Word Problems

Multitask Prompted Training Enables Zero-Shot Task Generalization

Do Prompt-Based Models Really Understand the Meaning of Their Prompts?

Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing

Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity

GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow

Are NLP Models really able to Solve Simple Math Word Problems?

Prompt Programming for Large Language Models: Beyond the Few-Shot Paradigm

What Makes Good In-Context Examples for GPT-3?

Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies

Making Pre-trained Language Models Better Few-shot Learners

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

Transformers: State-of-the-Art Natural Language Processing

It’s Not Just Size That Matters: Small Language Models Are Also Few-Shot Learners

Language Models are Few-Shot Learners

Unsupervised Commonsense Question Answering with Self-Talk

5分で分かる!? 有名論文ナナメ読み：Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding

PyTorch: An Imperative Style, High-Performance Deep Learning Library

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Explain Yourself! Leveraging Language Models for Commonsense Reasoning

Attention is All you Need

Program Induction by Rationale Generation: Learning to Solve and Explain Algebraic Word Problems

Pointer Sentinel Mixture Models

Solving General Arithmetic Word Problems

MAWPS: A Math Word Problem Repository

Parsing Algebraic Word Problems into Equations

Learning to Solve Arithmetic Word Problems with Verb Categorization

A measure of intelligence

The structure of human intelligence: It is verbal, perceptual, and image rotation (VPR), not fluid and crystallized

GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. https://github.com/kingoflolz/mesh-transformer-jax

AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts

CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge

Language Models are Unsupervised Multitask Learners

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

The Cattell-Horn-Carroll Theory of Cognitive Abilities: Past, Present, and Future.

Heuristics and Biases: Individual Differences in Reasoning: Implications for the Rationality Debate?

Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)?

(a) Did you state the full set of assumptions of all theoretical results

with respect to the random seed after running experiments multiple times)? [No] Our paper mainly used GPT-3 API with greedy decoding

Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation

Do the main claims made in the abstract and introduction accurately reflect the paper's contributions and scope?

Have you read the ethics review guidelines and ensured that your paper conforms to them

code, data, models) or curating/releasing new assets... (a) If your work uses existing assets

Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable?

If you used crowdsourcing or conducted research with human subjects... (a) Did you include the full text of instructions given to participants and screenshots

Did you discuss any potential negative societal impacts of your work