Challenging BIG… (2022-10-17T00:00:00.000000Z)

TL;DR

This work finds that applying chain-of-thought (CoT) prompting to BBH tasks enables PaLM to surpass the average human-rater performance on 10 of the 23 tasks, and Codex to surpass it on 17 of the23 tasks.

Abstract

BIG-Bench (Srivastava et al., 2022) is a diverse evaluation suite that focuses on tasks believed to be beyond the capabilities of current language models. Language models have already made good progress on this benchmark, with the best model in the BIG-Bench paper outperforming average reported human-rater results on 65% of the BIG-Bench tasks via few-shot prompting. But on what tasks do language models fall short of average human-rater performance, and are those tasks actually unsolvable by current language models? In this work, we focus on a suite of 23 challenging BIG-Bench tasks which we call BIG-Bench Hard (BBH). These are the task for which prior language model evaluations did not outperform the average human-rater. We find that applying chain-of-thought (CoT) prompting to BBH tasks enables PaLM to surpass the average human-rater performance on 10 of the 23 tasks, and Codex (code-davinci-002) to surpass the average human-rater performance on 17 of the 23 tasks. Since many tasks in BBH require multi-step reasoning, few-shot prompting without CoT, as done in the BIG-Bench evaluations (Srivastava et al., 2022), substantially underestimates the best performance and capabilities of language models, which is better captured via CoT prompting. As further analysis, we explore the interaction between CoT and model scale on BBH, finding that CoT enables emergent task performance on several BBH tasks with otherwise flat scaling curves.

Authors

Abstract

References67 items

Large Language Models are few(1)-shot Table Reasoners

Language Models are Multilingual Chain-of-Thought Reasoners

Binding Language Models in Symbolic Languages

Compositional Semantic Parsing with Large Language Models

Text and Patterns: For Effective Chain of Thought, It Takes Two to Tango

Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?

Language Models (Mostly) Know What They Know

Emergent Abilities of Large Language Models

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

Large Language Models are Zero-Shot Reasoners

Prompt-and-Rerank: A Method for Zero-Shot and Few-Shot Arbitrary Textual Style Transfer with Small Language Models

Least-to-Most Prompting Enables Complex Reasoning in Large Language Models

Selection-Inference: Exploiting Large Language Models for Interpretable Logical Reasoning

AmbiPun: Generating Humorous Puns with Ambiguous Context

PaLM: Scaling Language Modeling with Pathways

Can language models learn from explanations in context?

Training Compute-Optimal Large Language Models

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Training language models to follow instructions with human feedback

Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?

Predictability and Surprise in Large Generative Models

Competition-level code generation with AlphaCode

Chain of Thought Prompting Elicits Reasoning in Large Language Models

Reframing Human-AI Collaboration for Generating Free-Text Explanations

Scaling Language Models: Methods, Analysis & Insights from Training Gopher

Show Your Work: Scratchpads for Intermediate Computation with Language Models

Few-Shot Self-Rationalization with Natural Language Prompts

An Explanation of In-context Learning as Implicit Bayesian Inference

MetaICL: Learning to Learn In Context

Multitask Prompted Training Enables Zero-Shot Task Generalization

Language Models are Few-shot Multilingual Learners

Can Language Models Encode Perceptual Structure Without Grounding? A Case Study in Color

A Recipe for Arbitrary Text Style Transfer with Large Language Models

Finetuned Language Models Are Zero-Shot Learners

Do Prompt-Based Models Really Understand the Meaning of Their Prompts?

Program Synthesis with Large Language Models

Evaluating Large Language Models Trained on Code

True Few-Shot Learning with Language Models

Cross-Task Generalization via Natural Language Crowdsourcing Instructions

The Power of Scale for Parameter-Efficient Prompt Tuning

Calibrate Before Use: Improving Few-Shot Performance of Language Models

When Can Models Learn From Explanations? A Formal Framework for Understanding the Roles of Explanation Data

Scaling Laws for Transfer

Towards Interpretable Natural Language Understanding with Explanations as Latent Variables

Language Models are Few-Shot Learners

Scaling Laws for Neural Language Models

Exploiting Cloze-Questions for Few-Shot Text Classification and Natural Language Inference

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics

On the Advance of Making Language Models Better Reasoners

Mapping Language Models to Grounded Conceptual Spaces

On the Machine Learning of Ethical Judgments from Natural Language

Natural Language Inference with a Human Touch: Using Human Explanations to Guide Model Attention

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Amongst all the options, the only movie similar to these ones seems to be The Princess Bride (1987)

crowdworkers) or research with human participants?

Did you report the full text of instructions given to participants, including e.g., screenshots, disclaimers of any risks to participants or annotators

for preprocessing, for normalization, or for evaluation), did you report the implementation, model, and parameter settings used (e.g., NLTK, Spacy, ROUGE, etc

error bars around results, summary statistics from sets of experiments), and is it transparent whether you are reporting the max, mean, etc

crowdsourcing platform, students) and paid participants, and discuss if such payment is adequate given the participants' demographic

Kristian tells the truth

The PaLM models

Was the data collection protocol approved (or determined exempt) by an ethics review board? No response

Did you report the basic demographic and geographic characteristics of the annotator population that is the source of the data? No response

Fidel says Maybelle lies

CoT Prompt for Snarks Determine which of two sentences is sarcastic

tasks: cause and effect, word unscrambling, movie dialog same or different, moral permissibility, fake text, discourse marker prediction, checkmate in one, mnist ascii, ascii word

TL;DR

Abstract

Authors

TL;DR

Abstract

Authors

References67 items

Large Language Models are few(1)-shot Table Reasoners

Language Models are Multilingual Chain-of-Thought Reasoners

Binding Language Models in Symbolic Languages

Compositional Semantic Parsing with Large Language Models

Text and Patterns: For Effective Chain of Thought, It Takes Two to Tango

Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?

Language Models (Mostly) Know What They Know

Emergent Abilities of Large Language Models

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

Large Language Models are Zero-Shot Reasoners

Prompt-and-Rerank: A Method for Zero-Shot and Few-Shot Arbitrary Textual Style Transfer with Small Language Models

Least-to-Most Prompting Enables Complex Reasoning in Large Language Models

Selection-Inference: Exploiting Large Language Models for Interpretable Logical Reasoning

AmbiPun: Generating Humorous Puns with Ambiguous Context

PaLM: Scaling Language Modeling with Pathways

Can language models learn from explanations in context?

Training Compute-Optimal Large Language Models

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Training language models to follow instructions with human feedback

Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?

Predictability and Surprise in Large Generative Models

Competition-level code generation with AlphaCode

Chain of Thought Prompting Elicits Reasoning in Large Language Models

Reframing Human-AI Collaboration for Generating Free-Text Explanations

Scaling Language Models: Methods, Analysis & Insights from Training Gopher

Show Your Work: Scratchpads for Intermediate Computation with Language Models

Few-Shot Self-Rationalization with Natural Language Prompts

An Explanation of In-context Learning as Implicit Bayesian Inference

MetaICL: Learning to Learn In Context

Multitask Prompted Training Enables Zero-Shot Task Generalization

Language Models are Few-shot Multilingual Learners

Can Language Models Encode Perceptual Structure Without Grounding? A Case Study in Color

A Recipe for Arbitrary Text Style Transfer with Large Language Models

Finetuned Language Models Are Zero-Shot Learners

Do Prompt-Based Models Really Understand the Meaning of Their Prompts?

Program Synthesis with Large Language Models

Evaluating Large Language Models Trained on Code

True Few-Shot Learning with Language Models

Cross-Task Generalization via Natural Language Crowdsourcing Instructions

The Power of Scale for Parameter-Efficient Prompt Tuning

Calibrate Before Use: Improving Few-Shot Performance of Language Models

When Can Models Learn From Explanations? A Formal Framework for Understanding the Roles of Explanation Data

Scaling Laws for Transfer

Towards Interpretable Natural Language Understanding with Explanations as Latent Variables

Language Models are Few-Shot Learners

Scaling Laws for Neural Language Models

Exploiting Cloze-Questions for Few-Shot Text Classification and Natural Language Inference

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics

On the Advance of Making Language Models Better Reasoners

Mapping Language Models to Grounded Conceptual Spaces

On the Machine Learning of Ethical Judgments from Natural Language

Natural Language Inference with a Human Touch: Using Human Explanations to Guide Model Attention

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Amongst all the options, the only movie similar to these ones seems to be The Princess Bride (1987)

crowdworkers) or research with human participants?

Did you report the full text of instructions given to participants, including e.g., screenshots, disclaimers of any risks to participants or annotators

for preprocessing, for normalization, or for evaluation), did you report the implementation, model, and parameter settings used (e.g., NLTK, Spacy, ROUGE, etc

error bars around results, summary statistics from sets of experiments), and is it transparent whether you are reporting the max, mean, etc

crowdsourcing platform, students) and paid participants, and discuss if such payment is adequate given the participants' demographic

Kristian tells the truth

The PaLM models

Was the data collection protocol approved (or determined exempt) by an ethics review board? No response

Did you report the basic demographic and geographic characteristics of the annotator population that is the source of the data? No response

Fidel says Maybelle lies

CoT Prompt for Snarks Determine which of two sentences is sarcastic

tasks: cause and effect, word unscrambling, movie dialog same or different, moral permissibility, fake text, discourse marker prediction, checkmate in one, mnist ascii, ascii word

Field of Study

Venue Information

Name

Type

URL

Alternate Names