SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems (2019-05-02T00:00:00.000000Z)

TL;DR

A new benchmark styled after GLUE is presented, a new set of more difficult language understanding tasks, a software toolkit, and a public leaderboard are presented.

Abstract

In the last year, new models and methods for pretraining and transfer learning have driven striking performance improvements across a range of language understanding tasks. The GLUE benchmark, introduced a little over one year ago, offers a single-number metric that summarizes progress on a diverse set of such tasks, but performance on the benchmark has recently surpassed the level of non-expert humans, suggesting limited headroom for further research. In this paper we present SuperGLUE, a new benchmark styled after GLUE with a new set of more difficult language understanding tasks, a software toolkit, and a public leaderboard. SuperGLUE is available at this http URL.

Authors

Alex Wang

6 papers

Amanpreet Singh

6 papers

Julian Michael

6 papers

TL;DR

Abstract

Authors

References86 items

jiant: A Software Toolkit for Research on General-Purpose Text Understanding Models

The CommitmentBank: Investigating projection in naturally occurring discourse

BAM! Born-Again Multi-Task Networks for Natural Language Understanding

XLNet: Generalized Autoregressive Pretraining for Language Understanding

Human vs. Muppet: A Conservative Estimate of Human Performance on the GLUE Benchmark

What do you learn from context? Probing for sentence structure in contextualized word representations

A Surprisingly Robust Trick for the Winograd Schema Challenge

BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

Repurposing Entailment for Multi-Hop Question Answering Tasks

Improving Multi-Task Deep Neural Networks via Knowledge Distillation for Natural Language Understanding

Inoculation by Fine-Tuning: A Method for Analyzing Challenge Datasets

PAWS: Paraphrase Adversaries from Word Scrambling

Lipstick on a Pig: Debiasing Methods Cover up Systematic Gender Biases in Word Embeddings But do not Remove Them

Lipstick on a Pig: Debiasing Methods Cover up Systematic Gender Biases in Word Embeddings But do not Remove Them

Linguistic Knowledge and Transferability of Contextual Representations

Evidence Sentence Extraction for Machine Reading Comprehension

The Referential Reader: A Recurrent Entity Network for Anaphora Resolution

Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference

Multi-Task Deep Neural Networks for Natural Language Understanding

Multilingual Constituency Parsing with Self-Attention and Pre-Training

Snorkel DryBell: A Case Study in Deploying Weak Supervision at Industrial Scale

Non-entailed subsequences as a challenge for natural language inference

Sentence Encoders on STILTs: Supplementary Training on Intermediate Labeled-data Tasks

ReCoRD: Bridging the Gap between Human and Machine Commonsense Reading Comprehension

Modeling Empathy and Distress in Reaction to News Stories

WiC: the Word-in-Context Dataset for Evaluating Context-Sensitive Meaning Representations

QuAC: Question Answering in Context

SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference

Identifying Well-formed Natural Language Questions

Gender Bias.

Gender Bias in Neural Natural Language Processing

Ultra-Fine Entity Typing

The Natural Language Decathlon: Multitask Learning as Question Answering

Stress Test Evaluation for Natural Language Inference

Looking Beyond the Surface: A Challenge Set for Reading Comprehension over Multiple Sentences

Neural Network Acceptability Judgments

Born Again Neural Networks

Examining Gender and Race Bias in Two Hundred Sentiment Analysis Systems

Gender Bias in Coreference Resolution

GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

Gender Bias in Coreference Resolution: Evaluation and Debiasing Methods

AllenNLP: A Deep Semantic Natural Language Processing Platform

SentEval: An Evaluation Toolkit for Universal Sentence Representations

Deep Contextualized Word Representations

Automatic differentiation in PyTorch

Learned in Translation: Contextualized Word Vectors

SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation

Have You Lost the Thread? Discovering Ongoing Conversations in Scattered Dialog Blocks

Adversarial Examples for Evaluating Reading Comprehension Systems

End-to-end Neural Coreference Resolution

Supervised Learning of Universal Sentence Representations from Natural Language Inference Data

A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference

Social Bias in Elicited Natural Language Inferences

SQuAD: 100,000+ Questions for Machine Comprehension of Text

How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation

Learning Distributed Representations of Sentences from Unlabelled Data

Semi-supervised Sequence Learning

Skip-Thought Vectors

Distilling the Knowledge in a Neural Network

Adam: A Method for Stochastic Optimization

GloVe: Global Vectors for Word Representation

Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank

The Winograd Schema Challenge

Choice of Plausible Alternatives: An Evaluation of Commonsense Causal Reasoning

A unified architecture for natural language processing: deep neural networks with multitask learning

Re-evaluating the Role of Bleu in Machine Translation Research

WordNet: A Lexical Database for English

A conservative human baseline estimate for GLUE: People still (mostly) beat machines. Unpublished manuscript available at gluebenchmark.com, 2019

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

2019a, see paper for a description of the categories)

Social IQa: Commonsense Reasoning about Social Interactions

Improving Language Understanding by Generative Pre-Training

Mind the GAP: A balanced corpus of gendered ambiguous pronouns

A Corpus and Model Integrating Multiword Expressions and Supersenses

The Seventh PASCAL Recognizing Textual Entailment Challenge

The Sixth PASCAL Recognizing Textual Entailment Challenge