The GEM Benchmark: Natural Language Generation, its Evaluation and Metrics (2021-02-02T00:00:00.000000Z)

TL;DR

GEM, a living benchmark for natural language Generation (NLG), its Evaluation, and Metrics, is introduced and the description of the data for the 2021 shared task at the associated GEM Workshop is described.

Abstract

We introduce GEM, a living benchmark for natural language Generation (NLG), its Evaluation, and Metrics. Measuring progress in NLG relies on a constantly evolving ecosystem of automated metrics, datasets, and human evaluation standards. Due to this moving target, new models often still evaluate on divergent anglo-centric corpora with well-established, but flawed, metrics. This disconnect makes it challenging to identify the limitations of current models and opportunities for progress. Addressing this limitation, GEM provides an environment in which models can easily be applied to a wide set of tasks and in which evaluation strategies can be tested. Regular updates to the benchmark will help NLG research become more multilingual and evolve the challenge alongside models. This paper serves as the description of the data for the 2021 shared task at the associated GEM Workshop.

Authors

References171 items

The Human Evaluation Datasheet 1.0: A Template for Recording Details of Human Evaluation Experiments in NLP

DynaSent: A Dynamic Benchmark for Sentiment Analysis

Disentangling the Properties of Human Evaluation Methods: A Classification System to Support Comparability, Meta-Evaluation and Reproducibility Testing

GLGE: A New General Language Generation Evaluation Benchmark

Interpretable Multi-dataset Evaluation for Named Entity Recognition

GO FIGURE: A Meta Evaluation of Factuality in Summarization

mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer

WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization

Participatory Research for Low-resourced Machine Translation: A Case Study in African Languages

Utility Is in the Eye of the User: A Critique of NLP Leaderboard Design

KILT: a Benchmark for Knowledge Intensive Language Tasks

Towards a Decomposable Metric for Explainable Evaluation of Text Generation from AMR

SummEval: Re-evaluating Summarization Evaluation

Bringing the People Back In: Contesting Benchmark Machine Learning Datasets

DART: Open-Domain Structured Data Record to Text Generation

MultiWOZ 2.2 : A Dialogue Dataset with Additional Annotation Corrections and State Tracking Baselines

Evaluation of Text Generation: A Survey

Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine Translation Evaluation Metrics

Beyond Leaderboards: A survey of methods for revealing weaknesses in Natural Language Inference data and models

Beyond Accuracy: Behavioral Testing of NLP Models with CheckList

FEQA: A Question Answering Evaluation Framework for Faithfulness Assessment in Abstractive Summarization

How Can We Accelerate Progress Towards Human-like Linguistic Generalization?

On Faithfulness and Factuality in Abstractive Summarization

Neural CRF Model for Sentence Alignment in Text Simplification

ASSET: A Dataset for Tuning and Evaluation of Sentence Simplification Models with Multiple Rewriting Transformations

XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning

MLSUM: The Multilingual Summarization Corpus

Few-Shot Natural Language Generation by Rewriting Templates

ToTTo: A Controlled Table-To-Text Generation Dataset

AmbigQA: Answering Ambiguous Open-domain Questions

A Human Evaluation of AMR-to-English Generation Systems

BLEU Might Be Guilty but References Are Not Innocent

BLEURT: Learning Robust Metrics for Text Generation

Asking and Answering Questions to Evaluate the Factual Consistency of Summaries

XGLUE: A New Benchmark Datasetfor Cross-lingual Pre-training, Understanding and Generation

The State and Fate of Linguistic Diversity and Inclusion in the NLP World

XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization

Bangla Natural Language Image to Text (BNLIT)

CommonGen: A Constrained Text Generation Challenge for Generative Commonsense Reasoning

Multilingual Denoising Pre-training for Neural Machine Translation

PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization

Computational Linguistics: 16th International Conference of the Pacific Association for Computational Linguistics, PACLING 2019, Hanoi, Vietnam, October 11–13, 2019, Revised Selected Papers

Should All Cross-Lingual Embeddings Speak English?

DIALOGPT : Large-Scale Generative Pre-training for Conversational Response Generation

Mimic and Rephrase: Reflective Listening in Open-Ended Dialogue

Semantic Noise Matters for Neural Natural Language Generation

Comprehensive Multi-Dataset Evaluation of Reading Comprehension

BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

MLQA: Evaluating Cross-lingual Extractive Question Answering

Neural Generation for Czech: Data and Baselines

Towards Scalable Multi-domain Conversational Agents: The Schema-Guided Dialogue Dataset

Answers Unite! Unsupervised Metrics for Reinforced Summarization Models

Results of the WMT19 Metrics Shared Task: Segment-Level and Strong MT Systems Pose Big Challenges

RoBERTa: A Robustly Optimized BERT Pretraining Approach

ELI5: Long Form Question Answering

What Should I Ask? Using Conversationally Informative Rewards for Goal-oriented Visual Dialog.

BIGPATENT: A Large-Scale Dataset for Abstractive and Coherent Summarization

Data-to-text Generation with Entity Modeling

Handling Divergent Reference Texts when Evaluating Table-to-Text Generation

Question Answering as an Automatic Evaluation Metric for News Article Summarization

SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems

Unsupervised Data Augmentation for Consistency Training

BERTScore: Evaluating Text Generation with BERT

Unifying Human and Statistical Evaluation for Natural Language Generation

Evaluating the State-of-the-Art of End-to-End Natural Language Generation: The E2E NLG Challenge

Data Statements for Natural Language Processing: Toward Mitigating System Bias and Enabling Better Science

Content Selection in Deep Learning Models of Summarization

Model Cards for Model Reporting

Results of the WMT18 Metrics Shared Task: Both characters and embeddings achieve good performance

Wizard of Wikipedia: Knowledge-Powered Conversational agents

A Structured Review of the Validity of BLEU

Don’t Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization

CoQA: A Conversational Question Answering Challenge

Measuring the Diversity of Automatic Image Descriptions

The First Multilingual Surface Realisation Shared Task (SR’18): Overview and Evaluation Results

The Natural Language Decathlon: Multitask Learning as Question Answering

Hierarchical Neural Story Generation

GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents

Datasheets for datasets

Dear Sir or Madam, May I Introduce the GYAFC Dataset: Corpus, Benchmarks and Metrics for Formality Style Transfer

Generating Wikipedia by Summarizing Long Sequences

Personalizing Dialogue Agents: I have a dog, do you have pets too?

The NarrativeQA Reading Comprehension Challenge

Visual Question Generation as Dual Task of Visual Question Answering

The WebNLG Challenge: Generating Text from RDF Data

Results of the WMT17 Metrics Shared Task

Challenges in Data-to-Document Generation

The E2E Dataset: New Challenges For End-to-End Generation

Analysing Data-To-Text Generation Benchmarks

Learning to Ask: Neural Question Generation for Reading Comprehension

Survey of the State of the Art in Natural Language Generation: Core tasks, applications and evaluation

A Context-aware Natural Language Generator for Dialogue Systems

Results of the WMT16 Metrics Shared Task

Optimizing Statistical Machine Translation for Text Simplification

Sequence-to-Sequence Generation for Spoken Dialogue via Deep Syntax Trees and Strings

Neural Text Generation from Structured Data with Application to the Biography Domain

Abstractive Text Summarization using Sequence-to-sequence RNNs and Beyond

100

GENeration

101

A Diversity-Promoting Objective Function for Neural Conversation Models

102

Personalized Machine Translation: Predicting Translational Preferences

103

Results of the WMT15 Metrics Shared Task

104

The Ubuntu Dialogue Corpus: A Large Dataset for Research in Unstructured Multi-Turn Dialogue Systems

105

LCSTS: A Large Scale Chinese Short Text Summarization Dataset

106

Teaching Machines to Read and Comprehend

107

Chinese Poetry Generation with Recurrent Neural Networks

108

Neural Machine Translation by Jointly Learning to Align and Translate

109

The First Surface Realisation Shared Task: Overview and Evaluation Results

110

Heavy rain events in the Western Mediterranean: an atmospheric pattern classification

111

METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments

112

ROUGE: A Package for Automatic Evaluation of Summaries

113

Bleu: a Method for Automatic Evaluation of Machine Translation

114

Building Natural-Language Generation Systems

115

The plane with parallel coordinates

116

The State and I

117

FRANCE

118

Louisiana

119

G ENIE A Leaderboard for Human-in-the-Loop Evaluation of Text Generation

120

SAFEval: Summarization Asks for Fact-based Evaluation

121

Twenty Years of Confusion in Human Evaluation: NLG Needs Evaluation Sheets and Standardised Definitions

122

Results of the WMT20 Metrics Shared Task

123

The Third Multilingual Surface Realisation Shared Task (SR’20): Overview and Evaluation Results

124

Findings of the Fourth Workshop on Neural Generation and Translation

125

The 2020 Bilingual, Bi-Directional WebNLG+ Shared Task: Overview and Evaluation Results (WebNLG+ 2020)

126

STORIUM: A Dataset and Evaluation Platform for Machine-in-the-Loop Story Generation

127

2020a. GLGE: A

128

NUBIA: neural based interchangeability

129

Schema-Guided Dialogue

130

A case study in African

131

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

132

How to Compare Summarizers without Target Length? Pitfalls, Solutions and Re-Examination of the Neural Summarization Literature

133

Proceedings of the 2nd Workshop on Multilingual Surface Realisation (MSR 2019)

134

2015) 14. Mimic and Rephrase

135

The #benderrule: On naming the languages we study and why it matters

136

Survey of the state of the art in natural language generation

137

PubMed, Arxiv

138

Wiseman et al.,

139

A Context-aware Natural Language Generation Dataset for Dialogue Systems

140

Writing Prompts.

141

Alex Context NLG (Dušek and Jurcıcek

142

Automatic analysis of syntactic complexity in second language writing

143

A Mathematical Theory of Communication

144

HOW COMPLEX IS THAT SENTENCE? A PROPOSED REVISION OF THE ROSENBERG AND ABBEDUTO D-LEVEL SCALE

145

BRINGING THE PEOPLE BACK IN

146

Evaluating the State of the Art

147

Studies in language behavior: A program of research

148

Linguistic I Ssues in L Anguage Technology Lilt on Achieving and Evaluating Language-independence in Nlp on Achieving and Evaluating Language-independence in Nlp

149

UvA-DARE (Digital Academic Repository) A Dataset and Evaluation Metrics for Abstractive Compression of Sentences and Short Paragraphs

150

Criteria Selection Survey As part of our selection process, we queried all GEM members about the utility of tasks and selection criteria. The questions below were included in the survey

151

entity tracking/generation, referring expression generation, surface realization, content selection

152

• For each suggested task

153

Examples in dataset Test split, e.g., i.i.d., or non-overlap dimension

154

High-level Task, e.g., data-to-text, or summarization

155

LDC or ELRA)

156

• Diversity of tasks is more important than focus on an NLG task (by including multiple datasets for the same task)

157

en-US, es-MX 10. Input modality, e.g., text, graph, table, images 11

158

• If we include an NLG task (e.g., simplification or data2text), we need multiple datasets for that task

159

Evaluation strategies (in original paper / papers that use dataset)

160

• We should exclude tasks that require encoding anything but text (e.g., images or graphs)

161

1797–1807, Brussels, Belgium

162

Output form, e.g.

163

of the Association for Computational Linguistics

164

• We should exclude tasks that are the focus of a shared task in 2021

165

We should include a set of tasks with no clear evaluation strategy

166

• We should prefer tasks with test sets with multiple references

167

Communicative goal, e.g., provide specific information, or entertainment, or accomplish a task

168

Language(s)

169

• We should include noisy and clean datasets

170

# References per example 16. Data Quality / potential Issues, e.g., noisy, clean, biased, code-mixing (different languages/writing systems)

171