1
The Human Evaluation Datasheet 1.0: A Template for Recording Details of Human Evaluation Experiments in NLP
2
DynaSent: A Dynamic Benchmark for Sentiment Analysis
3
Disentangling the Properties of Human Evaluation Methods: A Classification System to Support Comparability, Meta-Evaluation and Reproducibility Testing
4
GLGE: A New General Language Generation Evaluation Benchmark
5
Interpretable Multi-dataset Evaluation for Named Entity Recognition
6
GO FIGURE: A Meta Evaluation of Factuality in Summarization
7
mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer
8
WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization
9
Participatory Research for Low-resourced Machine Translation: A Case Study in African Languages
10
Utility Is in the Eye of the User: A Critique of NLP Leaderboard Design
11
KILT: a Benchmark for Knowledge Intensive Language Tasks
12
Towards a Decomposable Metric for Explainable Evaluation of Text Generation from AMR
13
SummEval: Re-evaluating Summarization Evaluation
14
Bringing the People Back In: Contesting Benchmark Machine Learning Datasets
15
DART: Open-Domain Structured Data Record to Text Generation
16
MultiWOZ 2.2 : A Dialogue Dataset with Additional Annotation Corrections and State Tracking Baselines
17
Evaluation of Text Generation: A Survey
18
Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine Translation Evaluation Metrics
19
Beyond Leaderboards: A survey of methods for revealing weaknesses in Natural Language Inference data and models
20
Beyond Accuracy: Behavioral Testing of NLP Models with CheckList
21
FEQA: A Question Answering Evaluation Framework for Faithfulness Assessment in Abstractive Summarization
22
How Can We Accelerate Progress Towards Human-like Linguistic Generalization?
23
On Faithfulness and Factuality in Abstractive Summarization
24
Neural CRF Model for Sentence Alignment in Text Simplification
25
ASSET: A Dataset for Tuning and Evaluation of Sentence Simplification Models with Multiple Rewriting Transformations
26
XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning
27
MLSUM: The Multilingual Summarization Corpus
28
Few-Shot Natural Language Generation by Rewriting Templates
29
ToTTo: A Controlled Table-To-Text Generation Dataset
30
AmbigQA: Answering Ambiguous Open-domain Questions
31
A Human Evaluation of AMR-to-English Generation Systems
32
BLEU Might Be Guilty but References Are Not Innocent
33
BLEURT: Learning Robust Metrics for Text Generation
34
Asking and Answering Questions to Evaluate the Factual Consistency of Summaries
35
XGLUE: A New Benchmark Datasetfor Cross-lingual Pre-training, Understanding and Generation
36
The State and Fate of Linguistic Diversity and Inclusion in the NLP World
37
XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization
38
Bangla Natural Language Image to Text (BNLIT)
39
CommonGen: A Constrained Text Generation Challenge for Generative Commonsense Reasoning
40
Multilingual Denoising Pre-training for Neural Machine Translation
41
PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization
42
Computational Linguistics: 16th International Conference of the Pacific Association for Computational Linguistics, PACLING 2019, Hanoi, Vietnam, October 11–13, 2019, Revised Selected Papers
43
Should All Cross-Lingual Embeddings Speak English?
44
DIALOGPT : Large-Scale Generative Pre-training for Conversational Response Generation
45
Mimic and Rephrase: Reflective Listening in Open-Ended Dialogue
46
Semantic Noise Matters for Neural Natural Language Generation
47
Comprehensive Multi-Dataset Evaluation of Reading Comprehension
48
BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension
49
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
50
MLQA: Evaluating Cross-lingual Extractive Question Answering
51
Neural Generation for Czech: Data and Baselines
52
Towards Scalable Multi-domain Conversational Agents: The Schema-Guided Dialogue Dataset
53
Answers Unite! Unsupervised Metrics for Reinforced Summarization Models
54
Results of the WMT19 Metrics Shared Task: Segment-Level and Strong MT Systems Pose Big Challenges
55
RoBERTa: A Robustly Optimized BERT Pretraining Approach
56
ELI5: Long Form Question Answering
57
What Should I Ask? Using Conversationally Informative Rewards for Goal-oriented Visual Dialog.
58
BIGPATENT: A Large-Scale Dataset for Abstractive and Coherent Summarization
59
Data-to-text Generation with Entity Modeling
60
Handling Divergent Reference Texts when Evaluating Table-to-Text Generation
61
Question Answering as an Automatic Evaluation Metric for News Article Summarization
62
SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems
63
Unsupervised Data Augmentation for Consistency Training
64
BERTScore: Evaluating Text Generation with BERT
65
Unifying Human and Statistical Evaluation for Natural Language Generation
66
Evaluating the State-of-the-Art of End-to-End Natural Language Generation: The E2E NLG Challenge
67
Data Statements for Natural Language Processing: Toward Mitigating System Bias and Enabling Better Science
68
Content Selection in Deep Learning Models of Summarization
69
Model Cards for Model Reporting
70
Results of the WMT18 Metrics Shared Task: Both characters and embeddings achieve good performance
71
Wizard of Wikipedia: Knowledge-Powered Conversational agents
72
A Structured Review of the Validity of BLEU
73
Don’t Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization
74
CoQA: A Conversational Question Answering Challenge
75
Measuring the Diversity of Automatic Image Descriptions
76
The First Multilingual Surface Realisation Shared Task (SR’18): Overview and Evaluation Results
77
The Natural Language Decathlon: Multitask Learning as Question Answering
78
Hierarchical Neural Story Generation
79
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
80
A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents
81
Datasheets for datasets
82
Dear Sir or Madam, May I Introduce the GYAFC Dataset: Corpus, Benchmarks and Metrics for Formality Style Transfer
83
Generating Wikipedia by Summarizing Long Sequences
84
Personalizing Dialogue Agents: I have a dog, do you have pets too?
85
The NarrativeQA Reading Comprehension Challenge
86
Visual Question Generation as Dual Task of Visual Question Answering
87
The WebNLG Challenge: Generating Text from RDF Data
88
Results of the WMT17 Metrics Shared Task
89
Challenges in Data-to-Document Generation
90
The E2E Dataset: New Challenges For End-to-End Generation
91
Analysing Data-To-Text Generation Benchmarks
92
Learning to Ask: Neural Question Generation for Reading Comprehension
93
Survey of the State of the Art in Natural Language Generation: Core tasks, applications and evaluation
94
A Context-aware Natural Language Generator for Dialogue Systems
95
Results of the WMT16 Metrics Shared Task
96
Optimizing Statistical Machine Translation for Text Simplification
97
Sequence-to-Sequence Generation for Spoken Dialogue via Deep Syntax Trees and Strings
98
Neural Text Generation from Structured Data with Application to the Biography Domain
99
Abstractive Text Summarization using Sequence-to-sequence RNNs and Beyond
101
A Diversity-Promoting Objective Function for Neural Conversation Models
102
Personalized Machine Translation: Predicting Translational Preferences
103
Results of the WMT15 Metrics Shared Task
104
The Ubuntu Dialogue Corpus: A Large Dataset for Research in Unstructured Multi-Turn Dialogue Systems
105
LCSTS: A Large Scale Chinese Short Text Summarization Dataset
106
Teaching Machines to Read and Comprehend
107
Chinese Poetry Generation with Recurrent Neural Networks
108
Neural Machine Translation by Jointly Learning to Align and Translate
109
The First Surface Realisation Shared Task: Overview and Evaluation Results
110
Heavy rain events in the Western Mediterranean: an atmospheric pattern classification
111
METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments
112
ROUGE: A Package for Automatic Evaluation of Summaries
113
Bleu: a Method for Automatic Evaluation of Machine Translation
114
Building Natural-Language Generation Systems
115
The plane with parallel coordinates
119
G ENIE A Leaderboard for Human-in-the-Loop Evaluation of Text Generation
120
SAFEval: Summarization Asks for Fact-based Evaluation
121
Twenty Years of Confusion in Human Evaluation: NLG Needs Evaluation Sheets and Standardised Definitions
122
Results of the WMT20 Metrics Shared Task
123
The Third Multilingual Surface Realisation Shared Task (SR’20): Overview and Evaluation Results
124
Findings of the Fourth Workshop on Neural Generation and Translation
125
The 2020 Bilingual, Bi-Directional WebNLG+ Shared Task: Overview and Evaluation Results (WebNLG+ 2020)
126
STORIUM: A Dataset and Evaluation Platform for Machine-in-the-Loop Story Generation
128
NUBIA: neural based interchangeability
129
Schema-Guided Dialogue
130
A case study in African
131
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
132
How to Compare Summarizers without Target Length? Pitfalls, Solutions and Re-Examination of the Neural Summarization Literature
133
Proceedings of the 2nd Workshop on Multilingual Surface Realisation (MSR 2019)
134
2015) 14. Mimic and Rephrase
135
The #benderrule: On naming the languages we study and why it matters
136
Survey of the state of the art in natural language generation
139
A Context-aware Natural Language Generation Dataset for Dialogue Systems
141
Alex Context NLG (Dušek and Jurcıcek
142
Automatic analysis of syntactic complexity in second language writing
143
A Mathematical Theory of Communication
144
HOW COMPLEX IS THAT SENTENCE? A PROPOSED REVISION OF THE ROSENBERG AND ABBEDUTO D-LEVEL SCALE
145
BRINGING THE PEOPLE BACK IN
146
Evaluating the State of the Art
147
Studies in language behavior: A program of research
148
Linguistic I Ssues in L Anguage Technology Lilt on Achieving and Evaluating Language-independence in Nlp on Achieving and Evaluating Language-independence in Nlp
149
UvA-DARE (Digital Academic Repository) A Dataset and Evaluation Metrics for Abstractive Compression of Sentences and Short Paragraphs
150
Criteria Selection Survey As part of our selection process, we queried all GEM members about the utility of tasks and selection criteria. The questions below were included in the survey
151
entity tracking/generation, referring expression generation, surface realization, content selection
152
• For each suggested task
153
Examples in dataset Test split, e.g., i.i.d., or non-overlap dimension
154
High-level Task, e.g., data-to-text, or summarization
156
• Diversity of tasks is more important than focus on an NLG task (by including multiple datasets for the same task)
157
en-US, es-MX 10. Input modality, e.g., text, graph, table, images 11
158
• If we include an NLG task (e.g., simplification or data2text), we need multiple datasets for that task
159
Evaluation strategies (in original paper / papers that use dataset)
160
• We should exclude tasks that require encoding anything but text (e.g., images or graphs)
161
1797–1807, Brussels, Belgium
163
of the Association for Computational Linguistics
164
• We should exclude tasks that are the focus of a shared task in 2021
165
We should include a set of tasks with no clear evaluation strategy
166
• We should prefer tasks with test sets with multiple references
167
Communicative goal, e.g., provide specific information, or entertainment, or accomplish a task
169
• We should include noisy and clean datasets
170
# References per example 16. Data Quality / potential Issues, e.g., noisy, clean, biased, code-mixing (different languages/writing systems)
171
Data Quality / potential Issues