Natural Questions: A Benchmark for Question Answering Research

Published in

Transactions of the Association for Computation...(2019)

External Links:

Generate Graph DownloadPDF

TL;DR

The Natural Questions corpus, a question answering data set, is presented, introducing robust metrics for the purposes of evaluating question answering systems; demonstrating high human upper bounds on these metrics; and establishing baseline results using competitive methods drawn from related literature.

Abstract

We present the Natural Questions corpus, a question answering data set. Questions consist of real anonymized, aggregated queries issued to the Google search engine. An annotator is presented with a question along with a Wikipedia page from the top 5 search results, and annotates a long answer (typically a paragraph) and a short answer (one or more entities) if present on the page, or marks null if no long/short answer is present. The public release consists of 307,373 training examples with single annotations; 7,830 examples with 5-way annotations for development data; and a further 7,842 examples with 5-way annotated sequestered as test data. We present experiments validating quality of the data. We also describe analysis of 25-way annotations on 302 examples, giving insights into human variability on the annotation task. We introduce robust metrics for the purposes of evaluating question answering systems; demonstrate high human upper bounds on these metrics; and establish baseline results using competitive methods drawn from related literature.

Authors

Matthew Kelcey

4 papers

Quoc V. Le

42 papers

Andrew M. Dai

8 papers

References38 items

A BERT Baseline for the Natural Questions

HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering

CoQA: A Conversational Question Answering Challenge

QuAC: Question Answering in Context

Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering

Natural Questions: A Benchmark for Question Answering Research

Published in

Transactions of the Association for Computation...(2019)

External Links:

Generate Graph DownloadPDF

TL;DR

Abstract

Authors

Matthew Kelcey

4 papers

Quoc V. Le

42 papers

Andrew M. Dai

8 papers

References38 items

A BERT Baseline for the Natural Questions

HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering

CoQA: A Conversational Question Answering Challenge

QuAC: Question Answering in Context

Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering

Jakob Uszkoreit

7 papers

Chris Alberti

8 papers

T. Kwiatkowski

4 papers

J. Palomaki

3 papers

Olivia Redfield

1 papers

Michael Collins

7 papers

Ankur P. Parikh

5 papers

I. Polosukhin

4 papers

Jacob Devlin

8 papers

Kenton Lee

10 papers

Kristina Toutanova

7 papers

Llion Jones

5 papers

Ming-Wei Chang

11 papers

Slav Petrov

5 papers

Know What You Don’t Know: Unanswerable Questions for SQuAD

The NarrativeQA Reading Comprehension Challenge

DuReader: a Chinese Machine Reading Comprehension Dataset from Real-world Applications

Simple and Effective Multi-Paragraph Reading Comprehension

Adversarial Examples for Evaluating Reading Comprehension Systems

TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference

RACE: Large-scale ReAding Comprehension Dataset From Examinations

Reading Wikipedia to Answer Open-Domain Questions

Proceedings for the 5th International Conference on Learning Representations

MS MARCO: A Human Generated MAchine Reading COmprehension Dataset

Who did What: A Large-Scale Person-Centered Cloze Dataset

The LAMBADA dataset: Word prediction requiring a broad discourse context

SQuAD: 100,000+ Questions for Machine Comprehension of Text

A Thorough Examination of the CNN/Daily Mail Reading Comprehension Task

A Decomposable Attention Model for Natural Language Inference

The Goldilocks Principle: Reading Children's Books with Explicit Memory Representations

WikiQA: A Challenge Dataset for Open-Domain Question Answering

A large annotated corpus for learning natural language inference

Teaching Machines to Read and Comprehend

MCTest: A Challenge Dataset for the Open-Domain Machine Comprehension of Text

Bleu: a Method for Automatic Evaluation of Machine Translation

Automatic Acquisition of Hyponyms from Large Text Corpora

Conference on Empirical

A Probabilistic Theory of Pattern Recognition

of the Association for Computational Linguistics

proportions from NQ training data. Percentages are proportions of entire data set. A total of 49% of all examples

We focus on Wikipedia as it is a very

contain multiple entities as well as an adjective, adverb, verb, or determiner

contain a categorical noun phrase immediately preceded by a preposition or relative clause

Field of Study

Computer Science

Journal Information

Name

Transactions of the Association for Computational Linguistics

Page

453-466

Volume

Venue Information

Name

Transactions of the Association for Computational Linguistics

Type

journal

URL

https://www.mitpressjournals.org/loi/tacl

Alternate Names

Trans Assoc Comput Linguistics
TACL

TL;DR

Abstract

Authors

References38 items

A BERT Baseline for the Natural Questions

HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering

CoQA: A Conversational Question Answering Challenge

QuAC: Question Answering in Context

Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering

TL;DR

Abstract

Authors

References38 items

A BERT Baseline for the Natural Questions

HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering

CoQA: A Conversational Question Answering Challenge

QuAC: Question Answering in Context

Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering

Know What You Don’t Know: Unanswerable Questions for SQuAD

The NarrativeQA Reading Comprehension Challenge

DuReader: a Chinese Machine Reading Comprehension Dataset from Real-world Applications

Simple and Effective Multi-Paragraph Reading Comprehension

Adversarial Examples for Evaluating Reading Comprehension Systems

TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference

RACE: Large-scale ReAding Comprehension Dataset From Examinations

Reading Wikipedia to Answer Open-Domain Questions

Proceedings for the 5th International Conference on Learning Representations

MS MARCO: A Human Generated MAchine Reading COmprehension Dataset

Who did What: A Large-Scale Person-Centered Cloze Dataset

The LAMBADA dataset: Word prediction requiring a broad discourse context

SQuAD: 100,000+ Questions for Machine Comprehension of Text

A Thorough Examination of the CNN/Daily Mail Reading Comprehension Task

A Decomposable Attention Model for Natural Language Inference

The Goldilocks Principle: Reading Children's Books with Explicit Memory Representations

WikiQA: A Challenge Dataset for Open-Domain Question Answering

A large annotated corpus for learning natural language inference

Teaching Machines to Read and Comprehend

MCTest: A Challenge Dataset for the Open-Domain Machine Comprehension of Text

COMPREHENSION

Bleu: a Method for Automatic Evaluation of Machine Translation

Automatic Acquisition of Hyponyms from Large Text Corpora

Conference on Empirical

A Probabilistic Theory of Pattern Recognition

of the Association for Computational Linguistics

proportions from NQ training data. Percentages are proportions of entire data set. A total of 49% of all examples

We focus on Wikipedia as it is a very

Texas, November

contain multiple entities as well as an adjective, adverb, verb, or determiner

contain a categorical noun phrase immediately preceded by a preposition or relative clause

Correct

Field of Study

Journal Information

Name

Page

Volume

Venue Information

Name

Type

URL

Alternate Names