Are We Done with MMLU? (2024-06-06T00:00:00.000000Z)

TL;DR

This work introduces a comprehensive framework for identifying dataset errors using a novel error annotation protocol and creates MMLU-Redux, which is a subset of 5,700 manually re-annotated questions across all 57 MMLU subjects.

Abstract

Maybe not. We identify and analyse errors in the popular Massive Multitask Language Understanding (MMLU) benchmark. Even though MMLU is widely adopted, our analysis demonstrates numerous ground truth errors that obscure the true capabilities of LLMs. For example, we find that 57% of the analysed questions in the Virology subset contain errors. To address this issue, we introduce a comprehensive framework for identifying dataset errors using a novel error annotation protocol. Then, we create MMLU-Redux, which is a subset of 5,700 manually re-annotated questions across all 57 MMLU subjects. We estimate that 6.49% of MMLU questions contain errors. Using MMLU-Redux, we demonstrate significant discrepancies with the model performance metrics that were originally reported. Our results strongly advocate for revising MMLU's error-ridden questions to enhance its future utility and reliability as a benchmark. https://huggingface.co/datasets/edinburgh-dawg/mmlu-redux-2.0.

Authors

Jean Kaddour

3 papers

Rohit Saxena

2 papers

Aryo Pradipta Gema

1 papers

TL;DR

Abstract

Authors

References65 items

Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation

The Llama 3 Herd of Models

MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

Capabilities of Gemini Models in Medicine

Croissant: A Metadata Format for ML-Ready Datasets

Board

Challenges and Applications of Large Language Models

Llama 2: Open Foundation and Fine-Tuned Chat Models

A Systematic Study and Comprehensive Evaluation of ChatGPT on Benchmark Datasets

Holistic Evaluation of Language Models

The MiniPile Challenge for Data-Efficient Language Models

Generating Data to Mitigate Spurious Correlations in Natural Language Inference Datasets

A Survey on Programmatic Weak Supervision

Chain of Thought Prompting Elicits Reasoning in Large Language Models

Weak Supervision for Affordable Modeling of Electrocardiogram Data

Toward Annotator Group Bias in Crowdsourcing

TruthfulQA: Measuring How Models Mimic Human Falsehoods

Deduplicating Training Data Makes Language Models Better

Pyserini: A Python Toolkit for Reproducible Information Retrieval Research with Sparse and Dense Representations

How well do you know your summarization datasets?

LoRA: Low-Rank Adaptation of Large Language Models

Re-evaluating Evaluation in Text Summarization

What Disease does this Patient Have? A Large-scale Open Domain Question Answering Dataset from Medical Exams

Measuring Massive Multitask Language Understanding

Language Models are Few-Shot Learners

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

There is Strength in Numbers: Avoiding the Hypothesis-Only Bias in Natural Language Inference via Ensemble Adversarial Training

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Are We Modeling the Task or the Annotator? An Investigation of Annotator Bias in Natural Language Understanding Datasets

Human Uncertainty Makes Classification More Robust

MathQA: Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms

Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering

Hypothesis Only Baselines in Natural Language Inference

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Annotation Artifacts in Natural Language Inference Data

Decoupled Weight Decay Regularization

The Effect of Different Writing Tasks on Linguistic Style: A Case Study of the ROC Story Cloze Task

A Thorough Examination of the CNN/Daily Mail Reading Comprehension Task

ImageNet Large Scale Visual Recognition Challenge

A Survey of Crowdsourcing Systems

A Coefficient of Agreement for Nominal Scales

. Anthropic. model card and evaluations for claude models

Are we done with imagenet? CoRR , abs/2006

AI@Meta

AI Explained YouTube Channel - SmartGPT: Major Benchmark Broken - 89.0% on MMLU + Exam’s Many Errors

The Claude 3 Model Family: Opus, Sonnet, Haiku

Have you read the ethics review guidelines and ensured that your paper conforms to them?

e) Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content?

b) Did you mention the license of the assets? [Yes] The license of the dataset is available on the dataset URL (CC-BY 4.0)

Error Type Statistics

2023. GPT-4 technical report

Gemini: A family of highly capable multimodal models

c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation?

Are large language models good evaluators for abstractive summarization?

(a) If your work uses existing assets, did you cite the creators? [Yes] Our work is based on

2023. A framework for few-shot language model evaluation

If you used crowdsourcing or conducted research with human subjects

Open a new discussion on the Hugging Face page of MMLU-Redux

b) Did you describe any potential participant risks, with links to Institutional Review

Errors in the MMLU: The Deep Learning Benchmark is Wrong Surprisingly Often

If you are using existing assets (e.g., code, data, models) or curating/releasing new assets... (a) If your work uses existing assets, did you cite the creators? [Yes] Our work is based on the MMLU

c) Did you include any new assets either in the supplemental material or as a URL? [Yes] MMLU-Redux is provided as a supplementary material

Insert the title with the prefix “[ADD]” to signify additional data annotation, followed by the name of the MMLU subject you are contributing to. For example

Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] We provide details about our resources in

Familiarise yourself with the taxonomy. The taxonomy is designed to be simple and broad to cover all possible erroneous cases found in MMLU

Field of Study

Venue Information

Name

Type

URL

Alternate Names