OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems (2024-02-21T00:00:00.000000Z)

TL;DR

This work presents OlympiadBench, an Olympiad-level bilingual multimodal scientific benchmark, featuring 8,476 problems from Olympiad-level mathematics and physics competitions, including the Chinese college entrance exam, and implements a comprehensive assessment methodology to accurately evaluate model responses.

Abstract

Recent advancements have seen Large Language Models (LLMs) and Large Multimodal Models (LMMs) surpassing general human capabilities in various tasks, approaching the proficiency level of human experts across multiple domains. With traditional benchmarks becoming less challenging for these models, new rigorous challenges are essential to gauge their advanced abilities. In this work, we present OlympiadBench, an Olympiad-level bilingual multimodal scientific benchmark, featuring 8,476 problems from Olympiad-level mathematics and physics competitions, including the Chinese college entrance exam. Each problem is detailed with expert-level annotations for step-by-step reasoning. Evaluating top-tier models on OlympiadBench, we implement a comprehensive assessment methodology to accurately evaluate model responses. Notably, the best-performing model, GPT-4V, attains an average score of 17.97% on OlympiadBench, with a mere 10.74% in physics, highlighting the benchmark rigor and the intricacy of physical reasoning. Our analysis orienting GPT-4V points out prevalent issues with hallucinations, knowledge omissions, and logical fallacies. We hope that our challenging benchmark can serve as a valuable resource for helping future AGI research endeavors. The data and evaluation code are available at \url{https://github.com/OpenBMB/OlympiadBench}

Authors

References64 items

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

CMMU: A Benchmark for Chinese Multi-modal Multi-type Question Understanding and Reasoning

Solving olympiad geometry without human demonstrations

MMMU: A Massive Multi-Discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

AI for Mathematics: A Cognitive Science Perspective

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

SciEval: A Multi-Level Large Language Model Evaluation Benchmark for Scientific Research

Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification

CMATH: Can Your Language Model Pass Chinese Elementary School Math Test?

Evaluating Language Models for Mathematics through Interactions

TheoremQA: A Theorem-driven Question Answering dataset

C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models

Sparks of Artificial General Intelligence: Early experiments with GPT-4

A Brief Report on LawGPT 1.0: A Virtual Legal Assistant Based on GPT-3

Mathematical Capabilities of ChatGPT

Transfer Knowledge from Natural Language to Electrocardiography: Can We Detect Cardiovascular Disease Through Language Models?

A Survey of Deep Learning for Mathematical Reasoning

Large Language Models Meet NL2Code: A Survey

UniGeo: Unifying Geometry Logical Reasoning via Reformulating Mathematical Expression

Draft, Sketch, and Prove: Guiding Formal Theorem Provers with Informal Proofs

Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering

Solving Quantitative Reasoning Problems with Language Models

Training Verifiers to Solve Math Word Problems

MiniF2F: a cross-system benchmark for formal Olympiad-level mathematics

GeoQA: A Geometric Question Answering Benchmark Towards Multimodal Numerical Reasoning

Inter-GPS: Interpretable Geometry Problem Solving with Formal Language and Symbolic Reasoning

NaturalProofs: Mathematical Theorem Proving in Natural Language

Measuring Mathematical Problem Solving With the MATH Dataset

Measuring Massive Multitask Language Understanding

Scaling Laws for Neural Language Models

MathQA: Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms

Deep Neural Solver for Math Word Problems

Crowdsourcing Multiple Choice Science Questions

Program Induction by Rationale Generation: Learning to Solve and Explain Algebraic Word Problems

MAWPS: A Math Word Problem Repository

Natural Language Input for a Computer Problem Solving System

Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities

. Nous-hermes-2-yi-34b model card

An Augmented Benchmark Dataset for Geometric Question Answering through Dual Parallel Text Encoding

Response.

of the Association for Computational Linguistics

OpenAI. 2023b. Gpt-4v(ision) system card

Incorrect Judging: Calls for future work of automatically deciding required precision of the answer, or automatically judging expressions such as a √ b and √ a 2 b with a ≥ 0

Models may give correct answers with a false process. Mainly observed for problems with a simple answer, such as the variables takes 0 as the answer

Mathematics Competitions

2023. Visual instruction tuning

Gemini Team. 2023.

A Dataset Details A.1 Data Sources

success in giving correct overall idea, but fail in calculation (such as solving quadratic equations with extra negative signs), which leads to a wrong answer

Value Calculation Error: GPT-4V make simple calculation mistakes sometimes, such as outputting b 2 + 7 = b +72 , these mistakes appears

Inappropriate response: Some problems trigger inappropriate response, which are banned by the API to return

that are crucial for the advancement models

Logical Reasoning / Induction Error / Conceptual Confusion: GPT-4V sometimes makes false reasoning or induction, as well as encounters conceptual confusion

Introducing Unnecessary variables or concepts

2024a. Mm-llms

Expression Calculation Error: Similar to value calculation error, but happens when transforming between two expressions

Volume 1: Long Papers

Given a simple solution, GPT-4V may choose a more complex method to solve the problem

pipeline workflow

Unfinished Answering: sometimes GPT-4V says the question have confliction in settings, or degenerates after some tokens

Insufficient Classification Discussions

2022. Emer-gent abilities of large language models

OpenAI. 2023a.

2023. Have llms advanced enough? a challenging problem solving benchmark for large language models