1
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
2
CMMU: A Benchmark for Chinese Multi-modal Multi-type Question Understanding and Reasoning
3
Solving olympiad geometry without human demonstrations
4
MMMU: A Massive Multi-Discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
5
AI for Mathematics: A Cognitive Science Perspective
6
MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts
7
SciEval: A Multi-Level Large Language Model Evaluation Benchmark for Scientific Research
8
Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification
9
CMATH: Can Your Language Model Pass Chinese Elementary School Math Test?
10
Evaluating Language Models for Mathematics through Interactions
11
TheoremQA: A Theorem-driven Question Answering dataset
12
C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models
13
Sparks of Artificial General Intelligence: Early experiments with GPT-4
14
A Brief Report on LawGPT 1.0: A Virtual Legal Assistant Based on GPT-3
15
Mathematical Capabilities of ChatGPT
16
Transfer Knowledge from Natural Language to Electrocardiography: Can We Detect Cardiovascular Disease Through Language Models?
17
A Survey of Deep Learning for Mathematical Reasoning
18
Large Language Models Meet NL2Code: A Survey
19
UniGeo: Unifying Geometry Logical Reasoning via Reformulating Mathematical Expression
20
Draft, Sketch, and Prove: Guiding Formal Theorem Provers with Informal Proofs
21
Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering
22
Solving Quantitative Reasoning Problems with Language Models
23
Training Verifiers to Solve Math Word Problems
24
MiniF2F: a cross-system benchmark for formal Olympiad-level mathematics
25
GeoQA: A Geometric Question Answering Benchmark Towards Multimodal Numerical Reasoning
26
Inter-GPS: Interpretable Geometry Problem Solving with Formal Language and Symbolic Reasoning
27
NaturalProofs: Mathematical Theorem Proving in Natural Language
28
Measuring Mathematical Problem Solving With the MATH Dataset
29
Measuring Massive Multitask Language Understanding
30
Scaling Laws for Neural Language Models
31
MathQA: Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms
32
Deep Neural Solver for Math Word Problems
33
Crowdsourcing Multiple Choice Science Questions
34
Program Induction by Rationale Generation: Learning to Solve and Explain Algebraic Word Problems
35
MAWPS: A Math Word Problem Repository
36
Natural Language Input for a Computer Problem Solving System
37
Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities
38
. Nous-hermes-2-yi-34b model card
39
An Augmented Benchmark Dataset for Geometric Question Answering through Dual Parallel Text Encoding
41
of the Association for Computational Linguistics
42
OpenAI. 2023b. Gpt-4v(ision) system card
43
Incorrect Judging: Calls for future work of automatically deciding required precision of the answer, or automatically judging expressions such as a √ b and √ a 2 b with a ≥ 0
44
Models may give correct answers with a false process. Mainly observed for problems with a simple answer, such as the variables takes 0 as the answer
45
Mathematics Competitions
46
2023. Visual instruction tuning
48
A Dataset Details A.1 Data Sources
49
success in giving correct overall idea, but fail in calculation (such as solving quadratic equations with extra negative signs), which leads to a wrong answer
50
Value Calculation Error: GPT-4V make simple calculation mistakes sometimes, such as outputting b 2 + 7 = b +72 , these mistakes appears
51
Inappropriate response: Some problems trigger inappropriate response, which are banned by the API to return
52
that are crucial for the advancement models
53
Logical Reasoning / Induction Error / Conceptual Confusion: GPT-4V sometimes makes false reasoning or induction, as well as encounters conceptual confusion
54
Introducing Unnecessary variables or concepts
56
Expression Calculation Error: Similar to value calculation error, but happens when transforming between two expressions
58
Given a simple solution, GPT-4V may choose a more complex method to solve the problem
60
Unfinished Answering: sometimes GPT-4V says the question have confliction in settings, or degenerates after some tokens
61
Insufficient Classification Discussions
62
2022. Emer-gent abilities of large language models
64
2023. Have llms advanced enough? a challenging problem solving benchmark for large language models