MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering (2024-10-09T00:00:00.000000Z)

TL;DR

MLE-bench is introduced, a benchmark for measuring how well AI agents perform at machine learning engineering, and various forms of resource scaling for AI agents and the impact of contamination from pre-training are investigated.

Abstract

We introduce MLE-bench, a benchmark for measuring how well AI agents perform at machine learning engineering. To this end, we curate 75 ML engineering-related competitions from Kaggle, creating a diverse set of challenging tasks that test real-world ML engineering skills such as training models, preparing datasets, and running experiments. We establish human baselines for each competition using Kaggle's publicly available leaderboards. We use open-source agent scaffolds to evaluate several frontier language models on our benchmark, finding that the best-performing setup--OpenAI's o1-preview with AIDE scaffolding--achieves at least the level of a Kaggle bronze medal in 16.9% of competitions. In addition to our main results, we investigate various forms of resource scaling for AI agents and the impact of contamination from pre-training. We open-source our benchmark code (github.com/openai/mle-bench/) to facilitate future research in understanding the ML engineering capabilities of AI agents.

Authors

Jun Shern Chan

1 papers

Neil Chowdhury

1 papers

Oliver Jaffe

1 papers

TL;DR

Abstract

Authors

References50 items

DSBench: How Far Are Data Science Agents from Becoming Data Science Experts?

AI Agents That Matter

ConStat: Performance-Based Contamination Detection in Large Language Models

AgentQuest: A Modular Benchmark Framework to Measure Progress and Improve LLM Agents

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Discovering and exploring cases of educational source code plagiarism with Dolos

GAIA: a benchmark for General AI Assistants

ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

MLAgentBench: Evaluating Language Agents on Machine Learning Experimentation

AgentBench: Evaluating LLMs as Agents

Can GPT-4 Perform Neural Architecture Search?

Quantifying Memorization Across Neural Language Models

Program Synthesis with Large Language Models

Evaluating Large Language Models Trained on Code

Measuring Coding Challenge Competence With APPS

Potential Energy

Competition:

Structures

Factory

train

AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation

OpenAI. Preparedness Framework,

Anthropic’s Responsible Scaling Policy, Version 1.0

Openvaccine: Covid-19 mrna vaccine degradation prediction

Dipole Moments

The training and test splits are by *molecule*, so that no molecule in the training data is found in the test data

The competition must have finished and is therefore unlikely to change (and, for some competitions, the test set is now publicly available

Introducing Weco AIDE,

Frontier Safety Framework, May 2024

Research: quantifying GitHub Copilot’s impact on developer productivity and happiness,

This challenge aims to predict interactions between atoms. Imaging technologies like MRI enable us to see and understand the molecular composition of tissues

**Dataset**: The dataset files for the competition are available in ‘/home/data/‘

The train and test sets are from the same distribution, such that it is feasible to create a new train and test split from the public training data

Researchers around the world conduct NMR experiments to further understanding of the structure and dynamics of molecules

**magnetic_shielding_tensors

1 D ATASET C URATION C RITERIA We manually filter candidate competitions according to the following criteria

Using NMR to gain insight into a molecule’s structure and dynamics depends on the ability to accurately predict so-called ‘‘scalar couplings’’

the competition finishes

the CHAMPS Scalar Coupling challenge to apply predictive analytics to chemistry and chemical biology

Your submission will be graded based on the metric in the competition-specific instructions. Your goal is to achieve the best score possible on the competition task

The shift from models to compound ai systems

The competition’s evaluation metric can be computed locally

AutoCodeRover: Au-tonomous Program Improvement, July 2024

GitHub Copilot Workspace: Welcome to the Copilot-native developer environment, April 2024

**scalar_coupling_contributions

Using state-of-the-art methods from quantum mechanics

NOTE: additional data is provided for the molecules in Train only!*

*mulliken_charges

Kaggle

Field of Study

Journal Information

Name

Volume

Venue Information

Name

Type

URL

Alternate Names

The training and test splits are by molecule, so that no molecule in the training data is found in the test data

Dataset: The dataset files for the competition are available in ‘/home/data/‘