OpenAI has unveiled MLE-bench in a blog post and accompanying paper. A new benchmark designed to measure how well AI agents perform at machine learning engineering tasks.
MLE-bench curates ''75 Machine Learning engineering-related competitions from Kaggle, creating a diverse set of challenging tasks that test real-world ML engineering skills such as training models, preparing datasets, and running experiments.'' The benchmark establishes human baselines for each competition using Kaggle's publicly available leaderboards.
To evaluate AI performance, the researchers used open-source agent scaffolds to test several frontier language models on the benchmark. The best-performing setup — OpenAI's o1-preview with AIDE scaffolding — achieved at least the level of a Kaggle bronze medal in 16.9% of competitions.
In addition to the main results, the study investigates various forms of resource-scaling for AI agents and the impact of contamination from pre-training. These provide insights into factors affecting AI performance in machine learning engineering tasks.
To facilitate future research in understanding the ML engineering capabilities of AI agents, OpenAI has open-sourced the benchmark code.