OpenAI introduced MLE-bench, a new benchmark designed to evaluate how well AI agents perform on real-world machine learning engineering tasks using Kaggle competitions.
Kaggle competitions are online challenges where data scientists compete to solve complex problems using machine learning for prizes and recognition. In research, the AI models often succeeded in applying standard techniques but struggled with tasks requiring adaptability or creative problem-solving.
https://arxiv.org/pdf/2410.07095.pdf?_bhlid=600138081016fd6d499a4fb5f233931a7ead258e
The best-performing setup, OpenAI’s o1-preview model with AIDE scaffolding, achieved at least a bronze medal in 16.9% of competitions.