openai/mle-bench — FindAgent

o

openai/mle-bench

MLE-bench is a benchmark for measuring how well AI agents perform at machine learning engineering

⭐

1.6k

Stars

🔱

249

Forks

👁

25

Watchers

📋

9

Issues

PythonNOASSERTION创建于 2024/10/8更新于今天

在 GitHub 上查看访问主页

README

由 Gemini 翻译整理

MLE-bench

这是论文 "MLE-Bench: Evaluating Machine Learning Agents on Machine Learning Engineering" 的代码库。我们发布了用于构建数据集的代码、评估逻辑，以及我们在该基准测试中评估过的智能体（Agent）。

排行榜

更新 (2026-04-24)：为了确保提交结果的公平性和可比性，我们正在制定改进流程，期间暂不接受新的排行榜提交。我们将在未来分享有关该流程的更新信息。

智能体 (Agent)	使用的 LLM	低难度 (Lite) (%)	中难度 (%)	高难度 (%)	总分 (%)	运行时间 (小时)	日期	开源源代码	评分报告
Famou-Agent 2.0	Gemini-3-Pro-Preview	80.3 ± 1.52	64.04 ± 2.32	42.22 ± 2.22	64.44 ± 1.18	24	2026-02-23	X	✓
AIBuildAI	Claude-Opus-4.6	77.27 ± 0.00	61.40 ± 0.88	46.67 ± 0.00	63.11 ± 0.44	24	2026-03-06	X	✓
CAIR MARS+	Gemini-3-Pro-Preview	78.79 ± 1.52	60.53 ± 1.52	44.44 ± 2.22	62.67 ± 0.77	24	2026-02-17	X	✓
MLEvolve	Gemini-3-Pro-Preview	80.30 ± 1.52	57.89 ± 1.52	42.22 ± 2.22	61.33 ± 1.33	12	2026-02-14	✓	✓
PiEvolve(Fractal AI Research)	Gemini-3-Pro-Preview[^4]	80.30 ± 1.52[^3]	58.77 ± 0.88[^3]	40.0 ± 0.00[^3]	61.33 ± 0.77[^3]	24	2026-01-05	X	✓
Famou-Agent 2.0	Gemini-2.5-Pro	75.76 ± 1.52	57.89 ± 1.52	40.00 ± 0.00	59.56 ± 0.89	24	2025-12-27	X	✓
ML-Master 2.0	Deepseek-V3.2-Speciale	75.76 ± 1.51	50.88 ± 3.51	42.22 ± 2.22	56.44 ± 2.47	24	2025-12-16	X	✓
CAIR MARS	Gemini-3-Pro-Preview	74.24 ± 1.52	52.63 ± 3.04	37.78 ± 2.22	56.0 ± 1.54	24	2026-01-25	X	✓
PiEvolve(Fractal AI Research)	Gemini-3-Pro-Preview[^4]	74.24 ± 3.03[^3]	45.61 ± 0.88[^3]	35.55 ± 2.22[^3]	52.0 ± 0.77[^3]	12	2026-01-05	X	✓
Leeroo	Gemini-3-Pro-Preview[^4]	68.18 ± 2.62[^3]	44.74 ± 1.52[^3]	40.00 ± 0.00[^3]	50.67 ± 1.33[^3]	24	2025-12-07	✓	✓
Thesis	gpt-5-codex	65.15 ± 1.52	45.61 ± 7.18	31.11 ± 2.22	48.44 ± 3.64	24	2025-11-10	X	✓
CAIR MLE-STAR-Pro-1.5	Gemini-2.5-Pro	68.18 ± 2.62	34.21 ± 1.52	33.33 ± 0.00	44.00 ± 1.33	24	2025-11-25	X	✓
Famou-Agent	Gemini-2.5-Pro	62.12 ± 1.52	36.84 ± 1.52	33.33 ± 0.00	43.56 ± 0.89	24	2025-10-10	X	✓
Operand ensemble	gpt-5 (low verbosity/effort)[^2]	63.64 ± 0.00	33.33 ± 0.88[^3]	20.00 ± 0.00[^3]	39.56 ± 0.44[^3]	24	2025-10-06	X	✓
CAIR MLE-STAR-Pro-1.0	Gemini-2.5-Pro	66.67 ± 1.52	25.44 ± 0.88	31.11 ± 2.22	38.67 ± 0.77	12	2025-11-03	X	✓
InternAgent	deepseek-r1	62.12 ± 3.03	26.32 ± 2.63	24.44 ± 2.22	36.44 ± 1.18	12	2025-09-12	X	✓
[R&D-Agent](https://github.com/microsoft/