[MRG] :test_tube: Add MLE-Bench connection setup and usage by YuanmingLeee · Pull Request #303 · MLSysOps/MLE-agent

YuanmingLeee · 2025-07-10T17:56:04Z

This pull request introduces a new experimental CLI and related infrastructure to integrate the MLE-Bench benchmarking tool into the project. It includes updates to documentation, the addition of new CLI and installation scripts, and configuration changes to support the integration

Closes #301

Below are the most important changes grouped by theme:

New CLI and Integration with MLE-Bench

Added exp/cli.py: A simple experimental CLI script to bridge MLE-agent and MLE-Bench, allowing users to install, prepare datasets, run benchmarks, and grade submissions. It proxies commands to the external mlebench tool.
Added exp/init.py: A script to download files and install the MLE-Bench repository using Git LFS and copy it into the installed mlebench package's folder.
Added exp/mlebench_api.py: An API rewrite wrapper over mle-bench functions for downloading datasets and grading assessment.

Documentation Updates

Updated exp/README.md: Added detailed instructions for installing and using MLE-Bench, including commands for preparing datasets and grading submissions.

Dependency and Configuration Changes

Updated pyproject.toml:
- Added pip and setuptools as dependencies to ensure compatibility.
- Introduced an optional dependency group bench for installing mlebench directly from its GitHub repository.
- Added a new [tool.uv] section to override dependencies, specifically skipping tensorflow-io-gcs-filesystem on Windows.

What has been done to verify that this works as intended?

Belowed functions are tested on Win11 CP311 and Ubuntu 22.04 (WSL) CP311

Install and init

GIT_LFS_SKIP_SMUDGE=1 pip install -e .[bench]

mle-exp init

Expected output:

[2025-07-13 14:22:09,325] [exp.init:103] - INFO - Upgrading existing experiments/ directory …
[2025-07-13 14:22:09,325] [exp.init:116] - INFO - Removing old experiments/ folder …
[2025-07-13 14:22:09,327] [exp.init:123] - INFO - Cloning https://github.com/openai/mle-bench.git → C:\Users\xxx\AppData\Local\Temp\tmpbmnd8ovn\mle-bench
[2025-07-13 14:22:16,016] [exp.init:140] - INFO - Pulling Git LFS objects for experiments/, mlebench/competitions/ …
[2025-07-13 14:22:16,656] [exp.init:158] - INFO - Copying C:\Users\li_yu\AppData\Local\Temp\tmpbmnd8ovn\mle-bench\experiments → C:\Users\xxx\MLE-agent\.venv\Lib\site-packages\experiments
[2025-07-13 14:22:16,759] [exp.init:160] - INFO - Copying C:\Users\xxx\AppData\Local\Temp\tmpbmnd8ovn\mle-bench\mlebench\competitions → C:\Users\xxx\MLE-agent\.venv\Lib\site-packages\mlebench\competitions
[2025-07-13 14:22:21,718] [exp.init:166] - INFO - Done! Sample file copied: experiments\competition_categories.csv

Prepare datasets

mle-exp prepare --lite

Grading one submission

# mle-exp grade-sample <PATH_TO_SUBMISSION> <competition-id>
mle-exp grade-sample submission.csv tabular-playground-series-may-2022

Expected output:

Competition report:
 {
    "competition_id": "tabular-playground-series-may-2022",
    "score": 0.93787,
    "gold_threshold": 0.99823,
    "silver_threshold": 0.99822,
    "bronze_threshold": 0.99818,
    "median_threshold": 0.972675,
    "any_medal": false,
    "gold_medal": false,
    "silver_medal": false,
    "bronze_medal": false,
    "above_median": false,
    "submission_exists": true,
    "valid_submission": true,
    "is_lower_better": false,
    "created_at": "2025-07-13T13:54:09.706242",
    "submission_path": "submission.csv"
}

Grading multiple submissions

mle-exp grade --submission test_multi_submit.jsonl --output-dir .

where your test multi-submission JSONL file has content:

{"competition_id": "tabular-playground-series-may-2022", "submission_path": "C:\\Users\\li_yu\\PycharmProjects\\mle-bench-solver\\submission.csv"}

Expected output:

Grading submissions: 100%|█████████| 1/1 [00:01<00:00,  1.06s/submission]
[2025-07-13 14:03:07,138] [grade.py:40] {
    "total_runs": 1,
    "total_runs_with_submissions": 1,
    "total_valid_submissions": 1,
    "total_medals": 0,
    "total_gold_medals": 0,
    "total_silver_medals": 0,
    "total_bronze_medals": 0,
    "total_above_median": 0
}

And save into a file <timestamp_grading_report.json

Why is this the best possible solution? Were any other approaches considered?

N.A.

How does this change affect users? Describe intentional changes to behavior and behavior that could have accidentally been affected by code changes. In other words, what are the regression risks?

Not affected if do not install the extra package

Do we need any specific form for testing your changes? If so, please attach one.

See above

Does this change require updates to documentation? If so, please file an issue here and include the link below.

Before submitting this PR, please make sure you have:

confirmed all checks still pass OR confirm CI build passes.
verified that any code or assets from external sources are properly credited in comments and/or in
the credit file.

YuanmingLeee · 2025-07-10T17:56:35Z

@syangx38

Introduce documentation for MLE-Bench, including installation steps, dataset preparation commands, and grading examples. This enhances usability by providing clear guidance for users setting up and interacting with the benchmark.

…ands

- Update CLI commands to use `mle-exp` for consistency with package naming. - Improve logging by ensuring no duplicate handlers and standardizing formatting. - Refactor Git LFS initialization to validate files and handle multiple directories that missed previously.

…ities - Fix CLI and API utilities bug: type Path not working for click.Path. - Improve error handling by adding detailed traceback info and JSONL validation.

HuaizhengZhang

LGTM

YuanmingLeee assigned HuaizhengZhang Jul 10, 2025

dosubot Bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Jul 10, 2025

dosubot Bot added documentation Improvements or additions to documentation enhancement New feature or request labels Jul 10, 2025

YuanmingLeee added 4 commits July 12, 2025 18:57

🏗️ [build] Implement MLE-Exp CLI with init, prepare, and grading comm…

03554f3

…ands

🔨 [cli] Fix click type bug and enhance error handling in grading util…

e3e8f29

…ities - Fix CLI and API utilities bug: type Path not working for click.Path. - Improve error handling by adding detailed traceback info and JSONL validation.

YuanmingLeee force-pushed the exp-mlebench branch from 3ee112e to e3e8f29 Compare July 13, 2025 06:11

dosubot Bot added size:XL This PR changes 500-999 lines, ignoring generated files. and removed size:L This PR changes 100-499 lines, ignoring generated files. labels Jul 13, 2025

YuanmingLeee changed the title ~~[WIP] 🧪 Add MLE-Bench connection setup and usage~~ [MRG] 🧪 Add MLE-Bench connection setup and usage Jul 13, 2025

Merge branch 'main' into exp-mlebench

c2c2fef

HuaizhengZhang reviewed Jul 13, 2025

View reviewed changes

Comment thread exp/README.md Outdated

📝 Update exp README

037c2ca

HuaizhengZhang approved these changes Jul 14, 2025

View reviewed changes

HuaizhengZhang merged commit 52452f1 into main Jul 14, 2025
3 checks passed

YuanmingLeee deleted the exp-mlebench branch July 14, 2025 06:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MRG] 🧪 Add MLE-Bench connection setup and usage#303

[MRG] 🧪 Add MLE-Bench connection setup and usage#303
HuaizhengZhang merged 6 commits into
mainfrom
exp-mlebench

YuanmingLeee commented Jul 10, 2025 •

edited

Loading

Uh oh!

YuanmingLeee commented Jul 10, 2025

Uh oh!

Uh oh!

HuaizhengZhang left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

YuanmingLeee commented Jul 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

New CLI and Integration with MLE-Bench

Documentation Updates

Dependency and Configuration Changes

What has been done to verify that this works as intended?

Install and init

Prepare datasets

Grading one submission

Grading multiple submissions

Why is this the best possible solution? Were any other approaches considered?

How does this change affect users? Describe intentional changes to behavior and behavior that could have accidentally been affected by code changes. In other words, what are the regression risks?

Do we need any specific form for testing your changes? If so, please attach one.

Does this change require updates to documentation? If so, please file an issue here and include the link below.

Before submitting this PR, please make sure you have:

Uh oh!

YuanmingLeee commented Jul 10, 2025

Uh oh!

Uh oh!

HuaizhengZhang left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

YuanmingLeee commented Jul 10, 2025 •

edited

Loading