Skip to content

[MRG] 🧪 Add MLE-Bench connection setup and usage#303

Merged
HuaizhengZhang merged 6 commits into
mainfrom
exp-mlebench
Jul 14, 2025
Merged

[MRG] 🧪 Add MLE-Bench connection setup and usage#303
HuaizhengZhang merged 6 commits into
mainfrom
exp-mlebench

Conversation

@YuanmingLeee
Copy link
Copy Markdown
Contributor

@YuanmingLeee YuanmingLeee commented Jul 10, 2025

This pull request introduces a new experimental CLI and related infrastructure to integrate the MLE-Bench benchmarking tool into the project. It includes updates to documentation, the addition of new CLI and installation scripts, and configuration changes to support the integration

Closes #301

Below are the most important changes grouped by theme:

New CLI and Integration with MLE-Bench

  • Added exp/cli.py: A simple experimental CLI script to bridge MLE-agent and MLE-Bench, allowing users to install, prepare datasets, run benchmarks, and grade submissions. It proxies commands to the external mlebench tool.
  • Added exp/init.py: A script to download files and install the MLE-Bench repository using Git LFS and copy it into the installed mlebench package's folder.
  • Added exp/mlebench_api.py: An API rewrite wrapper over mle-bench functions for downloading datasets and grading assessment.

Documentation Updates

  • Updated exp/README.md: Added detailed instructions for installing and using MLE-Bench, including commands for preparing datasets and grading submissions.

Dependency and Configuration Changes

  • Updated pyproject.toml:
    • Added pip and setuptools as dependencies to ensure compatibility.
    • Introduced an optional dependency group bench for installing mlebench directly from its GitHub repository.
    • Added a new [tool.uv] section to override dependencies, specifically skipping tensorflow-io-gcs-filesystem on Windows.

What has been done to verify that this works as intended?

Belowed functions are tested on Win11 CP311 and Ubuntu 22.04 (WSL) CP311

Install and init
GIT_LFS_SKIP_SMUDGE=1 pip install -e .[bench]

mle-exp init

Expected output:

[2025-07-13 14:22:09,325] [exp.init:103] - INFO - Upgrading existing experiments/ directory …
[2025-07-13 14:22:09,325] [exp.init:116] - INFO - Removing old experiments/ folder …
[2025-07-13 14:22:09,327] [exp.init:123] - INFO - Cloning https://github.com/openai/mle-bench.git → C:\Users\xxx\AppData\Local\Temp\tmpbmnd8ovn\mle-bench
[2025-07-13 14:22:16,016] [exp.init:140] - INFO - Pulling Git LFS objects for experiments/, mlebench/competitions/ …
[2025-07-13 14:22:16,656] [exp.init:158] - INFO - Copying C:\Users\li_yu\AppData\Local\Temp\tmpbmnd8ovn\mle-bench\experiments → C:\Users\xxx\MLE-agent\.venv\Lib\site-packages\experiments
[2025-07-13 14:22:16,759] [exp.init:160] - INFO - Copying C:\Users\xxx\AppData\Local\Temp\tmpbmnd8ovn\mle-bench\mlebench\competitions → C:\Users\xxx\MLE-agent\.venv\Lib\site-packages\mlebench\competitions
[2025-07-13 14:22:21,718] [exp.init:166] - INFO - Done! Sample file copied: experiments\competition_categories.csv
Prepare datasets
mle-exp prepare --lite
Grading one submission
# mle-exp grade-sample <PATH_TO_SUBMISSION> <competition-id>
mle-exp grade-sample submission.csv tabular-playground-series-may-2022

Expected output:

Competition report:
 {
    "competition_id": "tabular-playground-series-may-2022",
    "score": 0.93787,
    "gold_threshold": 0.99823,
    "silver_threshold": 0.99822,
    "bronze_threshold": 0.99818,
    "median_threshold": 0.972675,
    "any_medal": false,
    "gold_medal": false,
    "silver_medal": false,
    "bronze_medal": false,
    "above_median": false,
    "submission_exists": true,
    "valid_submission": true,
    "is_lower_better": false,
    "created_at": "2025-07-13T13:54:09.706242",
    "submission_path": "submission.csv"
}
Grading multiple submissions
mle-exp grade --submission test_multi_submit.jsonl --output-dir .

where your test multi-submission JSONL file has content:

{"competition_id": "tabular-playground-series-may-2022", "submission_path": "C:\\Users\\li_yu\\PycharmProjects\\mle-bench-solver\\submission.csv"}

Expected output:

Grading submissions: 100%|█████████| 1/1 [00:01<00:00,  1.06s/submission]
[2025-07-13 14:03:07,138] [grade.py:40] {
    "total_runs": 1,
    "total_runs_with_submissions": 1,
    "total_valid_submissions": 1,
    "total_medals": 0,
    "total_gold_medals": 0,
    "total_silver_medals": 0,
    "total_bronze_medals": 0,
    "total_above_median": 0
}

And save into a file <timestamp_grading_report.json

Why is this the best possible solution? Were any other approaches considered?

N.A.

How does this change affect users? Describe intentional changes to behavior and behavior that could have accidentally been affected by code changes. In other words, what are the regression risks?

Not affected if do not install the extra package

Do we need any specific form for testing your changes? If so, please attach one.

See above

Does this change require updates to documentation? If so, please file an issue here and include the link below.

Before submitting this PR, please make sure you have:

  • confirmed all checks still pass OR confirm CI build passes.
  • verified that any code or assets from external sources are properly credited in comments and/or in
    the credit file.

@dosubot dosubot Bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Jul 10, 2025
@YuanmingLeee
Copy link
Copy Markdown
Contributor Author

@syangx38

@dosubot dosubot Bot added documentation Improvements or additions to documentation enhancement New feature or request labels Jul 10, 2025
Introduce documentation for MLE-Bench, including installation steps,
dataset preparation commands, and grading examples.

This enhances usability by providing clear guidance for users setting
up and interacting with the benchmark.
- Update CLI commands to use `mle-exp` for consistency with package naming.
- Improve logging by ensuring no duplicate handlers and standardizing formatting.
- Refactor Git LFS initialization to validate files and handle multiple directories that missed previously.
…ities

- Fix CLI and API utilities bug: type Path not working for click.Path.
- Improve
error handling by adding detailed traceback info and JSONL validation.
@dosubot dosubot Bot added size:XL This PR changes 500-999 lines, ignoring generated files. and removed size:L This PR changes 100-499 lines, ignoring generated files. labels Jul 13, 2025
@YuanmingLeee YuanmingLeee changed the title [WIP] 🧪 Add MLE-Bench connection setup and usage [MRG] 🧪 Add MLE-Bench connection setup and usage Jul 13, 2025
Comment thread exp/README.md Outdated
Copy link
Copy Markdown
Contributor

@HuaizhengZhang HuaizhengZhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@HuaizhengZhang HuaizhengZhang merged commit 52452f1 into main Jul 14, 2025
3 checks passed
@YuanmingLeee YuanmingLeee deleted the exp-mlebench branch July 14, 2025 06:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation enhancement New feature or request size:XL This PR changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Exp] Add MLE Bench Script

2 participants