An executable benchmark for evaluating AI agents on agentic Modelica workflows.
This repository contains the public protocol, schema, scoring notes, and demo tasks for the Modelica Agent Workflow Benchmark v0.1 Preview.
The benchmark evaluates whether an AI agent can work through a Modelica engineering task end to end: inspect a faulty model, edit the Modelica source, run OpenModelica feedback, and submit a final model that passes executable validation.
The official v0.1 evaluation set contains 132 hidden Modelica workflow tasks. The hidden set is not published to reduce benchmark contamination. Public files in this repository document the task format and provide a small demo split for local smoke testing.
Each task is a JSON file with:
workflow_goal: what the agent should accomplish;model_name: the top-level Modelica model to repair;initial_model: the complete faulty Modelica source;verification: OpenModelica check/simulation settings;acceptance: high-level acceptance rules.
The canonical demo task files live in benchmark/samples/*.json. For easier reading, the same initial models are mirrored as .mo files in benchmark/samples/modelica/.
A valid submission is a complete final Modelica model that passes OpenModelica checkModel and reaches an accepted simulation status under the task settings.
A clean OpenModelica simulation pass is accepted. A non-fatal warning status is also accepted only when OpenModelica produces a non-empty result file and reports that the simulation finished successfully.
Warnings remain FAIL when they correspond to fatal solver errors, missing result files, failed initialization, division by zero, integrator failure, or any simulation output that does not successfully complete.
The public demo split is intentionally small and self-contained. All four public demo tasks are standalone Modelica models with no external library dependency.
The hidden official set is broader. It contains standalone tasks, Modelica Standard Library based tasks, and tasks derived from public Modelica libraries such as AixLib, Buildings, IBPSA, OpenIPSL, and TRANSFORM. The full hidden task contents and construction metadata are not released.
| split | visibility | dependency types | purpose |
|---|---|---|---|
public_demo |
public | standalone only | format inspection and tooling smoke tests |
hidden_official_v0.1 |
private | standalone, MSL, public Modelica libraries | official aggregate evaluation |
Difficulty is assigned by empirical agent performance and workflow complexity, not by source-code length alone.
| bucket | intended meaning |
|---|---|
| easy | Localized repair; useful for checking that an agent can parse the task format, edit Modelica, and complete the OMC loop. |
| medium | Requires more Modelica context, cross-equation consistency, or nontrivial parameter/interface reasoning. |
| hard | Requires deeper workflow search, larger model context, library interaction, simulation-stage debugging, or robust finalization behavior. |
The public demo split includes two easy and two medium tasks. Hard tasks are kept in the hidden official set to preserve evaluation value.
v0.1 Hidden-Set Snapshot
All agents were evaluated on the same 132-task hidden set under the same benchmark and wall-clock conditions.
| Agent | Total | easy | medium | hard |
|---|---|---|---|---|
| GateForge | 130/132 | 21/21 | 56/56 | 53/55 |
| Claude Code | 123/132 | 21/21 | 55/56 | 47/55 |
| OpenCode | 120/132 | 21/21 | 50/56 | 49/55 |
| Agent | reported tokens* | wall time |
|---|---|---|
| GateForge | ~39.7M | ~14,658s |
| Claude Code | ~15.9M | ~35,191s |
| OpenCode | ~66.1M | ~20,843s |
*Reported tokens are runner-reported estimates; GateForge records provider usage directly, while other runners may omit local context management, compression, retries, or tool-output handling costs.
Official evaluation runs each task in a fresh agent session and isolated workspace. Agents must not carry conversation history, scratchpads, repaired candidates, task observations, or tool state from one task to another. Read-only infrastructure caches, such as container images and Modelica library caches, may be reused when they do not expose task content.
| task | difficulty | dependency | model | task JSON | readable model |
|---|---|---|---|---|---|
demo_001 |
easy | standalone | ThermalZone_v0 |
benchmark/samples/demo_001.json |
benchmark/samples/modelica/demo_001_initial.mo |
demo_002 |
easy | standalone | ExciterAVR_v0 |
benchmark/samples/demo_002.json |
benchmark/samples/modelica/demo_002_initial.mo |
demo_003 |
medium | standalone | SyncMachineSimplified_v0 |
benchmark/samples/demo_003.json |
benchmark/samples/modelica/demo_003_initial.mo |
demo_004 |
medium | standalone | HydroTurbineGov_v0 |
benchmark/samples/demo_004.json |
benchmark/samples/modelica/demo_004_initial.mo |
These demo tasks are not intended to be a leaderboard. They exist to document the task format and let users validate their tooling.
benchmark/
benchmark_card.md
schema.json
scoring.md
submission.md
samples/
demo_001.json
demo_002.json
demo_003.json
demo_004.json
modelica/
demo_001_initial.mo
demo_002_initial.mo
demo_003_initial.mo
demo_004_initial.mo
scripts/
validate_sample.py
score_submission.py
Validate public sample files:
python3 scripts/validate_sample.py benchmark/samples/demo_001.json
python3 scripts/validate_sample.py benchmark/samples/*.jsonValidate a submission JSON against a task JSON:
python3 scripts/score_submission.py benchmark/samples/demo_001.json path/to/submission.jsonThe scoring script checks schema-level requirements only. Official scoring runs OpenModelica checkModel and simulation using the task verification settings and warning policy.
The benchmark supports three public submission interfaces:
prediction_jsonl: submit precomputed final Modelica models;agent_command: run an agent command in one fresh workspace per task;agent_docker_image: run a containerized agent with only the current task workspace mounted.
See benchmark/submission.md for the full submission spec.
The public demo split can be run locally for format checks, tooling smoke tests, and submission preparation. It is not an official leaderboard split.
For the v0.1 Preview, the 132-task hidden official set is evaluated through maintainer-run official evaluation. Participants may provide a prediction JSONL file, an agent command, or a Docker image following benchmark/submission.md. The maintainers run the controlled evaluator and return a public-safe aggregate report.
The hidden tasks, hidden-set keys, and evaluator backend are not distributed. This protects the benchmark from leakage and training-data contamination.
Hidden Evaluation Policy
The official 132-task set is kept private. Aggregate results may be published, but hidden task contents and construction metadata are not released. This protects the benchmark from rapid training-data contamination.
Code, scripts, schemas, and tooling are licensed under the Apache License 2.0. Benchmark task data and task content are governed by DATA_LICENSE.md and may not be used for model training, fine-tuning, distillation, dataset augmentation, or benchmark memorization without prior written permission.
The public demo tasks are provided for benchmark format inspection and local smoke testing. They are not intended for training or benchmark memorization.