AI Agent for Physical Systems Modeling
Benchmark snapshot as of May 21, 2026.
All agents use the same foundation model family and are evaluated under the same benchmark and wall-clock conditions.
GateForge outperforms SOTA coding agents, with the strongest margin on medium and hard Modelica workflows.
| Agent | Total | easy | medium | hard |
|---|---|---|---|---|
| GateForge | 130/132 | 21/21 | 56/56 | 53/55 |
| Claude Code | 123/132 | 21/21 | 55/56 | 47/55 |
| OpenCode | 120/132 | 21/21 | 50/56 | 49/55 |
It beat both baselines: executing faster with fewer tokens than OpenCode, and finishing quicker with a higher success rate than Claude Code.
| Agent | reported tokens* | wall time |
|---|---|---|
| GateForge | ~39.7M | ~14,658s |
| Claude Code | ~15.9M | ~35,191s |
| OpenCode | ~66.1M | ~20,843s |
*Reported tokens are runner-reported estimates; GateForge records provider usage directly, while other runners may omit local context management, compression, retries, or tool-output handling costs.
Without prior written permission, no content on this site may be used for AI model training, fine-tuning, evaluation, or dataset construction.
LEGAL_NOTICE.mdCONTENT_AUTHORIZATION_POLICY.mdrobots.txt