Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 15 additions & 0 deletions .gitattributes
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
# Force LF line endings for shell scripts and config text files so they
# work correctly under bash/WSL even when the repo is cloned on Windows
# with core.autocrlf=true. CRLF in *.env files breaks `source`, and CRLF
# in *.sh files breaks the bash interpreter ("$'\r': command not found").
*.sh text eol=lf
*.bash text eol=lf
*.env text eol=lf
*.yaml text eol=lf
*.yml text eol=lf
*.toml text eol=lf

# Windows batch scripts must keep CRLF so cmd.exe parses them reliably.
*.bat text eol=crlf
*.cmd text eol=crlf
*.ps1 text eol=crlf
11 changes: 10 additions & 1 deletion .secrets.baseline
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
"files": "^.secrets.baseline$",
"lines": null
},
"generated_at": "2026-05-10T11:36:08Z",
"generated_at": "2026-05-28T12:15:43Z",
"plugins_used": [
{
"name": "AWSKeyDetector"
Expand Down Expand Up @@ -306,6 +306,15 @@
"type": "Hex High Entropy String",
"verified_result": null
}
],
"scripts/model_profiles.bat": [
{
"hashed_secret": "af89b35ce32cfc9eaf4c102325da47616e6eff93",
"is_verified": false,
"line_number": 18,
"type": "Base64 High Entropy String",
"verified_result": null
}
]
},
"version": "0.13.1+ibm.64.dss",
Expand Down
64 changes: 64 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,19 @@ git clone https://github.com/cuga-project/cuga-eval.git
cd cuga-eval
```

> **Windows users:** every `.sh` script in this repo has a sibling `.bat`. You don't need
> WSL or Git Bash for the simple wrappers (`setup_cuga.bat`, `run_app.bat`, `run_registry.bat`,
> `viz.bat`, `model_profiles.bat`, the per-benchmark `analyze.bat`, etc.) — they run on
> stock `cmd.exe`. The heavier scripts (eval/compare/clean and the `m3_pad_to_cap_verify`
> helper) delegate to bash via Git Bash or WSL because they use POSIX-only features. See
> [Running on Windows](#running-on-windows) below.
>
> If you're using WSL and cloned with Windows git (default `core.autocrlf=true`),
> the `*.sh` and `*.env` files end up with CRLF line endings, which break bash under WSL
> (`$'\r': command not found`) and `source`d env files. Run `fix_line_endings.bat`
> (double-click in Explorer, or run from `cmd.exe` / PowerShell) once before running any
> setup scripts under WSL.

### 2. Run setup script
```bash
# Clone CUGA agent and set up the base environment
Expand Down Expand Up @@ -176,6 +189,57 @@ cd benchmarks/m3 && ./eval.sh
cd benchmarks/appworld && ./eval.sh
```

### Running on Windows

Every script has a `.bat` sibling. Same flags, same semantics; just substitute the
extension and use `\` instead of `/`:

```bat
:: Top-level dispatcher (these scripts delegate to bash — see note below)
scripts\eval.bat --benchmark bpo
scripts\eval.bat --benchmark m3 --model-profile gpt-oss
scripts\compare.bat --benchmark bpo --runs 3

:: Setup (pure cmd.exe — no bash required)
setup_cuga.bat
setup_m3.bat --verify
setup_appworld.bat

:: Per-benchmark, from the benchmark dir
cd benchmarks\bpo && eval.bat
cd benchmarks\m3 && run_registry.bat

:: Run from PowerShell the same way — pwsh launches .bat via cmd.exe
.\setup_cuga.bat
.\scripts\eval.bat --benchmark bpo
```

The `.bat` files fall into two groups:

- **Pure `cmd.exe` ports** — setup scripts, env loaders, registry runners, app
launchers, model profiles, the analyze and viz thin-wrappers. Work on a vanilla
Windows install with `cmd.exe` or PowerShell. No bash needed.
- **Bash-delegate shims** — the heavy eval/compare/clean scripts and
`m3_pad_to_cap_verify`. These use POSIX features (signal traps, `lsof`, `pkill`,
process substitution, sourceable function libraries, embedded `python3` here-docs)
that don't have clean `cmd.exe` equivalents, so each shim calls
[`benchmarks\helpers\_delegate_to_bash.bat`](benchmarks/helpers/_delegate_to_bash.bat),
which finds a `bash` in this order: Git Bash (well-known install paths) →
`bash` on `PATH` → WSL. Install [Git for Windows](https://git-scm.com/download/win)
(provides Git Bash) or run `wsl --install` if neither is present.

A smoke test for the `.bat` scripts ships at `scripts/test_bat_scripts.ps1`. It
runs on any platform with PowerShell 7+:

```bash
pwsh scripts/test_bat_scripts.ps1
```

It validates that every `.sh` has a `.bat` sibling, that each `.bat` is well-formed,
and that the delegate shims point to existing `.sh` files. Long-term, this whole
layer will move to Python (one entrypoint instead of two parallel script trees) —
tracked in [issue #88](../../issues/88).

### Model profiles

Available profiles: `gpt-oss`, `gpt4o`, `gpt4.1`, `opus4.5`
Expand Down
7 changes: 7 additions & 0 deletions benchmarks/appworld/compare.bat
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
@echo off
REM Windows equivalent of benchmarks/appworld/compare.sh — delegates to bash.
setlocal
set "_THIS=%~dp0"
if "%_THIS:~-1%"=="\" set "_THIS=%_THIS:~0,-1%"
call "%_THIS%\..\helpers\_delegate_to_bash.bat" "%_THIS%\compare.sh" %*
exit /b %errorlevel%
8 changes: 8 additions & 0 deletions benchmarks/appworld/eval.bat
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
@echo off
REM Windows equivalent of benchmarks/appworld/eval.sh — delegates to bash
REM (traps, kill -0, lsof, process substitution, find with -mindepth).
setlocal
set "_THIS=%~dp0"
if "%_THIS:~-1%"=="\" set "_THIS=%_THIS:~0,-1%"
call "%_THIS%\..\helpers\_delegate_to_bash.bat" "%_THIS%\eval.sh" %*
exit /b %errorlevel%
19 changes: 19 additions & 0 deletions benchmarks/appworld/run_app.bat
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
@echo off
REM Windows equivalent of benchmarks/appworld/run_app.sh
REM Loads env and starts AppWorld.

setlocal
set "SCRIPT_DIR=%~dp0"
if "%SCRIPT_DIR:~-1%"=="\" set "SCRIPT_DIR=%SCRIPT_DIR:~0,-1%"
pushd "%SCRIPT_DIR%\..\.." >nul
set "PROJECT_ROOT=%CD%"
popd >nul

echo Loading AppWorld configuration...
call "%PROJECT_ROOT%\benchmarks\helpers\load_env.bat" "appworld"

echo.
echo Starting AppWorld...
cd /d "%PROJECT_ROOT%"
uv run cuga start appworld
exit /b %errorlevel%
13 changes: 13 additions & 0 deletions benchmarks/appworld/run_eval.bat
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
@echo off
REM Windows equivalent of benchmarks/appworld/run_eval.sh
REM Loads AppWorld env and runs cuga-eval.

setlocal
set "SCRIPT_DIR=%~dp0"
if "%SCRIPT_DIR:~-1%"=="\" set "SCRIPT_DIR=%SCRIPT_DIR:~0,-1%"
pushd "%SCRIPT_DIR%\..\.." >nul
set "PROJECT_ROOT=%CD%"
popd >nul
call "%PROJECT_ROOT%\benchmarks\helpers\load_env.bat" "appworld"
cuga-eval appworld %*
exit /b %errorlevel%
8 changes: 8 additions & 0 deletions benchmarks/appworld/run_registry.bat
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
@echo off
REM Windows equivalent of benchmarks/appworld/run_registry.sh
REM Delegates to the generic helper.
setlocal
set "SCRIPT_DIR=%~dp0"
if "%SCRIPT_DIR:~-1%"=="\" set "SCRIPT_DIR=%SCRIPT_DIR:~0,-1%"
call "%SCRIPT_DIR%\..\helpers\run_registry.bat" "appworld"
exit /b %errorlevel%
7 changes: 7 additions & 0 deletions benchmarks/bpo/compare.bat
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
@echo off
REM Windows equivalent of benchmarks/bpo/compare.sh — delegates to bash.
setlocal
set "_THIS=%~dp0"
if "%_THIS:~-1%"=="\" set "_THIS=%_THIS:~0,-1%"
call "%_THIS%\..\helpers\_delegate_to_bash.bat" "%_THIS%\compare.sh" %*
exit /b %errorlevel%
7 changes: 7 additions & 0 deletions benchmarks/bpo/eval.bat
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
@echo off
REM Windows equivalent of benchmarks/bpo/eval.sh — delegates to bash.
setlocal
set "_THIS=%~dp0"
if "%_THIS:~-1%"=="\" set "_THIS=%_THIS:~0,-1%"
call "%_THIS%\..\helpers\_delegate_to_bash.bat" "%_THIS%\eval.sh" %*
exit /b %errorlevel%
19 changes: 19 additions & 0 deletions benchmarks/bpo/run_app.bat
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
@echo off
REM Windows equivalent of benchmarks/bpo/run_app.sh
REM Loads env and runs the BPO FastAPI app on port 8095.

setlocal
set "SCRIPT_DIR=%~dp0"
if "%SCRIPT_DIR:~-1%"=="\" set "SCRIPT_DIR=%SCRIPT_DIR:~0,-1%"
pushd "%SCRIPT_DIR%\..\.." >nul
set "PROJECT_ROOT=%CD%"
popd >nul

echo Loading BPO configuration...
call "%PROJECT_ROOT%\benchmarks\helpers\load_env.bat" "bpo"

echo.
echo Starting BPO FastAPI app on port 8095...
cd /d "%PROJECT_ROOT%"
uv run uvicorn benchmarks.bpo.main:app --reload --port 8095
exit /b %errorlevel%
7 changes: 7 additions & 0 deletions benchmarks/bpo/run_registry.bat
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
@echo off
REM Windows equivalent of benchmarks/bpo/run_registry.sh
setlocal
set "SCRIPT_DIR=%~dp0"
if "%SCRIPT_DIR:~-1%"=="\" set "SCRIPT_DIR=%SCRIPT_DIR:~0,-1%"
call "%SCRIPT_DIR%\..\helpers\run_registry.bat" "bpo"
exit /b %errorlevel%
60 changes: 60 additions & 0 deletions benchmarks/helpers/_delegate_to_bash.bat
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
@echo off
REM Shared helper: invokes a .sh script via Git Bash or WSL, forwarding all args.
REM
REM Usage (from another .bat):
REM call "<path-to-helpers>\_delegate_to_bash.bat" "<absolute-or-relative-path-to-script.sh>" %*
REM
REM Rationale: many of the .sh scripts in this repo use POSIX-only features
REM (process substitution, traps, lsof, pkill, comm, find -mindepth, mktemp,
REM heredocs, etc.) that don't have clean cmd.exe equivalents. Rather than
REM ship subtly-broken cmd.exe ports, we delegate to a real bash. A native
REM Python port is tracked in the follow-up issue.

setlocal enabledelayedexpansion

if "%~1"=="" (
echo [ERROR] _delegate_to_bash.bat called without a script path
exit /b 2
)
set "_SCRIPT=%~1"
shift

if not exist "%_SCRIPT%" (
echo [ERROR] Script not found: %_SCRIPT%
exit /b 2
)

REM Try Git Bash in well-known install locations
for %%G in (
"%ProgramFiles%\Git\bin\bash.exe"
"%ProgramFiles(x86)%\Git\bin\bash.exe"
"%LocalAppData%\Programs\Git\bin\bash.exe"
) do (
if exist %%G (
%%G "%_SCRIPT%" %*
exit /b !errorlevel!
)
)

REM Then any bash on PATH (e.g. msys2, cygwin)
where bash >nul 2>&1
if not errorlevel 1 (
bash "%_SCRIPT%" %*
exit /b !errorlevel!
)

REM Finally WSL
where wsl >nul 2>&1
if not errorlevel 1 (
for /f "delims=" %%P in ('wsl wslpath -u "%_SCRIPT%" 2^>nul') do set "_WSL_SCRIPT=%%P"
if not "!_WSL_SCRIPT!"=="" (
wsl bash "!_WSL_SCRIPT!" %*
exit /b !errorlevel!
)
)

echo [ERROR] No bash interpreter found on this system.
echo This script requires bash. Install one of:
echo - Git for Windows ^(provides Git Bash^): https://git-scm.com/download/win
echo - WSL ^(Windows Subsystem for Linux^): wsl --install
exit /b 1
24 changes: 24 additions & 0 deletions benchmarks/helpers/common.bat
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
@echo off
REM Placeholder for benchmarks/helpers/common.sh.
REM
REM common.sh is a bash function library (port_in_use, wait_for_server,
REM parse_common_args, cleanup_pids, etc.) that gets sourced by other .sh
REM scripts. There's no equivalent of `source` for function definitions in
REM cmd.exe, so a direct port is not feasible.
REM
REM In practice, this file is never called directly: the heavy .bat files
REM in this repo (eval.bat, compare.bat, etc.) delegate to bash via
REM _delegate_to_bash.bat, and bash sources common.sh itself.
REM
REM If you ARE invoking this file directly, you probably want one of:
REM - call _delegate_to_bash.bat ".\common.sh" ^<args^> (run from bash)
REM - Use Git Bash or WSL to source it the normal way
REM
REM See the follow-up issue for the Python migration that removes this gap.

if "%~1"=="" (
echo common.bat is a placeholder. See comment block in this file.
exit /b 0
)
echo [WARN] common.bat does not implement %~1 in cmd.exe. Use bash to source common.sh.
exit /b 1
55 changes: 55 additions & 0 deletions benchmarks/helpers/load_env.bat
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
@echo off
REM Windows equivalent of load_env.sh
REM
REM Usage: call load_env.bat [benchmark_name]
REM
REM Sourcing semantics: this script writes a temporary .bat snippet of `set`
REM commands and calls it, so env vars persist into the caller's scope when
REM invoked via `call`.

setlocal enabledelayedexpansion

set "BENCHMARK_NAME=%~1"

set "HELPERS_DIR=%~dp0"
if "%HELPERS_DIR:~-1%"=="\" set "HELPERS_DIR=%HELPERS_DIR:~0,-1%"
pushd "%HELPERS_DIR%\..\.." >nul
set "PROJECT_ROOT=%CD%"
popd >nul
set "CONFIG_DIR=%PROJECT_ROOT%\config"

REM Temp file holds the set-commands we'll call from the caller's scope
set "_SETS=%TEMP%\cuga_loadenv_%RANDOM%_%RANDOM%.bat"
echo @echo off> "%_SETS%"

call :emit_env_file "%PROJECT_ROOT%\.env" ".env (secrets)"
call :emit_env_file "%CONFIG_DIR%\global.env" "global.env"
if not "%BENCHMARK_NAME%"=="" (
call :emit_env_file "%PROJECT_ROOT%\benchmarks\%BENCHMARK_NAME%\config\%BENCHMARK_NAME%.env" "%BENCHMARK_NAME%.env"
)

REM Default LOGURU_LEVEL handling
if "%LOGURU_LEVEL%"=="" echo set "LOGURU_LEVEL=WARNING">> "%_SETS%"
if /i "%VERBOSE%"=="true" echo set "LOGURU_LEVEL=DEBUG">> "%_SETS%"

REM Single-line endlocal so %_SETS% is expanded at parse time (before endlocal runs)
endlocal & call "%_SETS%" & del "%_SETS%" 2>nul
exit /b 0

:emit_env_file
set "_FILE=%~1"
set "_LABEL=%~2"
if not exist "%_FILE%" (
if not "%_LABEL%"=="" echo (skipping missing %_LABEL%)
exit /b 0
)
echo [ok] Loading %_LABEL%
for /f "usebackq tokens=* eol=#" %%L in ("%_FILE%") do (
set "_line=%%L"
if not "!_line!"=="" (
for /f "tokens=1,* delims==" %%A in ("!_line!") do (
echo set "%%A=%%B">> "%_SETS%"
)
)
)
exit /b 0
28 changes: 28 additions & 0 deletions benchmarks/helpers/run_registry.bat
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
@echo off
REM Windows equivalent of run_registry.sh
REM Loads env (global + benchmark-specific) and starts the registry server.
REM Usage: run_registry.bat ^<benchmark_name^>

setlocal

set "BENCHMARK_NAME=%~1"
if "%BENCHMARK_NAME%"=="" (
echo Usage: %~nx0 ^<benchmark_name^>
echo Example: %~nx0 m3
exit /b 1
)

set "SCRIPT_DIR=%~dp0"
if "%SCRIPT_DIR:~-1%"=="\" set "SCRIPT_DIR=%SCRIPT_DIR:~0,-1%"
pushd "%SCRIPT_DIR%\..\.." >nul
set "PROJECT_ROOT=%CD%"
popd >nul

echo Loading %BENCHMARK_NAME% evaluation configuration...
call "%SCRIPT_DIR%\load_env.bat" "%BENCHMARK_NAME%"

echo.
echo Starting registry server...
cd /d "%PROJECT_ROOT%"
uv run registry
exit /b %errorlevel%
Loading
Loading