fix(m3): bundle on interrupt and save partial results on crash by sami-marreed · Pull Request #4 · cuga-project/cuga-eval

sami-marreed · 2026-05-26T09:36:05Z

Summary

Two related crash-resilience issues in the M3 evaluation flow:

(internal #91) — eval/compare bundles were only created when EVAL_EXIT==0. Any Ctrl-C or crash skipped the bundle entirely.
(internal #92) — when eval_m3.py was interrupted or crashed at the outer level, save_evaluation_results was never called, so bundling had no result JSON.

Internal issue numbers were not migrated to this repo. Closest open issues: #11, #12.

What changed

M3 interrupt / partial save (core)

benchmarks/m3/eval.sh, compare.sh: bundle from cleanup trap; RUN_START_TS for correct result file; idempotent BUNDLE_DONE.
benchmarks/m3/eval_m3.py: partial save on KeyboardInterrupt / outer exceptions (m3_config_partial_* prefixes).
benchmarks/m3/tests/test_partial_save_on_interrupt.py: regression tests.

Merge from `main` + smoke harness (follow-up)

Merged latest main.
just test-smoke-e2e: one AppWorld SDK task, one ReAct task, one M3 hockey (m3_task_2, --max-samples 1); validates bundle report.md metrics (tokens, steps, duration, etc.; cost may be --).
scripts/smoke_benchmarks.sh, benchmarks/helpers/validate_bundle_report.py.
AppWorld SDK: populate steps in results for report tables; _fmt shows 0 instead of -- for numeric zeros.

Other

pyproject.toml: pytest --import-mode=importlib (fixes test collection with multiple tests/ packages).

Verification

just lint
just test-regression (270 passed on branch HEAD)
just security
just test-smoke-e2e (live; needs AppWorld + M3 containers + API keys)

Test plan

Ctrl-C mid M3 eval → bundle + partial m3_config_partial_*.json created
Success path still produces exactly one bundle
Optional: just test-smoke-e2e on a machine with full benchmark runtime

Summary by CodeRabbit

New Features
- Added live smoke testing capability with prerequisite validation
- Added bundle report validation to detect missing metrics
- Improved resilience with automatic partial result saving on interrupts or crashes
Bug Fixes
- Fixed typos in benchmark data file names
Documentation
- Updated contributing guide and command templates with smoke testing instructions
- Added documentation on partial result recovery and bundle resilience
Tests
- Added smoke testing and bundle validation test coverage
- Added regression tests for partial result saving

Two related issues: interrupt (Ctrl-C) or crash skipped the bundle entirely, so long-running runs lost all logs/trajectories/Langfuse traces even though they existed on disk. crashed, `save_evaluation_results` was never called, so the next bundling step had no result JSON to package. Sequential and batched modes already handled per-task failures gracefully; only the outer KeyboardInterrupt / unhandled-exception path was missing. Changes: benchmarks/m3/eval.sh, benchmarks/m3/compare.sh: - Extract bundle code into create_bundle / create_compare_bundle. - Call from cleanup trap (EXIT INT TERM ERR) so a bundle is produced on success, crash, or Ctrl-C. Idempotent via BUNDLE_DONE. - eval.sh records RUN_START_TS so the bundle picks the result file from *this* run, not a stale earlier one. - Bundle subcommands wrapped with `|| echo ... best-effort` so a report-generation hiccup can't suppress the bundle itself. benchmarks/m3/eval_m3.py: - Hoist all_results to before the eval try-block; use .extend() in batched mode so an interrupt during gather doesn't drop captured results. - Add except (KeyboardInterrupt, asyncio.CancelledError) and except Exception handlers that call save_evaluation_results with prefix m3_config_partial (or m3_config_no_gt_partial) before re-raising. benchmarks/m3/tests/test_partial_save_on_interrupt.py: - New regression tests asserting save_evaluation_results writes a valid JSON with the m3_config_partial / m3_config_no_gt_partial prefixes that the interrupt path now emits. benchmarks/m3/tests/conftest.py: - Add project root to sys.path so the m3 tests are runnable in isolation (matches the bpo conftest). pyproject.toml: - Switch pytest to --import-mode=importlib. Per-benchmark tests/ packages were colliding on the `tests.*` namespace, breaking collection in `just ci` (introduced by PR #87). importlib mode imports each test file independently. Unblocks `just ci`. Closes #91 #92

- Add scripts/smoke_benchmarks.sh and report.md validator - Populate steps in AppWorld SDK results; show 0 in reports instead of -- - Add just test-smoke-e2e and document in CONTRIBUTING.md

haroldship · 2026-06-03T19:17:58Z

Update

Merged latest main and pushed (a3773d9).
Added e2e smoke harness: just test-smoke-e2e (AppWorld SDK + ReAct + M3 hockey sample) with report.md validation.
Local just ci: lint ✓, 270 tests ✓, security ✓.

Live smoke (just test-smoke-e2e) pending — requires AppWorld services, M3 capability containers, and API keys.

haroldship · 2026-06-03T19:26:07Z

Live smoke: just test-smoke-e2e completed successfully (AppWorld SDK 82e2fac_1, AppWorld ReAct, M3 hockey m3_task_2 ×1). All three bundle report.md files passed validation.

Follow-up fix: 9133c39 — validator invoked via python -m benchmarks.helpers.validate_bundle_report.

haroldship · 2026-06-03T19:32:15Z

Smoke correction: The earlier test-smoke-e2e run validated AppWorld SDK + ReAct bundles from this run, but M3 did not produce a new bundle (registry port 8001 conflict, then 0 results). The script incorrectly passed by validating an older compare bundle — fixed in 9133c39 follow-up (use run start timestamp + free port 8001 before M3).

M3 smoke still blocked locally by cuga mode config path (accurate # comment in env). AppWorld portions of smoke are green.

- Pick report.md by mtime since RUN_START_TS, not lexicographic tail - Free port 8001 before M3 so registry can start after AppWorld

haroldship · 2026-06-03T19:39:47Z

Pushed b1816fa (smoke bundle timestamp fix) and $(git rev-parse --short HEAD) (strip inline # comments in load_env.sh).

M3 smoke retry: eval produced a bundle but the hockey task hit API/Langfuse auth errors and zero metrics in report.md — would fail validation until keys/containers are up. AppWorld SDK + ReAct smoke remain green from the earlier run.

Inline `#` notes in m3.env were appended to values (e.g. CUGA_MODE), breaking bundle report generation. Document optional just test-smoke-e2e in create-pr.

haroldship · 2026-06-03T19:45:31Z

Status: All CI checks green on latest d172bee. No open review threads.

Scope reminder: Core change is M3 interrupt bundling + partial save; follow-up commits add live smoke harness (just test-smoke-e2e) and load_env inline-comment fix. M3 portion of smoke is manual-only (not CI).

Ready for review when you are.

…resilience # Conflicts: # benchmarks/helpers/load_env.sh # benchmarks/m3/compare.sh # benchmarks/m3/eval.sh # benchmarks/m3/eval_m3.py

coderabbitai · 2026-06-07T14:09:41Z

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: d8c92b4d-7410-4eec-8dfc-05e455152dea

📥 Commits

Reviewing files that changed from the base of the PR and between 193fef2 and 80f1549.

📒 Files selected for processing (3)

benchmarks/appworld/eval_appworld_sdk.py
benchmarks/helpers/validate_bundle_report.py
scripts/smoke_benchmarks.sh

📝 Walkthrough

Walkthrough

This PR implements smoke-testing validation infrastructure for the M3 and AppWorld benchmarks, hardens evaluation resilience via partial-result salvage on interrupt, improves step counting across evaluators, and refactors bundle creation into idempotent cleanup-trap handlers. Step counting is fixed to use invoke-derived metrics with fallbacks; bundle reports are validated via a new CLI utility; and M3 evaluation now persists partial results when interrupted, with centralized idempotent bundle creation in both eval and compare scripts.

Changes

Smoke Testing and Evaluation Resilience

Layer / File(s)	Summary
Step count capture and fallback logic `benchmarks/appworld/eval_appworld_sdk.py`, `benchmarks/helpers/sdk_eval_helpers.py`, `benchmarks/helpers/compare_report.py`	AppWorld evaluator now persists `agent.invoke()` result and derives step counts via ReAct extraction or fallback to tracker/tool_calls metrics. Tracker callbacks pass computed agent_steps to finish_task. Format utility treats zero as a valid value instead of a missing marker.
Bundle report validation utility `benchmarks/helpers/validate_bundle_report.py`, `benchmarks/helpers/tests/test_validate_bundle_report.py`	New validator parses report.md Per-Task and summary sections, enforces required metrics (Tokens, LLM Calls, Cache Tokens, Duration, Steps), validates summary totals, and provides a CLI entrypoint with comprehensive tests covering valid and invalid report shapes.
End-to-end smoke test harness `justfile`, `scripts/smoke_benchmarks.sh`	Smoke execution wired through a new just task and bash script that orchestrates three benchmarks (AppWorld SDK, AppWorld ReAct, M3 hockey), selects reports by run-start timestamp, frees port 8001, validates each report via CLI, and supports AppWorld task ID override.
M3 evaluation partial-result salvage on interrupt `benchmarks/m3/eval_m3.py`, `benchmarks/m3/tests/conftest.py`, `benchmarks/m3/tests/test_partial_save_on_interrupt.py`	Config-mode evaluation moves result storage outside try block to preserve state across interruption/exception handlers. Exception handlers now save partial results with `m3_config_partial` or `m3_config_no_gt_partial` prefixes before re-raising. Test suite enables isolated m3 execution; regression tests verify partial-save file format and prefixes.
Idempotent bundle creation and cleanup-trap wiring `benchmarks/m3/eval.sh`, `benchmarks/m3/compare.sh`	Both eval and compare scripts centralize bundle creation into idempotent `create_bundle()` and `create_compare_bundle()` functions guarded by BUNDLE_DONE flags. Bundle creation is invoked from success paths and cleanup traps (EXIT/INT/TERM/ERR) to salvage partial artifacts on interrupt. Compare script filters out `m3_config_partial` files during aggregation. Report generation is best-effort and does not abort.
Documentation, configuration comments, and multiturn filename corrections `.bob/commands/cuga-create-pr.md`, `.claude/commands/cuga-create-pr.md`, `.cursor/commands/cuga-create-pr.md`, `CONTRIBUTING.md`, `README.md`, `pyproject.toml`, `benchmarks/m3/eval_m3_multiturn.py`, `scripts/create_eval_bundle.py`	AI-assistant command templates and CONTRIBUTING guide document optional smoke-testing prerequisites. README documents M3 resilience behavior on interrupts and crashes. pyproject.toml clarifies pytest importlib mode rationale to prevent per-benchmark test package collisions. Multiturn evaluator and bundle-script references are corrected from misspelled `olympics_mutliturn.json` to `olympics_multiturn.json`.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related issues

[Feature]: Evaluation bundle must include full Vakra judge scores in all artifacts (cuga + react) #56: This PR directly implements bundle assembly, partial-result salvage on interrupt, and report validation tooling that addresses the bundle-creation and resilience behavior described in the issue.
fix(m3): make failure banner reachable by selectively suppressing ERR around evaluator invocations #55: Changes to benchmarks/m3/eval.sh cleanup trap and bundling refactoring are directly related to interrupt/exception handling and bundle creation reorganization.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 37.93% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title 'fix(m3): bundle on interrupt and save partial results on crash' accurately describes the main objectives: addressing M3 evaluation resilience by creating bundles on interrupts and saving partial results on crashes.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/issue-91-92-eval-resilience

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

benchmarks/m3/compare.sh (1)
382-393: ⚠️ Potential issue | 🟠 Major

INT/TERM cleanup must terminate the script after bundling/cleanup.

trap compare_cleanup EXIT INT TERM runs the cleanup but the handler returns; during Ctrl-C while bash "$SCRIPT_DIR/eval.sh" ... is in the per-run if ...; then ... else ... fi, set -e won’t stop because the failure is inside the if, so the script can keep going with later runs. Split traps so compare_cleanup runs on EXIT, while INT/TERM run the cleanup and then exit immediately (e.g., exit 130 for INT and exit 143 for TERM). Don’t rely on $? inside the INT/TERM handler for the exit code.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@benchmarks/m3/compare.sh` around lines 382 - 393, The compare_cleanup
function is currently trapped for EXIT, INT, and TERM but returns to the script
on SIGINT/SIGTERM allowing the main loop to continue; change the trap setup so
compare_cleanup remains on EXIT only, and add separate INT and TERM handlers
that call compare_cleanup (and create_compare_bundle if needed) then immediately
exit with the conventional codes (exit 130 for INT, exit 143 for TERM) without
relying on $?; update the trap invocation(s) near the existing "trap
compare_cleanup EXIT INT TERM" line and reference compare_cleanup and
create_compare_bundle when implementing the new INT/TERM handlers.

🧹 Nitpick comments (3)

benchmarks/helpers/validate_bundle_report.py (1)
68-74: ⚡ Quick win

Summary validation only flags empty values, not missing lines.

The regex search on line 69 only matches when the "Label:" pattern exists in the text. If the summary section or a specific label is completely missing, no error is raised. This may be intentional (opt-in validation), but if the smoke test expects these metrics to always be present, the validator should flag missing lines as well.
Optional: require summary lines to exist
     for label in ("Total Tokens", "Total LLM Calls", "Total Duration"):
         m = re.search(rf"\*\*{re.escape(label)}\*\*:\s*(.+)", text)
-        if m:
+        if not m:
+            errors.append(f"{path}: summary {label} is missing")
+        else:
             val = m.group(1).strip()
             if not val or val == "--":
                 errors.append(f"{path}: summary {label} is empty")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@benchmarks/helpers/validate_bundle_report.py` around lines 68 - 74, The
current loop only appends an error when a matched label exists but has an empty
value; update the validation to also append an error when the label line is
entirely missing: inside the loop over labels ("Total Tokens", "Total LLM
Calls", "Total Duration") check
re.search(rf"\*\*{re.escape(label)}\*\*:\s*(.+)", text) and if m is None append
errors.append(f"{path}: summary {label} is missing"); if m exists keep the
existing val check (if not val or val == "--" append the same error). Ensure you
reference the same local names (label, m, val, errors) when implementing the
change.
scripts/validate_bundle_report.py (1)
11-26: ⚡ Quick win

Duplicate main() implementation.

This main() function duplicates the logic in benchmarks/helpers/validate_bundle_report.py:main(). Both files define argparse, file existence checks, and error printing.

Consider removing this script and invoking the module directly via python -m benchmarks.helpers.validate_bundle_report (which is what smoke_benchmarks.sh line 50 already does). The CLI wrapper adds no extra functionality.
Optional: remove duplicate script

Since smoke_benchmarks.sh line 50 already uses uv run python -m benchmarks.helpers.validate_bundle_report, this wrapper script is unused. Consider deleting scripts/validate_bundle_report.py to reduce duplication.

Alternatively, if the wrapper is needed for other use cases, refactor to avoid code duplication:
#!/usr/bin/env python3
"""CLI wrapper for bundle report validation."""

from benchmarks.helpers.validate_bundle_report import main

if __name__ == "__main__":
    raise SystemExit(main())
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/validate_bundle_report.py` around lines 11 - 26, The main() in this
script duplicates the CLI in the benchmarks.helpers.validate_bundle_report
module (same argparse, file checks and error printing); remove the duplicate by
either deleting this script or replacing its implementation to import and
delegate to benchmarks.helpers.validate_bundle_report.main (i.e., remove local
argparse/validate_report_md logic and in the script implement a minimal CLI stub
that calls the module's main and exits with its return code, invoking main()
only under if __name__ == "__main__").
benchmarks/appworld/eval_appworld_sdk.py (1)
318-324: 💤 Low value

Consider recording explicit zero step counts.

The if agent_steps: guard on line 323 omits result["steps"] when the count is zero. If zero steps is a valid and meaningful state (e.g., immediate failure without tool calls), it should be recorded explicitly. If zero is truly "missing" data, the current behavior is correct.
Optional: record zero explicitly if meaningful
-    if agent_steps:
+    if agent_steps is not None:
         result["steps"] = agent_steps
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@benchmarks/appworld/eval_appworld_sdk.py` around lines 318 - 324, The current
truthy check "if agent_steps:" drops zero counts; change the guard so zero is
recorded by testing for None instead of truthiness — i.e., after computing
agent_steps (via invoke_result_holder/_react_steps_from_invoke_result or
fallback len(tracker.steps) or len(tool_calls)), set result["steps"]=agent_steps
when agent_steps is not None (or otherwise explicitly handle zero as a valid
value) so that 0 is stored but None still means "missing".

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@benchmarks/m3/compare.sh`:
- Around line 200-230: The JSON builder currently folds all entries from
CONFIG_RESULT_VALS into JSON_PARTS, including partial run artifacts; modify the
loop that iterates over files (the for f in $files loop that builds file_list)
to skip any filenames that start with "m3_config_partial_" or
"m3_config_no_gt_partial_" (or otherwise match a configured partial-file
pattern) so they are not appended to file_list, and ensure JSON_PARTS only
receives non-empty file_list before adding ("${config}":[${file_list}]);
alternatively, tag partial files separately rather than adding them to
JSON_INPUT by filtering them out in this same place.

In `@benchmarks/m3/eval_m3.py`:
- Around line 2418-2422: The hoisted all_results list is not updated during
in-flight work so a KeyboardInterrupt/Exception handler can save only an
empty/incomplete file; modify run_config_mode, evaluate_tasks_in_batches, and
evaluate_single_task to accept a shared accumulator or checkpoint callback
(e.g., pass a parameter named checkpoint_callback or shared_results) and call it
to flush completed domain/batch/task results into the outer all_results before
each await/long-running loop step; alternatively, after each completed
domain/batch in those helpers, append results to the outer all_results via the
provided callback so the outer interrupt handlers will persist progress mid-run.

In `@benchmarks/m3/eval.sh`:
- Around line 375-380: The evaluator invocation(s) (the `uv run python -m
benchmarks.m3.eval_m3*` calls) are unguarded under set -e so failures trigger
the trap before EVAL_EXIT and the failure banner; modify the eval branch to
capture the evaluator exit code by temporarily disabling set -e or wrapping the
call in an if/then (e.g., use `set +e`/`set -e` around the call or `if ! uv run
...; then EVAL_EXIT=$?; echo ...; fi`) so that EVAL_EXIT is set and the else
block can run; ensure this change preserves calling `create_bundle` in the
success path and leaves `trap cleanup` behavior intact.

In `@CONTRIBUTING.md`:
- Around line 48-51: Update the comment describing the "just ci" target so it no
longer implies it runs the live smoke job; specifically change the line that
currently reads "just ci              # all of the above" to explicitly state
that "just ci" runs the automated tests (test-regression, security) but does not
run the live smoke target "just test-smoke-e2e" (the smoke job is
optional/manual and requires API keys/containers).

---

Outside diff comments:
In `@benchmarks/m3/compare.sh`:
- Around line 382-393: The compare_cleanup function is currently trapped for
EXIT, INT, and TERM but returns to the script on SIGINT/SIGTERM allowing the
main loop to continue; change the trap setup so compare_cleanup remains on EXIT
only, and add separate INT and TERM handlers that call compare_cleanup (and
create_compare_bundle if needed) then immediately exit with the conventional
codes (exit 130 for INT, exit 143 for TERM) without relying on $?; update the
trap invocation(s) near the existing "trap compare_cleanup EXIT INT TERM" line
and reference compare_cleanup and create_compare_bundle when implementing the
new INT/TERM handlers.

---

Nitpick comments:
In `@benchmarks/appworld/eval_appworld_sdk.py`:
- Around line 318-324: The current truthy check "if agent_steps:" drops zero
counts; change the guard so zero is recorded by testing for None instead of
truthiness — i.e., after computing agent_steps (via
invoke_result_holder/_react_steps_from_invoke_result or fallback
len(tracker.steps) or len(tool_calls)), set result["steps"]=agent_steps when
agent_steps is not None (or otherwise explicitly handle zero as a valid value)
so that 0 is stored but None still means "missing".

In `@benchmarks/helpers/validate_bundle_report.py`:
- Around line 68-74: The current loop only appends an error when a matched label
exists but has an empty value; update the validation to also append an error
when the label line is entirely missing: inside the loop over labels ("Total
Tokens", "Total LLM Calls", "Total Duration") check
re.search(rf"\*\*{re.escape(label)}\*\*:\s*(.+)", text) and if m is None append
errors.append(f"{path}: summary {label} is missing"); if m exists keep the
existing val check (if not val or val == "--" append the same error). Ensure you
reference the same local names (label, m, val, errors) when implementing the
change.

In `@scripts/validate_bundle_report.py`:
- Around line 11-26: The main() in this script duplicates the CLI in the
benchmarks.helpers.validate_bundle_report module (same argparse, file checks and
error printing); remove the duplicate by either deleting this script or
replacing its implementation to import and delegate to
benchmarks.helpers.validate_bundle_report.main (i.e., remove local
argparse/validate_report_md logic and in the script implement a minimal CLI stub
that calls the module's main and exits with its return code, invoking main()
only under if __name__ == "__main__").

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 95ae6aa0-5d28-4472-aafd-0e9abb86b7a5

📥 Commits

Reviewing files that changed from the base of the PR and between 99a7326 and faa1e51.

📒 Files selected for processing (18)

.bob/commands/cuga-create-pr.md
.claude/commands/cuga-create-pr.md
.cursor/commands/cuga-create-pr.md
CONTRIBUTING.md
benchmarks/appworld/eval_appworld_sdk.py
benchmarks/helpers/compare_report.py
benchmarks/helpers/sdk_eval_helpers.py
benchmarks/helpers/tests/test_validate_bundle_report.py
benchmarks/helpers/validate_bundle_report.py
benchmarks/m3/compare.sh
benchmarks/m3/eval.sh
benchmarks/m3/eval_m3.py
benchmarks/m3/tests/conftest.py
benchmarks/m3/tests/test_partial_save_on_interrupt.py
justfile
pyproject.toml
scripts/smoke_benchmarks.sh
scripts/validate_bundle_report.py

coderabbitai · 2026-06-07T14:18:47Z

+    # Hoisted so the KeyboardInterrupt / Exception handlers below can save
+    # whatever was collected if the eval is interrupted (#91, #92). In
+    # sequential mode results are appended as tasks complete; in batched
+    # mode evaluate_tasks_in_batches replaces the list with its return.
+    all_results: List[Dict[str, Any]] = []


⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Hoisting all_results here still drops in-flight progress on interrupt.

all_results is only updated after run_config_mode(..., defer_save=True), evaluate_tasks_in_batches(...), or evaluate_single_task(...) returns. If Ctrl-C lands mid-capability pass, mid-batch, or during a later domain of the current service, the new handlers will still save an empty/incomplete partial file even though inner locals already hold completed results. Thread a shared accumulator/checkpoint callback into those helpers, or flush completed domain/batch results into the outer list before the next await.

Also applies to: 2633-2639, 2722-2751

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@benchmarks/m3/eval_m3.py` around lines 2418 - 2422, The hoisted all_results list is not updated during in-flight work so a KeyboardInterrupt/Exception handler can save only an empty/incomplete file; modify run_config_mode, evaluate_tasks_in_batches, and evaluate_single_task to accept a shared accumulator or checkpoint callback (e.g., pass a parameter named checkpoint_callback or shared_results) and call it to flush completed domain/batch/task results into the outer all_results before each await/long-running loop step; alternatively, after each completed domain/batch in those helpers, append results to the outer all_results via the provided callback so the outer interrupt handlers will persist progress mid-run.

Valid observation, and accurately scoped as a heavy lift. evaluate_single_task does iterate multiple domains internally before returning, so an interrupt mid-task would still lose that task's in-flight domain progress — the hoisted all_results only captures completed tasks/batches, not partial progress within one.

That said, I'd treat this as a follow-up rather than blocking this PR: the change here moves the failure mode from 'lose everything collected across the entire run' (#91/#92) to 'lose at most the currently in-flight task/batch' — a meaningful reliability improvement on its own, and a large reduction in blast radius. Threading a checkpoint/accumulator callback through evaluate_single_task/evaluate_tasks_in_batches to get domain-level granularity is a reasonable next step; happy to file a tracking issue for it if that's useful.

@haroldship, that's a fair framing — the blast radius reduction from "lose the entire run" to "lose at most the in-flight task/batch" is a real and useful improvement on its own, and domain-level checkpointing is clearly a follow-up scope.

Happy to open a tracking issue for the checkpoint/accumulator callback work if you'd like — just say the word and I'll file it against this repo with the relevant context linked back to this PR.

@CodeRabbit please open the tracking issue

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash gh issue create \ --repo cuga-project/cuga-eval \ --assignee haroldship \ --title "feat(m3): domain-level checkpoint callback for partial saves on interrupt" \ --body "## Background PR `#4` ([fix/issue-91-92-eval-resilience](https://github.com/cuga-project/cuga-eval/pull/4)) hoists \`all_results\` before the main \`try\` block in \`run_config_mode\` and saves partial results in the \`KeyboardInterrupt\`/\`Exception\` handlers. This reduces the blast radius from *\"lose the entire run\"* to *\"lose at most the in-flight task/batch\"*. However, \`all_results\` is only updated **after** \`evaluate_single_task\` or \`evaluate_tasks_in_batches\` returns in full. If a Ctrl-C or exception lands mid-task (e.g. while a later domain within the same task is still running), the outer handlers persist an empty or incomplete file — the completed domains' results exist only in inner locals. **Discussed in:** https://github.com/cuga-project/cuga-eval/pull/4#discussion_r3369504325 ## Goal Achieve **domain-level** checkpoint granularity so that any domain that completes before an interrupt is preserved, regardless of whether its parent task finished. ## Proposed approach Thread a \`checkpoint_callback\` (or shared mutable \`shared_results\` list) parameter into \`evaluate_single_task\` and \`evaluate_tasks_in_batches\`. After each domain (or batch) result is appended in the inner helpers, invoke the callback to flush those results into the outer \`all_results\` before the next \`await\`. The outer interrupt handlers then persist whatever has accumulated. Rough sketch: \`\`\`python # run_config_mode all_results: List[Dict[str, Any]] = [] def _checkpoint(results): all_results.extend(results) # pass _checkpoint into evaluate_single_task / evaluate_tasks_in_batches \`\`\` Within \`evaluate_single_task\`, call \`checkpoint_callback(evaluator.results)\` after each domain loop iteration completes successfully. ## Scope - \`benchmarks/m3/eval_m3.py\`: \`run_config_mode\`, \`evaluate_single_task\`, \`evaluate_tasks_in_batches\` - May also want to consider the same pattern for \`compare.sh\` sequential domain loops if applicable - Add or extend regression tests in \`benchmarks/m3/tests/test_partial_save_on_interrupt.py\` "

Length of output: 4256

@haroldship, the tracking issue is live: #54 — feat(m3): domain-level checkpoint callback for partial saves on interrupt

It's assigned to you and includes the full context, the proposed checkpoint_callback approach, and the files in scope (run_config_mode, evaluate_single_task, evaluate_tasks_in_batches, plus a note about extending the regression tests in test_partial_save_on_interrupt.py).

🧬 Code Graph Analysis Results

benchmarks/m3/m3_vakra_score.py (lines 421-460)

def patch_tracker_scores(results: List[Dict[str, Any]], tracker: Any) -> int: """Sync Vakra-corrected scores back into the tracker's trajectory files. The tracker writes results.json/results.csv at ``finish_task`` time with the pre-Vakra keyword score. Vakra rescoring mutates ``results[i]["match_rate"]`` and ``results[i]["success"]`` in place, but those updated values never reach the trajectory bundle — so trajectories/results.json ends up at score=0 while report.md shows the correct score (issue `#71`). Call this immediately after ``score_results_async`` returns. It walks the rescored results and uses ``tracker.update_task`` to overwrite the stored ``score`` and the ``eval`` JSON blob with the Vakra verdict. Returns the number of tracker entries patched (useful for logging/tests). """ if tracker is None or not getattr(tracker, "experiment_folder", None): return 0 tracker_tasks = getattr(tracker, "tasks", None) if not isinstance(tracker_tasks, dict): return 0 patched = 0 for r in results: task_id = _result_uuid(r) if not task_id or task_id not in tracker_tasks: continue score = float(r.get("match_rate", 0.0)) new_eval = json.dumps( { "task_name": r.get("task_name", task_id), "difficulty": r.get("difficulty", "unknown"), "success": bool(r.get("success", False)), "match_rate": score, "found_keywords": r.get("found_keywords", []), "missing_keywords": r.get("missing_keywords", []), } ) if tracker.update_task(task_id=task_id, score=score, eval=new_eval): patched += 1 return patched

🧠 Learnings used

Learnt from: haroldship Repo: cuga-project/cuga-eval PR: 4 File: benchmarks/m3/eval.sh:375-380 Timestamp: 2026-06-07T17:29:50.043Z Learning: In `benchmarks/m3/eval.sh`, the `ERR` trap fires on evaluator failure regardless of `set -e`/`set +e` state in bash. Because `cleanup()` unconditionally ends with `exit $exit_code`, the script never reaches `EVAL_EXIT=$?` or the `else` branch after the evaluator invocation. The "✗ M3 evaluation failed (exit code: ...)" banner is unreachable — this is a known cosmetic issue deferred to a follow-up. A real fix would require selectively suppressing ERR around the `uv run` invocation (`trap '' ERR; uv run ...; EVAL_EXIT=$?; trap cleanup ERR`), which has implications for the crash-salvage path and needs careful testing. The functional salvage behavior (create_bundle via ERR/EXIT trap) is already correct.

coderabbitai · 2026-06-07T14:18:47Z

+    # Create reproducibility bundle (idempotent — cleanup trap also calls
+    # this on interrupt/crash, see #91, #92).
+    create_bundle
 else
    echo -e "${RED:-}✗ M3 evaluation failed (exit code: $EVAL_EXIT)${NC:-}"
+    # cleanup trap will call create_bundle to salvage what we have.


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail # Inspect the relevant region around the reported lines and the evaluator invocation. sed -n '330,420p' benchmarks/m3/eval.sh | nl -ba

Repository: cuga-project/cuga-eval

Length of output: 109

🏁 Script executed:

#!/bin/bash set -euo pipefail FILE="benchmarks/m3/eval.sh" echo "=== Matches for key identifiers ===" rg -n "EVAL_EXIT|uv run|M3 evaluation failed|create_bundle|cleanup trap|trap|EVAL" "$FILE" || true echo echo "=== Context around reported lines (330-420) ===" awk 'NR>=330 && NR<=420 {printf "%d:%s\n", NR, $0}' "$FILE" | sed -n '1,200p'

Repository: cuga-project/cuga-eval

Length of output: 4843

🏁 Script executed:

#!/bin/bash set -euo pipefail FILE="benchmarks/m3/eval.sh" echo "=== File header (first 120 lines) ===" awk 'NR>=1 && NR<=140 {print NR ":" $0}' "$FILE" echo echo "=== cleanup/create_bundle area (lines 110-260) ===" awk 'NR>=110 && NR<=260 {print NR ":" $0}' "$FILE" echo echo "=== Any set +e / errexit toggling and handling near eval call (lines 340-390) ===" rg -n "set \\+e|set -e|errexit|trap cleanup|cleanup\\(" "$FILE" || true

Repository: cuga-project/cuga-eval

Length of output: 12712

Guard the evaluator call so the failure banner runs under set -e.

With set -e and unguarded uv run python -m benchmarks.m3.eval_m3* ... calls in the evaluation branches, a non-zero exit triggers trap cleanup ... ERR/EXIT and aborts the script before reaching EVAL_EXIT=$? and the else block that prints ✗ M3 evaluation failed (exit code: $EVAL_EXIT). Wrap the evaluator invocation(s) to capture the exit code (e.g., set +e temporarily or if ! uv run ...; then ...; fi).

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@benchmarks/m3/eval.sh` around lines 375 - 380, The evaluator invocation(s) (the `uv run python -m benchmarks.m3.eval_m3*` calls) are unguarded under set -e so failures trigger the trap before EVAL_EXIT and the failure banner; modify the eval branch to capture the evaluator exit code by temporarily disabling set -e or wrapping the call in an if/then (e.g., use `set +e`/`set -e` around the call or `if ! uv run ...; then EVAL_EXIT=$?; echo ...; fi`) so that EVAL_EXIT is set and the else block can run; ensure this change preserves calling `create_bundle` in the success path and leaves `trap cleanup` behavior intact.

I looked at this closely, and it's technically correct but the fix isn't as simple as guarding the exit code — and the underlying salvage behavior (the actual point of this PR) is unaffected.

Verified empirically: the ERR trap fires on the evaluator's non-zero exit regardless of set -e/set +e state (tested with a minimal repro — set +e; false still triggers an ERR-trapped cleanup). And cleanup() here unconditionally ends with exit $exit_code, so it force-exits the moment it runs. That means wrapping the uv run call in set +e/set -e is a no-op — cleanup still fires immediately on failure and exits before EVAL_EXIT=$? is ever reached. I tried exactly that and reverted it once the test confirmed it changes nothing.

So yes, the "✗ M3 evaluation failed (exit code: ...)" banner in the else branch is genuinely unreachable — but that's cosmetic only. The actual salvage path (create_bundle running and the correct exit code propagating to the caller) already works correctly via the ERR/EXIT trap, which is the entire point of #91/#92.

A real fix would mean suppressing ERR specifically around the uv run invocation (trap '' ERR; uv run ...; EVAL_EXIT=$?; trap cleanup ERR) so the script falls through to the explicit branch instead of the trap firing immediately — which would change when the salvage bundle gets created (deferred to the script's natural exit) and needs careful testing against the crash-salvage path. Given the banner is purely informational and the functional behavior is already correct, I'd rather not touch trap semantics in this PR for a cosmetic message. Happy to file a follow-up if you'd like it tracked.

@haroldship, thanks for the thorough investigation — that's a subtlety I missed. You're right that trap ... ERR fires even under set +e, and since cleanup() calls exit $exit_code unconditionally, the script never reaches EVAL_EXIT=$? on the failure path regardless of how the uv run call is guarded. My suggested fix would indeed have been a no-op.

The dead else branch is cosmetic and the actual salvage behavior (the whole point of this PR) is already working correctly via the trap. Deferring this to a follow-up makes sense.

Would you like me to open a GitHub issue to track the follow-up (making the failure banner reachable by selectively suppressing ERR around the evaluator invocations)?

✏️ Learnings added

Learnt from: haroldship Repo: cuga-project/cuga-eval PR: 4 File: benchmarks/m3/eval.sh:375-380 Timestamp: 2026-06-07T17:29:50.043Z Learning: In `benchmarks/m3/eval.sh`, the `ERR` trap fires on evaluator failure regardless of `set -e`/`set +e` state in bash. Because `cleanup()` unconditionally ends with `exit $exit_code`, the script never reaches `EVAL_EXIT=$?` or the `else` branch after the evaluator invocation. The "✗ M3 evaluation failed (exit code: ...)" banner is unreachable — this is a known cosmetic issue deferred to a follow-up. A real fix would require selectively suppressing ERR around the `uv run` invocation (`trap '' ERR; uv run ...; EVAL_EXIT=$?; trap cleanup ERR`), which has implications for the crash-salvage path and needs careful testing. The functional salvage behavior (create_bundle via ERR/EXIT trap) is already correct.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

@CodeRabbit please open the tracking issue

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash gh issue create \ --repo cuga-project/cuga-eval \ --title 'fix(m3): make failure banner reachable by selectively suppressing ERR around evaluator invocations' \ --body "## Background In \`benchmarks/m3/eval.sh\`, the \`ERR\` trap fires on evaluator failure regardless of \`set -e\`/\`set +e\` state in bash. Because \`cleanup()\` unconditionally ends with \`exit \$exit_code\`, the script never reaches \`EVAL_EXIT=\$?\` or the \`else\` branch after the evaluator invocation. As a result, the \"✗ M3 evaluation failed (exit code: ...)\" banner is unreachable. This was identified during review of PR `#4` (comment: https://github.com/cuga-project/cuga-eval/pull/4#discussion_r3369504327). ## Functional impact None — the crash-salvage behavior (\`create_bundle\` via \`ERR\`/\`EXIT\` trap) is already correct. This is a cosmetic issue only. ## Proposed fix Selectively suppress \`ERR\` around each \`uv run\` evaluator invocation so the script falls through to the explicit success/failure branches instead of immediately trapping: \`\`\`bash trap '' ERR uv run python -m benchmarks.m3.eval_m3 ... EVAL_EXIT=\$? trap cleanup ERR \`\`\` This change needs careful testing against the crash-salvage path to ensure the bundle is still created correctly on early interrupt/exception. ## Requested by `@haroldship`" \ --assignee haroldship

Length of output: 1456

@haroldship, done — tracking issue opened: #55 (assigned to you).

🧠 Learnings used

Learnt from: haroldship Repo: cuga-project/cuga-eval PR: 4 File: benchmarks/m3/eval.sh:375-380 Timestamp: 2026-06-07T17:29:50.043Z Learning: In `benchmarks/m3/eval.sh`, the `ERR` trap fires on evaluator failure regardless of `set -e`/`set +e` state in bash. Because `cleanup()` unconditionally ends with `exit $exit_code`, the script never reaches `EVAL_EXIT=$?` or the `else` branch after the evaluator invocation. The "✗ M3 evaluation failed (exit code: ...)" banner is unreachable — this is a known cosmetic issue deferred to a follow-up. A real fix would require selectively suppressing ERR around the `uv run` invocation (`trap '' ERR; uv run ...; EVAL_EXIT=$?; trap cleanup ERR`), which has implications for the crash-salvage path and needs careful testing. The functional salvage behavior (create_bundle via ERR/EXIT trap) is already correct.

haroldship · 2026-06-07T15:35:48Z

Sanity test results (PR #3 recipe + AppWorld/BPO)

Ran the same small m3_task_2/hockey recipe used to validate PR #3, plus small AppWorld and BPO sanity runs, against this branch (post-merge with main, commit faa1e51).

M3 eval (`cuga`, `--max-samples 2`)

caffeinate -i bash benchmarks/m3/eval.sh \
  --m3-data benchmarks/m3/data/small_train.zip \
  --capability m3_task_2 --domain hockey --max-samples 2

Completed cleanly — exit 0, no exceptions, bundle created (evaluation_bundles/20260607_141940_default).
Pass rate 0/2 (0%) — both misses are model-judging outcomes (wrong answer / groundedness on hockey task 2), not harness errors. 409K tokens, 23 LLM calls, 90.3s.

M3 compare (`--compare-agents --runs 2`)

caffeinate -i bash benchmarks/m3/compare.sh --compare-agents --runs 2 \
  --m3-data benchmarks/m3/data/small_train.zip \
  --capability m3_task_2 --domain hockey --max-samples 2

All 4 runs completed — 0 failed out of 4, total 5m29s, comparison bundle created (evaluation_bundles/20260607_142705_compare_gpt-oss_cuga_react).

Configuration                   Runs  Pass Rate  pass@2  pass^2  maj@2  Cons   Tokens     LLM   Time
cuga  — policies (GPT-OSS-120B)    2     50.0%   50.0%   50.0%  50.0%  1.00  167,838.0  13.0  22.4s
react — policies (GPT-OSS-120B)    2     25.0%   50.0%    0.0%   0.0%    --   63,086.5   5.5   4.9s

AppWorld (2 tasks)

caffeinate -i bash benchmarks/appworld/eval.sh --task 82e2fac_1 5238afc_1

Completed cleanly — exit 0, bundle created (evaluation_bundles/20260607_151534_default).

Worth flagging separately (pre-existing, unrelated to this PR — it doesn't touch any AppWorld code): both tasks hit the same agent-side bug (TypeError: 'NoneType' object is not subscriptable in cuga's format_subtask) and were marked failed by the dataset-level evaluator (Completed: 0/2, pass percentages 0.0 and 71.4). Yet the bundle's final "Evaluation Report" (Langfuse-trace based) only lists task 82e2fac_1 and reports a misleading "Pass Rate 1/1 (100%)" — silently dropping the failed task 5238afc_1 from the report. I'd suggest filing this as a separate issue against the AppWorld harness/report generator; happy to do so if useful.

BPO (2 tasks)

caffeinate -i bash benchmarks/bpo/eval.sh --task 1 2

Completed cleanly — exit 0, 2/2 (100%) pass rate, bundle created (evaluation_bundles/20260607_152441_default). 26.9K tokens, 5 LLM calls, 13.8s total.

All four harnesses run end-to-end without crashes after the merge with main. No regressions observed from the conflict resolution (load_env.sh, eval.sh, eval_m3.py, compare.sh).

haroldship · 2026-06-07T15:36:54Z

Manual test plan for this PR (interrupt / crash bundling)

This exercises the core fix: bundles and partial results should now be produced even when an M3 eval/compare run is interrupted (Ctrl-C) or crashes mid-flight, instead of being silently lost (issues #91/#92).

1. Ctrl-C mid-`eval.sh` run → bundle + partial results

caffeinate -i bash benchmarks/m3/eval.sh \
  --m3-data benchmarks/m3/data/small_train.zip \
  --capability m3_task_2 --domain hockey --max-samples 2

Let it run past the first task's completion (watch for tool_call/Vakra scoring lines for task 1 in the console log — the first full result write only happens at the very end of a normal run, so there's no earlier "saved" message to watch for), then press Ctrl-C.
Expect:
- Cleaning up... / Creating reproducibility bundle... printed by the cleanup trap (benchmarks/m3/eval.sh:207).
- A console log line 📁 Partial results (N task-results) saved to: .../m3_config_partial_*.json (or m3_config_no_gt_partial_*.json with --no-ground-truth), written by the KeyboardInterrupt/CancelledError handler in eval_m3.py:2733.
- A matching benchmarks/m3/results/m3_config_partial_*.json file appears on disk.
- A new bundle directory under benchmarks/m3/evaluation_bundles/ containing that partial result file, trajectories-so-far, and console/registry logs.
- Exit code reflects the interrupt (130), not 0.

2. Ctrl-C before any result is saved → no spurious bundle

Re-run the same command and press Ctrl-C immediately (before the first task finishes / before any results/m3_*.json is written with mtime >= RUN_START_TS).
Expect: create_bundle (eval.sh:127) finds no fresh result file and prints No result file from this run was found — skipping bundle. — no empty/bogus bundle directory is created.

3. Crash mid-run (simulate #92) → partial save still happens

Same recipe, but force a crash instead of Ctrl-C, e.g. kill the registry server mid-run (kill <registry_pid> from another terminal) so eval_m3.py raises an unhandled exception.
Expect: the generic except Exception as eval_err handler (eval_m3.py:2741) still writes m3_config_partial_*.json before re-raising, and the cleanup trap (triggered via ERR/EXIT) still creates a bundle from it.

4. Success path → exactly one bundle, no duplication

caffeinate -i bash benchmarks/m3/eval.sh \
  --m3-data benchmarks/m3/data/small_train.zip \
  --capability m3_task_2 --domain hockey --max-samples 1

Let it run to completion normally.
Expect: Creating reproducibility bundle... is printed exactly once (the BUNDLE_DONE guard at eval.sh:120/128 makes create_bundle idempotent — the success-path call at eval.sh:377 and the trap's call at eval.sh:215 must not double-bundle). Confirm only one new directory appears under evaluation_bundles/.

5. Same checks for `compare.sh`

Repeat Correct broken links after CUGA changes #1 and fix(m3): bundle on interrupt and save partial results on crash #4 against compare.sh --compare-agents --runs 2 ... — Ctrl-C during run 1 of N should still leave a comparison bundle with whatever runs completed; a full run should still produce exactly one comparison bundle.

6. Regression tests (automated, but worth re-running locally)

just test-regression          # includes benchmarks/m3/tests/test_partial_save_on_interrupt.py

7. Optional — full smoke harness (if you have live AppWorld + M3 + API keys)

just test-smoke-e2e

…es from comparisons The multiturn data file was misspelled "olympics_mutliturn.json" since the initial release, with eval.sh, compare.sh, eval_m3_multiturn.py, and create_eval_bundle.py all consistently referencing the same misspelling. Renamed the file and fixed all references for consistency. Also filter m3_config_partial_*/m3_config_no_gt_partial_* result files out of compare.sh's comparison set — interrupted/crashed runs were being folded into aggregate pass-rate and token totals alongside complete runs, skewing the comparison report.

agent_steps == 0 was falsy, so a task that genuinely completed in zero steps was recorded as "steps not measured" instead of 0.

Use a shared _EMPTY_MARKERS set so "--", em-dash, and "-" placeholders are all recognized consistently as empty in both per-task and summary checks, and match the table header strictly on "Task" rather than any tuple member.

CONTRIBUTING.md's "just ci # all of the above" was misleading — the recipe runs lint + test-regression + security only, not the full suite (smoke is optional/manual). README.md gains a section explaining how M3's eval.sh/compare.sh salvage a best-effort bundle and partial result file when a run is interrupted or crashes, and how compare.sh excludes those partials from aggregate totals.

… saves Companion to the file rename in 5e9cb6c — updates the references in eval.sh, compare.sh, eval_m3_multiturn.py, and create_eval_bundle.py to the corrected filename, and filters m3_config_partial_*/m3_config_no_gt_partial_* result files out of compare.sh's comparison set so an interrupted/crashed run's partial results don't skew aggregate pass-rate and token totals.

coderabbitai

🧹 Nitpick comments (1)

benchmarks/m3/compare.sh (1)

420-435: 💤 Low value

Consider refactoring to avoid ls | grep pattern.

Shellcheck flags the ls | grep pattern as potentially fragile for non-alphanumeric filenames. Since the filenames here are controlled by eval_m3.py (not user input), the current implementation is safe. However, a more robust alternative would use a for loop with pattern matching:

_list_results_for_agent() {
    local agent="$1"
    if [[ "$agent" == "cuga" ]]; then
        for f in "$RESULTS_DIR"/m3_config_*.json; do
            [[ -f "$f" ]] || continue
            case "$(basename "$f")" in
                m3_config_partial_*|m3_config_no_gt_partial_*)
                    continue
                    ;;
            esac
            echo "$f"
        done | sort
    else
        # react: existing logic
        ...
    fi
}

This eliminates the ls | grep pattern and is more maintainable.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@benchmarks/m3/compare.sh` around lines 420 - 435, The ls|grep pipeline used
to list results when checking "$RESULTS_DIR"/m3_config_*.json (for agent "cuga")
is fragile; replace it with a glob-based loop that iterates over
"$RESULTS_DIR"/m3_config_*.json, skips non-files (use [[ -f "$f" ]] ||
continue), filter out basenames matching m3_config_partial_* and
m3_config_no_gt_partial_* via a case or conditional, echo each accepted path,
and pipe the output to sort; update the branch that checks [[ "$agent" == "cuga"
]] to use this loop (referencing RESULTS_DIR and the m3_config_* patterns) so
you avoid ls | grep while preserving the same filtering semantics.

Source: Linters/SAST tools

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@benchmarks/m3/compare.sh`:
- Around line 420-435: The ls|grep pipeline used to list results when checking
"$RESULTS_DIR"/m3_config_*.json (for agent "cuga") is fragile; replace it with a
glob-based loop that iterates over "$RESULTS_DIR"/m3_config_*.json, skips
non-files (use [[ -f "$f" ]] || continue), filter out basenames matching
m3_config_partial_* and m3_config_no_gt_partial_* via a case or conditional,
echo each accepted path, and pipe the output to sort; update the branch that
checks [[ "$agent" == "cuga" ]] to use this loop (referencing RESULTS_DIR and
the m3_config_* patterns) so you avoid ls | grep while preserving the same
filtering semantics.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 95ca68e2-87c2-4507-a426-d9697c4002d6

📥 Commits

Reviewing files that changed from the base of the PR and between faa1e51 and e531eb3.

📒 Files selected for processing (9)

CONTRIBUTING.md
README.md
benchmarks/appworld/eval_appworld_sdk.py
benchmarks/helpers/validate_bundle_report.py
benchmarks/m3/compare.sh
benchmarks/m3/data/olympics_multiturn.json
benchmarks/m3/eval.sh
benchmarks/m3/eval_m3_multiturn.py
scripts/create_eval_bundle.py

✅ Files skipped from review due to trivial changes (2)

README.md
CONTRIBUTING.md

🚧 Files skipped from review as they are similar to previous changes (3)

benchmarks/helpers/validate_bundle_report.py
benchmarks/m3/eval.sh
benchmarks/appworld/eval_appworld_sdk.py

…resilience

Sergey-Zeltyn · 2026-06-08T13:57:21Z

Dead code: scripts/validate_bundle_report.py
scripts/validate_bundle_report.py duplicates benchmarks/helpers/validate_bundle_report.py:main() almost line-for-line, but scripts/smoke_benchmarks.sh:48 invokes python -m benchmarks.helpers.validate_bundle_report (per the 9133c39 follow-up referenced in the PR comments). The scripts/ copy looks like a leftover from before the module-based invocation was adopted — please delete it or wire the smoke script to call it.

Sergey-Zeltyn · 2026-06-08T14:02:07Z

Deferred import inside complete_and_eval
benchmarks/appworld/eval_appworld_sdk.py:316 does from `benchmarks.helpers.sdk_eval_helpers import _react_steps_from_invoke_result inside the function. If there's a circular-import reason, please add a one-line comment; otherwise lift it to the module imports.

Sergey-Zeltyn · 2026-06-08T14:04:12Z

_parse_table_header returns [] on an empty line
benchmarks/helpers/validate_bundle_report.py:11-15:

cells = [c.strip() for c in line.strip().strip("|").split("|")]
if not cells or cells[0] == "Task":
    return cells
return None

The branch returns an empty list rather than None when cells is empty. Caller checks if cols and cols[0] == "Task" so this is harmless today, but it inverts the contract suggested by the type hint (list[str] | None). Either return None on empty, or simplify to return cells if cells and cells[0] == "Task" else None.

Sergey-Zeltyn · 2026-06-08T14:05:20Z

free_port race in the smoke runner
scripts/smoke_benchmarks.sh:32-37 sends a default SIGTERM and sleeps a fixed 2 s. A stuck registry process may need a follow-up kill -9 (or a poll loop until the port is free) — otherwise the next M3 eval may still find 8001 occupied. Given this is exactly the failure that prompted the 9133c39 correction, it's worth hardening.

Sergey-Zeltyn · 2026-06-08T14:06:55Z

Mass rename: any stale mutliturn references left?
The diff covers benchmarks/m3/eval.sh, benchmarks/m3/compare.sh, benchmarks/m3/eval_m3_multiturn.py, and scripts/create_eval_bundle.py. Worth a final rg -l mutliturn to confirm nothing in docs/bundles/tests still points at the misspelled filename, especially the M3_MULTITURN_DATA_FILE env-var convention which now defaults to the new name (any external caller exporting the old value gets a silent file-not-found).

Sergey-Zeltyn

See comments above.
All comments are minor, correct whatever you see fit

- Remove scripts/validate_bundle_report.py: dead-code duplicate of benchmarks/helpers/validate_bundle_report.py:main(), which is the module smoke_benchmarks.sh actually invokes. - eval_appworld_sdk.py: lift the deferred _react_steps_from_invoke_result import to module level (no circular-import reason found). - validate_bundle_report._parse_table_header: return None rather than [] on an empty header line, matching its list[str] | None contract. - smoke_benchmarks.sh free_port: poll until the port is actually free after SIGTERM, falling back to SIGKILL — closes the race that caused the earlier 9133c39 M3 port-8001 failure.

haroldship changed the title ~~fix(m3): bundle on interrupt and save partial results on crash (#91, #92)~~ fix(m3): bundle on interrupt and save partial results on crash May 26, 2026

haroldship added 3 commits May 28, 2026 15:14

Merge branch 'main' into fix/issue-91-92-eval-resilience

dba0164

chore: merge main into fix/issue-91-92-eval-resilience

fd9115c

chore: add e2e smoke benchmarks and bundle report validation

a3773d9

- Add scripts/smoke_benchmarks.sh and report.md validator - Populate steps in AppWorld SDK results; show 0 in reports instead of -- - Add just test-smoke-e2e and document in CONTRIBUTING.md

fix: run bundle report validator as module in smoke script

9133c39

fix: smoke script only validates bundles from the current run

b1816fa

- Pick report.md by mtime since RUN_START_TS, not lexicographic tail - Free port 8001 before M3 so registry can start after AppWorld

fix: strip inline comments when loading benchmark env files

d172bee

Inline `#` notes in m3.env were appended to values (e.g. CUGA_MODE), breaking bundle report generation. Document optional just test-smoke-e2e in create-pr.

haroldship mentioned this pull request Jun 4, 2026

fix(m3): repair harness bugs that artificially zeroed CUGA M3 pass rate #3

Merged

6 tasks

Merge remote-tracking branch 'origin/main' into fix/issue-91-92-eval-…

faa1e51

…resilience # Conflicts: # benchmarks/helpers/load_env.sh # benchmarks/m3/compare.sh # benchmarks/m3/eval.sh # benchmarks/m3/eval_m3.py

coderabbitai Bot reviewed Jun 7, 2026

View reviewed changes

haroldship requested a review from Sergey-Zeltyn June 7, 2026 20:00

This was referenced Jun 8, 2026

feat(m3): domain-level checkpoint callback for partial saves on interrupt #54

Open

fix(m3): make failure banner reachable by selectively suppressing ERR around evaluator invocations #55

Open

haroldship added 5 commits June 8, 2026 08:29

fix(appworld): treat zero step counts as measured, not missing

da830dc

agent_steps == 0 was falsy, so a task that genuinely completed in zero steps was recorded as "steps not measured" instead of 0.

coderabbitai Bot reviewed Jun 8, 2026

View reviewed changes

Merge remote-tracking branch 'origin/main' into fix/issue-91-92-eval-…

193fef2

…resilience

haroldship mentioned this pull request Jun 8, 2026

[Feature]: Evaluation bundle must include full Vakra judge scores in all artifacts (cuga + react) #56

Open

8 tasks

Sergey-Zeltyn reviewed Jun 8, 2026

View reviewed changes

haroldship merged commit 142c29a into main Jun 8, 2026
3 of 4 checks passed

haroldship deleted the fix/issue-91-92-eval-resilience branch June 8, 2026 19:57

Conversation

sami-marreed commented May 26, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changed

M3 interrupt / partial save (core)

Merge from main + smoke harness (follow-up)

Other

Verification

Test plan

Summary by CodeRabbit

Uh oh!

haroldship commented Jun 3, 2026

Update

Uh oh!

haroldship commented Jun 3, 2026

Uh oh!

haroldship commented Jun 3, 2026

Uh oh!

haroldship commented Jun 3, 2026

Uh oh!

haroldship commented Jun 3, 2026

Uh oh!

coderabbitai Bot commented Jun 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Estimated code review effort

Possibly related issues

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot Jun 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

haroldship Jun 7, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 7, 2026

Choose a reason for hiding this comment

Uh oh!

haroldship Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 8, 2026

Choose a reason for hiding this comment

benchmarks/m3/m3_vakra_score.py (lines 421-460)

Uh oh!

coderabbitai Bot Jun 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

haroldship Jun 7, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 7, 2026

Choose a reason for hiding this comment

Uh oh!

haroldship Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

haroldship commented Jun 7, 2026

Sanity test results (PR #3 recipe + AppWorld/BPO)

M3 eval (cuga, --max-samples 2)

M3 compare (--compare-agents --runs 2)

AppWorld (2 tasks)

BPO (2 tasks)

Uh oh!

haroldship commented Jun 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Manual test plan for this PR (interrupt / crash bundling)

sami-marreed commented May 26, 2026 •

edited by coderabbitai Bot

Loading

Merge from `main` + smoke harness (follow-up)

coderabbitai Bot commented Jun 7, 2026 •

edited

Loading

coderabbitai Bot Jun 7, 2026 •

edited

Loading

`benchmarks/m3/m3_vakra_score.py` (lines 421-460)

coderabbitai Bot Jun 7, 2026 •

edited

Loading

M3 eval (`cuga`, `--max-samples 2`)

M3 compare (`--compare-agents --runs 2`)

haroldship commented Jun 7, 2026 •

edited

Loading

1. Ctrl-C mid-`eval.sh` run → bundle + partial results

5. Same checks for `compare.sh`