fix(m3): bundle on interrupt and save partial results on crash#4
Conversation
Two related issues:
interrupt (Ctrl-C) or crash skipped the bundle entirely, so long-running
runs lost all logs/trajectories/Langfuse traces even though they
existed on disk.
crashed, `save_evaluation_results` was never called, so the next
bundling step had no result JSON to package. Sequential and batched
modes already handled per-task failures gracefully; only the outer
KeyboardInterrupt / unhandled-exception path was missing.
Changes:
benchmarks/m3/eval.sh, benchmarks/m3/compare.sh:
- Extract bundle code into create_bundle / create_compare_bundle.
- Call from cleanup trap (EXIT INT TERM ERR) so a bundle is produced
on success, crash, or Ctrl-C. Idempotent via BUNDLE_DONE.
- eval.sh records RUN_START_TS so the bundle picks the result file
from *this* run, not a stale earlier one.
- Bundle subcommands wrapped with `|| echo ... best-effort` so a
report-generation hiccup can't suppress the bundle itself.
benchmarks/m3/eval_m3.py:
- Hoist all_results to before the eval try-block; use .extend() in
batched mode so an interrupt during gather doesn't drop captured
results.
- Add except (KeyboardInterrupt, asyncio.CancelledError) and except
Exception handlers that call save_evaluation_results with prefix
m3_config_partial (or m3_config_no_gt_partial) before re-raising.
benchmarks/m3/tests/test_partial_save_on_interrupt.py:
- New regression tests asserting save_evaluation_results writes a
valid JSON with the m3_config_partial / m3_config_no_gt_partial
prefixes that the interrupt path now emits.
benchmarks/m3/tests/conftest.py:
- Add project root to sys.path so the m3 tests are runnable in
isolation (matches the bpo conftest).
pyproject.toml:
- Switch pytest to --import-mode=importlib. Per-benchmark tests/
packages were colliding on the `tests.*` namespace, breaking
collection in `just ci` (introduced by PR #87). importlib mode
imports each test file independently. Unblocks `just ci`.
Closes #91 #92
- Add scripts/smoke_benchmarks.sh and report.md validator - Populate steps in AppWorld SDK results; show 0 in reports instead of -- - Add just test-smoke-e2e and document in CONTRIBUTING.md
Update
Live smoke ( |
|
Live smoke: Follow-up fix: |
|
Smoke correction: The earlier M3 smoke still blocked locally by cuga mode config path ( |
- Pick report.md by mtime since RUN_START_TS, not lexicographic tail - Free port 8001 before M3 so registry can start after AppWorld
|
Pushed M3 smoke retry: eval produced a bundle but the hockey task hit API/Langfuse auth errors and zero metrics in |
Inline `#` notes in m3.env were appended to values (e.g. CUGA_MODE), breaking bundle report generation. Document optional just test-smoke-e2e in create-pr.
|
Status: All CI checks green on latest Scope reminder: Core change is M3 interrupt bundling + partial save; follow-up commits add live smoke harness ( Ready for review when you are. |
…resilience # Conflicts: # benchmarks/helpers/load_env.sh # benchmarks/m3/compare.sh # benchmarks/m3/eval.sh # benchmarks/m3/eval_m3.py
|
Caution Review failedThe pull request is closed. ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
📝 WalkthroughWalkthroughThis PR implements smoke-testing validation infrastructure for the M3 and AppWorld benchmarks, hardens evaluation resilience via partial-result salvage on interrupt, improves step counting across evaluators, and refactors bundle creation into idempotent cleanup-trap handlers. Step counting is fixed to use invoke-derived metrics with fallbacks; bundle reports are validated via a new CLI utility; and M3 evaluation now persists partial results when interrupted, with centralized idempotent bundle creation in both eval and compare scripts. ChangesSmoke Testing and Evaluation Resilience
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related issues
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
benchmarks/m3/compare.sh (1)
382-393:⚠️ Potential issue | 🟠 MajorINT/TERM cleanup must terminate the script after bundling/cleanup.
trap compare_cleanup EXIT INT TERMruns the cleanup but the handler returns; during Ctrl-C whilebash "$SCRIPT_DIR/eval.sh" ...is in the per-runif ...; then ... else ... fi,set -ewon’t stop because the failure is inside theif, so the script can keep going with later runs. Split traps socompare_cleanupruns onEXIT, whileINT/TERMrun the cleanup and thenexitimmediately (e.g.,exit 130for INT andexit 143for TERM). Don’t rely on$?inside the INT/TERM handler for the exit code.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@benchmarks/m3/compare.sh` around lines 382 - 393, The compare_cleanup function is currently trapped for EXIT, INT, and TERM but returns to the script on SIGINT/SIGTERM allowing the main loop to continue; change the trap setup so compare_cleanup remains on EXIT only, and add separate INT and TERM handlers that call compare_cleanup (and create_compare_bundle if needed) then immediately exit with the conventional codes (exit 130 for INT, exit 143 for TERM) without relying on $?; update the trap invocation(s) near the existing "trap compare_cleanup EXIT INT TERM" line and reference compare_cleanup and create_compare_bundle when implementing the new INT/TERM handlers.
🧹 Nitpick comments (3)
benchmarks/helpers/validate_bundle_report.py (1)
68-74: ⚡ Quick winSummary validation only flags empty values, not missing lines.
The regex search on line 69 only matches when the "Label:" pattern exists in the text. If the summary section or a specific label is completely missing, no error is raised. This may be intentional (opt-in validation), but if the smoke test expects these metrics to always be present, the validator should flag missing lines as well.
Optional: require summary lines to exist
for label in ("Total Tokens", "Total LLM Calls", "Total Duration"): m = re.search(rf"\*\*{re.escape(label)}\*\*:\s*(.+)", text) - if m: + if not m: + errors.append(f"{path}: summary {label} is missing") + else: val = m.group(1).strip() if not val or val == "--": errors.append(f"{path}: summary {label} is empty")🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@benchmarks/helpers/validate_bundle_report.py` around lines 68 - 74, The current loop only appends an error when a matched label exists but has an empty value; update the validation to also append an error when the label line is entirely missing: inside the loop over labels ("Total Tokens", "Total LLM Calls", "Total Duration") check re.search(rf"\*\*{re.escape(label)}\*\*:\s*(.+)", text) and if m is None append errors.append(f"{path}: summary {label} is missing"); if m exists keep the existing val check (if not val or val == "--" append the same error). Ensure you reference the same local names (label, m, val, errors) when implementing the change.scripts/validate_bundle_report.py (1)
11-26: ⚡ Quick winDuplicate main() implementation.
This
main()function duplicates the logic inbenchmarks/helpers/validate_bundle_report.py:main(). Both files define argparse, file existence checks, and error printing.Consider removing this script and invoking the module directly via
python -m benchmarks.helpers.validate_bundle_report(which is whatsmoke_benchmarks.shline 50 already does). The CLI wrapper adds no extra functionality.Optional: remove duplicate script
Since
smoke_benchmarks.shline 50 already usesuv run python -m benchmarks.helpers.validate_bundle_report, this wrapper script is unused. Consider deletingscripts/validate_bundle_report.pyto reduce duplication.Alternatively, if the wrapper is needed for other use cases, refactor to avoid code duplication:
#!/usr/bin/env python3 """CLI wrapper for bundle report validation.""" from benchmarks.helpers.validate_bundle_report import main if __name__ == "__main__": raise SystemExit(main())🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scripts/validate_bundle_report.py` around lines 11 - 26, The main() in this script duplicates the CLI in the benchmarks.helpers.validate_bundle_report module (same argparse, file checks and error printing); remove the duplicate by either deleting this script or replacing its implementation to import and delegate to benchmarks.helpers.validate_bundle_report.main (i.e., remove local argparse/validate_report_md logic and in the script implement a minimal CLI stub that calls the module's main and exits with its return code, invoking main() only under if __name__ == "__main__").benchmarks/appworld/eval_appworld_sdk.py (1)
318-324: 💤 Low valueConsider recording explicit zero step counts.
The
if agent_steps:guard on line 323 omitsresult["steps"]when the count is zero. If zero steps is a valid and meaningful state (e.g., immediate failure without tool calls), it should be recorded explicitly. If zero is truly "missing" data, the current behavior is correct.Optional: record zero explicitly if meaningful
- if agent_steps: + if agent_steps is not None: result["steps"] = agent_steps🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@benchmarks/appworld/eval_appworld_sdk.py` around lines 318 - 324, The current truthy check "if agent_steps:" drops zero counts; change the guard so zero is recorded by testing for None instead of truthiness — i.e., after computing agent_steps (via invoke_result_holder/_react_steps_from_invoke_result or fallback len(tracker.steps) or len(tool_calls)), set result["steps"]=agent_steps when agent_steps is not None (or otherwise explicitly handle zero as a valid value) so that 0 is stored but None still means "missing".
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@benchmarks/m3/compare.sh`:
- Around line 200-230: The JSON builder currently folds all entries from
CONFIG_RESULT_VALS into JSON_PARTS, including partial run artifacts; modify the
loop that iterates over files (the for f in $files loop that builds file_list)
to skip any filenames that start with "m3_config_partial_" or
"m3_config_no_gt_partial_" (or otherwise match a configured partial-file
pattern) so they are not appended to file_list, and ensure JSON_PARTS only
receives non-empty file_list before adding ("${config}":[${file_list}]);
alternatively, tag partial files separately rather than adding them to
JSON_INPUT by filtering them out in this same place.
In `@benchmarks/m3/eval_m3.py`:
- Around line 2418-2422: The hoisted all_results list is not updated during
in-flight work so a KeyboardInterrupt/Exception handler can save only an
empty/incomplete file; modify run_config_mode, evaluate_tasks_in_batches, and
evaluate_single_task to accept a shared accumulator or checkpoint callback
(e.g., pass a parameter named checkpoint_callback or shared_results) and call it
to flush completed domain/batch/task results into the outer all_results before
each await/long-running loop step; alternatively, after each completed
domain/batch in those helpers, append results to the outer all_results via the
provided callback so the outer interrupt handlers will persist progress mid-run.
In `@benchmarks/m3/eval.sh`:
- Around line 375-380: The evaluator invocation(s) (the `uv run python -m
benchmarks.m3.eval_m3*` calls) are unguarded under set -e so failures trigger
the trap before EVAL_EXIT and the failure banner; modify the eval branch to
capture the evaluator exit code by temporarily disabling set -e or wrapping the
call in an if/then (e.g., use `set +e`/`set -e` around the call or `if ! uv run
...; then EVAL_EXIT=$?; echo ...; fi`) so that EVAL_EXIT is set and the else
block can run; ensure this change preserves calling `create_bundle` in the
success path and leaves `trap cleanup` behavior intact.
In `@CONTRIBUTING.md`:
- Around line 48-51: Update the comment describing the "just ci" target so it no
longer implies it runs the live smoke job; specifically change the line that
currently reads "just ci # all of the above" to explicitly state
that "just ci" runs the automated tests (test-regression, security) but does not
run the live smoke target "just test-smoke-e2e" (the smoke job is
optional/manual and requires API keys/containers).
---
Outside diff comments:
In `@benchmarks/m3/compare.sh`:
- Around line 382-393: The compare_cleanup function is currently trapped for
EXIT, INT, and TERM but returns to the script on SIGINT/SIGTERM allowing the
main loop to continue; change the trap setup so compare_cleanup remains on EXIT
only, and add separate INT and TERM handlers that call compare_cleanup (and
create_compare_bundle if needed) then immediately exit with the conventional
codes (exit 130 for INT, exit 143 for TERM) without relying on $?; update the
trap invocation(s) near the existing "trap compare_cleanup EXIT INT TERM" line
and reference compare_cleanup and create_compare_bundle when implementing the
new INT/TERM handlers.
---
Nitpick comments:
In `@benchmarks/appworld/eval_appworld_sdk.py`:
- Around line 318-324: The current truthy check "if agent_steps:" drops zero
counts; change the guard so zero is recorded by testing for None instead of
truthiness — i.e., after computing agent_steps (via
invoke_result_holder/_react_steps_from_invoke_result or fallback
len(tracker.steps) or len(tool_calls)), set result["steps"]=agent_steps when
agent_steps is not None (or otherwise explicitly handle zero as a valid value)
so that 0 is stored but None still means "missing".
In `@benchmarks/helpers/validate_bundle_report.py`:
- Around line 68-74: The current loop only appends an error when a matched label
exists but has an empty value; update the validation to also append an error
when the label line is entirely missing: inside the loop over labels ("Total
Tokens", "Total LLM Calls", "Total Duration") check
re.search(rf"\*\*{re.escape(label)}\*\*:\s*(.+)", text) and if m is None append
errors.append(f"{path}: summary {label} is missing"); if m exists keep the
existing val check (if not val or val == "--" append the same error). Ensure you
reference the same local names (label, m, val, errors) when implementing the
change.
In `@scripts/validate_bundle_report.py`:
- Around line 11-26: The main() in this script duplicates the CLI in the
benchmarks.helpers.validate_bundle_report module (same argparse, file checks and
error printing); remove the duplicate by either deleting this script or
replacing its implementation to import and delegate to
benchmarks.helpers.validate_bundle_report.main (i.e., remove local
argparse/validate_report_md logic and in the script implement a minimal CLI stub
that calls the module's main and exits with its return code, invoking main()
only under if __name__ == "__main__").
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 95ae6aa0-5d28-4472-aafd-0e9abb86b7a5
📒 Files selected for processing (18)
.bob/commands/cuga-create-pr.md.claude/commands/cuga-create-pr.md.cursor/commands/cuga-create-pr.mdCONTRIBUTING.mdbenchmarks/appworld/eval_appworld_sdk.pybenchmarks/helpers/compare_report.pybenchmarks/helpers/sdk_eval_helpers.pybenchmarks/helpers/tests/test_validate_bundle_report.pybenchmarks/helpers/validate_bundle_report.pybenchmarks/m3/compare.shbenchmarks/m3/eval.shbenchmarks/m3/eval_m3.pybenchmarks/m3/tests/conftest.pybenchmarks/m3/tests/test_partial_save_on_interrupt.pyjustfilepyproject.tomlscripts/smoke_benchmarks.shscripts/validate_bundle_report.py
| # Hoisted so the KeyboardInterrupt / Exception handlers below can save | ||
| # whatever was collected if the eval is interrupted (#91, #92). In | ||
| # sequential mode results are appended as tasks complete; in batched | ||
| # mode evaluate_tasks_in_batches replaces the list with its return. | ||
| all_results: List[Dict[str, Any]] = [] |
There was a problem hiding this comment.
Hoisting all_results here still drops in-flight progress on interrupt.
all_results is only updated after run_config_mode(..., defer_save=True), evaluate_tasks_in_batches(...), or evaluate_single_task(...) returns. If Ctrl-C lands mid-capability pass, mid-batch, or during a later domain of the current service, the new handlers will still save an empty/incomplete partial file even though inner locals already hold completed results. Thread a shared accumulator/checkpoint callback into those helpers, or flush completed domain/batch results into the outer list before the next await.
Also applies to: 2633-2639, 2722-2751
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@benchmarks/m3/eval_m3.py` around lines 2418 - 2422, The hoisted all_results
list is not updated during in-flight work so a KeyboardInterrupt/Exception
handler can save only an empty/incomplete file; modify run_config_mode,
evaluate_tasks_in_batches, and evaluate_single_task to accept a shared
accumulator or checkpoint callback (e.g., pass a parameter named
checkpoint_callback or shared_results) and call it to flush completed
domain/batch/task results into the outer all_results before each
await/long-running loop step; alternatively, after each completed domain/batch
in those helpers, append results to the outer all_results via the provided
callback so the outer interrupt handlers will persist progress mid-run.
There was a problem hiding this comment.
Valid observation, and accurately scoped as a heavy lift. evaluate_single_task does iterate multiple domains internally before returning, so an interrupt mid-task would still lose that task's in-flight domain progress — the hoisted all_results only captures completed tasks/batches, not partial progress within one.
That said, I'd treat this as a follow-up rather than blocking this PR: the change here moves the failure mode from 'lose everything collected across the entire run' (#91/#92) to 'lose at most the currently in-flight task/batch' — a meaningful reliability improvement on its own, and a large reduction in blast radius. Threading a checkpoint/accumulator callback through evaluate_single_task/evaluate_tasks_in_batches to get domain-level granularity is a reasonable next step; happy to file a tracking issue for it if that's useful.
There was a problem hiding this comment.
@haroldship, that's a fair framing — the blast radius reduction from "lose the entire run" to "lose at most the in-flight task/batch" is a real and useful improvement on its own, and domain-level checkpointing is clearly a follow-up scope.
Happy to open a tracking issue for the checkpoint/accumulator callback work if you'd like — just say the word and I'll file it against this repo with the relevant context linked back to this PR.
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
gh issue create \
--repo cuga-project/cuga-eval \
--assignee haroldship \
--title "feat(m3): domain-level checkpoint callback for partial saves on interrupt" \
--body "## Background
PR `#4` ([fix/issue-91-92-eval-resilience](https://github.com/cuga-project/cuga-eval/pull/4)) hoists \`all_results\` before the main \`try\` block in \`run_config_mode\` and saves partial results in the \`KeyboardInterrupt\`/\`Exception\` handlers. This reduces the blast radius from *\"lose the entire run\"* to *\"lose at most the in-flight task/batch\"*.
However, \`all_results\` is only updated **after** \`evaluate_single_task\` or \`evaluate_tasks_in_batches\` returns in full. If a Ctrl-C or exception lands mid-task (e.g. while a later domain within the same task is still running), the outer handlers persist an empty or incomplete file — the completed domains' results exist only in inner locals.
**Discussed in:** https://github.com/cuga-project/cuga-eval/pull/4#discussion_r3369504325
## Goal
Achieve **domain-level** checkpoint granularity so that any domain that completes before an interrupt is preserved, regardless of whether its parent task finished.
## Proposed approach
Thread a \`checkpoint_callback\` (or shared mutable \`shared_results\` list) parameter into \`evaluate_single_task\` and \`evaluate_tasks_in_batches\`. After each domain (or batch) result is appended in the inner helpers, invoke the callback to flush those results into the outer \`all_results\` before the next \`await\`. The outer interrupt handlers then persist whatever has accumulated.
Rough sketch:
\`\`\`python
# run_config_mode
all_results: List[Dict[str, Any]] = []
def _checkpoint(results):
all_results.extend(results)
# pass _checkpoint into evaluate_single_task / evaluate_tasks_in_batches
\`\`\`
Within \`evaluate_single_task\`, call \`checkpoint_callback(evaluator.results)\` after each domain loop iteration completes successfully.
## Scope
- \`benchmarks/m3/eval_m3.py\`: \`run_config_mode\`, \`evaluate_single_task\`, \`evaluate_tasks_in_batches\`
- May also want to consider the same pattern for \`compare.sh\` sequential domain loops if applicable
- Add or extend regression tests in \`benchmarks/m3/tests/test_partial_save_on_interrupt.py\`
"Length of output: 4256
@haroldship, the tracking issue is live: #54 — feat(m3): domain-level checkpoint callback for partial saves on interrupt
It's assigned to you and includes the full context, the proposed checkpoint_callback approach, and the files in scope (run_config_mode, evaluate_single_task, evaluate_tasks_in_batches, plus a note about extending the regression tests in test_partial_save_on_interrupt.py).
🧬 Code Graph Analysis Results
benchmarks/m3/m3_vakra_score.py (lines 421-460)
def patch_tracker_scores(results: List[Dict[str, Any]], tracker: Any) -> int:
"""Sync Vakra-corrected scores back into the tracker's trajectory files.
The tracker writes results.json/results.csv at ``finish_task`` time with the
pre-Vakra keyword score. Vakra rescoring mutates ``results[i]["match_rate"]``
and ``results[i]["success"]`` in place, but those updated values never reach
the trajectory bundle — so trajectories/results.json ends up at score=0
while report.md shows the correct score (issue `#71`).
Call this immediately after ``score_results_async`` returns. It walks the
rescored results and uses ``tracker.update_task`` to overwrite the stored
``score`` and the ``eval`` JSON blob with the Vakra verdict.
Returns the number of tracker entries patched (useful for logging/tests).
"""
if tracker is None or not getattr(tracker, "experiment_folder", None):
return 0
tracker_tasks = getattr(tracker, "tasks", None)
if not isinstance(tracker_tasks, dict):
return 0
patched = 0
for r in results:
task_id = _result_uuid(r)
if not task_id or task_id not in tracker_tasks:
continue
score = float(r.get("match_rate", 0.0))
new_eval = json.dumps(
{
"task_name": r.get("task_name", task_id),
"difficulty": r.get("difficulty", "unknown"),
"success": bool(r.get("success", False)),
"match_rate": score,
"found_keywords": r.get("found_keywords", []),
"missing_keywords": r.get("missing_keywords", []),
}
)
if tracker.update_task(task_id=task_id, score=score, eval=new_eval):
patched += 1
return patched🧠 Learnings used
Learnt from: haroldship
Repo: cuga-project/cuga-eval PR: 4
File: benchmarks/m3/eval.sh:375-380
Timestamp: 2026-06-07T17:29:50.043Z
Learning: In `benchmarks/m3/eval.sh`, the `ERR` trap fires on evaluator failure regardless of `set -e`/`set +e` state in bash. Because `cleanup()` unconditionally ends with `exit $exit_code`, the script never reaches `EVAL_EXIT=$?` or the `else` branch after the evaluator invocation. The "✗ M3 evaluation failed (exit code: ...)" banner is unreachable — this is a known cosmetic issue deferred to a follow-up. A real fix would require selectively suppressing ERR around the `uv run` invocation (`trap '' ERR; uv run ...; EVAL_EXIT=$?; trap cleanup ERR`), which has implications for the crash-salvage path and needs careful testing. The functional salvage behavior (create_bundle via ERR/EXIT trap) is already correct.
| # Create reproducibility bundle (idempotent — cleanup trap also calls | ||
| # this on interrupt/crash, see #91, #92). | ||
| create_bundle | ||
| else | ||
| echo -e "${RED:-}✗ M3 evaluation failed (exit code: $EVAL_EXIT)${NC:-}" | ||
| # cleanup trap will call create_bundle to salvage what we have. |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Inspect the relevant region around the reported lines and the evaluator invocation.
sed -n '330,420p' benchmarks/m3/eval.sh | nl -baRepository: cuga-project/cuga-eval
Length of output: 109
🏁 Script executed:
#!/bin/bash
set -euo pipefail
FILE="benchmarks/m3/eval.sh"
echo "=== Matches for key identifiers ==="
rg -n "EVAL_EXIT|uv run|M3 evaluation failed|create_bundle|cleanup trap|trap|EVAL" "$FILE" || true
echo
echo "=== Context around reported lines (330-420) ==="
awk 'NR>=330 && NR<=420 {printf "%d:%s\n", NR, $0}' "$FILE" | sed -n '1,200p'Repository: cuga-project/cuga-eval
Length of output: 4843
🏁 Script executed:
#!/bin/bash
set -euo pipefail
FILE="benchmarks/m3/eval.sh"
echo "=== File header (first 120 lines) ==="
awk 'NR>=1 && NR<=140 {print NR ":" $0}' "$FILE"
echo
echo "=== cleanup/create_bundle area (lines 110-260) ==="
awk 'NR>=110 && NR<=260 {print NR ":" $0}' "$FILE"
echo
echo "=== Any set +e / errexit toggling and handling near eval call (lines 340-390) ==="
rg -n "set \\+e|set -e|errexit|trap cleanup|cleanup\\(" "$FILE" || trueRepository: cuga-project/cuga-eval
Length of output: 12712
Guard the evaluator call so the failure banner runs under set -e.
With set -e and unguarded uv run python -m benchmarks.m3.eval_m3* ... calls in the evaluation branches, a non-zero exit triggers trap cleanup ... ERR/EXIT and aborts the script before reaching EVAL_EXIT=$? and the else block that prints ✗ M3 evaluation failed (exit code: $EVAL_EXIT). Wrap the evaluator invocation(s) to capture the exit code (e.g., set +e temporarily or if ! uv run ...; then ...; fi).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@benchmarks/m3/eval.sh` around lines 375 - 380, The evaluator invocation(s)
(the `uv run python -m benchmarks.m3.eval_m3*` calls) are unguarded under set -e
so failures trigger the trap before EVAL_EXIT and the failure banner; modify the
eval branch to capture the evaluator exit code by temporarily disabling set -e
or wrapping the call in an if/then (e.g., use `set +e`/`set -e` around the call
or `if ! uv run ...; then EVAL_EXIT=$?; echo ...; fi`) so that EVAL_EXIT is set
and the else block can run; ensure this change preserves calling `create_bundle`
in the success path and leaves `trap cleanup` behavior intact.
There was a problem hiding this comment.
I looked at this closely, and it's technically correct but the fix isn't as simple as guarding the exit code — and the underlying salvage behavior (the actual point of this PR) is unaffected.
Verified empirically: the ERR trap fires on the evaluator's non-zero exit regardless of set -e/set +e state (tested with a minimal repro — set +e; false still triggers an ERR-trapped cleanup). And cleanup() here unconditionally ends with exit $exit_code, so it force-exits the moment it runs. That means wrapping the uv run call in set +e/set -e is a no-op — cleanup still fires immediately on failure and exits before EVAL_EXIT=$? is ever reached. I tried exactly that and reverted it once the test confirmed it changes nothing.
So yes, the "✗ M3 evaluation failed (exit code: ...)" banner in the else branch is genuinely unreachable — but that's cosmetic only. The actual salvage path (create_bundle running and the correct exit code propagating to the caller) already works correctly via the ERR/EXIT trap, which is the entire point of #91/#92.
A real fix would mean suppressing ERR specifically around the uv run invocation (trap '' ERR; uv run ...; EVAL_EXIT=$?; trap cleanup ERR) so the script falls through to the explicit branch instead of the trap firing immediately — which would change when the salvage bundle gets created (deferred to the script's natural exit) and needs careful testing against the crash-salvage path. Given the banner is purely informational and the functional behavior is already correct, I'd rather not touch trap semantics in this PR for a cosmetic message. Happy to file a follow-up if you'd like it tracked.
There was a problem hiding this comment.
@haroldship, thanks for the thorough investigation — that's a subtlety I missed. You're right that trap ... ERR fires even under set +e, and since cleanup() calls exit $exit_code unconditionally, the script never reaches EVAL_EXIT=$? on the failure path regardless of how the uv run call is guarded. My suggested fix would indeed have been a no-op.
The dead else branch is cosmetic and the actual salvage behavior (the whole point of this PR) is already working correctly via the trap. Deferring this to a follow-up makes sense.
Would you like me to open a GitHub issue to track the follow-up (making the failure banner reachable by selectively suppressing ERR around the evaluator invocations)?
✏️ Learnings added
Learnt from: haroldship
Repo: cuga-project/cuga-eval PR: 4
File: benchmarks/m3/eval.sh:375-380
Timestamp: 2026-06-07T17:29:50.043Z
Learning: In `benchmarks/m3/eval.sh`, the `ERR` trap fires on evaluator failure regardless of `set -e`/`set +e` state in bash. Because `cleanup()` unconditionally ends with `exit $exit_code`, the script never reaches `EVAL_EXIT=$?` or the `else` branch after the evaluator invocation. The "✗ M3 evaluation failed (exit code: ...)" banner is unreachable — this is a known cosmetic issue deferred to a follow-up. A real fix would require selectively suppressing ERR around the `uv run` invocation (`trap '' ERR; uv run ...; EVAL_EXIT=$?; trap cleanup ERR`), which has implications for the crash-salvage path and needs careful testing. The functional salvage behavior (create_bundle via ERR/EXIT trap) is already correct.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
gh issue create \
--repo cuga-project/cuga-eval \
--title 'fix(m3): make failure banner reachable by selectively suppressing ERR around evaluator invocations' \
--body "## Background
In \`benchmarks/m3/eval.sh\`, the \`ERR\` trap fires on evaluator failure regardless of \`set -e\`/\`set +e\` state in bash. Because \`cleanup()\` unconditionally ends with \`exit \$exit_code\`, the script never reaches \`EVAL_EXIT=\$?\` or the \`else\` branch after the evaluator invocation. As a result, the \"✗ M3 evaluation failed (exit code: ...)\" banner is unreachable.
This was identified during review of PR `#4` (comment: https://github.com/cuga-project/cuga-eval/pull/4#discussion_r3369504327).
## Functional impact
None — the crash-salvage behavior (\`create_bundle\` via \`ERR\`/\`EXIT\` trap) is already correct. This is a cosmetic issue only.
## Proposed fix
Selectively suppress \`ERR\` around each \`uv run\` evaluator invocation so the script falls through to the explicit success/failure branches instead of immediately trapping:
\`\`\`bash
trap '' ERR
uv run python -m benchmarks.m3.eval_m3 ...
EVAL_EXIT=\$?
trap cleanup ERR
\`\`\`
This change needs careful testing against the crash-salvage path to ensure the bundle is still created correctly on early interrupt/exception.
## Requested by
`@haroldship`" \
--assignee haroldshipLength of output: 1456
@haroldship, done — tracking issue opened: #55 (assigned to you).
🧠 Learnings used
Learnt from: haroldship
Repo: cuga-project/cuga-eval PR: 4
File: benchmarks/m3/eval.sh:375-380
Timestamp: 2026-06-07T17:29:50.043Z
Learning: In `benchmarks/m3/eval.sh`, the `ERR` trap fires on evaluator failure regardless of `set -e`/`set +e` state in bash. Because `cleanup()` unconditionally ends with `exit $exit_code`, the script never reaches `EVAL_EXIT=$?` or the `else` branch after the evaluator invocation. The "✗ M3 evaluation failed (exit code: ...)" banner is unreachable — this is a known cosmetic issue deferred to a follow-up. A real fix would require selectively suppressing ERR around the `uv run` invocation (`trap '' ERR; uv run ...; EVAL_EXIT=$?; trap cleanup ERR`), which has implications for the crash-salvage path and needs careful testing. The functional salvage behavior (create_bundle via ERR/EXIT trap) is already correct.
Sanity test results (PR #3 recipe + AppWorld/BPO)Ran the same small M3 eval (
|
Manual test plan for this PR (interrupt / crash bundling)This exercises the core fix: bundles and partial results should now be produced even when an M3 eval/compare run is interrupted (Ctrl-C) or crashes mid-flight, instead of being silently lost (issues #91/#92). 1. Ctrl-C mid-
|
…es from comparisons The multiturn data file was misspelled "olympics_mutliturn.json" since the initial release, with eval.sh, compare.sh, eval_m3_multiturn.py, and create_eval_bundle.py all consistently referencing the same misspelling. Renamed the file and fixed all references for consistency. Also filter m3_config_partial_*/m3_config_no_gt_partial_* result files out of compare.sh's comparison set — interrupted/crashed runs were being folded into aggregate pass-rate and token totals alongside complete runs, skewing the comparison report.
agent_steps == 0 was falsy, so a task that genuinely completed in zero steps was recorded as "steps not measured" instead of 0.
Use a shared _EMPTY_MARKERS set so "--", em-dash, and "-" placeholders are all recognized consistently as empty in both per-task and summary checks, and match the table header strictly on "Task" rather than any tuple member.
CONTRIBUTING.md's "just ci # all of the above" was misleading — the recipe runs lint + test-regression + security only, not the full suite (smoke is optional/manual). README.md gains a section explaining how M3's eval.sh/compare.sh salvage a best-effort bundle and partial result file when a run is interrupted or crashes, and how compare.sh excludes those partials from aggregate totals.
… saves Companion to the file rename in 5e9cb6c — updates the references in eval.sh, compare.sh, eval_m3_multiturn.py, and create_eval_bundle.py to the corrected filename, and filters m3_config_partial_*/m3_config_no_gt_partial_* result files out of compare.sh's comparison set so an interrupted/crashed run's partial results don't skew aggregate pass-rate and token totals.
There was a problem hiding this comment.
🧹 Nitpick comments (1)
benchmarks/m3/compare.sh (1)
420-435: 💤 Low valueConsider refactoring to avoid
ls | greppattern.Shellcheck flags the
ls | greppattern as potentially fragile for non-alphanumeric filenames. Since the filenames here are controlled byeval_m3.py(not user input), the current implementation is safe. However, a more robust alternative would use a for loop with pattern matching:_list_results_for_agent() { local agent="$1" if [[ "$agent" == "cuga" ]]; then for f in "$RESULTS_DIR"/m3_config_*.json; do [[ -f "$f" ]] || continue case "$(basename "$f")" in m3_config_partial_*|m3_config_no_gt_partial_*) continue ;; esac echo "$f" done | sort else # react: existing logic ... fi }This eliminates the
ls | greppattern and is more maintainable.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@benchmarks/m3/compare.sh` around lines 420 - 435, The ls|grep pipeline used to list results when checking "$RESULTS_DIR"/m3_config_*.json (for agent "cuga") is fragile; replace it with a glob-based loop that iterates over "$RESULTS_DIR"/m3_config_*.json, skips non-files (use [[ -f "$f" ]] || continue), filter out basenames matching m3_config_partial_* and m3_config_no_gt_partial_* via a case or conditional, echo each accepted path, and pipe the output to sort; update the branch that checks [[ "$agent" == "cuga" ]] to use this loop (referencing RESULTS_DIR and the m3_config_* patterns) so you avoid ls | grep while preserving the same filtering semantics.Source: Linters/SAST tools
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@benchmarks/m3/compare.sh`:
- Around line 420-435: The ls|grep pipeline used to list results when checking
"$RESULTS_DIR"/m3_config_*.json (for agent "cuga") is fragile; replace it with a
glob-based loop that iterates over "$RESULTS_DIR"/m3_config_*.json, skips
non-files (use [[ -f "$f" ]] || continue), filter out basenames matching
m3_config_partial_* and m3_config_no_gt_partial_* via a case or conditional,
echo each accepted path, and pipe the output to sort; update the branch that
checks [[ "$agent" == "cuga" ]] to use this loop (referencing RESULTS_DIR and
the m3_config_* patterns) so you avoid ls | grep while preserving the same
filtering semantics.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 95ca68e2-87c2-4507-a426-d9697c4002d6
📒 Files selected for processing (9)
CONTRIBUTING.mdREADME.mdbenchmarks/appworld/eval_appworld_sdk.pybenchmarks/helpers/validate_bundle_report.pybenchmarks/m3/compare.shbenchmarks/m3/data/olympics_multiturn.jsonbenchmarks/m3/eval.shbenchmarks/m3/eval_m3_multiturn.pyscripts/create_eval_bundle.py
✅ Files skipped from review due to trivial changes (2)
- README.md
- CONTRIBUTING.md
🚧 Files skipped from review as they are similar to previous changes (3)
- benchmarks/helpers/validate_bundle_report.py
- benchmarks/m3/eval.sh
- benchmarks/appworld/eval_appworld_sdk.py
|
Dead code: scripts/validate_bundle_report.py |
|
Deferred import inside complete_and_eval |
|
_parse_table_header returns [] on an empty line The branch returns an empty list rather than None when cells is empty. Caller checks |
|
free_port race in the smoke runner |
|
Mass rename: any stale mutliturn references left? |
Sergey-Zeltyn
left a comment
There was a problem hiding this comment.
See comments above.
All comments are minor, correct whatever you see fit
- Remove scripts/validate_bundle_report.py: dead-code duplicate of benchmarks/helpers/validate_bundle_report.py:main(), which is the module smoke_benchmarks.sh actually invokes. - eval_appworld_sdk.py: lift the deferred _react_steps_from_invoke_result import to module level (no circular-import reason found). - validate_bundle_report._parse_table_header: return None rather than [] on an empty header line, matching its list[str] | None contract. - smoke_benchmarks.sh free_port: poll until the port is actually free after SIGTERM, falling back to SIGKILL — closes the race that caused the earlier 9133c39 M3 port-8001 failure.
Summary
Two related crash-resilience issues in the M3 evaluation flow:
EVAL_EXIT==0. Any Ctrl-C or crash skipped the bundle entirely.eval_m3.pywas interrupted or crashed at the outer level,save_evaluation_resultswas never called, so bundling had no result JSON.What changed
M3 interrupt / partial save (core)
benchmarks/m3/eval.sh,compare.sh: bundle from cleanup trap;RUN_START_TSfor correct result file; idempotentBUNDLE_DONE.benchmarks/m3/eval_m3.py: partial save onKeyboardInterrupt/ outer exceptions (m3_config_partial_*prefixes).benchmarks/m3/tests/test_partial_save_on_interrupt.py: regression tests.Merge from
main+ smoke harness (follow-up)main.just test-smoke-e2e: one AppWorld SDK task, one ReAct task, one M3 hockey (m3_task_2,--max-samples 1); validates bundlereport.mdmetrics (tokens, steps, duration, etc.; cost may be--).scripts/smoke_benchmarks.sh,benchmarks/helpers/validate_bundle_report.py.stepsin results for report tables;_fmtshows0instead of--for numeric zeros.Other
pyproject.toml: pytest--import-mode=importlib(fixes test collection with multipletests/packages).Verification
just lintjust test-regression(270 passed on branch HEAD)just securityjust test-smoke-e2e(live; needs AppWorld + M3 containers + API keys)Test plan
m3_config_partial_*.jsoncreatedjust test-smoke-e2eon a machine with full benchmark runtimeSummary by CodeRabbit
New Features
Bug Fixes
Documentation
Tests