Skip to content

fix(m3): bundle on interrupt and save partial results on crash#4

Merged
haroldship merged 15 commits into
mainfrom
fix/issue-91-92-eval-resilience
Jun 8, 2026
Merged

fix(m3): bundle on interrupt and save partial results on crash#4
haroldship merged 15 commits into
mainfrom
fix/issue-91-92-eval-resilience

Conversation

@sami-marreed

@sami-marreed sami-marreed commented May 26, 2026

Copy link
Copy Markdown
Contributor

Summary

Two related crash-resilience issues in the M3 evaluation flow:

  • (internal #91) — eval/compare bundles were only created when EVAL_EXIT==0. Any Ctrl-C or crash skipped the bundle entirely.
  • (internal #92) — when eval_m3.py was interrupted or crashed at the outer level, save_evaluation_results was never called, so bundling had no result JSON.

Internal issue numbers were not migrated to this repo. Closest open issues: #11, #12.

What changed

M3 interrupt / partial save (core)

  • benchmarks/m3/eval.sh, compare.sh: bundle from cleanup trap; RUN_START_TS for correct result file; idempotent BUNDLE_DONE.
  • benchmarks/m3/eval_m3.py: partial save on KeyboardInterrupt / outer exceptions (m3_config_partial_* prefixes).
  • benchmarks/m3/tests/test_partial_save_on_interrupt.py: regression tests.

Merge from main + smoke harness (follow-up)

  • Merged latest main.
  • just test-smoke-e2e: one AppWorld SDK task, one ReAct task, one M3 hockey (m3_task_2, --max-samples 1); validates bundle report.md metrics (tokens, steps, duration, etc.; cost may be --).
  • scripts/smoke_benchmarks.sh, benchmarks/helpers/validate_bundle_report.py.
  • AppWorld SDK: populate steps in results for report tables; _fmt shows 0 instead of -- for numeric zeros.

Other

  • pyproject.toml: pytest --import-mode=importlib (fixes test collection with multiple tests/ packages).

Verification

  • just lint
  • just test-regression (270 passed on branch HEAD)
  • just security
  • just test-smoke-e2e (live; needs AppWorld + M3 containers + API keys)

Test plan

  • Ctrl-C mid M3 eval → bundle + partial m3_config_partial_*.json created
  • Success path still produces exactly one bundle
  • Optional: just test-smoke-e2e on a machine with full benchmark runtime

Summary by CodeRabbit

  • New Features

    • Added live smoke testing capability with prerequisite validation
    • Added bundle report validation to detect missing metrics
    • Improved resilience with automatic partial result saving on interrupts or crashes
  • Bug Fixes

    • Fixed typos in benchmark data file names
  • Documentation

    • Updated contributing guide and command templates with smoke testing instructions
    • Added documentation on partial result recovery and bundle resilience
  • Tests

    • Added smoke testing and bundle validation test coverage
    • Added regression tests for partial result saving

Two related issues:

interrupt (Ctrl-C) or crash skipped the bundle entirely, so long-running
runs lost all logs/trajectories/Langfuse traces even though they
existed on disk.

crashed, `save_evaluation_results` was never called, so the next
bundling step had no result JSON to package. Sequential and batched
modes already handled per-task failures gracefully; only the outer
KeyboardInterrupt / unhandled-exception path was missing.

Changes:

benchmarks/m3/eval.sh, benchmarks/m3/compare.sh:
  - Extract bundle code into create_bundle / create_compare_bundle.
  - Call from cleanup trap (EXIT INT TERM ERR) so a bundle is produced
    on success, crash, or Ctrl-C. Idempotent via BUNDLE_DONE.
  - eval.sh records RUN_START_TS so the bundle picks the result file
    from *this* run, not a stale earlier one.
  - Bundle subcommands wrapped with `|| echo ... best-effort` so a
    report-generation hiccup can't suppress the bundle itself.

benchmarks/m3/eval_m3.py:
  - Hoist all_results to before the eval try-block; use .extend() in
    batched mode so an interrupt during gather doesn't drop captured
    results.
  - Add except (KeyboardInterrupt, asyncio.CancelledError) and except
    Exception handlers that call save_evaluation_results with prefix
    m3_config_partial (or m3_config_no_gt_partial) before re-raising.

benchmarks/m3/tests/test_partial_save_on_interrupt.py:
  - New regression tests asserting save_evaluation_results writes a
    valid JSON with the m3_config_partial / m3_config_no_gt_partial
    prefixes that the interrupt path now emits.

benchmarks/m3/tests/conftest.py:
  - Add project root to sys.path so the m3 tests are runnable in
    isolation (matches the bpo conftest).

pyproject.toml:
  - Switch pytest to --import-mode=importlib. Per-benchmark tests/
    packages were colliding on the `tests.*` namespace, breaking
    collection in `just ci` (introduced by PR #87). importlib mode
    imports each test file independently. Unblocks `just ci`.

Closes #91 #92
@haroldship haroldship changed the title fix(m3): bundle on interrupt and save partial results on crash (#91, #92) fix(m3): bundle on interrupt and save partial results on crash May 26, 2026
- Add scripts/smoke_benchmarks.sh and report.md validator
- Populate steps in AppWorld SDK results; show 0 in reports instead of --
- Add just test-smoke-e2e and document in CONTRIBUTING.md
@haroldship

Copy link
Copy Markdown
Collaborator

Update

  • Merged latest main and pushed (a3773d9).
  • Added e2e smoke harness: just test-smoke-e2e (AppWorld SDK + ReAct + M3 hockey sample) with report.md validation.
  • Local just ci: lint ✓, 270 tests ✓, security ✓.

Live smoke (just test-smoke-e2e) pending — requires AppWorld services, M3 capability containers, and API keys.

@haroldship

Copy link
Copy Markdown
Collaborator

Live smoke: just test-smoke-e2e completed successfully (AppWorld SDK 82e2fac_1, AppWorld ReAct, M3 hockey m3_task_2 ×1). All three bundle report.md files passed validation.

Follow-up fix: 9133c39 — validator invoked via python -m benchmarks.helpers.validate_bundle_report.

@haroldship

Copy link
Copy Markdown
Collaborator

Smoke correction: The earlier test-smoke-e2e run validated AppWorld SDK + ReAct bundles from this run, but M3 did not produce a new bundle (registry port 8001 conflict, then 0 results). The script incorrectly passed by validating an older compare bundle — fixed in 9133c39 follow-up (use run start timestamp + free port 8001 before M3).

M3 smoke still blocked locally by cuga mode config path (accurate # comment in env). AppWorld portions of smoke are green.

- Pick report.md by mtime since RUN_START_TS, not lexicographic tail
- Free port 8001 before M3 so registry can start after AppWorld
@haroldship

Copy link
Copy Markdown
Collaborator

Pushed b1816fa (smoke bundle timestamp fix) and $(git rev-parse --short HEAD) (strip inline # comments in load_env.sh).

M3 smoke retry: eval produced a bundle but the hockey task hit API/Langfuse auth errors and zero metrics in report.md — would fail validation until keys/containers are up. AppWorld SDK + ReAct smoke remain green from the earlier run.

Inline `#` notes in m3.env were appended to values (e.g. CUGA_MODE), breaking
bundle report generation. Document optional just test-smoke-e2e in create-pr.
@haroldship

Copy link
Copy Markdown
Collaborator

Status: All CI checks green on latest d172bee. No open review threads.

Scope reminder: Core change is M3 interrupt bundling + partial save; follow-up commits add live smoke harness (just test-smoke-e2e) and load_env inline-comment fix. M3 portion of smoke is manual-only (not CI).

Ready for review when you are.

…resilience

# Conflicts:
#	benchmarks/helpers/load_env.sh
#	benchmarks/m3/compare.sh
#	benchmarks/m3/eval.sh
#	benchmarks/m3/eval_m3.py
@coderabbitai

coderabbitai Bot commented Jun 7, 2026

Copy link
Copy Markdown

Review Change Stack

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: d8c92b4d-7410-4eec-8dfc-05e455152dea

📥 Commits

Reviewing files that changed from the base of the PR and between 193fef2 and 80f1549.

📒 Files selected for processing (3)
  • benchmarks/appworld/eval_appworld_sdk.py
  • benchmarks/helpers/validate_bundle_report.py
  • scripts/smoke_benchmarks.sh

📝 Walkthrough

Walkthrough

This PR implements smoke-testing validation infrastructure for the M3 and AppWorld benchmarks, hardens evaluation resilience via partial-result salvage on interrupt, improves step counting across evaluators, and refactors bundle creation into idempotent cleanup-trap handlers. Step counting is fixed to use invoke-derived metrics with fallbacks; bundle reports are validated via a new CLI utility; and M3 evaluation now persists partial results when interrupted, with centralized idempotent bundle creation in both eval and compare scripts.

Changes

Smoke Testing and Evaluation Resilience

Layer / File(s) Summary
Step count capture and fallback logic
benchmarks/appworld/eval_appworld_sdk.py, benchmarks/helpers/sdk_eval_helpers.py, benchmarks/helpers/compare_report.py
AppWorld evaluator now persists agent.invoke() result and derives step counts via ReAct extraction or fallback to tracker/tool_calls metrics. Tracker callbacks pass computed agent_steps to finish_task. Format utility treats zero as a valid value instead of a missing marker.
Bundle report validation utility
benchmarks/helpers/validate_bundle_report.py, benchmarks/helpers/tests/test_validate_bundle_report.py
New validator parses report.md Per-Task and summary sections, enforces required metrics (Tokens, LLM Calls, Cache Tokens, Duration, Steps), validates summary totals, and provides a CLI entrypoint with comprehensive tests covering valid and invalid report shapes.
End-to-end smoke test harness
justfile, scripts/smoke_benchmarks.sh
Smoke execution wired through a new just task and bash script that orchestrates three benchmarks (AppWorld SDK, AppWorld ReAct, M3 hockey), selects reports by run-start timestamp, frees port 8001, validates each report via CLI, and supports AppWorld task ID override.
M3 evaluation partial-result salvage on interrupt
benchmarks/m3/eval_m3.py, benchmarks/m3/tests/conftest.py, benchmarks/m3/tests/test_partial_save_on_interrupt.py
Config-mode evaluation moves result storage outside try block to preserve state across interruption/exception handlers. Exception handlers now save partial results with m3_config_partial or m3_config_no_gt_partial prefixes before re-raising. Test suite enables isolated m3 execution; regression tests verify partial-save file format and prefixes.
Idempotent bundle creation and cleanup-trap wiring
benchmarks/m3/eval.sh, benchmarks/m3/compare.sh
Both eval and compare scripts centralize bundle creation into idempotent create_bundle() and create_compare_bundle() functions guarded by BUNDLE_DONE flags. Bundle creation is invoked from success paths and cleanup traps (EXIT/INT/TERM/ERR) to salvage partial artifacts on interrupt. Compare script filters out m3_config_*partial* files during aggregation. Report generation is best-effort and does not abort.
Documentation, configuration comments, and multiturn filename corrections
.bob/commands/cuga-create-pr.md, .claude/commands/cuga-create-pr.md, .cursor/commands/cuga-create-pr.md, CONTRIBUTING.md, README.md, pyproject.toml, benchmarks/m3/eval_m3_multiturn.py, scripts/create_eval_bundle.py
AI-assistant command templates and CONTRIBUTING guide document optional smoke-testing prerequisites. README documents M3 resilience behavior on interrupts and crashes. pyproject.toml clarifies pytest importlib mode rationale to prevent per-benchmark test package collisions. Multiturn evaluator and bundle-script references are corrected from misspelled olympics_mutliturn.json to olympics_multiturn.json.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related issues

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 37.93% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title 'fix(m3): bundle on interrupt and save partial results on crash' accurately describes the main objectives: addressing M3 evaluation resilience by creating bundles on interrupts and saving partial results on crashes.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/issue-91-92-eval-resilience

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
benchmarks/m3/compare.sh (1)

382-393: ⚠️ Potential issue | 🟠 Major

INT/TERM cleanup must terminate the script after bundling/cleanup.

trap compare_cleanup EXIT INT TERM runs the cleanup but the handler returns; during Ctrl-C while bash "$SCRIPT_DIR/eval.sh" ... is in the per-run if ...; then ... else ... fi, set -e won’t stop because the failure is inside the if, so the script can keep going with later runs. Split traps so compare_cleanup runs on EXIT, while INT/TERM run the cleanup and then exit immediately (e.g., exit 130 for INT and exit 143 for TERM). Don’t rely on $? inside the INT/TERM handler for the exit code.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@benchmarks/m3/compare.sh` around lines 382 - 393, The compare_cleanup
function is currently trapped for EXIT, INT, and TERM but returns to the script
on SIGINT/SIGTERM allowing the main loop to continue; change the trap setup so
compare_cleanup remains on EXIT only, and add separate INT and TERM handlers
that call compare_cleanup (and create_compare_bundle if needed) then immediately
exit with the conventional codes (exit 130 for INT, exit 143 for TERM) without
relying on $?; update the trap invocation(s) near the existing "trap
compare_cleanup EXIT INT TERM" line and reference compare_cleanup and
create_compare_bundle when implementing the new INT/TERM handlers.
🧹 Nitpick comments (3)
benchmarks/helpers/validate_bundle_report.py (1)

68-74: ⚡ Quick win

Summary validation only flags empty values, not missing lines.

The regex search on line 69 only matches when the "Label:" pattern exists in the text. If the summary section or a specific label is completely missing, no error is raised. This may be intentional (opt-in validation), but if the smoke test expects these metrics to always be present, the validator should flag missing lines as well.

Optional: require summary lines to exist
     for label in ("Total Tokens", "Total LLM Calls", "Total Duration"):
         m = re.search(rf"\*\*{re.escape(label)}\*\*:\s*(.+)", text)
-        if m:
+        if not m:
+            errors.append(f"{path}: summary {label} is missing")
+        else:
             val = m.group(1).strip()
             if not val or val == "--":
                 errors.append(f"{path}: summary {label} is empty")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@benchmarks/helpers/validate_bundle_report.py` around lines 68 - 74, The
current loop only appends an error when a matched label exists but has an empty
value; update the validation to also append an error when the label line is
entirely missing: inside the loop over labels ("Total Tokens", "Total LLM
Calls", "Total Duration") check
re.search(rf"\*\*{re.escape(label)}\*\*:\s*(.+)", text) and if m is None append
errors.append(f"{path}: summary {label} is missing"); if m exists keep the
existing val check (if not val or val == "--" append the same error). Ensure you
reference the same local names (label, m, val, errors) when implementing the
change.
scripts/validate_bundle_report.py (1)

11-26: ⚡ Quick win

Duplicate main() implementation.

This main() function duplicates the logic in benchmarks/helpers/validate_bundle_report.py:main(). Both files define argparse, file existence checks, and error printing.

Consider removing this script and invoking the module directly via python -m benchmarks.helpers.validate_bundle_report (which is what smoke_benchmarks.sh line 50 already does). The CLI wrapper adds no extra functionality.

Optional: remove duplicate script

Since smoke_benchmarks.sh line 50 already uses uv run python -m benchmarks.helpers.validate_bundle_report, this wrapper script is unused. Consider deleting scripts/validate_bundle_report.py to reduce duplication.

Alternatively, if the wrapper is needed for other use cases, refactor to avoid code duplication:

#!/usr/bin/env python3
"""CLI wrapper for bundle report validation."""

from benchmarks.helpers.validate_bundle_report import main

if __name__ == "__main__":
    raise SystemExit(main())
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/validate_bundle_report.py` around lines 11 - 26, The main() in this
script duplicates the CLI in the benchmarks.helpers.validate_bundle_report
module (same argparse, file checks and error printing); remove the duplicate by
either deleting this script or replacing its implementation to import and
delegate to benchmarks.helpers.validate_bundle_report.main (i.e., remove local
argparse/validate_report_md logic and in the script implement a minimal CLI stub
that calls the module's main and exits with its return code, invoking main()
only under if __name__ == "__main__").
benchmarks/appworld/eval_appworld_sdk.py (1)

318-324: 💤 Low value

Consider recording explicit zero step counts.

The if agent_steps: guard on line 323 omits result["steps"] when the count is zero. If zero steps is a valid and meaningful state (e.g., immediate failure without tool calls), it should be recorded explicitly. If zero is truly "missing" data, the current behavior is correct.

Optional: record zero explicitly if meaningful
-    if agent_steps:
+    if agent_steps is not None:
         result["steps"] = agent_steps
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@benchmarks/appworld/eval_appworld_sdk.py` around lines 318 - 324, The current
truthy check "if agent_steps:" drops zero counts; change the guard so zero is
recorded by testing for None instead of truthiness — i.e., after computing
agent_steps (via invoke_result_holder/_react_steps_from_invoke_result or
fallback len(tracker.steps) or len(tool_calls)), set result["steps"]=agent_steps
when agent_steps is not None (or otherwise explicitly handle zero as a valid
value) so that 0 is stored but None still means "missing".
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@benchmarks/m3/compare.sh`:
- Around line 200-230: The JSON builder currently folds all entries from
CONFIG_RESULT_VALS into JSON_PARTS, including partial run artifacts; modify the
loop that iterates over files (the for f in $files loop that builds file_list)
to skip any filenames that start with "m3_config_partial_" or
"m3_config_no_gt_partial_" (or otherwise match a configured partial-file
pattern) so they are not appended to file_list, and ensure JSON_PARTS only
receives non-empty file_list before adding ("${config}":[${file_list}]);
alternatively, tag partial files separately rather than adding them to
JSON_INPUT by filtering them out in this same place.

In `@benchmarks/m3/eval_m3.py`:
- Around line 2418-2422: The hoisted all_results list is not updated during
in-flight work so a KeyboardInterrupt/Exception handler can save only an
empty/incomplete file; modify run_config_mode, evaluate_tasks_in_batches, and
evaluate_single_task to accept a shared accumulator or checkpoint callback
(e.g., pass a parameter named checkpoint_callback or shared_results) and call it
to flush completed domain/batch/task results into the outer all_results before
each await/long-running loop step; alternatively, after each completed
domain/batch in those helpers, append results to the outer all_results via the
provided callback so the outer interrupt handlers will persist progress mid-run.

In `@benchmarks/m3/eval.sh`:
- Around line 375-380: The evaluator invocation(s) (the `uv run python -m
benchmarks.m3.eval_m3*` calls) are unguarded under set -e so failures trigger
the trap before EVAL_EXIT and the failure banner; modify the eval branch to
capture the evaluator exit code by temporarily disabling set -e or wrapping the
call in an if/then (e.g., use `set +e`/`set -e` around the call or `if ! uv run
...; then EVAL_EXIT=$?; echo ...; fi`) so that EVAL_EXIT is set and the else
block can run; ensure this change preserves calling `create_bundle` in the
success path and leaves `trap cleanup` behavior intact.

In `@CONTRIBUTING.md`:
- Around line 48-51: Update the comment describing the "just ci" target so it no
longer implies it runs the live smoke job; specifically change the line that
currently reads "just ci              # all of the above" to explicitly state
that "just ci" runs the automated tests (test-regression, security) but does not
run the live smoke target "just test-smoke-e2e" (the smoke job is
optional/manual and requires API keys/containers).

---

Outside diff comments:
In `@benchmarks/m3/compare.sh`:
- Around line 382-393: The compare_cleanup function is currently trapped for
EXIT, INT, and TERM but returns to the script on SIGINT/SIGTERM allowing the
main loop to continue; change the trap setup so compare_cleanup remains on EXIT
only, and add separate INT and TERM handlers that call compare_cleanup (and
create_compare_bundle if needed) then immediately exit with the conventional
codes (exit 130 for INT, exit 143 for TERM) without relying on $?; update the
trap invocation(s) near the existing "trap compare_cleanup EXIT INT TERM" line
and reference compare_cleanup and create_compare_bundle when implementing the
new INT/TERM handlers.

---

Nitpick comments:
In `@benchmarks/appworld/eval_appworld_sdk.py`:
- Around line 318-324: The current truthy check "if agent_steps:" drops zero
counts; change the guard so zero is recorded by testing for None instead of
truthiness — i.e., after computing agent_steps (via
invoke_result_holder/_react_steps_from_invoke_result or fallback
len(tracker.steps) or len(tool_calls)), set result["steps"]=agent_steps when
agent_steps is not None (or otherwise explicitly handle zero as a valid value)
so that 0 is stored but None still means "missing".

In `@benchmarks/helpers/validate_bundle_report.py`:
- Around line 68-74: The current loop only appends an error when a matched label
exists but has an empty value; update the validation to also append an error
when the label line is entirely missing: inside the loop over labels ("Total
Tokens", "Total LLM Calls", "Total Duration") check
re.search(rf"\*\*{re.escape(label)}\*\*:\s*(.+)", text) and if m is None append
errors.append(f"{path}: summary {label} is missing"); if m exists keep the
existing val check (if not val or val == "--" append the same error). Ensure you
reference the same local names (label, m, val, errors) when implementing the
change.

In `@scripts/validate_bundle_report.py`:
- Around line 11-26: The main() in this script duplicates the CLI in the
benchmarks.helpers.validate_bundle_report module (same argparse, file checks and
error printing); remove the duplicate by either deleting this script or
replacing its implementation to import and delegate to
benchmarks.helpers.validate_bundle_report.main (i.e., remove local
argparse/validate_report_md logic and in the script implement a minimal CLI stub
that calls the module's main and exits with its return code, invoking main()
only under if __name__ == "__main__").
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 95ae6aa0-5d28-4472-aafd-0e9abb86b7a5

📥 Commits

Reviewing files that changed from the base of the PR and between 99a7326 and faa1e51.

📒 Files selected for processing (18)
  • .bob/commands/cuga-create-pr.md
  • .claude/commands/cuga-create-pr.md
  • .cursor/commands/cuga-create-pr.md
  • CONTRIBUTING.md
  • benchmarks/appworld/eval_appworld_sdk.py
  • benchmarks/helpers/compare_report.py
  • benchmarks/helpers/sdk_eval_helpers.py
  • benchmarks/helpers/tests/test_validate_bundle_report.py
  • benchmarks/helpers/validate_bundle_report.py
  • benchmarks/m3/compare.sh
  • benchmarks/m3/eval.sh
  • benchmarks/m3/eval_m3.py
  • benchmarks/m3/tests/conftest.py
  • benchmarks/m3/tests/test_partial_save_on_interrupt.py
  • justfile
  • pyproject.toml
  • scripts/smoke_benchmarks.sh
  • scripts/validate_bundle_report.py

Comment thread benchmarks/m3/compare.sh
Comment thread benchmarks/m3/eval_m3.py
Comment on lines +2418 to +2422
# Hoisted so the KeyboardInterrupt / Exception handlers below can save
# whatever was collected if the eval is interrupted (#91, #92). In
# sequential mode results are appended as tasks complete; in batched
# mode evaluate_tasks_in_batches replaces the list with its return.
all_results: List[Dict[str, Any]] = []

@coderabbitai coderabbitai Bot Jun 7, 2026

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Hoisting all_results here still drops in-flight progress on interrupt.

all_results is only updated after run_config_mode(..., defer_save=True), evaluate_tasks_in_batches(...), or evaluate_single_task(...) returns. If Ctrl-C lands mid-capability pass, mid-batch, or during a later domain of the current service, the new handlers will still save an empty/incomplete partial file even though inner locals already hold completed results. Thread a shared accumulator/checkpoint callback into those helpers, or flush completed domain/batch results into the outer list before the next await.

Also applies to: 2633-2639, 2722-2751

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@benchmarks/m3/eval_m3.py` around lines 2418 - 2422, The hoisted all_results
list is not updated during in-flight work so a KeyboardInterrupt/Exception
handler can save only an empty/incomplete file; modify run_config_mode,
evaluate_tasks_in_batches, and evaluate_single_task to accept a shared
accumulator or checkpoint callback (e.g., pass a parameter named
checkpoint_callback or shared_results) and call it to flush completed
domain/batch/task results into the outer all_results before each
await/long-running loop step; alternatively, after each completed domain/batch
in those helpers, append results to the outer all_results via the provided
callback so the outer interrupt handlers will persist progress mid-run.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valid observation, and accurately scoped as a heavy lift. evaluate_single_task does iterate multiple domains internally before returning, so an interrupt mid-task would still lose that task's in-flight domain progress — the hoisted all_results only captures completed tasks/batches, not partial progress within one.

That said, I'd treat this as a follow-up rather than blocking this PR: the change here moves the failure mode from 'lose everything collected across the entire run' (#91/#92) to 'lose at most the currently in-flight task/batch' — a meaningful reliability improvement on its own, and a large reduction in blast radius. Threading a checkpoint/accumulator callback through evaluate_single_task/evaluate_tasks_in_batches to get domain-level granularity is a reasonable next step; happy to file a tracking issue for it if that's useful.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@haroldship, that's a fair framing — the blast radius reduction from "lose the entire run" to "lose at most the in-flight task/batch" is a real and useful improvement on its own, and domain-level checkpointing is clearly a follow-up scope.

Happy to open a tracking issue for the checkpoint/accumulator callback work if you'd like — just say the word and I'll file it against this repo with the relevant context linked back to this PR.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@CodeRabbit please open the tracking issue

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
gh issue create \
  --repo cuga-project/cuga-eval \
  --assignee haroldship \
  --title "feat(m3): domain-level checkpoint callback for partial saves on interrupt" \
  --body "## Background

PR `#4` ([fix/issue-91-92-eval-resilience](https://github.com/cuga-project/cuga-eval/pull/4)) hoists \`all_results\` before the main \`try\` block in \`run_config_mode\` and saves partial results in the \`KeyboardInterrupt\`/\`Exception\` handlers. This reduces the blast radius from *\"lose the entire run\"* to *\"lose at most the in-flight task/batch\"*.

However, \`all_results\` is only updated **after** \`evaluate_single_task\` or \`evaluate_tasks_in_batches\` returns in full. If a Ctrl-C or exception lands mid-task (e.g. while a later domain within the same task is still running), the outer handlers persist an empty or incomplete file — the completed domains' results exist only in inner locals.

**Discussed in:** https://github.com/cuga-project/cuga-eval/pull/4#discussion_r3369504325

## Goal

Achieve **domain-level** checkpoint granularity so that any domain that completes before an interrupt is preserved, regardless of whether its parent task finished.

## Proposed approach

Thread a \`checkpoint_callback\` (or shared mutable \`shared_results\` list) parameter into \`evaluate_single_task\` and \`evaluate_tasks_in_batches\`. After each domain (or batch) result is appended in the inner helpers, invoke the callback to flush those results into the outer \`all_results\` before the next \`await\`. The outer interrupt handlers then persist whatever has accumulated.

Rough sketch:

\`\`\`python
# run_config_mode
all_results: List[Dict[str, Any]] = []

def _checkpoint(results):
    all_results.extend(results)

# pass _checkpoint into evaluate_single_task / evaluate_tasks_in_batches
\`\`\`

Within \`evaluate_single_task\`, call \`checkpoint_callback(evaluator.results)\` after each domain loop iteration completes successfully.

## Scope

- \`benchmarks/m3/eval_m3.py\`: \`run_config_mode\`, \`evaluate_single_task\`, \`evaluate_tasks_in_batches\`
- May also want to consider the same pattern for \`compare.sh\` sequential domain loops if applicable
- Add or extend regression tests in \`benchmarks/m3/tests/test_partial_save_on_interrupt.py\`
"

Length of output: 4256



@haroldship, the tracking issue is live: #54feat(m3): domain-level checkpoint callback for partial saves on interrupt

It's assigned to you and includes the full context, the proposed checkpoint_callback approach, and the files in scope (run_config_mode, evaluate_single_task, evaluate_tasks_in_batches, plus a note about extending the regression tests in test_partial_save_on_interrupt.py).


🧬 Code Graph Analysis Results

benchmarks/m3/m3_vakra_score.py (lines 421-460)

def patch_tracker_scores(results: List[Dict[str, Any]], tracker: Any) -> int:
    """Sync Vakra-corrected scores back into the tracker's trajectory files.

    The tracker writes results.json/results.csv at ``finish_task`` time with the
    pre-Vakra keyword score. Vakra rescoring mutates ``results[i]["match_rate"]``
    and ``results[i]["success"]`` in place, but those updated values never reach
    the trajectory bundle — so trajectories/results.json ends up at score=0
    while report.md shows the correct score (issue `#71`).

    Call this immediately after ``score_results_async`` returns. It walks the
    rescored results and uses ``tracker.update_task`` to overwrite the stored
    ``score`` and the ``eval`` JSON blob with the Vakra verdict.

    Returns the number of tracker entries patched (useful for logging/tests).
    """
    if tracker is None or not getattr(tracker, "experiment_folder", None):
        return 0
    tracker_tasks = getattr(tracker, "tasks", None)
    if not isinstance(tracker_tasks, dict):
        return 0

    patched = 0
    for r in results:
        task_id = _result_uuid(r)
        if not task_id or task_id not in tracker_tasks:
            continue
        score = float(r.get("match_rate", 0.0))
        new_eval = json.dumps(
            {
                "task_name": r.get("task_name", task_id),
                "difficulty": r.get("difficulty", "unknown"),
                "success": bool(r.get("success", False)),
                "match_rate": score,
                "found_keywords": r.get("found_keywords", []),
                "missing_keywords": r.get("missing_keywords", []),
            }
        )
        if tracker.update_task(task_id=task_id, score=score, eval=new_eval):
            patched += 1
    return patched

🧠 Learnings used
Learnt from: haroldship
Repo: cuga-project/cuga-eval PR: 4
File: benchmarks/m3/eval.sh:375-380
Timestamp: 2026-06-07T17:29:50.043Z
Learning: In `benchmarks/m3/eval.sh`, the `ERR` trap fires on evaluator failure regardless of `set -e`/`set +e` state in bash. Because `cleanup()` unconditionally ends with `exit $exit_code`, the script never reaches `EVAL_EXIT=$?` or the `else` branch after the evaluator invocation. The "✗ M3 evaluation failed (exit code: ...)" banner is unreachable — this is a known cosmetic issue deferred to a follow-up. A real fix would require selectively suppressing ERR around the `uv run` invocation (`trap '' ERR; uv run ...; EVAL_EXIT=$?; trap cleanup ERR`), which has implications for the crash-salvage path and needs careful testing. The functional salvage behavior (create_bundle via ERR/EXIT trap) is already correct.

Comment thread benchmarks/m3/eval.sh
Comment on lines +375 to +380
# Create reproducibility bundle (idempotent — cleanup trap also calls
# this on interrupt/crash, see #91, #92).
create_bundle
else
echo -e "${RED:-}✗ M3 evaluation failed (exit code: $EVAL_EXIT)${NC:-}"
# cleanup trap will call create_bundle to salvage what we have.

@coderabbitai coderabbitai Bot Jun 7, 2026

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Inspect the relevant region around the reported lines and the evaluator invocation.
sed -n '330,420p' benchmarks/m3/eval.sh | nl -ba

Repository: cuga-project/cuga-eval

Length of output: 109


🏁 Script executed:

#!/bin/bash
set -euo pipefail

FILE="benchmarks/m3/eval.sh"

echo "=== Matches for key identifiers ==="
rg -n "EVAL_EXIT|uv run|M3 evaluation failed|create_bundle|cleanup trap|trap|EVAL" "$FILE" || true

echo
echo "=== Context around reported lines (330-420) ==="
awk 'NR>=330 && NR<=420 {printf "%d:%s\n", NR, $0}' "$FILE" | sed -n '1,200p'

Repository: cuga-project/cuga-eval

Length of output: 4843


🏁 Script executed:

#!/bin/bash
set -euo pipefail

FILE="benchmarks/m3/eval.sh"

echo "=== File header (first 120 lines) ==="
awk 'NR>=1 && NR<=140 {print NR ":" $0}' "$FILE"

echo
echo "=== cleanup/create_bundle area (lines 110-260) ==="
awk 'NR>=110 && NR<=260 {print NR ":" $0}' "$FILE"

echo
echo "=== Any set +e / errexit toggling and handling near eval call (lines 340-390) ==="
rg -n "set \\+e|set -e|errexit|trap cleanup|cleanup\\(" "$FILE" || true

Repository: cuga-project/cuga-eval

Length of output: 12712


Guard the evaluator call so the failure banner runs under set -e.

With set -e and unguarded uv run python -m benchmarks.m3.eval_m3* ... calls in the evaluation branches, a non-zero exit triggers trap cleanup ... ERR/EXIT and aborts the script before reaching EVAL_EXIT=$? and the else block that prints ✗ M3 evaluation failed (exit code: $EVAL_EXIT). Wrap the evaluator invocation(s) to capture the exit code (e.g., set +e temporarily or if ! uv run ...; then ...; fi).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@benchmarks/m3/eval.sh` around lines 375 - 380, The evaluator invocation(s)
(the `uv run python -m benchmarks.m3.eval_m3*` calls) are unguarded under set -e
so failures trigger the trap before EVAL_EXIT and the failure banner; modify the
eval branch to capture the evaluator exit code by temporarily disabling set -e
or wrapping the call in an if/then (e.g., use `set +e`/`set -e` around the call
or `if ! uv run ...; then EVAL_EXIT=$?; echo ...; fi`) so that EVAL_EXIT is set
and the else block can run; ensure this change preserves calling `create_bundle`
in the success path and leaves `trap cleanup` behavior intact.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at this closely, and it's technically correct but the fix isn't as simple as guarding the exit code — and the underlying salvage behavior (the actual point of this PR) is unaffected.

Verified empirically: the ERR trap fires on the evaluator's non-zero exit regardless of set -e/set +e state (tested with a minimal repro — set +e; false still triggers an ERR-trapped cleanup). And cleanup() here unconditionally ends with exit $exit_code, so it force-exits the moment it runs. That means wrapping the uv run call in set +e/set -e is a no-op — cleanup still fires immediately on failure and exits before EVAL_EXIT=$? is ever reached. I tried exactly that and reverted it once the test confirmed it changes nothing.

So yes, the "✗ M3 evaluation failed (exit code: ...)" banner in the else branch is genuinely unreachable — but that's cosmetic only. The actual salvage path (create_bundle running and the correct exit code propagating to the caller) already works correctly via the ERR/EXIT trap, which is the entire point of #91/#92.

A real fix would mean suppressing ERR specifically around the uv run invocation (trap '' ERR; uv run ...; EVAL_EXIT=$?; trap cleanup ERR) so the script falls through to the explicit branch instead of the trap firing immediately — which would change when the salvage bundle gets created (deferred to the script's natural exit) and needs careful testing against the crash-salvage path. Given the banner is purely informational and the functional behavior is already correct, I'd rather not touch trap semantics in this PR for a cosmetic message. Happy to file a follow-up if you'd like it tracked.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@haroldship, thanks for the thorough investigation — that's a subtlety I missed. You're right that trap ... ERR fires even under set +e, and since cleanup() calls exit $exit_code unconditionally, the script never reaches EVAL_EXIT=$? on the failure path regardless of how the uv run call is guarded. My suggested fix would indeed have been a no-op.

The dead else branch is cosmetic and the actual salvage behavior (the whole point of this PR) is already working correctly via the trap. Deferring this to a follow-up makes sense.

Would you like me to open a GitHub issue to track the follow-up (making the failure banner reachable by selectively suppressing ERR around the evaluator invocations)?


✏️ Learnings added
Learnt from: haroldship
Repo: cuga-project/cuga-eval PR: 4
File: benchmarks/m3/eval.sh:375-380
Timestamp: 2026-06-07T17:29:50.043Z
Learning: In `benchmarks/m3/eval.sh`, the `ERR` trap fires on evaluator failure regardless of `set -e`/`set +e` state in bash. Because `cleanup()` unconditionally ends with `exit $exit_code`, the script never reaches `EVAL_EXIT=$?` or the `else` branch after the evaluator invocation. The "✗ M3 evaluation failed (exit code: ...)" banner is unreachable — this is a known cosmetic issue deferred to a follow-up. A real fix would require selectively suppressing ERR around the `uv run` invocation (`trap '' ERR; uv run ...; EVAL_EXIT=$?; trap cleanup ERR`), which has implications for the crash-salvage path and needs careful testing. The functional salvage behavior (create_bundle via ERR/EXIT trap) is already correct.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@CodeRabbit please open the tracking issue

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
gh issue create \
  --repo cuga-project/cuga-eval \
  --title 'fix(m3): make failure banner reachable by selectively suppressing ERR around evaluator invocations' \
  --body "## Background

In \`benchmarks/m3/eval.sh\`, the \`ERR\` trap fires on evaluator failure regardless of \`set -e\`/\`set +e\` state in bash. Because \`cleanup()\` unconditionally ends with \`exit \$exit_code\`, the script never reaches \`EVAL_EXIT=\$?\` or the \`else\` branch after the evaluator invocation. As a result, the \"✗ M3 evaluation failed (exit code: ...)\" banner is unreachable.

This was identified during review of PR `#4` (comment: https://github.com/cuga-project/cuga-eval/pull/4#discussion_r3369504327).

## Functional impact

None — the crash-salvage behavior (\`create_bundle\` via \`ERR\`/\`EXIT\` trap) is already correct. This is a cosmetic issue only.

## Proposed fix

Selectively suppress \`ERR\` around each \`uv run\` evaluator invocation so the script falls through to the explicit success/failure branches instead of immediately trapping:

\`\`\`bash
trap '' ERR
uv run python -m benchmarks.m3.eval_m3 ...
EVAL_EXIT=\$?
trap cleanup ERR
\`\`\`

This change needs careful testing against the crash-salvage path to ensure the bundle is still created correctly on early interrupt/exception.

## Requested by

`@haroldship`" \
  --assignee haroldship

Length of output: 1456



@haroldship, done — tracking issue opened: #55 (assigned to you).


🧠 Learnings used
Learnt from: haroldship
Repo: cuga-project/cuga-eval PR: 4
File: benchmarks/m3/eval.sh:375-380
Timestamp: 2026-06-07T17:29:50.043Z
Learning: In `benchmarks/m3/eval.sh`, the `ERR` trap fires on evaluator failure regardless of `set -e`/`set +e` state in bash. Because `cleanup()` unconditionally ends with `exit $exit_code`, the script never reaches `EVAL_EXIT=$?` or the `else` branch after the evaluator invocation. The "✗ M3 evaluation failed (exit code: ...)" banner is unreachable — this is a known cosmetic issue deferred to a follow-up. A real fix would require selectively suppressing ERR around the `uv run` invocation (`trap '' ERR; uv run ...; EVAL_EXIT=$?; trap cleanup ERR`), which has implications for the crash-salvage path and needs careful testing. The functional salvage behavior (create_bundle via ERR/EXIT trap) is already correct.

Comment thread CONTRIBUTING.md Outdated
@haroldship

Copy link
Copy Markdown
Collaborator

Sanity test results (PR #3 recipe + AppWorld/BPO)

Ran the same small m3_task_2/hockey recipe used to validate PR #3, plus small AppWorld and BPO sanity runs, against this branch (post-merge with main, commit faa1e51).

M3 eval (cuga, --max-samples 2)

caffeinate -i bash benchmarks/m3/eval.sh \
  --m3-data benchmarks/m3/data/small_train.zip \
  --capability m3_task_2 --domain hockey --max-samples 2

Completed cleanly — exit 0, no exceptions, bundle created (evaluation_bundles/20260607_141940_default).
Pass rate 0/2 (0%) — both misses are model-judging outcomes (wrong answer / groundedness on hockey task 2), not harness errors. 409K tokens, 23 LLM calls, 90.3s.

M3 compare (--compare-agents --runs 2)

caffeinate -i bash benchmarks/m3/compare.sh --compare-agents --runs 2 \
  --m3-data benchmarks/m3/data/small_train.zip \
  --capability m3_task_2 --domain hockey --max-samples 2

All 4 runs completed — 0 failed out of 4, total 5m29s, comparison bundle created (evaluation_bundles/20260607_142705_compare_gpt-oss_cuga_react).

Configuration                   Runs  Pass Rate  pass@2  pass^2  maj@2  Cons   Tokens     LLM   Time
cuga  — policies (GPT-OSS-120B)    2     50.0%   50.0%   50.0%  50.0%  1.00  167,838.0  13.0  22.4s
react — policies (GPT-OSS-120B)    2     25.0%   50.0%    0.0%   0.0%    --   63,086.5   5.5   4.9s

AppWorld (2 tasks)

caffeinate -i bash benchmarks/appworld/eval.sh --task 82e2fac_1 5238afc_1

Completed cleanly — exit 0, bundle created (evaluation_bundles/20260607_151534_default).

Worth flagging separately (pre-existing, unrelated to this PR — it doesn't touch any AppWorld code): both tasks hit the same agent-side bug (TypeError: 'NoneType' object is not subscriptable in cuga's format_subtask) and were marked failed by the dataset-level evaluator (Completed: 0/2, pass percentages 0.0 and 71.4). Yet the bundle's final "Evaluation Report" (Langfuse-trace based) only lists task 82e2fac_1 and reports a misleading "Pass Rate 1/1 (100%)" — silently dropping the failed task 5238afc_1 from the report. I'd suggest filing this as a separate issue against the AppWorld harness/report generator; happy to do so if useful.

BPO (2 tasks)

caffeinate -i bash benchmarks/bpo/eval.sh --task 1 2

Completed cleanly — exit 0, 2/2 (100%) pass rate, bundle created (evaluation_bundles/20260607_152441_default). 26.9K tokens, 5 LLM calls, 13.8s total.


All four harnesses run end-to-end without crashes after the merge with main. No regressions observed from the conflict resolution (load_env.sh, eval.sh, eval_m3.py, compare.sh).

@haroldship

haroldship commented Jun 7, 2026

Copy link
Copy Markdown
Collaborator

Manual test plan for this PR (interrupt / crash bundling)

This exercises the core fix: bundles and partial results should now be produced even when an M3 eval/compare run is interrupted (Ctrl-C) or crashes mid-flight, instead of being silently lost (issues #91/#92).

1. Ctrl-C mid-eval.sh run → bundle + partial results

caffeinate -i bash benchmarks/m3/eval.sh \
  --m3-data benchmarks/m3/data/small_train.zip \
  --capability m3_task_2 --domain hockey --max-samples 2
  • Let it run past the first task's completion (watch for tool_call/Vakra scoring lines for task 1 in the console log — the first full result write only happens at the very end of a normal run, so there's no earlier "saved" message to watch for), then press Ctrl-C.
  • Expect:
    • Cleaning up... / Creating reproducibility bundle... printed by the cleanup trap (benchmarks/m3/eval.sh:207).
    • A console log line 📁 Partial results (N task-results) saved to: .../m3_config_partial_*.json (or m3_config_no_gt_partial_*.json with --no-ground-truth), written by the KeyboardInterrupt/CancelledError handler in eval_m3.py:2733.
    • A matching benchmarks/m3/results/m3_config_partial_*.json file appears on disk.
    • A new bundle directory under benchmarks/m3/evaluation_bundles/ containing that partial result file, trajectories-so-far, and console/registry logs.
    • Exit code reflects the interrupt (130), not 0.

2. Ctrl-C before any result is saved → no spurious bundle

  • Re-run the same command and press Ctrl-C immediately (before the first task finishes / before any results/m3_*.json is written with mtime >= RUN_START_TS).
  • Expect: create_bundle (eval.sh:127) finds no fresh result file and prints No result file from this run was found — skipping bundle. — no empty/bogus bundle directory is created.

3. Crash mid-run (simulate #92) → partial save still happens

  • Same recipe, but force a crash instead of Ctrl-C, e.g. kill the registry server mid-run (kill <registry_pid> from another terminal) so eval_m3.py raises an unhandled exception.
  • Expect: the generic except Exception as eval_err handler (eval_m3.py:2741) still writes m3_config_partial_*.json before re-raising, and the cleanup trap (triggered via ERR/EXIT) still creates a bundle from it.

4. Success path → exactly one bundle, no duplication

caffeinate -i bash benchmarks/m3/eval.sh \
  --m3-data benchmarks/m3/data/small_train.zip \
  --capability m3_task_2 --domain hockey --max-samples 1
  • Let it run to completion normally.
  • Expect: Creating reproducibility bundle... is printed exactly once (the BUNDLE_DONE guard at eval.sh:120/128 makes create_bundle idempotent — the success-path call at eval.sh:377 and the trap's call at eval.sh:215 must not double-bundle). Confirm only one new directory appears under evaluation_bundles/.

5. Same checks for compare.sh

6. Regression tests (automated, but worth re-running locally)

just test-regression          # includes benchmarks/m3/tests/test_partial_save_on_interrupt.py

7. Optional — full smoke harness (if you have live AppWorld + M3 + API keys)

just test-smoke-e2e

…es from comparisons

The multiturn data file was misspelled "olympics_mutliturn.json" since the
initial release, with eval.sh, compare.sh, eval_m3_multiturn.py, and
create_eval_bundle.py all consistently referencing the same misspelling.
Renamed the file and fixed all references for consistency.

Also filter m3_config_partial_*/m3_config_no_gt_partial_* result files out of
compare.sh's comparison set — interrupted/crashed runs were being folded into
aggregate pass-rate and token totals alongside complete runs, skewing the
comparison report.
agent_steps == 0 was falsy, so a task that genuinely completed in zero
steps was recorded as "steps not measured" instead of 0.
Use a shared _EMPTY_MARKERS set so "--", em-dash, and "-" placeholders are
all recognized consistently as empty in both per-task and summary checks,
and match the table header strictly on "Task" rather than any tuple member.
CONTRIBUTING.md's "just ci # all of the above" was misleading — the recipe
runs lint + test-regression + security only, not the full suite (smoke is
optional/manual).

README.md gains a section explaining how M3's eval.sh/compare.sh salvage a
best-effort bundle and partial result file when a run is interrupted or
crashes, and how compare.sh excludes those partials from aggregate totals.
… saves

Companion to the file rename in 5e9cb6c — updates the references in
eval.sh, compare.sh, eval_m3_multiturn.py, and create_eval_bundle.py to the
corrected filename, and filters m3_config_partial_*/m3_config_no_gt_partial_*
result files out of compare.sh's comparison set so an interrupted/crashed
run's partial results don't skew aggregate pass-rate and token totals.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
benchmarks/m3/compare.sh (1)

420-435: 💤 Low value

Consider refactoring to avoid ls | grep pattern.

Shellcheck flags the ls | grep pattern as potentially fragile for non-alphanumeric filenames. Since the filenames here are controlled by eval_m3.py (not user input), the current implementation is safe. However, a more robust alternative would use a for loop with pattern matching:

_list_results_for_agent() {
    local agent="$1"
    if [[ "$agent" == "cuga" ]]; then
        for f in "$RESULTS_DIR"/m3_config_*.json; do
            [[ -f "$f" ]] || continue
            case "$(basename "$f")" in
                m3_config_partial_*|m3_config_no_gt_partial_*)
                    continue
                    ;;
            esac
            echo "$f"
        done | sort
    else
        # react: existing logic
        ...
    fi
}

This eliminates the ls | grep pattern and is more maintainable.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@benchmarks/m3/compare.sh` around lines 420 - 435, The ls|grep pipeline used
to list results when checking "$RESULTS_DIR"/m3_config_*.json (for agent "cuga")
is fragile; replace it with a glob-based loop that iterates over
"$RESULTS_DIR"/m3_config_*.json, skips non-files (use [[ -f "$f" ]] ||
continue), filter out basenames matching m3_config_partial_* and
m3_config_no_gt_partial_* via a case or conditional, echo each accepted path,
and pipe the output to sort; update the branch that checks [[ "$agent" == "cuga"
]] to use this loop (referencing RESULTS_DIR and the m3_config_* patterns) so
you avoid ls | grep while preserving the same filtering semantics.

Source: Linters/SAST tools

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@benchmarks/m3/compare.sh`:
- Around line 420-435: The ls|grep pipeline used to list results when checking
"$RESULTS_DIR"/m3_config_*.json (for agent "cuga") is fragile; replace it with a
glob-based loop that iterates over "$RESULTS_DIR"/m3_config_*.json, skips
non-files (use [[ -f "$f" ]] || continue), filter out basenames matching
m3_config_partial_* and m3_config_no_gt_partial_* via a case or conditional,
echo each accepted path, and pipe the output to sort; update the branch that
checks [[ "$agent" == "cuga" ]] to use this loop (referencing RESULTS_DIR and
the m3_config_* patterns) so you avoid ls | grep while preserving the same
filtering semantics.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 95ca68e2-87c2-4507-a426-d9697c4002d6

📥 Commits

Reviewing files that changed from the base of the PR and between faa1e51 and e531eb3.

📒 Files selected for processing (9)
  • CONTRIBUTING.md
  • README.md
  • benchmarks/appworld/eval_appworld_sdk.py
  • benchmarks/helpers/validate_bundle_report.py
  • benchmarks/m3/compare.sh
  • benchmarks/m3/data/olympics_multiturn.json
  • benchmarks/m3/eval.sh
  • benchmarks/m3/eval_m3_multiturn.py
  • scripts/create_eval_bundle.py
✅ Files skipped from review due to trivial changes (2)
  • README.md
  • CONTRIBUTING.md
🚧 Files skipped from review as they are similar to previous changes (3)
  • benchmarks/helpers/validate_bundle_report.py
  • benchmarks/m3/eval.sh
  • benchmarks/appworld/eval_appworld_sdk.py

@Sergey-Zeltyn

Copy link
Copy Markdown
Collaborator

Dead code: scripts/validate_bundle_report.py
scripts/validate_bundle_report.py duplicates benchmarks/helpers/validate_bundle_report.py:main() almost line-for-line, but scripts/smoke_benchmarks.sh:48 invokes python -m benchmarks.helpers.validate_bundle_report (per the 9133c39 follow-up referenced in the PR comments). The scripts/ copy looks like a leftover from before the module-based invocation was adopted — please delete it or wire the smoke script to call it.

@Sergey-Zeltyn

Copy link
Copy Markdown
Collaborator

Deferred import inside complete_and_eval
benchmarks/appworld/eval_appworld_sdk.py:316 does from `benchmarks.helpers.sdk_eval_helpers import _react_steps_from_invoke_result inside the function. If there's a circular-import reason, please add a one-line comment; otherwise lift it to the module imports.

@Sergey-Zeltyn

Copy link
Copy Markdown
Collaborator

_parse_table_header returns [] on an empty line
benchmarks/helpers/validate_bundle_report.py:11-15:

cells = [c.strip() for c in line.strip().strip("|").split("|")]
if not cells or cells[0] == "Task":
    return cells
return None

The branch returns an empty list rather than None when cells is empty. Caller checks if cols and cols[0] == "Task" so this is harmless today, but it inverts the contract suggested by the type hint (list[str] | None). Either return None on empty, or simplify to return cells if cells and cells[0] == "Task" else None.

@Sergey-Zeltyn

Copy link
Copy Markdown
Collaborator

free_port race in the smoke runner
scripts/smoke_benchmarks.sh:32-37 sends a default SIGTERM and sleeps a fixed 2 s. A stuck registry process may need a follow-up kill -9 (or a poll loop until the port is free) — otherwise the next M3 eval may still find 8001 occupied. Given this is exactly the failure that prompted the 9133c39 correction, it's worth hardening.

@Sergey-Zeltyn

Copy link
Copy Markdown
Collaborator

Mass rename: any stale mutliturn references left?
The diff covers benchmarks/m3/eval.sh, benchmarks/m3/compare.sh, benchmarks/m3/eval_m3_multiturn.py, and scripts/create_eval_bundle.py. Worth a final rg -l mutliturn to confirm nothing in docs/bundles/tests still points at the misspelled filename, especially the M3_MULTITURN_DATA_FILE env-var convention which now defaults to the new name (any external caller exporting the old value gets a silent file-not-found).

@Sergey-Zeltyn Sergey-Zeltyn left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comments above.
All comments are minor, correct whatever you see fit

- Remove scripts/validate_bundle_report.py: dead-code duplicate of
  benchmarks/helpers/validate_bundle_report.py:main(), which is the
  module smoke_benchmarks.sh actually invokes.
- eval_appworld_sdk.py: lift the deferred _react_steps_from_invoke_result
  import to module level (no circular-import reason found).
- validate_bundle_report._parse_table_header: return None rather than []
  on an empty header line, matching its list[str] | None contract.
- smoke_benchmarks.sh free_port: poll until the port is actually free
  after SIGTERM, falling back to SIGKILL — closes the race that caused
  the earlier 9133c39 M3 port-8001 failure.
@haroldship haroldship merged commit 142c29a into main Jun 8, 2026
3 of 4 checks passed
@haroldship haroldship deleted the fix/issue-91-92-eval-resilience branch June 8, 2026 19:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants