Skip to content

Score reward against targeted tests — FAIL_TO_PASS/PASS_TO_PASS (#2b)#3

Open
flo7up wants to merge 1 commit into
feat/reward-candidate-patchfrom
feat/reward-test-scoping
Open

Score reward against targeted tests — FAIL_TO_PASS/PASS_TO_PASS (#2b)#3
flo7up wants to merge 1 commit into
feat/reward-candidate-patchfrom
feat/reward-test-scoping

Conversation

@flo7up

@flo7up flo7up commented Jun 13, 2026

Copy link
Copy Markdown
Owner

@-

Reward was whole-suite: a candidate passed only if every test in the repo
passed, so unrelated failing/flaky tests masked the real signal. This derives
the SWE-bench-style targeted sets and scores against exactly those.

- forge/test_report.py (new): pure, stdlib-only helpers — pytest detection,
  command wrapping to emit a JUnit report between stdout sentinels (preserving
  exit code; no volume mount needed), JUnit parsing, before/after delta
  derivation (fail_to_pass = failed-before/pass-after, pass_to_pass = pass-both),
  and targeted scoring.
- docker_runner.run_tests_in_docker: optional exec_command override so the
  wrapped pytest command runs while the recorded command stays clean.
- verify_task: wrap pytest runs, parse before/after reports, derive
  fail_to_pass/pass_to_pass (only when both runs yield a parseable report and the
  test environment is healthy); stored on VerificationResult.
- package_task: persist the sets into the pack's task.json under an "eval" block
  (read robustly as plain JSON).
- reward.py template: inline the same parser/scorer (kept in sync with
  test_report and asserted by an exec-based parity test). When a scoping spec is
  present and the command is pytest, score 1.0 iff all fail_to_pass pass and no
  pass_to_pass regresses; otherwise fall back to whole-suite exit code.

Demonstrated: a fix whose targeted tests pass scores 1.0 even when an unrelated
repo test fails (whole-suite would give 0.0), while a pass_to_pass regression
still scores 0.0. 98 tests pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant