Score reward against targeted tests — FAIL_TO_PASS/PASS_TO_PASS (#2b)#3
Open
flo7up wants to merge 1 commit into
Open
Score reward against targeted tests — FAIL_TO_PASS/PASS_TO_PASS (#2b)#3flo7up wants to merge 1 commit into
flo7up wants to merge 1 commit into
Conversation
Reward was whole-suite: a candidate passed only if every test in the repo passed, so unrelated failing/flaky tests masked the real signal. This derives the SWE-bench-style targeted sets and scores against exactly those. - forge/test_report.py (new): pure, stdlib-only helpers — pytest detection, command wrapping to emit a JUnit report between stdout sentinels (preserving exit code; no volume mount needed), JUnit parsing, before/after delta derivation (fail_to_pass = failed-before/pass-after, pass_to_pass = pass-both), and targeted scoring. - docker_runner.run_tests_in_docker: optional exec_command override so the wrapped pytest command runs while the recorded command stays clean. - verify_task: wrap pytest runs, parse before/after reports, derive fail_to_pass/pass_to_pass (only when both runs yield a parseable report and the test environment is healthy); stored on VerificationResult. - package_task: persist the sets into the pack's task.json under an "eval" block (read robustly as plain JSON). - reward.py template: inline the same parser/scorer (kept in sync with test_report and asserted by an exec-based parity test). When a scoping spec is present and the command is pytest, score 1.0 iff all fail_to_pass pass and no pass_to_pass regresses; otherwise fall back to whole-suite exit code. Demonstrated: a fix whose targeted tests pass scores 1.0 even when an unrelated repo test fails (whole-suite would give 0.0), while a pass_to_pass regression still scores 0.0. 98 tests pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
@-