Skip to content

fix: make Claude eval reruns scorable from Claude Code sessions#274

Closed
fazxes wants to merge 2 commits intomainfrom
feat/0277-claude-eval-rerun
Closed

fix: make Claude eval reruns scorable from Claude Code sessions#274
fazxes wants to merge 2 commits intomainfrom
feat/0277-claude-eval-rerun

Conversation

@fazxes
Copy link
Copy Markdown
Member

@fazxes fazxes commented Apr 9, 2026

What changed

  • Added Claude Code session detection in nightshift test startup.
  • When a nested Claude invocation would run inside Claude Code, the runner now falls back to codex if available, or fails early with an actionable message if it is not.
  • Added a shared test-runtime-dir handoff so eval subprocesses and the parent eval runner read the same artifacts.
  • Preserved the actual runtime agent in eval reports and added regressions for both the fallback path and the eval artifact handoff.
  • Marked task #0277 done and recorded the fresh eval report.

Root cause

  • The child nightshift test invocation could not reliably run inside the Claude Code shell when launched for eval reruns.
  • The eval wrapper also needed a shared runtime directory so the parent could score the child run’s artifacts.

Validation

  • make check
  • Fresh eval rerun: .recursive/evaluations/0093.md

Result

  • The fresh Phractal rerun now produces a scorable report instead of halting after two agent failures.
  • The report records a fallback run using codex with a total score of 78/100.

Copy link
Copy Markdown
Member Author

fazxes commented Apr 9, 2026

Closing this PR without merge. It failed the required code and safety review pair twice.

Blocking issues from the second review cycle:

  • nightshift/infra/worktree.py: the NIGHTSHIFT_TEST_RUNTIME_DIR override still permits symlink/ownership redirection under the allowed temp prefix.
  • nightshift/owl/eval_runner.py: the child eval subprocess still inherits a broad ambient environment.
  • nightshift/tests/test_nightshift.py and nightshift/tests/test_eval_runner.py: the new tests still do not exercise the real fallback path end-to-end.

Per Brain protocol, task #0277 is being marked blocked after two failed fix-review cycles. A future attempt should start from these findings rather than reopening this PR.

@fazxes fazxes closed this Apr 9, 2026
@fazxes fazxes deleted the feat/0277-claude-eval-rerun branch April 9, 2026 14:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant