fix: make Claude eval reruns scorable from Claude Code sessions by fazxes · Pull Request #274 · Recusive/Nightshift

fazxes · 2026-04-09T13:14:09Z

What changed

Added Claude Code session detection in nightshift test startup.
When a nested Claude invocation would run inside Claude Code, the runner now falls back to codex if available, or fails early with an actionable message if it is not.
Added a shared test-runtime-dir handoff so eval subprocesses and the parent eval runner read the same artifacts.
Preserved the actual runtime agent in eval reports and added regressions for both the fallback path and the eval artifact handoff.
Marked task #0277 done and recorded the fresh eval report.

Root cause

The child nightshift test invocation could not reliably run inside the Claude Code shell when launched for eval reruns.
The eval wrapper also needed a shared runtime directory so the parent could score the child run’s artifacts.

Validation

make check
Fresh eval rerun: .recursive/evaluations/0093.md

Result

The fresh Phractal rerun now produces a scorable report instead of halting after two agent failures.
The report records a fallback run using codex with a total score of 78/100.

fazxes · 2026-04-09T13:37:06Z

Closing this PR without merge. It failed the required code and safety review pair twice.

Blocking issues from the second review cycle:

nightshift/infra/worktree.py: the NIGHTSHIFT_TEST_RUNTIME_DIR override still permits symlink/ownership redirection under the allowed temp prefix.
nightshift/owl/eval_runner.py: the child eval subprocess still inherits a broad ambient environment.
nightshift/tests/test_nightshift.py and nightshift/tests/test_eval_runner.py: the new tests still do not exercise the real fallback path end-to-end.

Per Brain protocol, task #0277 is being marked blocked after two failed fix-review cycles. A future attempt should start from these findings rather than reopening this PR.

fazxes added 2 commits April 9, 2026 09:13

fix: make claude eval reruns scorable

04196df

fix: harden claude eval fallback

9addd55

fazxes closed this Apr 9, 2026

fazxes deleted the feat/0277-claude-eval-rerun branch April 9, 2026 14:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: make Claude eval reruns scorable from Claude Code sessions#274

fix: make Claude eval reruns scorable from Claude Code sessions#274
fazxes wants to merge 2 commits intomainfrom
feat/0277-claude-eval-rerun

fazxes commented Apr 9, 2026

Uh oh!

fazxes commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

fazxes commented Apr 9, 2026

What changed

Root cause

Validation

Result

Uh oh!

fazxes commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant