Skip to content
This repository was archived by the owner on May 14, 2026. It is now read-only.

514-1515: insulate runs from local branch changes#92

Merged
oatsandsugar merged 1 commit into
mainfrom
claude/sleepy-wu-29ada2
May 6, 2026
Merged

514-1515: insulate runs from local branch changes#92
oatsandsugar merged 1 commit into
mainfrom
claude/sleepy-wu-29ada2

Conversation

@oatsandsugar
Copy link
Copy Markdown
Contributor

Closes 514-1515.

Problem

Olivia hit a confusing failure: she switched branches and rebased while a dec-bench run was in flight, and the result was affected. Once a run starts, it shouldn't depend on the user's local branch state.

Root cause: the two-phase orchestration from 514-1419 / 514-1425 reads scenarios/<id>/assertions/ and packages/eval-core/src fresh from the host after the agent phase exits, then docker cp's them into the container. The agent phase is often 10+ minutes long, so a rebase mid-run silently shifts what the evaluator sees.

Scenario prompts, init scripts, and supervisord config are already pinned by the image hash, so they're not affected.

Solution

Snapshot the two host-read paths into a tempfile::TempDir at the top of every dec-bench run invocation. The runner reads from the snapshot for the rest of the run.

  • Snapshot from the working tree, not HEAD — preserves the "tweak an assertion and re-run without committing" loop. The trade-off is surfaced loudly (see below).
  • One snapshot per invocation, shared across matrix cells via Arc<RunSnapshot>.
  • tempfile::TempDir cleans up on Drop when execute() returns.

514-1419 / 514-1425 invariant preserved. The snapshot lives in a host-side tempdir; its contents reach the container only via the existing post-agent docker cp calls. The snapshot path is never bind-mounted, never baked into an image, never exposed via env vars. The agent phase still runs against the same sandboxed filesystem as before. Documented in the module-level comment on apps/cli/src/commands/snapshot.rs.

User experience

Every run now prints this at the top, before any container work:

Snapshot: /tmp/dec-bench-snapshot-XXXX
  Branch: my-feature (sha=abc1234, dirty=YES — testing uncommitted edits)
  Captured: scenarios/foo-bar-csv-ingest/assertions, packages/eval-core/src

A dirty tree triggers the loud dirty=YES — testing uncommitted edits line so it can't go unnoticed.

Each per-run results dir also gets a <run_id>.snapshot.json manifest (full SHA, branch, dirty, captured paths, tempdir root) so audit tooling can later prove which tree state produced a given result.

What to look for in review

  1. The freeze is real. Unit test create_snapshot_freezes_assertions_against_later_changes mutates the live tree after snapshotting and asserts the snapshot still has the original content.
  2. Sandbox invariant. apps/cli/src/commands/snapshot.rs module doc spells out: snapshot path must never be added to a container bind, baked into an image, or exposed via env. Reaches the container only via the existing post-agent docker cp.
  3. Snapshot timing. Created after --skip-existing / --limit / --dry-run filtering, so we don't copy files for runs that won't happen. Dry-runs don't snapshot at all.
  4. Lifecycle. Arc<RunSnapshot> shared across matrix cells; tempdir drops when execute() returns. Smoke-tested: clean run leaves no leftover; Ctrl-C'd run does (expected — Drop doesn't run on signal).
  5. Out of scope. build still reads scenario files live; harness bind mounts (~/.atlas etc.) unchanged; assertions stay out of the image (preserves 514-1419 / 514-1425).

Test plan

  • cargo build — clean, no warnings.
  • cargo test — 64 unit + 11 e2e + 2 integration pass, including 4 new snapshot tests.
  • End-to-end smoke run (dec-bench run --scenario foo-bar-csv-ingest --agent cursor --model composer-2 --version v0.0.0-doesnotexist):
    • Snapshot summary printed with correct branch, sha, dirty=YES
    • Inline build path still works
    • Both phases ran; all 9 result files written including snapshot.json
    • Manifest content verified by hand
    • Tempdir cleaned up post-run

Not tested live:

  • A real >10-min matrix run with a mid-run branch switch. The unit test covers the same insulation property synthetically; happy to do a live run as part of rollout if you want extra evidence.

Follow-ups (separate tickets, not blocking)

  • Cut a CLI release — runtime change, install script serves from latest GitHub release.
  • Decide whether build should also snapshot. Lower priority; image content-hash already pins prompts/init.

🤖 Generated with Claude Code

@linear
Copy link
Copy Markdown

linear Bot commented Apr 30, 2026

@vercel
Copy link
Copy Markdown

vercel Bot commented Apr 30, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
agent-evals-web Ready Ready Preview, Comment May 5, 2026 10:19pm

Request Review

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 30, 2026

Preview CLI build for 0.2.2-preview.pr.92.11d0161 is ready.

DEC_BENCH_INSTALL_VERSION=preview-pr-92-11d0161 curl -fsSL https://raw.githubusercontent.com/514-labs/agent-evals/11d0161d1ea43c468c80e544aa19485b400a1fad/install.sh | sh

Release: https://github.com/514-labs/agent-evals/releases/tag/preview-pr-92-11d0161

Version policy:

  • crate version: 0.2.2
  • release tag: v0.2.2
  • preview tag: preview-pr-92-11d0161
  • default image suffix: v0.2.2

Once a `dec-bench run` invocation starts, mid-run branch switches and
rebases must not change what the evaluator sees. The two-phase
orchestration in run.rs reads `scenarios/<id>/assertions/` and
`packages/eval-core/src` from the host *after* the agent phase exits, so
those host reads shifted under in-flight runs when users rebased.

Snapshot both paths into a `tempfile::TempDir` at the top of each
invocation and point the runner at the snapshot. One snapshot per
invocation, shared across matrix cells via Arc, cleaned up on Drop.
At run start we print a summary (branch, sha, dirty=YES if uncommitted)
and write a `<run_id>.snapshot.json` manifest into each results dir.

Snapshot is host-only — its contents reach the container via the
existing post-agent `docker cp`, preserving the 514-1419 / 514-1425
sandboxing of assertion files from the agent.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@oatsandsugar oatsandsugar force-pushed the claude/sleepy-wu-29ada2 branch from 9f2a66f to 3999f8a Compare May 5, 2026 22:17
@oatsandsugar oatsandsugar merged commit a6748e5 into main May 6, 2026
17 checks passed
@oatsandsugar oatsandsugar deleted the claude/sleepy-wu-29ada2 branch May 6, 2026 18:01
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant