This repository was archived by the owner on May 14, 2026. It is now read-only.
514-1515: insulate runs from local branch changes#92
Merged
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
|
Preview CLI build for DEC_BENCH_INSTALL_VERSION=preview-pr-92-11d0161 curl -fsSL https://raw.githubusercontent.com/514-labs/agent-evals/11d0161d1ea43c468c80e544aa19485b400a1fad/install.sh | shRelease: https://github.com/514-labs/agent-evals/releases/tag/preview-pr-92-11d0161 Version policy:
|
Once a `dec-bench run` invocation starts, mid-run branch switches and rebases must not change what the evaluator sees. The two-phase orchestration in run.rs reads `scenarios/<id>/assertions/` and `packages/eval-core/src` from the host *after* the agent phase exits, so those host reads shifted under in-flight runs when users rebased. Snapshot both paths into a `tempfile::TempDir` at the top of each invocation and point the runner at the snapshot. One snapshot per invocation, shared across matrix cells via Arc, cleaned up on Drop. At run start we print a summary (branch, sha, dirty=YES if uncommitted) and write a `<run_id>.snapshot.json` manifest into each results dir. Snapshot is host-only — its contents reach the container via the existing post-agent `docker cp`, preserving the 514-1419 / 514-1425 sandboxing of assertion files from the agent. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
9f2a66f to
3999f8a
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes 514-1515.
Problem
Olivia hit a confusing failure: she switched branches and rebased while a
dec-bench runwas in flight, and the result was affected. Once a run starts, it shouldn't depend on the user's local branch state.Root cause: the two-phase orchestration from 514-1419 / 514-1425 reads
scenarios/<id>/assertions/andpackages/eval-core/srcfresh from the host after the agent phase exits, thendocker cp's them into the container. The agent phase is often 10+ minutes long, so a rebase mid-run silently shifts what the evaluator sees.Scenario prompts, init scripts, and supervisord config are already pinned by the image hash, so they're not affected.
Solution
Snapshot the two host-read paths into a
tempfile::TempDirat the top of everydec-bench runinvocation. The runner reads from the snapshot for the rest of the run.HEAD— preserves the "tweak an assertion and re-run without committing" loop. The trade-off is surfaced loudly (see below).Arc<RunSnapshot>.tempfile::TempDircleans up onDropwhenexecute()returns.514-1419 / 514-1425 invariant preserved. The snapshot lives in a host-side tempdir; its contents reach the container only via the existing post-agent
docker cpcalls. The snapshot path is never bind-mounted, never baked into an image, never exposed via env vars. The agent phase still runs against the same sandboxed filesystem as before. Documented in the module-level comment onapps/cli/src/commands/snapshot.rs.User experience
Every run now prints this at the top, before any container work:
A dirty tree triggers the loud
dirty=YES — testing uncommitted editsline so it can't go unnoticed.Each per-run results dir also gets a
<run_id>.snapshot.jsonmanifest (full SHA, branch, dirty, captured paths, tempdir root) so audit tooling can later prove which tree state produced a given result.What to look for in review
create_snapshot_freezes_assertions_against_later_changesmutates the live tree after snapshotting and asserts the snapshot still has the original content.apps/cli/src/commands/snapshot.rsmodule doc spells out: snapshot path must never be added to a containerbind, baked into an image, or exposed via env. Reaches the container only via the existing post-agentdocker cp.--skip-existing/--limit/--dry-runfiltering, so we don't copy files for runs that won't happen. Dry-runs don't snapshot at all.Arc<RunSnapshot>shared across matrix cells; tempdir drops whenexecute()returns. Smoke-tested: clean run leaves no leftover; Ctrl-C'd run does (expected — Drop doesn't run on signal).buildstill reads scenario files live; harness bind mounts (~/.atlasetc.) unchanged; assertions stay out of the image (preserves 514-1419 / 514-1425).Test plan
cargo build— clean, no warnings.cargo test— 64 unit + 11 e2e + 2 integration pass, including 4 new snapshot tests.dec-bench run --scenario foo-bar-csv-ingest --agent cursor --model composer-2 --version v0.0.0-doesnotexist):dirty=YESsnapshot.jsonNot tested live:
Follow-ups (separate tickets, not blocking)
buildshould also snapshot. Lower priority; image content-hash already pins prompts/init.🤖 Generated with Claude Code