Skip to content

fix(engine/runtime): enforce run ownership locking and scrub inherited stage contract env#56

Open
vadimcomanescu wants to merge 3 commits intodanshapiro:mainfrom
vadimcomanescu:upstream/fix/runtime-run-ownership-locking
Open

fix(engine/runtime): enforce run ownership locking and scrub inherited stage contract env#56
vadimcomanescu wants to merge 3 commits intodanshapiro:mainfrom
vadimcomanescu:upstream/fix/runtime-run-ownership-locking

Conversation

@vadimcomanescu
Copy link

Summary

Hardens runtime safety by enforcing run ownership locks and preventing inherited stage contract env leakage.

Problem

  • Concurrent/resumed processes targeting the same run root can race on run artifacts.
  • Inherited stage contract env keys can leak into child execution contexts and create status contract confusion.

Why This Change

Run ownership must be explicit and recoverable under stale lock conditions, and per-stage contract env should not leak across boundaries.

What Changed

  • Added run ownership lock acquisition/release with stale owner reclaim logic.
  • Guarded lock release so foreign/mismatched owners are not removed.
  • Scrubbed inherited stage contract env keys from base node env.
  • Added regression tests for lock reclaim and env scrubbing behavior.

User-Visible Behavior

  • Safer run/resume ownership semantics under concurrent or stale process conditions.
  • Cleaner stage tool environments with reduced contract-env cross-contamination.

Risk

  • Locking logic is sensitive to PID/start-time detection.
  • Mitigation: table-driven tests cover live owner reject, stale reclaim, unreadable lock retries, and foreign release protection.

Validation

  • gofmt and go vet ./...
  • go test ./internal/attractor/engine -run 'TestAcquireRunOwnership_|TestReleaseRunOwnership_|TestBuildBaseNodeEnv_StripsKilroyContractEnvKeys|TestInputManifestContract_' -count=1

(cherry picked from commit a66b4f39f10f6c2ac53d44598c2544148356c19b)
(cherry picked from commit 05947fb09f9f187f0d3f080cd132c7d510b3cc83)
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8bd275275f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +115 to +116
if stale {
if removeErr := os.Remove(lockPath); removeErr != nil && !errors.Is(removeErr, os.ErrNotExist) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Stop deleting unreadable lock files on short age threshold

Deleting run.lock.json as soon as it has been unreadable for 250ms can break mutual exclusion: if the current owner creates the file and is delayed before writing JSON (e.g., scheduler pause, slow filesystem, process stop/continue), a second process will remove the still-valid lock and acquire ownership, so both runs can execute concurrently against the same logs_root and race on checkpoints/artifacts.

Useful? React with 👍 / 👎.

@vadimcomanescu
Copy link
Author

Addressed Codex review finding in commit c194cfd.

  • Removed unreadable-lock stale-age reclaim logic.
  • Unreadable run.lock.json now retries and resolves as ownership conflict instead of being deleted.
  • Updated regression tests to assert unreadable lock preservation and conflict behavior.

Validation run: go test ./internal/attractor/engine -run "TestAcquireRunOwnership|TestRun_OwnershipConflict" -count=1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant