Skip to content

Prevent duplicate cron execution and fix OpenCode validation#393

Closed
tbrandenburg wants to merge 4 commits intomainfrom
archon/thread-1e97102b
Closed

Prevent duplicate cron execution and fix OpenCode validation#393
tbrandenburg wants to merge 4 commits intomainfrom
archon/thread-1e97102b

Conversation

@tbrandenburg
Copy link
Copy Markdown
Owner

Summary

  • prevent duplicate cron execution across backend processes by adding PID ownership to backend cron startup and shutdown
  • fail backend startup fast when another live cron owner already exists, while replacing stale ownership automatically
  • restore OpenCode agent response parsing and harden CLI integration checks uncovered during validation

Changes

  • add cron PID ownership helpers, liveness checks, and cleanup paths in packages/pybackend/cron_service.py
  • add unit coverage for live-owner, stale-owner, cleanup, and partial-startup cron cases
  • restore parsed response_parts and combined_response handling in packages/pybackend/agent_cli.py
  • tighten OpenCode integration and parsing regression coverage, and apply required backend formatting cleanup
  • update .opencode/package-lock.json with the validation-time plugin refresh generated by OpenCode tooling

Validation

  • make qa-quick ✅ (381 passed, 1 skipped; existing sqlite ResourceWarning warnings remain)
  • cd packages/frontend && npx tsc --noEmit
  • make lint
  • cd packages/frontend && npx prettier --check src vite.config.ts vitest.config.ts eslint.config.js package.json tsconfig.json tsconfig.node.json index.html && cd ../pybackend && uv run ruff format --check *.py tests/integration/test_opencode_integration.py
  • cd packages/frontend && npm test && cd ../pybackend && uv run pytest -c pytest.cov.ini ✅ (504 passed, 2 skipped; existing integration mark and sqlite warnings remain)
  • make build
  • manual repro not run

Fixes #367

Tom Brandenburg added 2 commits April 28, 2026 00:21
Duplicate backend processes could each start their own cron scheduler against the same .made workspace, causing workflows and scheduled tasks to run twice silently.

Changes:
- Added cron PID ownership checks and stale-owner recovery before scheduler startup
- Released owned PID files on shutdown and after partial startup failures
- Added unit coverage for live-owner refusal, stale-owner replacement, and PID cleanup

Fixes #367
@vercel
Copy link
Copy Markdown

vercel Bot commented Apr 27, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
made Ready Ready Preview, Comment Apr 27, 2026 11:00pm

@tbrandenburg
Copy link
Copy Markdown
Owner Author

Comprehensive PR Review

PR: #393
Reviewed by: 5 specialized agents
Date: 2026-04-28T00:53:32+02:00


Summary

This consolidation is based on the artifacts present in this run. Only code-review-findings.md was available under the review artifact directory; error-handling, test-coverage, comment-quality, and docs-impact findings artifacts were not present. From the available review, the PR is directionally sound, but there is one merge-blocking cron ownership race and one medium-severity scheduler rollback gap.

Verdict: REQUEST_CHANGES

Severity Count
🔴 CRITICAL 0
🟠 HIGH 1
🟡 MEDIUM 1
🟢 LOW 0

🟠 High Issues

Cron ownership claim is still race-prone across concurrent startups

📍 packages/pybackend/cron_service.py:150

_claim_cron_ownership() uses a read-check-write flow on cron.pid without atomic acquisition. Two backend processes starting at the same time can both pass the check and both start schedulers, which leaves the duplicate-cron bug still open.

View recommended fix
def _claim_cron_ownership() -> tuple[Path, int]:
    pid_file = _get_cron_pid_file_path()
    current_pid = os.getpid()

    while True:
        try:
            fd = os.open(pid_file, os.O_WRONLY | os.O_CREAT | os.O_EXCL, 0o644)
        except FileExistsError:
            owner_pid = _read_cron_owner_pid(pid_file)
            if owner_pid is not None and owner_pid != current_pid:
                if _is_process_running(owner_pid):
                    raise RuntimeError(
                        f"Cron clock already owned by live process pid={owner_pid}"
                    )
            try:
                pid_file.unlink()
            except FileNotFoundError:
                pass
            continue

        with os.fdopen(fd, "w", encoding="utf-8") as handle:
            handle.write(f"{current_pid}\n")
        return pid_file, current_pid

🟡 Medium Issues

Startup rollback leaks a running scheduler if failure happens after scheduler.start()

📍 packages/pybackend/cron_service.py:547

If timeout monitor registration fails after scheduler startup, the error path releases the PID file but does not shut down the scheduler or clear _scheduler.

Options: Fix now | Create issue | Skip

View details

Recommended approach: register the timeout monitor before scheduler.start() and assign _scheduler only after startup fully succeeds. This is a low-effort change and avoids leaving a live in-memory scheduler behind after a reported startup failure.


✅ What's Good

  • The PR stays within the intended backend-local scope.
  • Focused tests were added for cron ownership and OpenCode parsing behavior.
  • The OpenCode parsing regression appears addressed with direct test coverage.

📋 Workflow Gap

The expected specialized review artifacts for error-handling, test-coverage, comment-quality, and docs-impact were not present in this run’s review directory, so their counts are effectively zero for this synthesis rather than explicit no-finding reviews.


Next Steps

  1. Fix the cron ownership race before merge.
  2. Prefer fixing the scheduler rollback gap in this PR as well.
  3. Backfill the missing specialized review artifacts if the workflow expected them.

Reviewed by Archon comprehensive-pr-review workflow
Artifacts: /home/tom/.archon/workspaces/tbrandenburg/made/artifacts/runs/0885119d8ac7b54e1f442a03b5b074a4/review/

@tbrandenburg
Copy link
Copy Markdown
Owner Author

Fix Report: PR #393

Date: 2026-04-28T00:56:24+02:00
Status: COMPLETE
Branch: archon/thread-1e97102b
Commit: 418fc8bfdadd644900fb00a61b8e7d111afe93cd
Philosophy: Aggressive fix — lean towards fixing everything


Summary

The review surfaced two cron startup defects in packages/pybackend/cron_service.py, and both were fixed in this PR branch. Cron ownership is now claimed with an atomic create-and-retry flow, and scheduler startup now remains fail-fast until the timeout monitor is registered. Two regression tests were added to lock both behaviors in place. No findings were skipped or blocked.


Fixes Applied

Severity Finding Location What Was Done
HIGH Cron ownership claim is still race-prone across concurrent startups packages/pybackend/cron_service.py:150 Reworked _claim_cron_ownership() to use `os.open(..., O_CREAT
MEDIUM Startup rollback leaks a running scheduler if failure happens after scheduler.start() packages/pybackend/cron_service.py:558 Moved timeout monitor registration ahead of scheduler.start() and deferred _scheduler assignment until startup fully succeeds.

Tests Added

File Test Cases
packages/pybackend/tests/unit/test_cron_service.py test_claim_cron_ownership_retries_after_stale_pid_race; test_start_cron_clock_releases_pid_when_timeout_monitor_setup_fails

Docs Updated

(none)


Skipped Findings

(none)


Blocked (Could Not Fix)

(none)


Suggested Follow-up Issues

(none)


Validation

Check Status
Type check npm --workspace packages/frontend exec tsc --noEmit
Lint make qa-quick
Tests ✅ 383 passed, 1 skipped via make qa-quick; 26 passed in targeted tests/unit/test_cron_service.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: Multiple backend instances cause duplicate cron job execution

1 participant