Skip to content

fix(scheduler): seed recurring_jobs via migration + correct worker healthcheck#388

Merged
remyluslosius merged 3 commits into
mainfrom
fix/seed-recurring-jobs
Apr 14, 2026
Merged

fix(scheduler): seed recurring_jobs via migration + correct worker healthcheck#388
remyluslosius merged 3 commits into
mainfrom
fix/seed-recurring-jobs

Conversation

@remyluslosius
Copy link
Copy Markdown
Contributor

Fixes #383.

Summary

  • Adds Alembic migration 054_seed_recurring_jobs that inserts the 9 baseline recurring schedules into the recurring_jobs table on every deploy, idempotently
  • Overrides the worker container healthcheck in docker-compose.yml (was inheriting the backend Dockerfile's curl localhost:8000, which hits nothing in the worker container and reports unhealthy forever)

Why

The PostgreSQL job queue's scheduler polls recurring_jobs every 10s and enqueues due entries. On a fresh deploy, the table was empty and stayed empty — no scheduler, no host monitoring, no compliance scans. Silent failure with no errors logged.

The app/services/job_queue/seed_schedule.py module exists and works, but nothing invoked it: not the worker entrypoint, not docker-compose, not a FastAPI startup hook, not a migration.

Discovered in production 2026-04-13 when worker had been up 5 hours with zero jobs dequeued, last host liveness ping 5 hours stale, last scan 5.5 hours overdue.

Why a migration (vs. entrypoint hook or FastAPI startup event)

  • Migration: runs once, on schema upgrade, naturally fits Alembic's existing DB provisioning flow. No redundant round-trips on every worker restart. Downgrade path well-defined.
  • Entrypoint hook (rejected): runs on every worker container start. Wastes DB round-trips. Couples schedule availability to worker lifecycle rather than DB lifecycle.
  • FastAPI startup event (rejected): wrong layer — couples API startup to scheduler state. If seed fails, does the API refuse to serve?

Idempotency

ON CONFLICT (name) DO NOTHING means the migration is safe to re-run against a DB where someone manually invoked python -m app.services.job_queue.seed_schedule. Validated against the production DB here, which had 8 rows from yesterday's manual seed plus one missing (retention policies, added later); the migration correctly inserted only the missing row.

Future schedule changes

If a new recurring schedule is added to SCHEDULE in app/services/job_queue/seed_schedule.py, add a follow-up migration (055_add_<name>_schedule.py) rather than editing this one. Keeps the migration history honest about what was seeded when.

Worker healthcheck

Replaces curl localhost:8000/health with a SQLAlchemy SELECT 1 against the configured DB URL. Rationale: the worker's only hard dependency is DB connectivity — without it, it can't dequeue or enqueue anything. A "worker is alive" probe that doesn't touch the thing the worker needs is not a healthcheck.

Test plan

  • Applied migration SQL directly to running DB; RETURNING name reported only the one missing row inserted, 8 conflicts silently skipped
  • black --check passes on the migration file
  • Schedulers verified running: dispatch_host_checks + dispatch_compliance_scans executing every 30s/2min, check_host_connectivity fanout working across 7 hosts, run_scheduled_kensa_scan firing for overdue hosts
  • CI pipeline passes
  • Fresh-deploy smoke test (spin up clean DB, run alembic upgrade head, verify SELECT count(*) FROM recurring_jobs returns 9)

remyluslosius and others added 3 commits April 13, 2026 22:00
…althcheck

Fixes #383.

The adaptive schedulers (host monitoring, compliance scanning) were silently
dormant on fresh deploys because recurring_jobs was never populated. The
seed_schedule module existed but was never invoked by any startup path.

Migration 054 inserts the 9 baseline schedules with ON CONFLICT (name)
DO NOTHING so it is idempotent against manual invocations of the seed
script and safe to re-run. Downgrade removes only these 9 named rows,
leaving operator-added schedules untouched.

Also overrides the worker container healthcheck, which inherited the
backend Dockerfile's curl-localhost-8000 probe and reported unhealthy
forever. The new probe verifies DB connectivity via SQLAlchemy, which
is the actual precondition for worker function.
@remyluslosius remyluslosius merged commit 7286c82 into main Apr 14, 2026
26 checks passed
@remyluslosius remyluslosius deleted the fix/seed-recurring-jobs branch April 14, 2026 10:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: recurring_jobs table never seeded — adaptive schedulers silently dormant on fresh deploy

1 participant