feat: resilient background job retry & monitoring (fixes #130)#641
Open
DrGalio wants to merge 1 commit intorohitdash08:mainfrom
Open
feat: resilient background job retry & monitoring (fixes #130)#641DrGalio wants to merge 1 commit intorohitdash08:mainfrom
DrGalio wants to merge 1 commit intorohitdash08:mainfrom
Conversation
) Implements a production-grade background job execution system with: Core Engine (services/jobs.py): - Exponential backoff retries (30s → 2min → 8min) - Dead-letter queue for permanently failed jobs - @register_handler decorator for pluggable job types - process_due_jobs() batch processor with configurable limits - retry_dead_letter_job() for manual reprocessing - get_job_stats() for monitoring aggregation Job Model (models.py): - BackgroundJob model with full lifecycle tracking - JobStatus enum: PENDING, RUNNING, SUCCESS, FAILED, DEAD_LETTER - Tracks: attempt count, max_retries, next_run_at, last_error, result Reminders Integration (routes/reminders.py): - run_due endpoint now enqueues jobs instead of fire-and-forget - send_reminder raises on failure for proper retry handling - Handler registered via @register_handler decorator Admin Monitoring API (routes/jobs.py): - GET /admin/jobs/stats — aggregated stats by status/type - GET /admin/jobs — paginated list with status/type filters - GET /admin/jobs/:id — full job details - POST /admin/jobs/:id/retry — manual dead-letter retry - POST /admin/jobs/process — manual batch processing trigger - DELETE /admin/jobs/:id — remove job record - All endpoints require admin role Observability (observability.py): - finmind_job_events_total Prometheus counter (event/job_type/status) - track_job_event() helper function Database: - background_jobs table with indexes (schema.sql) - Migration file for existing deployments (001_background_jobs.sql) - Auto-migration via _ensure_schema_compatibility() OpenAPI (openapi.yaml): - Full documentation for all 6 admin endpoints - BackgroundJob schema definition - Jobs tag added Tests (tests/test_jobs.py): - 20 tests covering service layer and admin API - Backoff calculation, enqueue, execute, retry, dead-letter - Admin auth enforcement (403 for non-admin, 401 for unauth)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements a production-grade resilient background job execution system with exponential backoff retries, dead-letter queue, and admin monitoring — addressing Issue #130 ($250 bounty).
What's Built
Core Engine (
services/jobs.py)@register_handlerdecorator: Pluggable job type system — just register a handler and enqueue jobsprocess_due_jobs(): Batch processor with configurable limitsretry_dead_letter_job(): Manual reprocessing of dead-lettered jobsget_job_stats(): Aggregated monitoring by status and job typeJob Model (
models.py)BackgroundJobmodel with full lifecycle trackingJobStatusenum:PENDING → RUNNING → SUCCESS | FAILED | DEAD_LETTERReminders Integration (
routes/reminders.py)/reminders/runnow enqueues jobs instead of fire-and-forget sendsAdmin Monitoring API (
routes/jobs.py)GET /admin/jobs/statsGET /admin/jobs?status=and?job_type=filtersGET /admin/jobs/:idPOST /admin/jobs/:id/retryPOST /admin/jobs/processDELETE /admin/jobs/:idAll endpoints require admin role (403 for non-admin, 401 for unauthenticated).
Observability
finmind_job_events_total(labels: event, job_type, status)Database
background_jobstable with proper indexes (status+next_run_at, job_type, created_at)migrations/001_background_jobs.sql_ensure_schema_compatibility()— zero-downtime deployOpenAPI
BackgroundJobschema definitionJobstagTests (
tests/test_jobs.py)20 tests covering:
get_job_stats()Acceptance Criteria
Before → After
Before:
After:
Files Changed
12 files changed, 1211 insertions(+), 9 deletions(-)