Skip to content

Conversation

@jumski
Copy link
Contributor

@jumski jumski commented Jan 12, 2026

Add automatic requeue for stalled tasks via cron job

This PR implements a system to automatically detect and requeue tasks that have stalled due to worker crashes or other issues. Key features:

  • Added a requeue_stalled_tasks() function that identifies tasks stuck in 'started' status beyond their timeout window
  • Tasks can be requeued up to 3 times before being marked as failed
  • Added tracking columns to step_tasks table: requeued_count and last_requeued_at
  • Implemented a configurable cron job via setup_requeue_stalled_tasks_cron() that runs every 15 seconds by default
  • Added comprehensive test suite covering basic requeuing, max requeue limits, and multi-flow scenarios
  • Increased default visibility timeout in edge-worker from 2 to 5 seconds for better reliability

This enhancement improves system resilience by ensuring tasks don't remain stuck when workers crash unexpectedly, addressing issue #586.

@changeset-bot
Copy link

changeset-bot bot commented Jan 12, 2026

🦋 Changeset detected

Latest commit: b3bc1e9

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 5 packages
Name Type
@pgflow/core Patch
@pgflow/edge-worker Patch
pgflow Patch
@pgflow/client Patch
@pgflow/dsl Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

Copy link
Contributor Author

jumski commented Jan 12, 2026

This stack of pull requests is managed by Graphite. Learn more about stacking.

@nx-cloud
Copy link

nx-cloud bot commented Jan 12, 2026

View your CI Pipeline Execution ↗ for commit b3bc1e9

Command Status Duration Result
nx affected -t lint typecheck test --parallel -... ❌ Failed 1m 55s View ↗
nx run edge-worker:test:integration ✅ Succeeded 5m 17s View ↗
nx run client:e2e ✅ Succeeded 2m 57s View ↗
nx run core:pgtap ✅ Succeeded 1m 50s View ↗
nx run cli:e2e ✅ Succeeded 6s View ↗
nx run edge-worker:e2e ✅ Succeeded 50s View ↗

☁️ Nx Cloud last updated this comment at 2026-01-12 10:03:06 UTC

… logic

- Introduced requeued_count and last_requeued_at columns to step_tasks table
- Developed requeue_stalled_tasks function to requeue or fail stalled tasks based on max requeues
- Created setup_requeue_stalled_tasks_cron function to schedule automatic requeue checks
- Updated migration scripts to include new columns and functions
- Added comprehensive tests for requeue behavior, max requeue limit, and cron setup
@jumski jumski force-pushed the 01-12-pgf-aav_implement_requeue_for_stalled_tasks branch from 7dabec6 to b3bc1e9 Compare January 12, 2026 09:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants