Skip to content

Conversation

@jumski
Copy link
Contributor

@jumski jumski commented Jan 12, 2026

Add automatic requeue for stalled tasks via cron job

This PR implements a system to automatically detect and requeue tasks that have stalled due to worker crashes or other issues. Key features:

  • Added a requeue_stalled_tasks() function that identifies tasks stuck in 'started' status beyond their timeout window
  • Tasks can be requeued up to 3 times before being marked as failed
  • Added tracking columns to step_tasks table: requeued_count and last_requeued_at
  • Implemented a configurable cron job via setup_requeue_stalled_tasks_cron() that runs every 15 seconds by default
  • Added comprehensive test suite covering basic requeuing, max requeue limits, and multi-flow scenarios
  • Increased default visibility timeout in edge-worker from 2 to 5 seconds for better reliability

This enhancement improves system resilience by ensuring tasks don't remain stuck when workers crash unexpectedly, addressing issue #586.

@changeset-bot
Copy link

changeset-bot bot commented Jan 12, 2026

🦋 Changeset detected

Latest commit: c5d97c5

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 5 packages
Name Type
@pgflow/core Patch
@pgflow/edge-worker Patch
pgflow Patch
@pgflow/client Patch
@pgflow/dsl Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

Copy link
Contributor Author

jumski commented Jan 12, 2026

This stack of pull requests is managed by Graphite. Learn more about stacking.

@nx-cloud
Copy link

nx-cloud bot commented Jan 12, 2026

View your CI Pipeline Execution ↗ for commit c5d97c5

Command Status Duration Result
nx run edge-worker:test:integration ✅ Succeeded 3m 46s View ↗
nx run client:e2e ✅ Succeeded 1m 10s View ↗
nx run core:pgtap ✅ Succeeded 1m 35s View ↗
nx run edge-worker:e2e ✅ Succeeded 51s View ↗
nx affected -t verify-exports --base=origin/mai... ✅ Succeeded 3s View ↗
nx affected -t build --configuration=production... ✅ Succeeded 3s View ↗
nx affected -t lint typecheck test --parallel -... ✅ Succeeded 41s View ↗
nx run cli:e2e ✅ Succeeded 6s View ↗

☁️ Nx Cloud last updated this comment at 2026-01-20 22:07:54 UTC

@jumski jumski force-pushed the 01-12-pgf-aav_implement_requeue_for_stalled_tasks branch 4 times, most recently from 367abd2 to 3653295 Compare January 19, 2026 07:25
@jumski jumski force-pushed the 01-12-pgf-aav_implement_requeue_for_stalled_tasks branch from 3653295 to 8f162cb Compare January 19, 2026 07:47
@jumski jumski mentioned this pull request Jan 19, 2026
@jumski jumski force-pushed the 01-12-pgf-aav_implement_requeue_for_stalled_tasks branch from 8f162cb to fadb95d Compare January 20, 2026 21:40
@jumski jumski force-pushed the 01-12-pgf-aav_implement_requeue_for_stalled_tasks branch from fadb95d to d49909b Compare January 20, 2026 21:48
… logic

- Introduced requeued_count and last_requeued_at columns to step_tasks table
- Developed requeue_stalled_tasks function to requeue or fail stalled tasks based on max requeues
- Created setup_requeue_stalled_tasks_cron function to schedule automatic requeue checks
- Updated migration scripts to include new columns and functions
- Added comprehensive tests for requeue behavior, max requeue limit, and cron setup
@jumski jumski force-pushed the 01-12-pgf-aav_implement_requeue_for_stalled_tasks branch from d49909b to c5d97c5 Compare January 20, 2026 22:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants