-
Notifications
You must be signed in to change notification settings - Fork 15
feat: add automatic requeue for stalled tasks via cron job #591
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
🦋 Changeset detectedLatest commit: c5d97c5 The changes in this PR will be included in the next version bump. This PR includes changesets to release 5 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
This stack of pull requests is managed by Graphite. Learn more about stacking. |
|
View your CI Pipeline Execution ↗ for commit c5d97c5
☁️ Nx Cloud last updated this comment at |
367abd2 to
3653295
Compare
3653295 to
8f162cb
Compare
8f162cb to
fadb95d
Compare
fadb95d to
d49909b
Compare
… logic - Introduced requeued_count and last_requeued_at columns to step_tasks table - Developed requeue_stalled_tasks function to requeue or fail stalled tasks based on max requeues - Created setup_requeue_stalled_tasks_cron function to schedule automatic requeue checks - Updated migration scripts to include new columns and functions - Added comprehensive tests for requeue behavior, max requeue limit, and cron setup
d49909b to
c5d97c5
Compare

Add automatic requeue for stalled tasks via cron job
This PR implements a system to automatically detect and requeue tasks that have stalled due to worker crashes or other issues. Key features:
requeue_stalled_tasks()function that identifies tasks stuck in 'started' status beyond their timeout windowstep_taskstable:requeued_countandlast_requeued_atsetup_requeue_stalled_tasks_cron()that runs every 15 seconds by defaultThis enhancement improves system resilience by ensuring tasks don't remain stuck when workers crash unexpectedly, addressing issue #586.