[WIP] feat: add support for native retry policies#4634
[WIP] feat: add support for native retry policies#4634dejanzele wants to merge 4 commits intoarmadaproject:masterfrom
Conversation
157a83f to
126aa9c
Compare
Signed-off-by: Jason Parraga <jparraga+gh@stackav.com>
Signed-off-by: Jason Parraga <jparraga+gh@stackav.com>
Signed-off-by: Jason Parraga <jparraga+gh@stackav.com>
Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
126aa9c to
b4a392a
Compare
|
One thing we ran into recently is that there can also be collision on service names and ingress names. This would only affect folks opting into those features so can probably be done in a follow up pull request. |
|
Another issue that we see daily that seems to completely disrupt scheduling is the fact that there is no concept of a gang generation. This specifically seems to happen when the scheduler has performed a preemption on a gang. Later the scheduler will do some logical schedule and preempt logic in the scheduler but it will see some old pods from the original gang schedule that were preempted still on the nodes and then we get error messages like I believe that it misinterprets real pods on the nodes from the previous gang generation as related to the logical schedule/preempt it does as part of the regular scheduling algorithm and then blows up. |
This work is a continuation of the work done by @Sovietaced and huge shoutout and big thanks for doing the initial work!
What type of PR is this?
Feature
What this PR does / why we need it
Adds native preemption retry support to Armada, allowing preempted jobs to be automatically requeued instead of failing permanently. This improves cluster utilization and user experience by giving preemptible workloads a chance to complete even when temporarily displaced by higher-priority jobs.
When a job is preempted by a higher-priority job, it currently fails permanently. Users must manually resubmit preempted jobs, which is:
Jobs can now opt-in to automatic retry after preemption. When a preemptible job is preempted:
Server configuration:
Scheduler configuration:
Example job which sets custom retry rules:
All jobs in a gang must have consistent preemption retry configuration:
preemptionRetryEnabledvaluepreemptionMaxRetryCountvalueWhich issue(s) this PR fixes
Fixes #4340
Special notes for your reviewer