Skip to content

[WIP] feat: add support for native retry policies#4634

Open
dejanzele wants to merge 4 commits intoarmadaproject:masterfrom
dejanzele:preemption-retries
Open

[WIP] feat: add support for native retry policies#4634
dejanzele wants to merge 4 commits intoarmadaproject:masterfrom
dejanzele:preemption-retries

Conversation

@dejanzele
Copy link
Member

@dejanzele dejanzele commented Jan 27, 2026

This work is a continuation of the work done by @Sovietaced and huge shoutout and big thanks for doing the initial work!

What type of PR is this?

Feature

What this PR does / why we need it

Adds native preemption retry support to Armada, allowing preempted jobs to be automatically requeued instead of failing permanently. This improves cluster utilization and user experience by giving preemptible workloads a chance to complete even when temporarily displaced by higher-priority jobs.

When a job is preempted by a higher-priority job, it currently fails permanently. Users must manually resubmit preempted jobs, which is:

  • Tedious for batch workloads with many jobs
  • Error-prone (users may not notice preemptions)
  • Wasteful of already-completed work in long-running jobs

Jobs can now opt-in to automatic retry after preemption. When a preemptible job is preempted:

  1. The run is marked as preempted (terminal state for the run)
  2. If retries are enabled and retry count is not exhausted, the job is requeued
  3. The job gets a new run when resources become available
  4. After exhausting retries, the job fails permanently

Server configuration:

submission:
  preemptionRetry:
    # Enable/disable the feature globally. When false, jobs with
    # preemption retry annotations are rejected at submission.
    enabled: true
    # Maximum retry count users can request via annotations.
    # If omitted, no upper bound is enforced.
    maxRetryCount: 10

Scheduler configuration:

scheduling:
  preemptionRetry:
    # Enable/disable preemption retry processing.
    enabled: true
    # Default retry count for jobs without annotation.
    # If omitted, defaults to 0 (no retries unless job specifies).
    defaultRetryCount: 3

Example job which sets custom retry rules:

queue: default
jobSetId: my-batch-job
jobs:
  - priority: 0
    namespace: default
    annotations:
      armadaproject.io/preemptionRetryEnabled: "true"
      armadaproject.io/preemptionMaxRetryCount: "3"
    podSpec:
      priorityClassName: preemptible
      containers:
        - name: worker
          image: my-image:latest
          resources:
            requests:
              cpu: "4"
              memory: "8Gi"

All jobs in a gang must have consistent preemption retry configuration:

  • Same preemptionRetryEnabled value
  • Same preemptionMaxRetryCount value

Which issue(s) this PR fixes

Fixes #4340

Special notes for your reviewer

@dejanzele dejanzele force-pushed the preemption-retries branch 2 times, most recently from 157a83f to 126aa9c Compare January 27, 2026 14:54
jparraga-stackav and others added 4 commits January 27, 2026 14:55
Signed-off-by: Jason Parraga <jparraga+gh@stackav.com>
Signed-off-by: Jason Parraga <jparraga+gh@stackav.com>
Signed-off-by: Jason Parraga <jparraga+gh@stackav.com>
Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
@dejanzele dejanzele changed the title feat: add support for native preemption retries feat: add support for native retry policies Jan 28, 2026
@Sovietaced
Copy link
Contributor

One thing we ran into recently is that there can also be collision on service names and ingress names. This would only affect folks opting into those features so can probably be done in a follow up pull request.

@dejanzele dejanzele changed the title feat: add support for native retry policies [WIP] feat: add support for native retry policies Jan 30, 2026
@Sovietaced
Copy link
Contributor

Another issue that we see daily that seems to completely disrupt scheduling is the fact that there is no concept of a gang generation. This specifically seems to happen when the scheduler has performed a preemption on a gang.

Later the scheduler will do some logical schedule and preempt logic in the scheduler but it will see some old pods from the original gang schedule that were preempted still on the nodes and then we get error messages like

scheduler.go:202 scheduling cycle failure error="gang runner-1328d9002fde4882bcb-n3-0-n3-0-dn0-0 was partially evicted: 2 out of 3 jobs evicted" cycleNumber=1030057

I believe that it misinterprets real pods on the nodes from the previous gang generation as related to the logical schedule/preempt it does as part of the regular scheduling algorithm and then blows up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Native support for preemption retries

3 participants