[WIP] feat: add support for native retry policies by dejanzele · Pull Request #4634 · armadaproject/armada

dejanzele · 2026-01-27T14:28:00Z

This work is a continuation of the work done by @Sovietaced and huge shoutout and big thanks for doing the initial work!

What type of PR is this?

Feature

What this PR does / why we need it

Adds native preemption retry support to Armada, allowing preempted jobs to be automatically requeued instead of failing permanently. This improves cluster utilization and user experience by giving preemptible workloads a chance to complete even when temporarily displaced by higher-priority jobs.

When a job is preempted by a higher-priority job, it currently fails permanently. Users must manually resubmit preempted jobs, which is:

Tedious for batch workloads with many jobs
Error-prone (users may not notice preemptions)
Wasteful of already-completed work in long-running jobs

Jobs can now opt-in to automatic retry after preemption. When a preemptible job is preempted:

The run is marked as preempted (terminal state for the run)
If retries are enabled and retry count is not exhausted, the job is requeued
The job gets a new run when resources become available
After exhausting retries, the job fails permanently

Server configuration:

submission:
  preemptionRetry:
    # Enable/disable the feature globally. When false, jobs with
    # preemption retry annotations are rejected at submission.
    enabled: true
    # Maximum retry count users can request via annotations.
    # If omitted, no upper bound is enforced.
    maxRetryCount: 10

Scheduler configuration:

scheduling:
  preemptionRetry:
    # Enable/disable preemption retry processing.
    enabled: true
    # Default retry count for jobs without annotation.
    # If omitted, defaults to 0 (no retries unless job specifies).
    defaultRetryCount: 3

Example job which sets custom retry rules:

queue: default
jobSetId: my-batch-job
jobs:
  - priority: 0
    namespace: default
    annotations:
      armadaproject.io/preemptionRetryEnabled: "true"
      armadaproject.io/preemptionMaxRetryCount: "3"
    podSpec:
      priorityClassName: preemptible
      containers:
        - name: worker
          image: my-image:latest
          resources:
            requests:
              cpu: "4"
              memory: "8Gi"

All jobs in a gang must have consistent preemption retry configuration:

Same preemptionRetryEnabled value
Same preemptionMaxRetryCount value

Which issue(s) this PR fixes

Fixes #4340

Special notes for your reviewer

Signed-off-by: Jason Parraga <jparraga+gh@stackav.com>

Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>

Sovietaced · 2026-01-28T19:15:55Z

One thing we ran into recently is that there can also be collision on service names and ingress names. This would only affect folks opting into those features so can probably be done in a follow up pull request.

Sovietaced · 2026-02-03T21:02:01Z

Another issue that we see daily that seems to completely disrupt scheduling is the fact that there is no concept of a gang generation. This specifically seems to happen when the scheduler has performed a preemption on a gang.

Later the scheduler will do some logical schedule and preempt logic in the scheduler but it will see some old pods from the original gang schedule that were preempted still on the nodes and then we get error messages like

scheduler.go:202 scheduling cycle failure error="gang runner-1328d9002fde4882bcb-n3-0-n3-0-dn0-0 was partially evicted: 2 out of 3 jobs evicted" cycleNumber=1030057

I believe that it misinterprets real pods on the nodes from the previous gang generation as related to the logical schedule/preempt it does as part of the regular scheduling algorithm and then blows up.

dejanzele force-pushed the preemption-retries branch 2 times, most recently from 157a83f to 126aa9c Compare January 27, 2026 14:54

jparraga-stackav and others added 4 commits January 27, 2026 14:55

Add support for native preemption retries

0efa9df

Signed-off-by: Jason Parraga <jparraga+gh@stackav.com>

Plumb run index to pod name

f3b494e

Signed-off-by: Jason Parraga <jparraga+gh@stackav.com>

Make migration smoother with nullable value

c5da887

Signed-off-by: Jason Parraga <jparraga+gh@stackav.com>

resolve conflicts and fix minor issues with native preemption retries

b4a392a

Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>

dejanzele force-pushed the preemption-retries branch from 126aa9c to b4a392a Compare January 27, 2026 14:55

dejanzele changed the title ~~feat: add support for native preemption retries~~ feat: add support for native retry policies Jan 28, 2026

Sovietaced approved these changes Jan 28, 2026

View reviewed changes

dejanzele changed the title ~~feat: add support for native retry policies~~ [WIP] feat: add support for native retry policies Jan 30, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] feat: add support for native retry policies#4634

[WIP] feat: add support for native retry policies#4634
dejanzele wants to merge 4 commits intoarmadaproject:masterfrom
dejanzele:preemption-retries

dejanzele commented Jan 27, 2026 •

edited

Loading

Uh oh!

Sovietaced commented Jan 28, 2026

Uh oh!

Sovietaced commented Feb 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

dejanzele commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

What this PR does / why we need it

Which issue(s) this PR fixes

Special notes for your reviewer

Uh oh!

Sovietaced commented Jan 28, 2026

Uh oh!

Sovietaced commented Feb 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dejanzele commented Jan 27, 2026 •

edited

Loading