[WIP] feat(api): KEP-3015: Workload Aware Scheduling for TrainJob by andreyvelich · Pull Request #3219 · kubeflow/trainer

andreyvelich · 2026-02-17T19:08:00Z

Related: #3015
Ref issues: kubernetes/enhancements#4671, kubernetes/kubernetes#132192.

This is initial KEP to support WAS in Kubeflow Trainer and TrainJob. We need to discuss whether we want to allow users to set Workload spec in TrainJob as well.

The implementation should be started after Kubernetes v1.36

Project to track WAS in Kubernetes: Workload-aware & Topology-aware Workstream (view)

/assign @tenzen-y @astefanutti @akshaychitneni @robert-bell @kubeflow/kubeflow-trainer-team

/cc @kannon92 @helayoty @wojtek-t @klueska @mm4tt @vsoch @dom4ha @VassilisVassiliadis

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

google-oss-prow · 2026-02-17T19:08:05Z

@andreyvelich: GitHub didn't allow me to request PR reviews from the following users: mm4tt, vsoch, dom4ha, VassilisVassiliadis, helayoty, wojtek-t, klueska.

Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs.

Details

In response to this:

Related: #3015
Ref issues: kubernetes/enhancements#4671, kubernetes/kubernetes#132192.

This is initial KEP to support WAS in Kubeflow Trainer and TrainJob. We need to discuss whether we want to allow users to set Workload spec in TrainJob as well.

The implementation should be started after Kubernetes v1.36

Project to track WAS in Kubernetes: Workload-aware & Topology-aware Workstream (view)

/assign @tenzen-y @astefanutti @akshaychitneni @robert-bell @kubeflow/kubeflow-trainer-team

/cc @kannon92 @helayoty @wojtek-t @klueska @mm4tt @vsoch @dom4ha @VassilisVassiliadis

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

google-oss-prow · 2026-02-17T19:08:06Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from andreyvelich. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copilot

Pull request overview

Adds an initial KEP/proposal document for integrating Kubernetes Workload API–based workload-aware (gang) scheduling into Kubeflow Trainer’s TrainJob via a new PodGroupPolicy plugin, as groundwork for future implementation.

Changes:

Introduces a new KEP (3015) describing Workload/PodGroup orchestration for TrainJob scheduling.
Documents intended API surface (podGroupPolicy.workload) and controller/plugin responsibilities across build/enforcement/watch phases.
Outlines lifecycle, RBAC, feature gates, and a test plan for the proposed integration.

docs/proposals/3015-workload-aware-scheduling/README.md

coveralls · 2026-02-17T19:12:35Z

Pull Request Test Coverage Report for Build 22145283120

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

For more information on this, see Tracking coverage changes with pull request builds.
To avoid this issue with future PRs, see these Recommended CI Configurations.
For a quick fix, rebase this PR at GitHub. Your next report should be accurate.

Details

0 of 0 changed or added relevant lines in 0 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage remained the same at 55.998%

Totals
Change from base Build 22081023611:	0.0%
Covered Lines:	1391
Relevant Lines:	2484

💛 - Coveralls

vsoch · 2026-02-18T14:19:57Z

I'm a little caught up on this discussion:

kubernetes-sigs/jobset#1068 (comment)

If the Workload is intended to be integrated with JobSet, are there ever cases of creating TrainJob without a backing JobSet? If not, should we not expose Workload (and gang scheduling) through JobSet instead of adding separately via PodGroupPolicy? What happens when we have PodGroupPolicy plus underlying JobSet support to create the Workload? Or is this intended as a temporary patch to expose it while it is being worked on for JobSet?

kannon92 · 2026-02-18T14:24:52Z

I'm a little caught up on this discussion:

kubernetes-sigs/jobset#1068 (comment)

If the Workload is intended to be integrated with JobSet, are there ever cases of creating TrainJob without a backing JobSet? If not, should we not expose Workload (and gang scheduling) through JobSet instead of adding separately via PodGroupPolicy? What happens when we have PodGroupPolicy plus underlying JobSet support to create the Workload? Or is this intended as a temporary patch to expose it while it is being worked on for JobSet?

Its a hard discussion to follow.

The general stance now is that the api that a user creates (sometimes known as true workload) should probably manage the podGroups / Workloads.

For Trainer we could maybe consider delegating this to JobSet for JobSet based TrainingRuntimes but that doesn't solve the problem for Flux or whatever other backend we have.

What happens when we have PodGroupPolicy plus underlying JobSet support to create the Workload? Or is this intended as a temporary patch to expose it while it is being worked on for JobSet?

The main outcome would be that if there exists a workloadRef or PodGroupRef on the pod template we are to assume that the WAS scheduled is "handled". But we haven't updated the KEP for JobSet. That was at least what we were discussing on all the features.

vsoch · 2026-02-18T14:37:55Z

The FluxPolicy implementation is just tweaking a JobSet, so I think if support was added to that underlying abstraction it would be inherited by Flux. I want to make sure we are careful to (essentially) not implement this layer twice, but rather just once at the most appropriate level (maybe JobSet?)

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

andreyvelich · 2026-02-18T14:58:18Z

Yes, @kannon92 is right – we’ve had several discussions about this over the past few months.

In general, higher-level controllers should manage Workload and PodGroup objects, since they have the most complete understanding of the AI application they are orchestrating. For example:

We can integrate TrainJob with Grove or Slurm CRDs, which are not based on JobSet (cc @Ronkahn21 @EkinKarabulut).
When TrainJob bootstraps the data cache, the PodGroup configuration may need to differ (cc @akshaychitneni).

For now, I suggest that we add validation in the TrainJob controller to ensure users do not configure the Workload API directly in JobSet or Job objects.

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

mm4tt

Huge thanks for working on this integration. Having working examples of WAS being adopted by real-workload controllers gives us the confidence we need to push WAS to Beta in 1.37. Great stuff!

mm4tt · 2026-02-20T12:26:56Z

docs/proposals/3015-workload-aware-scheduling/README.md

+
+    // Workload plugin using native Kubernetes Workload API for gang-scheduling
+    // Requires Kubernetes v1.35+ with GenericWorkload feature gates enabled.
+    Workload *WorkloadPodGroupPolicySource `json:"workload,omitempty"`


As discussed, this approach differs from the Job KEP (kubernetes/enhancements#5871), where we deliberately avoided Job API changes and simply enabled gang-scheduling by default in specific cases.

That said, introducing an explicit opt-in API here makes sense for now, as it naturally fits into the existing structure (e.g., gang-scheduling via coscheduling or Volcano).

One heads-up for the future: in 1.37, we will be working on a common API on the controller side for consuming WAS features like gang-scheduling. We want to avoid a fragmented ecosystem where every controller exposes its own API for WAS. Because of this, the API proposed here will likely need to be adjusted once we agree on the final shape of the common "WAS-controller API".

While I agree with what you wrote, but I don't think this is a concern here.

The change to this API is not an opt-in for gang-scheduling or opt-in for creation of Workload API.
This is rather a decision "which scheduing provider should be used".

@andreyvelich - please correct me if I'm wrong, but the way I understand it is that:

The inherent nature of TrainJob is that it should be gang-scheduled.

Up until now, there were two mechanism to achieve it - either via coscheduling plugin or via Volcano.

In order to achieve the the required semantic, users have to set it (in a way that was reflecting their cluster setup) - based on that value corresponding objects were created/modified

With this KEP, we're not changing that nor introducing any new API. We're just saying that there is new way of achieving the inherent gang-scheduling requirement. That underneath translates to creation of Workload/PodGroup object, but that's an implementation detail.

At some point in the future, it would be great if we could make the Workload a default mode (because it will work out-of-the-box in any k8s cluster) and not require this field at all, but that's just a usability improvement.

That's a fair point - it's a choice of scheduling provider. However, under the hood, it effectively acts as an opt-in for the Workload API.

If we establish a common opt-in/opt-out mechanism for WAS integration in the future (1.37), we wouldn't want TrainJob using a custom field while other controllers use the standard approach. Even if the final solution is exactly as you suggested—making Workload the default and dropping this field—it's still an API adjustment we'll need to make before Beta.

However, under the hood, it effectively acts as an opt-in for the Workload API.

My mental model is that it's actually not opt-in for Workload API.
For me it's "define my provider".

I will throw a parallel example here. Let's imagine Karpenter vs CustomComputeClass. Underneath for many functionalities you can achieve the same, but the api representation is very different.
If you have a API concept where you just state "karpenter" or "ccc" the the necessary objects are being created for you behind the scenes - that's actually pretty compelling.

In other want - you want certain specific functionalities, but these may backed by different implementations/components in different environments. And you need to tell your controller which one it should now talk to to achieve what we need.

So for me it's actually not an opt-in for Workload API.

One heads-up for the future: in 1.37, we will be working on a common API on the controller side for consuming WAS features like gang-scheduling

This is pretty interesting, do we have any work in progress doc to explore this?

please correct me if I'm wrong, but the way I understand it is that:

@wojtek-t Yes, your understanding is correct (users can also get gang-scheduling for TrainJob with Kueue or KAI Scheduler).
As I mentioned in the future plans, we might want to re-use this API for other WAS features like Topology-Aware Scheduling / DRA, so we should see whether this API makes sense for future extendability.

wojtek-t

Thanks a lot for helping us to validate the Workload/PodGroup design!

wojtek-t · 2026-02-20T13:10:25Z

docs/proposals/3015-workload-aware-scheduling/README.md

+  ownerReferences:
+    - apiVersion: trainer.kubeflow.org/v1alpha1
+      kind: TrainJob
+      name: my-job


In the Job integration KEP - we made this explicitly controller: true ownerRef.
Same for PodGroup.

I think it would be highly beneficial to unify that.

wojtek-t · 2026-02-20T13:12:03Z

docs/proposals/3015-workload-aware-scheduling/README.md

+```yaml
+spec:
+  schedulingGroup:
+    podGroupName: my-job-trainer-<hash>


Similarly - we add additional ownerRef to PodGroup from pods.

That is useful to ensure that Pods will get deleted if the PodGroup gets deleted...

Good callout, I've added similar diagram with OwnerReference relationship as @helayoty added to the Job KEP: https://github.com/kubernetes/enhancements/blob/master/keps/sig-apps/5547-integrate-workload-with-job/README.md#ownerreferences-relationship

wojtek-t · 2026-02-20T13:23:25Z

docs/proposals/3015-workload-aware-scheduling/README.md

+
+    // Workload plugin using native Kubernetes Workload API for gang-scheduling
+    // Requires Kubernetes v1.35+ with GenericWorkload feature gates enabled.
+    Workload *WorkloadPodGroupPolicySource `json:"workload,omitempty"`


While I agree with what you wrote, but I don't think this is a concern here.

The change to this API is not an opt-in for gang-scheduling or opt-in for creation of Workload API.
This is rather a decision "which scheduing provider should be used".

@andreyvelich - please correct me if I'm wrong, but the way I understand it is that:

The inherent nature of TrainJob is that it should be gang-scheduled.

Up until now, there were two mechanism to achieve it - either via coscheduling plugin or via Volcano.

In order to achieve the the required semantic, users have to set it (in a way that was reflecting their cluster setup) - based on that value corresponding objects were created/modified

With this KEP, we're not changing that nor introducing any new API. We're just saying that there is new way of achieving the inherent gang-scheduling requirement. That underneath translates to creation of Workload/PodGroup object, but that's an implementation detail.

At some point in the future, it would be great if we could make the Workload a default mode (because it will work out-of-the-box in any k8s cluster) and not require this field at all, but that's just a usability improvement.

wojtek-t · 2026-02-20T13:27:12Z

docs/proposals/3015-workload-aware-scheduling/README.md

+1. **Scheduling**: The kube-scheduler uses the Workload Scheduling Cycle to process entire PodGroups
+   atomically, ensuring all Pods in a gang are scheduled together.
+
+1. **Suspension**: When the TrainJob is suspended, Workload and PodGroup resources are preserved.


I think that eventually we should actually delete PodGroup on suspension - PodGroup is representing a runtime unit and when TrainJob is suspended we don't really have any runtime instance.

The question is what to do with the Workload - given it's effectively policy definition, I would say that we should keep it. But I don't have very strong opinion yet. I think that @kannon92 was suggesting removing that too.

I don't think that this comment is strictly Alpha requirement, but I think we should rethink the patter before Beta.

The API has changed so much in a month..

When MinCount was in the policy definition (workload) we have to clean it up on suspend. I have gotten requests that a suspended workload should be mutable. We have use cases where users want to right size a workload to match the actual usage (probably difficult but there are some internal services trying to guess how much GPUs you actually need for a llm versus how much you requested).

I also want the ability to have a LLM (or user) potentially modify a suspended workload so the workload could be admitted sooner (say you request 8 nodes with 64 gpus but maybe youll workload can be admitted with 4 pods and 32 gpus sooner. So they they should be able to modify the parallelism and the requests accordingly.

I have to check again with the Workload API (with TAS / DRA / etc) about what information we would need to modify in the workload API if a workload is suspended.

So they they should be able to modify the parallelism and the requests accordingly.

Yes, we also discuss similar use-cases with @VassilisVassiliadis: #2599 (comment).
Where external service (e.g. LLM) can decide optimal number of GPUs for TrainJob.
That will require to firstly submit TrainJob is suspended state, trigger the classifier model, set correct container resources, un-suspend the TrainJob.

I think that eventually we should actually delete PodGroup on suspension - PodGroup is representing a runtime unit and when TrainJob is suspended we don't really have any runtime instance.

That make sense, but this decision will depend on mutability of Workload fields (e.g. minCount).
If we decide to keep Workload API immutable, we might require to re-create it once resources are changed during TrainJob suspension.

I'm pretty convinced that minCount should be mutable field (not only for suspended workloads).
But despite that, it's only about the Workload.

But I think no matter what, we should always delete PodGroup(s) if we suspend the workload.

That makes sense!

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

feat(api): KEP-3015: Workload Aware Scheduling for TrainJob

514d8cd

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

Copilot AI review requested due to automatic review settings February 17, 2026 19:08

google-oss-prow bot added the do-not-merge/work-in-progress label Feb 17, 2026

google-oss-prow bot requested a review from kannon92 February 17, 2026 19:08

google-oss-prow bot added the size/XL label Feb 17, 2026

Copilot started reviewing on behalf of andreyvelich February 17, 2026 19:08 View session

Copilot AI reviewed Feb 17, 2026

View reviewed changes

andreyvelich and others added 3 commits February 18, 2026 14:45

Update docs/proposals/3015-workload-aware-scheduling/README.md

6aef33d

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

Update docs/proposals/3015-workload-aware-scheduling/README.md

088e4ef

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

Fix feature gates

1f24cf0

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

andreyvelich added 2 commits February 18, 2026 15:00

Update Custom Setting for PodGroupTemplates alternative

aa1f34e

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

Add Defaulting/Validation

09eb7ae

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

mm4tt reviewed Feb 20, 2026

View reviewed changes

wojtek-t reviewed Feb 20, 2026

View reviewed changes

Add OwnerReferences Relationship section

50ed361

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

helayoty added this to Workload-aware & Topology-aware Workstream Feb 23, 2026

helayoty moved this to Backlog in Workload-aware & Topology-aware Workstream Feb 23, 2026

Conversation

andreyvelich commented Feb 17, 2026

Uh oh!

google-oss-prow bot commented Feb 17, 2026

Uh oh!

google-oss-prow bot commented Feb 17, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coveralls commented Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Test Coverage Report for Build 22145283120

Warning: This coverage report may be inaccurate.

Details

💛 - Coveralls

Uh oh!

vsoch commented Feb 18, 2026

Uh oh!

kannon92 commented Feb 18, 2026

Uh oh!

vsoch commented Feb 18, 2026

Uh oh!

andreyvelich commented Feb 18, 2026

Uh oh!

mm4tt left a comment

Choose a reason for hiding this comment

Uh oh!

mm4tt Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wojtek-t left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andreyvelich Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

coveralls commented Feb 17, 2026 •

edited

Loading

mm4tt Feb 20, 2026 •

edited

Loading

andreyvelich Feb 23, 2026 •

edited

Loading