Skip to content

[WIP] feat(api): KEP-3015: Workload Aware Scheduling for TrainJob#3219

Open
andreyvelich wants to merge 7 commits intokubeflow:masterfrom
andreyvelich:was-kep
Open

[WIP] feat(api): KEP-3015: Workload Aware Scheduling for TrainJob#3219
andreyvelich wants to merge 7 commits intokubeflow:masterfrom
andreyvelich:was-kep

Conversation

@andreyvelich
Copy link
Member

Related: #3015
Ref issues: kubernetes/enhancements#4671, kubernetes/kubernetes#132192.

This is initial KEP to support WAS in Kubeflow Trainer and TrainJob. We need to discuss whether we want to allow users to set Workload spec in TrainJob as well.

The implementation should be started after Kubernetes v1.36

Project to track WAS in Kubernetes: Workload-aware & Topology-aware Workstream (view)

/assign @tenzen-y @astefanutti @akshaychitneni @robert-bell @kubeflow/kubeflow-trainer-team

/cc @kannon92 @helayoty @wojtek-t @klueska @mm4tt @vsoch @dom4ha @VassilisVassiliadis

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Copilot AI review requested due to automatic review settings February 17, 2026 19:08
@google-oss-prow google-oss-prow bot requested a review from kannon92 February 17, 2026 19:08
@google-oss-prow
Copy link

@andreyvelich: GitHub didn't allow me to request PR reviews from the following users: mm4tt, vsoch, dom4ha, VassilisVassiliadis, helayoty, wojtek-t, klueska.

Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs.

Details

In response to this:

Related: #3015
Ref issues: kubernetes/enhancements#4671, kubernetes/kubernetes#132192.

This is initial KEP to support WAS in Kubeflow Trainer and TrainJob. We need to discuss whether we want to allow users to set Workload spec in TrainJob as well.

The implementation should be started after Kubernetes v1.36

Project to track WAS in Kubernetes: Workload-aware & Topology-aware Workstream (view)

/assign @tenzen-y @astefanutti @akshaychitneni @robert-bell @kubeflow/kubeflow-trainer-team

/cc @kannon92 @helayoty @wojtek-t @klueska @mm4tt @vsoch @dom4ha @VassilisVassiliadis

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from andreyvelich. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an initial KEP/proposal document for integrating Kubernetes Workload API–based workload-aware (gang) scheduling into Kubeflow Trainer’s TrainJob via a new PodGroupPolicy plugin, as groundwork for future implementation.

Changes:

  • Introduces a new KEP (3015) describing Workload/PodGroup orchestration for TrainJob scheduling.
  • Documents intended API surface (podGroupPolicy.workload) and controller/plugin responsibilities across build/enforcement/watch phases.
  • Outlines lifecycle, RBAC, feature gates, and a test plan for the proposed integration.

@coveralls
Copy link

coveralls commented Feb 17, 2026

Pull Request Test Coverage Report for Build 22145283120

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 55.998%

Totals Coverage Status
Change from base Build 22081023611: 0.0%
Covered Lines: 1391
Relevant Lines: 2484

💛 - Coveralls

@vsoch
Copy link
Contributor

vsoch commented Feb 18, 2026

I'm a little caught up on this discussion:

kubernetes-sigs/jobset#1068 (comment)

If the Workload is intended to be integrated with JobSet, are there ever cases of creating TrainJob without a backing JobSet? If not, should we not expose Workload (and gang scheduling) through JobSet instead of adding separately via PodGroupPolicy? What happens when we have PodGroupPolicy plus underlying JobSet support to create the Workload? Or is this intended as a temporary patch to expose it while it is being worked on for JobSet?

@kannon92
Copy link
Contributor

I'm a little caught up on this discussion:

kubernetes-sigs/jobset#1068 (comment)

If the Workload is intended to be integrated with JobSet, are there ever cases of creating TrainJob without a backing JobSet? If not, should we not expose Workload (and gang scheduling) through JobSet instead of adding separately via PodGroupPolicy? What happens when we have PodGroupPolicy plus underlying JobSet support to create the Workload? Or is this intended as a temporary patch to expose it while it is being worked on for JobSet?

Its a hard discussion to follow.

The general stance now is that the api that a user creates (sometimes known as true workload) should probably manage the podGroups / Workloads.

For Trainer we could maybe consider delegating this to JobSet for JobSet based TrainingRuntimes but that doesn't solve the problem for Flux or whatever other backend we have.

What happens when we have PodGroupPolicy plus underlying JobSet support to create the Workload? Or is this intended as a temporary patch to expose it while it is being worked on for JobSet?

The main outcome would be that if there exists a workloadRef or PodGroupRef on the pod template we are to assume that the WAS scheduled is "handled". But we haven't updated the KEP for JobSet. That was at least what we were discussing on all the features.

@vsoch
Copy link
Contributor

vsoch commented Feb 18, 2026

The FluxPolicy implementation is just tweaking a JobSet, so I think if support was added to that underlying abstraction it would be inherited by Flux. I want to make sure we are careful to (essentially) not implement this layer twice, but rather just once at the most appropriate level (maybe JobSet?)

andreyvelich and others added 3 commits February 18, 2026 14:45
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
@andreyvelich
Copy link
Member Author

Yes, @kannon92 is right – we’ve had several discussions about this over the past few months.

In general, higher-level controllers should manage Workload and PodGroup objects, since they have the most complete understanding of the AI application they are orchestrating. For example:

For now, I suggest that we add validation in the TrainJob controller to ensure users do not configure the Workload API directly in JobSet or Job objects.

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Copy link

@mm4tt mm4tt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huge thanks for working on this integration. Having working examples of WAS being adopted by real-workload controllers gives us the confidence we need to push WAS to Beta in 1.37. Great stuff!


// Workload plugin using native Kubernetes Workload API for gang-scheduling
// Requires Kubernetes v1.35+ with GenericWorkload feature gates enabled.
Workload *WorkloadPodGroupPolicySource `json:"workload,omitempty"`
Copy link

@mm4tt mm4tt Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed, this approach differs from the Job KEP (kubernetes/enhancements#5871), where we deliberately avoided Job API changes and simply enabled gang-scheduling by default in specific cases.

That said, introducing an explicit opt-in API here makes sense for now, as it naturally fits into the existing structure (e.g., gang-scheduling via coscheduling or Volcano).

One heads-up for the future: in 1.37, we will be working on a common API on the controller side for consuming WAS features like gang-scheduling. We want to avoid a fragmented ecosystem where every controller exposes its own API for WAS. Because of this, the API proposed here will likely need to be adjusted once we agree on the final shape of the common "WAS-controller API".

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I agree with what you wrote, but I don't think this is a concern here.

The change to this API is not an opt-in for gang-scheduling or opt-in for creation of Workload API.
This is rather a decision "which scheduing provider should be used".

@andreyvelich - please correct me if I'm wrong, but the way I understand it is that:

  1. The inherent nature of TrainJob is that it should be gang-scheduled.
  2. Up until now, there were two mechanism to achieve it - either via coscheduling plugin or via Volcano.
  3. In order to achieve the the required semantic, users have to set it (in a way that was reflecting their cluster setup) - based on that value corresponding objects were created/modified
  4. With this KEP, we're not changing that nor introducing any new API. We're just saying that there is new way of achieving the inherent gang-scheduling requirement. That underneath translates to creation of Workload/PodGroup object, but that's an implementation detail.

At some point in the future, it would be great if we could make the Workload a default mode (because it will work out-of-the-box in any k8s cluster) and not require this field at all, but that's just a usability improvement.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a fair point - it's a choice of scheduling provider. However, under the hood, it effectively acts as an opt-in for the Workload API.

If we establish a common opt-in/opt-out mechanism for WAS integration in the future (1.37), we wouldn't want TrainJob using a custom field while other controllers use the standard approach. Even if the final solution is exactly as you suggested—making Workload the default and dropping this field—it's still an API adjustment we'll need to make before Beta.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, under the hood, it effectively acts as an opt-in for the Workload API.

My mental model is that it's actually not opt-in for Workload API.
For me it's "define my provider".

I will throw a parallel example here. Let's imagine Karpenter vs CustomComputeClass. Underneath for many functionalities you can achieve the same, but the api representation is very different.
If you have a API concept where you just state "karpenter" or "ccc" the the necessary objects are being created for you behind the scenes - that's actually pretty compelling.

In other want - you want certain specific functionalities, but these may backed by different implementations/components in different environments. And you need to tell your controller which one it should now talk to to achieve what we need.

So for me it's actually not an opt-in for Workload API.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One heads-up for the future: in 1.37, we will be working on a common API on the controller side for consuming WAS features like gang-scheduling

This is pretty interesting, do we have any work in progress doc to explore this?

please correct me if I'm wrong, but the way I understand it is that:

@wojtek-t Yes, your understanding is correct (users can also get gang-scheduling for TrainJob with Kueue or KAI Scheduler).
As I mentioned in the future plans, we might want to re-use this API for other WAS features like Topology-Aware Scheduling / DRA, so we should see whether this API makes sense for future extendability.

Copy link

@wojtek-t wojtek-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for helping us to validate the Workload/PodGroup design!

ownerReferences:
- apiVersion: trainer.kubeflow.org/v1alpha1
kind: TrainJob
name: my-job

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the Job integration KEP - we made this explicitly controller: true ownerRef.
Same for PodGroup.

I think it would be highly beneficial to unify that.

```yaml
spec:
schedulingGroup:
podGroupName: my-job-trainer-<hash>

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly - we add additional ownerRef to PodGroup from pods.

That is useful to ensure that Pods will get deleted if the PodGroup gets deleted...

Copy link
Member Author

@andreyvelich andreyvelich Feb 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good callout, I've added similar diagram with OwnerReference relationship as @helayoty added to the Job KEP: https://github.com/kubernetes/enhancements/blob/master/keps/sig-apps/5547-integrate-workload-with-job/README.md#ownerreferences-relationship


// Workload plugin using native Kubernetes Workload API for gang-scheduling
// Requires Kubernetes v1.35+ with GenericWorkload feature gates enabled.
Workload *WorkloadPodGroupPolicySource `json:"workload,omitempty"`

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I agree with what you wrote, but I don't think this is a concern here.

The change to this API is not an opt-in for gang-scheduling or opt-in for creation of Workload API.
This is rather a decision "which scheduing provider should be used".

@andreyvelich - please correct me if I'm wrong, but the way I understand it is that:

  1. The inherent nature of TrainJob is that it should be gang-scheduled.
  2. Up until now, there were two mechanism to achieve it - either via coscheduling plugin or via Volcano.
  3. In order to achieve the the required semantic, users have to set it (in a way that was reflecting their cluster setup) - based on that value corresponding objects were created/modified
  4. With this KEP, we're not changing that nor introducing any new API. We're just saying that there is new way of achieving the inherent gang-scheduling requirement. That underneath translates to creation of Workload/PodGroup object, but that's an implementation detail.

At some point in the future, it would be great if we could make the Workload a default mode (because it will work out-of-the-box in any k8s cluster) and not require this field at all, but that's just a usability improvement.

1. **Scheduling**: The kube-scheduler uses the Workload Scheduling Cycle to process entire PodGroups
atomically, ensuring all Pods in a gang are scheduled together.

1. **Suspension**: When the TrainJob is suspended, Workload and PodGroup resources are preserved.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that eventually we should actually delete PodGroup on suspension - PodGroup is representing a runtime unit and when TrainJob is suspended we don't really have any runtime instance.

The question is what to do with the Workload - given it's effectively policy definition, I would say that we should keep it. But I don't have very strong opinion yet. I think that @kannon92 was suggesting removing that too.

I don't think that this comment is strictly Alpha requirement, but I think we should rethink the patter before Beta.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The API has changed so much in a month..

When MinCount was in the policy definition (workload) we have to clean it up on suspend. I have gotten requests that a suspended workload should be mutable. We have use cases where users want to right size a workload to match the actual usage (probably difficult but there are some internal services trying to guess how much GPUs you actually need for a llm versus how much you requested).

I also want the ability to have a LLM (or user) potentially modify a suspended workload so the workload could be admitted sooner (say you request 8 nodes with 64 gpus but maybe youll workload can be admitted with 4 pods and 32 gpus sooner. So they they should be able to modify the parallelism and the requests accordingly.

I have to check again with the Workload API (with TAS / DRA / etc) about what information we would need to modify in the workload API if a workload is suspended.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So they they should be able to modify the parallelism and the requests accordingly.

Yes, we also discuss similar use-cases with @VassilisVassiliadis: #2599 (comment).
Where external service (e.g. LLM) can decide optimal number of GPUs for TrainJob.
That will require to firstly submit TrainJob is suspended state, trigger the classifier model, set correct container resources, un-suspend the TrainJob.

I think that eventually we should actually delete PodGroup on suspension - PodGroup is representing a runtime unit and when TrainJob is suspended we don't really have any runtime instance.

That make sense, but this decision will depend on mutability of Workload fields (e.g. minCount).
If we decide to keep Workload API immutable, we might require to re-create it once resources are changed during TrainJob suspension.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty convinced that minCount should be mutable field (not only for suspended workloads).
But despite that, it's only about the Workload.

But I think no matter what, we should always delete PodGroup(s) if we suspend the workload.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense!

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants