[WIP] feat(api): KEP-3015: Workload Aware Scheduling for TrainJob#3219
[WIP] feat(api): KEP-3015: Workload Aware Scheduling for TrainJob#3219andreyvelich wants to merge 7 commits intokubeflow:masterfrom
Conversation
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
|
@andreyvelich: GitHub didn't allow me to request PR reviews from the following users: mm4tt, vsoch, dom4ha, VassilisVassiliadis, helayoty, wojtek-t, klueska. Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Pull request overview
Adds an initial KEP/proposal document for integrating Kubernetes Workload API–based workload-aware (gang) scheduling into Kubeflow Trainer’s TrainJob via a new PodGroupPolicy plugin, as groundwork for future implementation.
Changes:
- Introduces a new KEP (3015) describing Workload/PodGroup orchestration for TrainJob scheduling.
- Documents intended API surface (
podGroupPolicy.workload) and controller/plugin responsibilities across build/enforcement/watch phases. - Outlines lifecycle, RBAC, feature gates, and a test plan for the proposed integration.
Pull Request Test Coverage Report for Build 22145283120Warning: This coverage report may be inaccurate.This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.
Details
💛 - Coveralls |
|
I'm a little caught up on this discussion: kubernetes-sigs/jobset#1068 (comment) If the Workload is intended to be integrated with JobSet, are there ever cases of creating TrainJob without a backing JobSet? If not, should we not expose Workload (and gang scheduling) through JobSet instead of adding separately via PodGroupPolicy? What happens when we have PodGroupPolicy plus underlying JobSet support to create the Workload? Or is this intended as a temporary patch to expose it while it is being worked on for JobSet? |
Its a hard discussion to follow. The general stance now is that the api that a user creates (sometimes known as true workload) should probably manage the podGroups / Workloads. For Trainer we could maybe consider delegating this to JobSet for JobSet based TrainingRuntimes but that doesn't solve the problem for Flux or whatever other backend we have.
The main outcome would be that if there exists a workloadRef or PodGroupRef on the pod template we are to assume that the WAS scheduled is "handled". But we haven't updated the KEP for JobSet. That was at least what we were discussing on all the features. |
|
The FluxPolicy implementation is just tweaking a JobSet, so I think if support was added to that underlying abstraction it would be inherited by Flux. I want to make sure we are careful to (essentially) not implement this layer twice, but rather just once at the most appropriate level (maybe JobSet?) |
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
|
Yes, @kannon92 is right – we’ve had several discussions about this over the past few months. In general, higher-level controllers should manage Workload and PodGroup objects, since they have the most complete understanding of the AI application they are orchestrating. For example:
For now, I suggest that we add validation in the TrainJob controller to ensure users do not configure the Workload API directly in JobSet or Job objects. |
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
mm4tt
left a comment
There was a problem hiding this comment.
Huge thanks for working on this integration. Having working examples of WAS being adopted by real-workload controllers gives us the confidence we need to push WAS to Beta in 1.37. Great stuff!
|
|
||
| // Workload plugin using native Kubernetes Workload API for gang-scheduling | ||
| // Requires Kubernetes v1.35+ with GenericWorkload feature gates enabled. | ||
| Workload *WorkloadPodGroupPolicySource `json:"workload,omitempty"` |
There was a problem hiding this comment.
As discussed, this approach differs from the Job KEP (kubernetes/enhancements#5871), where we deliberately avoided Job API changes and simply enabled gang-scheduling by default in specific cases.
That said, introducing an explicit opt-in API here makes sense for now, as it naturally fits into the existing structure (e.g., gang-scheduling via coscheduling or Volcano).
One heads-up for the future: in 1.37, we will be working on a common API on the controller side for consuming WAS features like gang-scheduling. We want to avoid a fragmented ecosystem where every controller exposes its own API for WAS. Because of this, the API proposed here will likely need to be adjusted once we agree on the final shape of the common "WAS-controller API".
There was a problem hiding this comment.
While I agree with what you wrote, but I don't think this is a concern here.
The change to this API is not an opt-in for gang-scheduling or opt-in for creation of Workload API.
This is rather a decision "which scheduing provider should be used".
@andreyvelich - please correct me if I'm wrong, but the way I understand it is that:
- The inherent nature of TrainJob is that it should be gang-scheduled.
- Up until now, there were two mechanism to achieve it - either via coscheduling plugin or via Volcano.
- In order to achieve the the required semantic, users have to set it (in a way that was reflecting their cluster setup) - based on that value corresponding objects were created/modified
- With this KEP, we're not changing that nor introducing any new API. We're just saying that there is new way of achieving the inherent gang-scheduling requirement. That underneath translates to creation of Workload/PodGroup object, but that's an implementation detail.
At some point in the future, it would be great if we could make the Workload a default mode (because it will work out-of-the-box in any k8s cluster) and not require this field at all, but that's just a usability improvement.
There was a problem hiding this comment.
That's a fair point - it's a choice of scheduling provider. However, under the hood, it effectively acts as an opt-in for the Workload API.
If we establish a common opt-in/opt-out mechanism for WAS integration in the future (1.37), we wouldn't want TrainJob using a custom field while other controllers use the standard approach. Even if the final solution is exactly as you suggested—making Workload the default and dropping this field—it's still an API adjustment we'll need to make before Beta.
There was a problem hiding this comment.
However, under the hood, it effectively acts as an opt-in for the Workload API.
My mental model is that it's actually not opt-in for Workload API.
For me it's "define my provider".
I will throw a parallel example here. Let's imagine Karpenter vs CustomComputeClass. Underneath for many functionalities you can achieve the same, but the api representation is very different.
If you have a API concept where you just state "karpenter" or "ccc" the the necessary objects are being created for you behind the scenes - that's actually pretty compelling.
In other want - you want certain specific functionalities, but these may backed by different implementations/components in different environments. And you need to tell your controller which one it should now talk to to achieve what we need.
So for me it's actually not an opt-in for Workload API.
There was a problem hiding this comment.
One heads-up for the future: in 1.37, we will be working on a common API on the controller side for consuming WAS features like gang-scheduling
This is pretty interesting, do we have any work in progress doc to explore this?
please correct me if I'm wrong, but the way I understand it is that:
@wojtek-t Yes, your understanding is correct (users can also get gang-scheduling for TrainJob with Kueue or KAI Scheduler).
As I mentioned in the future plans, we might want to re-use this API for other WAS features like Topology-Aware Scheduling / DRA, so we should see whether this API makes sense for future extendability.
wojtek-t
left a comment
There was a problem hiding this comment.
Thanks a lot for helping us to validate the Workload/PodGroup design!
| ownerReferences: | ||
| - apiVersion: trainer.kubeflow.org/v1alpha1 | ||
| kind: TrainJob | ||
| name: my-job |
There was a problem hiding this comment.
In the Job integration KEP - we made this explicitly controller: true ownerRef.
Same for PodGroup.
I think it would be highly beneficial to unify that.
| ```yaml | ||
| spec: | ||
| schedulingGroup: | ||
| podGroupName: my-job-trainer-<hash> |
There was a problem hiding this comment.
Similarly - we add additional ownerRef to PodGroup from pods.
That is useful to ensure that Pods will get deleted if the PodGroup gets deleted...
There was a problem hiding this comment.
Good callout, I've added similar diagram with OwnerReference relationship as @helayoty added to the Job KEP: https://github.com/kubernetes/enhancements/blob/master/keps/sig-apps/5547-integrate-workload-with-job/README.md#ownerreferences-relationship
|
|
||
| // Workload plugin using native Kubernetes Workload API for gang-scheduling | ||
| // Requires Kubernetes v1.35+ with GenericWorkload feature gates enabled. | ||
| Workload *WorkloadPodGroupPolicySource `json:"workload,omitempty"` |
There was a problem hiding this comment.
While I agree with what you wrote, but I don't think this is a concern here.
The change to this API is not an opt-in for gang-scheduling or opt-in for creation of Workload API.
This is rather a decision "which scheduing provider should be used".
@andreyvelich - please correct me if I'm wrong, but the way I understand it is that:
- The inherent nature of TrainJob is that it should be gang-scheduled.
- Up until now, there were two mechanism to achieve it - either via coscheduling plugin or via Volcano.
- In order to achieve the the required semantic, users have to set it (in a way that was reflecting their cluster setup) - based on that value corresponding objects were created/modified
- With this KEP, we're not changing that nor introducing any new API. We're just saying that there is new way of achieving the inherent gang-scheduling requirement. That underneath translates to creation of Workload/PodGroup object, but that's an implementation detail.
At some point in the future, it would be great if we could make the Workload a default mode (because it will work out-of-the-box in any k8s cluster) and not require this field at all, but that's just a usability improvement.
| 1. **Scheduling**: The kube-scheduler uses the Workload Scheduling Cycle to process entire PodGroups | ||
| atomically, ensuring all Pods in a gang are scheduled together. | ||
|
|
||
| 1. **Suspension**: When the TrainJob is suspended, Workload and PodGroup resources are preserved. |
There was a problem hiding this comment.
I think that eventually we should actually delete PodGroup on suspension - PodGroup is representing a runtime unit and when TrainJob is suspended we don't really have any runtime instance.
The question is what to do with the Workload - given it's effectively policy definition, I would say that we should keep it. But I don't have very strong opinion yet. I think that @kannon92 was suggesting removing that too.
I don't think that this comment is strictly Alpha requirement, but I think we should rethink the patter before Beta.
There was a problem hiding this comment.
The API has changed so much in a month..
When MinCount was in the policy definition (workload) we have to clean it up on suspend. I have gotten requests that a suspended workload should be mutable. We have use cases where users want to right size a workload to match the actual usage (probably difficult but there are some internal services trying to guess how much GPUs you actually need for a llm versus how much you requested).
I also want the ability to have a LLM (or user) potentially modify a suspended workload so the workload could be admitted sooner (say you request 8 nodes with 64 gpus but maybe youll workload can be admitted with 4 pods and 32 gpus sooner. So they they should be able to modify the parallelism and the requests accordingly.
I have to check again with the Workload API (with TAS / DRA / etc) about what information we would need to modify in the workload API if a workload is suspended.
There was a problem hiding this comment.
So they they should be able to modify the parallelism and the requests accordingly.
Yes, we also discuss similar use-cases with @VassilisVassiliadis: #2599 (comment).
Where external service (e.g. LLM) can decide optimal number of GPUs for TrainJob.
That will require to firstly submit TrainJob is suspended state, trigger the classifier model, set correct container resources, un-suspend the TrainJob.
I think that eventually we should actually delete PodGroup on suspension - PodGroup is representing a runtime unit and when TrainJob is suspended we don't really have any runtime instance.
That make sense, but this decision will depend on mutability of Workload fields (e.g. minCount).
If we decide to keep Workload API immutable, we might require to re-create it once resources are changed during TrainJob suspension.
There was a problem hiding this comment.
I'm pretty convinced that minCount should be mutable field (not only for suspended workloads).
But despite that, it's only about the Workload.
But I think no matter what, we should always delete PodGroup(s) if we suspend the workload.
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Related: #3015
Ref issues: kubernetes/enhancements#4671, kubernetes/kubernetes#132192.
This is initial KEP to support WAS in Kubeflow Trainer and TrainJob. We need to discuss whether we want to allow users to set Workload spec in TrainJob as well.
The implementation should be started after Kubernetes v1.36
Project to track WAS in Kubernetes: Workload-aware & Topology-aware Workstream (view)
/assign @tenzen-y @astefanutti @akshaychitneni @robert-bell @kubeflow/kubeflow-trainer-team
/cc @kannon92 @helayoty @wojtek-t @klueska @mm4tt @vsoch @dom4ha @VassilisVassiliadis