feat(runtimes): add Pending and Running status conditions for TrainJob#3019
feat(runtimes): add Pending and Running status conditions for TrainJob#3019RohitYandigeri wants to merge 2 commits intokubeflow:masterfrom
Conversation
feat(trainjob): add Pending and Running status conditions
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
🎉 Welcome to the Kubeflow Trainer! 🎉 Thanks for opening your first PR! We're happy to have you as part of our community 🚀 Here's what happens next:
Join the community:
Feel free to ask questions in the comments if you need any help or clarification! |
There was a problem hiding this comment.
Thank you for this contribution @RohitYandigeri!
As we discussed before, Job Active condition doesn't indicate that training process is running. Sometime, PyTorch can stuck in initialization phase or synchronization phase. Kubernetes Batch/Job doesn't detect that.
As part of this KEP: #2905, we are working on exposing training progress to the TrainJob status. That will help us to detect that the actual model training process is happening.
cc @robert-bell @kubeflow/kubeflow-trainer-team
Let's hold this PR for now, and discuss more details in 2905
/hold
|
I will close this PR since we will implement solution to detect TrainJob status as part of this KEP: #2905 |
|
@andreyvelich: Closed this PR. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Fixes #2713
This PR improves TrainJob observability by adding intermediate status
conditions that reflect the lifecycle of the underlying jobs.
What’s added
Pendingcondition when a TrainJob is created but underlying jobs have not startedRunningcondition when underlying jobs are actively executingkubectl get trainjobnow reflects these states consistentlyNotes
with Kubernetes conventions
manifests if requested
Why
This aligns TrainJob with standard Kubernetes resource status patterns and
provides better UX for users monitoring training lifecycle.
cc @kubeflow/kubeflow-trainer-team