feat: Add TrainJob progression tracking with real-time status updates#2820
feat: Add TrainJob progression tracking with real-time status updates#2820abhijeet-dhumal wants to merge 3 commits intokubeflow:masterfrom
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
a64fa0d to
76dc772
Compare
Pull Request Test Coverage Report for Build 17559363064Details
💛 - Coveralls |
76dc772 to
9c3b465
Compare
Signed-off-by: Abhijeet Dhumal <abdhumal@redhat.com>
9c3b465 to
7b5fe99
Compare
|
@andreyvelich @astefanutti @kannon92 May I request your review here .. |
- Added RHOAI-specific manifests for OpenShift AI deployment - Added Dockerfile.odh for ODH-specific container builds - Includes training runtimes for CUDA 2.4.1, 2.5.1 and ROCm 2.4.1, 2.5.1 - Added monitoring, RBAC, and configuration patches for RHOAI
|
This seems to be a pretty large change. Should there be a KEP update or a design doc summarizing this? Forgive me if I missed it. |
|
Yeah, we should prepare KEP as part of: #2779 |
|
@kannon92 you're right. As @andreyvelich mentioned we've discussed this during the community call and we are working on a draft for the KEP that we'll hopefully be able to share soon. |
|
Closing this PR as the proposal is created in separate PR mentioned below and available now to be reviewed for investigating on the ideal implementation approach |
This PR implements real-time progression tracking for TrainJobs, enabling users to monitor training progress directly through the Kubernetes API without requiring additional RBAC permissions for training pods.
What this PR does / why we need it:
Which issue(s) this PR fixes (optional, in
Fixes #<issue number>, #<issue number>, ...format, will close the issue(s) when PR gets merged):Fixes #2779
Summary and test results
Usage :
Valid API request to get training progress :
Checklist: