feat(docs): proposal for adding TTLSecondsAfterFinished and ActiveDeadlineSeconds fields to TrainJob CRD#3068
feat(docs): proposal for adding TTLSecondsAfterFinished and ActiveDeadlineSeconds fields to TrainJob CRD#3068XploY04 wants to merge 8 commits intokubeflow:masterfrom
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
🎉 Welcome to the Kubeflow Trainer! 🎉 Thanks for opening your first PR! We're happy to have you as part of our community 🚀 Here's what happens next:
Join the community:
Feel free to ask questions in the comments if you need any help or clarification! |
There was a problem hiding this comment.
Pull request overview
This PR introduces a comprehensive design proposal (KEP-style document) for adding TTL-based automatic cleanup and runtime deadline enforcement to the TrainJob CRD. The proposal addresses resource management issues by enabling automatic deletion of finished jobs and preventing runaway training workloads.
Key Changes
- Proposes adding
TTLSecondsAfterFinishedfield for automatic deletion of completed TrainJobs - Proposes adding
ActiveDeadlineSecondsfield to enforce maximum runtime limits - Includes detailed implementation plan, test strategy, production readiness considerations, and upgrade/downgrade procedures
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Hey @andreyvelich |
Pull Request Test Coverage Report for Build 22064803787Details
💛 - Coveralls |
andreyvelich
left a comment
There was a problem hiding this comment.
Sorry for the late reply @XploY04!
Overall, looks great, I left a few thoughts.
| - Expose `TTLSecondsAfterFinished` in the SDK (this is platform admin controlled) | ||
| - Automatically migrate existing TrainJobs to use new defaults | ||
| - Provide per-namespace TTL overrides | ||
|
|
There was a problem hiding this comment.
If you could add some user stories that would be helpful to explain why we want to add ActiveDeadlineSeconds to TrainJob and TTLSecondsAfterFinished to Runtime.
Ref: https://github.com/kubeflow/trainer/tree/master/docs/proposals/2442-jax-runtime-trainer-v2#user-stories
| // +kubebuilder:validation:Minimum=0 | ||
| TTLSecondsAfterFinished *int32 `json:"ttlSecondsAfterFinished,omitempty"` | ||
|
|
||
| // ActiveDeadlineSeconds specifies the default maximum runtime for TrainJobs | ||
| // using this runtime. Individual TrainJobs can override this value by setting | ||
| // their own ActiveDeadlineSeconds. | ||
| // +optional | ||
| // +kubebuilder:validation:Minimum=1 | ||
| ActiveDeadlineSeconds *int64 `json:"activeDeadlineSeconds,omitempty"` |
There was a problem hiding this comment.
@XploY04 I would suggest to remove activeDeadlineSeconds from Runtime spec initially, and tell users to configure timeout in trainJob.spec directly.
Once we get feedback that users want to configure timeout in Runtime for all TrainJob, we can extend it easily.
| Add new condition reason in `pkg/apis/trainer/v1alpha1/trainjob_types.go`: | ||
|
|
||
| ```go | ||
| const ( | ||
| // TrainJobDeadlineExceededReason is used when ActiveDeadlineSeconds is exceeded | ||
| TrainJobDeadlineExceededReason string = "DeadlineExceeded" | ||
| ) |
There was a problem hiding this comment.
This should be set for Failed condition in TrainJob, right ?
Like in Job: https://kubernetes.io/docs/concepts/workloads/controllers/job/#job-termination-and-cleanup
|
|
||
| | Field | TrainJob Value | Runtime Value | Effective Value | | ||
| |-------|---------------|---------------|-----------------| | ||
| | `ActiveDeadlineSeconds` | Set | Set | **TrainJob value** (override) | | ||
| | `ActiveDeadlineSeconds` | Set | Unset | TrainJob value | | ||
| | `ActiveDeadlineSeconds` | Unset | Set | Runtime value (default) | | ||
| | `ActiveDeadlineSeconds` | Unset | Unset | No deadline enforced | | ||
| | `TTLSecondsAfterFinished` | N/A | Set | Runtime value | | ||
| | `TTLSecondsAfterFinished` | N/A | Unset | No TTL cleanup | |
There was a problem hiding this comment.
You don’t need to include this table. Simply state that values defined in TrainJob take precedence over those specified in Runtime.
| # Uses runtime defaults: 8-hour deadline, 24-hour TTL | ||
| ``` | ||
|
|
||
| **TrainJob Overriding Deadline (Data Scientist):** |
There was a problem hiding this comment.
Could you also add simple example with Kubeflow SDK and train() API where AI practitioners can set timeout:
TrainerClient().train(
trainer=CustomTrainer(
func=get_torch_dist,
num_nodes=3,
),
initializer=Initializer(
model=HuggingFaceDatasetInitializer(storage_uri="hf://qwen3.2-instruct")
),
timeout=500
)cc @kubeflow/kubeflow-sdk-team
There was a problem hiding this comment.
Sure, I will add this and I can also take this up, after the implementation is completed here.
|
|
||
| ### Implementation Overview | ||
|
|
||
| **Controller Changes** (`pkg/controller/trainjob_controller.go`): |
There was a problem hiding this comment.
@tenzen-y @XploY04 Do we need to implement any of this functionality in runtime framework?
As of now we use Info and PodSets to merge parameters: https://github.com/kubeflow/trainer/blob/master/pkg/runtime/runtime.go#L36
There was a problem hiding this comment.
I don't think, any of these functionality is needed in the runtime framework because
- TTL: Must be handled at the TrainJob level because we need to delete the TrainJob object itself. Setting TTL on the JobSet would only delete the JobSet, leaving orphaned TrainJobs in etcd.
- Deadline: We can add deadline as a secondary enforcement in the runtime framework, but the controller needs to set the
failedcondition withReason: Deadline Exceededon the trainjob, which can't be achieved from Job-level activeDeadlineSeconds only.
So, I would not recommend passing any to the runtime framework for now.
There was a problem hiding this comment.
Sounds good, we can define the logic in the TrainJob controller directly .
| // +optional | ||
| // +kubebuilder:validation:Minimum=1 | ||
| // +kubebuilder:validation:XValidation:rule="self == oldSelf",message="field is immutable" | ||
| ActiveDeadlineSeconds *int64 `json:"activeDeadlineSeconds,omitempty"` |
There was a problem hiding this comment.
As alternative we can consider to use TemplateOverrides or Overrides API to update this value in Runtime: #3199
But that will force us to have something like this:
type Override struct {
Manager string `json:"manager,omitempty"`
// runtimeSpecOverrides defines overrides that applied to Runtime spec
RuntimeSpecOverrides []RuntimeSpecOverrides `json:"runtimeSpecOverrides,omitempty"`
// jobTemplateOverrides defines overrides that applied to JobTemplateSpec
JobTemplateOverrides []JobTemplateOverride `json:"jobTemplateOverrides,omitempty"`
// podTemplateOverrides defines overrides that applied to PodTemplateSpec
PodTemplateOverrides []PodTemplateOverride `json:"podTemplateOverrides,omitempty"`
}Not sure if that makes sense, compare to simple trainJob.spec.activeDeadlineSeconds.
There was a problem hiding this comment.
That looks overly complicated at first glance.
There was a problem hiding this comment.
True. An alternative to introducing RuntimeOverrides into the Override API would be to duplicate the relevant parameters directly in the TrainJob spec.
For example, if we decide that certain parameters should be overridable at the TrainJob level, we could define a dedicated field such as trainJob.spec.workloadSpec or trainJob.spec.podGroupPolicy.
@tenzen-y What do you think?
| 1. Controller-runtime triggers initial sync, reconciling all TrainJobs | ||
| 2. For each TrainJob, deadlines and TTL are recalculated from: | ||
| - The last resume time (or `metadata.creationTimestamp` if never suspended) for deadline calculation | ||
| - `LastTransitionTime` of the `Complete` or `Failed` condition for TTL calculation | ||
| - The referenced TrainingRuntime (protected from deletion via the `ResourceInUse` finalizer) | ||
| 3. If deadline/TTL already expired during downtime, action is taken immediately | ||
| 4. Otherwise, appropriate requeue times are set | ||
|
|
||
| This design ensures no TrainJobs are "forgotten" after a controller restart. |
There was a problem hiding this comment.
Do we know if Job has similar semantic?
cc @kannon92
There was a problem hiding this comment.
The K8s job controller has same semantics,
- for deadlines, on every sync the controller calls
pastActivedeadline()which recalculates the deadline from the persistedjob.status.startTime:
// From syncJob():
} else if jm.pastActiveDeadline(&job) {
jobCtx.finishedCondition = jm.newFailureCondition(
batch.JobReasonDeadlineExceeded,
"Job was active longer than specified deadline",
)
} else if job.Spec.ActiveDeadlineSeconds != nil && !jobSuspended(&job) {
syncDuration := time.Duration(*job.Spec.ActiveDeadlineSeconds)*time.Second -
jm.clock.Since(job.Status.StartTime.Time)
jm.queue.AddAfter(key, syncDuration)
}
- For TTL, the
ttl-after-finishedcontroller re-lists all Jobs on startup and recalculates expiry from persisted completionTime + ttlSecondsAfterFinished. If the TTL expired during downtime, deletion happens immediately.
Our proposal follows this exact same pattern using persisted timestamps (lastResumeTime, condition LastTransitionTime) to recalculate on restart, with no in-memory timer state.
Let me know if any changes are required here.
| - End-to-end TTL deletion from Runtime default | ||
| - End-to-end deadline from Runtime default | ||
| - TrainJob deadline overriding Runtime deadline | ||
| - Cascade deletion of owned resources |
There was a problem hiding this comment.
Let's also add integration tests for suspended TrainJobs
|
|
||
| This design ensures no TrainJobs are "forgotten" after a controller restart. | ||
|
|
||
| **Validation:** |
There was a problem hiding this comment.
Do we need to validate that deadline and TTL is not set in JobSet and Job?
There was a problem hiding this comment.
Yes, I think we should add it, because without it both levels might have different values that could cause conflicts. I will add this in the proposal.
|
Hi @andreyvelich |
| initializer=Initializer( | ||
| model=HuggingFaceDatasetInitializer(storage_uri="hf://qwen3.2-instruct") | ||
| ), | ||
| timeout=28800, # 8 hours max |
There was a problem hiding this comment.
timeout seems too generic, it may be useful to be more specific.
There was a problem hiding this comment.
Should I change it active_deadline_seconds ?
There was a problem hiding this comment.
Yeah, it looks like we agreed to have active_deadline_seconds for Katib SDK previously: kubeflow/katib#2568 (comment)
There was a problem hiding this comment.
Ok, so I will change it to active_deadline_seconds.
There was a problem hiding this comment.
That looks good / more specific to me. There might be other types of timeouts in the future.
| // +optional | ||
| // +kubebuilder:validation:Minimum=1 | ||
| // +kubebuilder:validation:XValidation:rule="self == oldSelf",message="field is immutable" | ||
| ActiveDeadlineSeconds *int64 `json:"activeDeadlineSeconds,omitempty"` |
There was a problem hiding this comment.
That looks overly complicated at first glance.
|
/ok-to-test |
|
@XploY04 @astefanutti @tenzen-y Are there any other open questions to this KEP before moving forward? |
|
@andreyvelich Nothing from my side, I have started working on the implementation already, since I saw no open questions except yours, it will be done by tomorrow. |
|
@XploY04 can you update the SDK example with |
|
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
…dlineSeconds fields to TrainJob CRD Signed-off-by: XploY04 <2004agarwalyash@gmail.com>
Signed-off-by: XploY04 <2004agarwalyash@gmail.com>
Signed-off-by: XploY04 <2004agarwalyash@gmail.com>
Signed-off-by: XploY04 <2004agarwalyash@gmail.com>
…SecondsAfterFinished validation, and remove proposed status fields, SDK changes, and metrics. Signed-off-by: XploY04 <2004agarwalyash@gmail.com>
Signed-off-by: XploY04 <2004agarwalyash@gmail.com>
Signed-off-by: XploY04 <2004agarwalyash@gmail.com>
Signed-off-by: XploY04 <2004agarwalyash@gmail.com>
andreyvelich
left a comment
There was a problem hiding this comment.
I think, we should be good to move this forward.
We haven't reached consensus around this, but we can discuss it as a followup: #3068 (comment)
/lgtm
/assign @tenzen-y @astefanutti @akshaychitneni
astefanutti
left a comment
There was a problem hiding this comment.
Thanks @XploY04.
/lgtm
| // This is a platform-level policy that individual TrainJobs cannot override. | ||
| // +optional | ||
| // +kubebuilder:validation:Minimum=0 | ||
| TTLSecondsAfterFinished *int32 `json:"ttlSecondsAfterFinished,omitempty"` |
There was a problem hiding this comment.
Why not reuse the batch/v1 Job-level TTL?
There was a problem hiding this comment.
Job-level ttlSecondsAfterFinished only deletes the underlying batch/v1 Job (or JobSet), but the TrainJob CR itself would remain as an orphan
| // Once reached, all running Pods are terminated and the TrainJob status becomes | ||
| // Failed with reason: DeadlineExceeded. | ||
| // +optional | ||
| // +kubebuilder:validation:Minimum=1 |
There was a problem hiding this comment.
Doens't this minimum validation conflict with +optional validation?
There was a problem hiding this comment.
+optional allows the field to be absent (nil pointer), while +kubebuilder:validation:Minimum=1 only applies when a value is provided.
|
|
||
| **Cons:** | ||
| - No centralized policy enforcement for platform admins | ||
| - Data scientists must set TTL on every job |
There was a problem hiding this comment.
I don't think so. Platform admins can set TTL on trainingRuntime Job / Pod spec.
There was a problem hiding this comment.
Yes, you are right here, the con is incorrect. The con is there's no way for a data scientist to set a TTLSecondsAfterFinished on their TrainJob that actually cleans up the TrainJob CR itself. I will update it.
| metadata: | ||
| name: torch-distributed-gpu | ||
| spec: | ||
| ttlSecondsAfterFinished: 86400 # Auto-delete after 24 hours |
There was a problem hiding this comment.
As I checked #2899, the request is cleaning Pods / Job /TrainJob.
I didn't se any request about runtime clean up.
Indeed, runtime is just template, not actual workload resource.
There was a problem hiding this comment.
the comment is creating confusion here ig, the more accurate comment would be # Auto-delete TrainJobs using this runtime after 24 hours. I think this brings more clarity.
What this PR does / why we need it:
Fixes #2899
PR #3258