feat(docs): proposal for adding TTLSecondsAfterFinished and ActiveDeadlineSeconds fields to TrainJob CRD by XploY04 · Pull Request #3068 · kubeflow/trainer

XploY04 · 2026-01-05T21:10:52Z

What this PR does / why we need it:

google-oss-prow · 2026-01-05T21:10:58Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign terrytangyuan for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

github-actions · 2026-01-05T21:11:01Z

🎉 Welcome to the Kubeflow Trainer! 🎉

Thanks for opening your first PR! We're happy to have you as part of our community 🚀

Here's what happens next:

If you haven't already, please check out our Contributing Guide for repo-specific guidelines and the Kubeflow Contributor Guide for general community standards.
Our team will review your PR soon! cc @kubeflow/kubeflow-trainer-team

Join the community:

Slack: Join our #kubeflow-trainer Slack channel.
Meetings: Attend the Kubeflow AutoML and Training Working Group bi-weekly meetings.

Feel free to ask questions in the comments if you need any help or clarification!
Thanks again for contributing to Kubeflow! 🙏

Copilot

Pull request overview

This PR introduces a comprehensive design proposal (KEP-style document) for adding TTL-based automatic cleanup and runtime deadline enforcement to the TrainJob CRD. The proposal addresses resource management issues by enabling automatic deletion of finished jobs and preventing runaway training workloads.

Key Changes

Proposes adding TTLSecondsAfterFinished field for automatic deletion of completed TrainJobs
Proposes adding ActiveDeadlineSeconds field to enforce maximum runtime limits
Includes detailed implementation plan, test strategy, production readiness considerations, and upgrade/downgrade procedures

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

docs/proposals/2899-add-TTL/README.md

XploY04 · 2026-01-22T18:12:22Z

Hey @andreyvelich
I have made the required changes. Please take a look at it.

coveralls · 2026-02-10T11:02:57Z

Pull Request Test Coverage Report for Build 22064803787

Details

0 of 0 changed or added relevant lines in 0 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage remained the same at 56.026%

Totals
Change from base Build 22051165353:	0.0%
Covered Lines:	1390
Relevant Lines:	2481

💛 - Coveralls

andreyvelich

Sorry for the late reply @XploY04!
Overall, looks great, I left a few thoughts.

andreyvelich · 2026-02-12T03:28:59Z

docs/proposals/2899-add-TTL/README.md

+- Expose `TTLSecondsAfterFinished` in the SDK (this is platform admin controlled)
+- Automatically migrate existing TrainJobs to use new defaults
+- Provide per-namespace TTL overrides
+


If you could add some user stories that would be helpful to explain why we want to add ActiveDeadlineSeconds to TrainJob and TTLSecondsAfterFinished to Runtime.
Ref: https://github.com/kubeflow/trainer/tree/master/docs/proposals/2442-jax-runtime-trainer-v2#user-stories

Sure, I will add.

andreyvelich · 2026-02-12T03:34:44Z

docs/proposals/2899-add-TTL/README.md

+    // +kubebuilder:validation:Minimum=0
+    TTLSecondsAfterFinished *int32 `json:"ttlSecondsAfterFinished,omitempty"`
+
+    // ActiveDeadlineSeconds specifies the default maximum runtime for TrainJobs
+    // using this runtime. Individual TrainJobs can override this value by setting
+    // their own ActiveDeadlineSeconds.
+    // +optional
+    // +kubebuilder:validation:Minimum=1
+    ActiveDeadlineSeconds *int64 `json:"activeDeadlineSeconds,omitempty"`


@XploY04 I would suggest to remove activeDeadlineSeconds from Runtime spec initially, and tell users to configure timeout in trainJob.spec directly.
Once we get feedback that users want to configure timeout in Runtime for all TrainJob, we can extend it easily.

andreyvelich · 2026-02-12T03:36:20Z

docs/proposals/2899-add-TTL/README.md

+Add new condition reason in `pkg/apis/trainer/v1alpha1/trainjob_types.go`:
+
+```go
+const (
+    // TrainJobDeadlineExceededReason is used when ActiveDeadlineSeconds is exceeded
+    TrainJobDeadlineExceededReason string = "DeadlineExceeded"
+)


This should be set for Failed condition in TrainJob, right ?
Like in Job: https://kubernetes.io/docs/concepts/workloads/controllers/job/#job-termination-and-cleanup

Yes, I will mention that here.

andreyvelich · 2026-02-12T03:37:13Z

docs/proposals/2899-add-TTL/README.md

+
+| Field | TrainJob Value | Runtime Value | Effective Value |
+|-------|---------------|---------------|-----------------|
+| `ActiveDeadlineSeconds` | Set | Set | **TrainJob value** (override) |
+| `ActiveDeadlineSeconds` | Set | Unset | TrainJob value |
+| `ActiveDeadlineSeconds` | Unset | Set | Runtime value (default) |
+| `ActiveDeadlineSeconds` | Unset | Unset | No deadline enforced |
+| `TTLSecondsAfterFinished` | N/A | Set | Runtime value |
+| `TTLSecondsAfterFinished` | N/A | Unset | No TTL cleanup |


You don’t need to include this table. Simply state that values defined in TrainJob take precedence over those specified in Runtime.

andreyvelich · 2026-02-12T03:41:13Z

docs/proposals/2899-add-TTL/README.md

+# Uses runtime defaults: 8-hour deadline, 24-hour TTL
+```
+
+**TrainJob Overriding Deadline (Data Scientist):**


Could you also add simple example with Kubeflow SDK and train() API where AI practitioners can set timeout:

TrainerClient().train( trainer=CustomTrainer( func=get_torch_dist, num_nodes=3, ), initializer=Initializer( model=HuggingFaceDatasetInitializer(storage_uri="hf://qwen3.2-instruct") ), timeout=500 )

cc @kubeflow/kubeflow-sdk-team

Sure, I will add this and I can also take this up, after the implementation is completed here.

andreyvelich · 2026-02-12T03:44:06Z

docs/proposals/2899-add-TTL/README.md

+
+### Implementation Overview
+
+**Controller Changes** (`pkg/controller/trainjob_controller.go`):


@tenzen-y @XploY04 Do we need to implement any of this functionality in runtime framework?
As of now we use Info and PodSets to merge parameters: https://github.com/kubeflow/trainer/blob/master/pkg/runtime/runtime.go#L36

Hi @andreyvelich

I don't think, any of these functionality is needed in the runtime framework because

TTL: Must be handled at the TrainJob level because we need to delete the TrainJob object itself. Setting TTL on the JobSet would only delete the JobSet, leaving orphaned TrainJobs in etcd.

Deadline: We can add deadline as a secondary enforcement in the runtime framework, but the controller needs to set the failed condition with Reason: Deadline Exceeded on the trainjob, which can't be achieved from Job-level activeDeadlineSeconds only.

So, I would not recommend passing any to the runtime framework for now.

Sounds good, we can define the logic in the TrainJob controller directly .

andreyvelich · 2026-02-12T03:48:14Z

docs/proposals/2899-add-TTL/README.md

+    // +optional
+    // +kubebuilder:validation:Minimum=1
+    // +kubebuilder:validation:XValidation:rule="self == oldSelf",message="field is immutable"
+    ActiveDeadlineSeconds *int64 `json:"activeDeadlineSeconds,omitempty"`


As alternative we can consider to use TemplateOverrides or Overrides API to update this value in Runtime: #3199
But that will force us to have something like this:

type Override struct { Manager string `json:"manager,omitempty"` // runtimeSpecOverrides defines overrides that applied to Runtime spec RuntimeSpecOverrides []RuntimeSpecOverrides `json:"runtimeSpecOverrides,omitempty"` // jobTemplateOverrides defines overrides that applied to JobTemplateSpec JobTemplateOverrides []JobTemplateOverride `json:"jobTemplateOverrides,omitempty"` // podTemplateOverrides defines overrides that applied to PodTemplateSpec PodTemplateOverrides []PodTemplateOverride `json:"podTemplateOverrides,omitempty"` }

Not sure if that makes sense, compare to simple trainJob.spec.activeDeadlineSeconds.

cc @mimowo @kaisoz

That looks overly complicated at first glance.

True. An alternative to introducing RuntimeOverrides into the Override API would be to duplicate the relevant parameters directly in the TrainJob spec.

For example, if we decide that certain parameters should be overridable at the TrainJob level, we could define a dedicated field such as trainJob.spec.workloadSpec or trainJob.spec.podGroupPolicy.
@tenzen-y What do you think?

andreyvelich · 2026-02-12T03:51:02Z

docs/proposals/2899-add-TTL/README.md

+1. Controller-runtime triggers initial sync, reconciling all TrainJobs
+2. For each TrainJob, deadlines and TTL are recalculated from:
+   - The last resume time (or `metadata.creationTimestamp` if never suspended) for deadline calculation
+   - `LastTransitionTime` of the `Complete` or `Failed` condition for TTL calculation
+   - The referenced TrainingRuntime (protected from deletion via the `ResourceInUse` finalizer)
+3. If deadline/TTL already expired during downtime, action is taken immediately
+4. Otherwise, appropriate requeue times are set
+
+This design ensures no TrainJobs are "forgotten" after a controller restart.


Do we know if Job has similar semantic?
cc @kannon92

The K8s job controller has same semantics,

for deadlines, on every sync the controller calls pastActivedeadline() which recalculates the deadline from the persisted job.status.startTime:

https://github.com/kubernetes/kubernetes/blob/150247a304f2cb290b8db8036f9dcab938983fb1/pkg/controller/job/job_controller.go#L968-L974

// From syncJob(): } else if jm.pastActiveDeadline(&job) { jobCtx.finishedCondition = jm.newFailureCondition( batch.JobReasonDeadlineExceeded, "Job was active longer than specified deadline", ) } else if job.Spec.ActiveDeadlineSeconds != nil && !jobSuspended(&job) { syncDuration := time.Duration(*job.Spec.ActiveDeadlineSeconds)*time.Second - jm.clock.Since(job.Status.StartTime.Time) jm.queue.AddAfter(key, syncDuration) }

For TTL, the ttl-after-finished controller re-lists all Jobs on startup and recalculates expiry from persisted completionTime + ttlSecondsAfterFinished. If the TTL expired during downtime, deletion happens immediately.

Our proposal follows this exact same pattern using persisted timestamps (lastResumeTime, condition LastTransitionTime) to recalculate on restart, with no in-memory timer state.

@kannon92 @andreyvelich

Let me know if any changes are required here.

Awesome, thanks for checking @XploY04!

andreyvelich · 2026-02-12T03:55:02Z

docs/proposals/2899-add-TTL/README.md

+    - End-to-end TTL deletion from Runtime default
+    - End-to-end deadline from Runtime default
+    - TrainJob deadline overriding Runtime deadline
+    - Cascade deletion of owned resources


Let's also add integration tests for suspended TrainJobs

Sure, I will add these.

andreyvelich · 2026-02-12T03:55:25Z

docs/proposals/2899-add-TTL/README.md

+
+This design ensures no TrainJobs are "forgotten" after a controller restart.
+
+**Validation:**


Do we need to validate that deadline and TTL is not set in JobSet and Job?

Yes, I think we should add it, because without it both levels might have different values that could cause conflicts. I will add this in the proposal.

XploY04 · 2026-02-12T16:21:15Z

Hi @andreyvelich
I have made the required changes in the proposal, let me know if any other changes are required.

astefanutti · 2026-02-16T11:36:14Z

docs/proposals/2899-add-TTL/README.md

+    initializer=Initializer(
+        model=HuggingFaceDatasetInitializer(storage_uri="hf://qwen3.2-instruct")
+    ),
+    timeout=28800,  # 8 hours max


timeout seems too generic, it may be useful to be more specific.

Should I change it active_deadline_seconds ?

Yeah, it looks like we agreed to have active_deadline_seconds for Katib SDK previously: kubeflow/katib#2568 (comment)

Ok, so I will change it to active_deadline_seconds.

That looks good / more specific to me. There might be other types of timeouts in the future.

astefanutti · 2026-02-16T11:36:45Z

docs/proposals/2899-add-TTL/README.md

+    // +optional
+    // +kubebuilder:validation:Minimum=1
+    // +kubebuilder:validation:XValidation:rule="self == oldSelf",message="field is immutable"
+    ActiveDeadlineSeconds *int64 `json:"activeDeadlineSeconds,omitempty"`


That looks overly complicated at first glance.

andreyvelich · 2026-02-16T13:54:03Z

/ok-to-test

andreyvelich · 2026-02-23T20:12:28Z

@XploY04 @astefanutti @tenzen-y Are there any other open questions to this KEP before moving forward?
I only noticed this unresolved part: #3068 (comment) ?

XploY04 · 2026-02-23T20:33:01Z

@andreyvelich Nothing from my side, I have started working on the implementation already, since I saw no open questions except yours, it will be done by tomorrow.

astefanutti · 2026-02-24T08:29:25Z

@XploY04 can you update the SDK example with timeout renamed to active_deadline_seconds?

review-notebook-app · 2026-02-24T18:07:33Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

…dlineSeconds fields to TrainJob CRD Signed-off-by: XploY04 <2004agarwalyash@gmail.com>

Signed-off-by: XploY04 <2004agarwalyash@gmail.com>

…SecondsAfterFinished validation, and remove proposed status fields, SDK changes, and metrics. Signed-off-by: XploY04 <2004agarwalyash@gmail.com>

Signed-off-by: XploY04 <2004agarwalyash@gmail.com>

andreyvelich

I think, we should be good to move this forward.
We haven't reached consensus around this, but we can discuss it as a followup: #3068 (comment)

/lgtm
/assign @tenzen-y @astefanutti @akshaychitneni

astefanutti

Thanks @XploY04.

/lgtm

tenzen-y · 2026-02-25T11:27:57Z

docs/proposals/2899-add-TTL/README.md

+    // This is a platform-level policy that individual TrainJobs cannot override.
+    // +optional
+    // +kubebuilder:validation:Minimum=0
+    TTLSecondsAfterFinished *int32 `json:"ttlSecondsAfterFinished,omitempty"`


Why not reuse the batch/v1 Job-level TTL?

Job-level ttlSecondsAfterFinished only deletes the underlying batch/v1 Job (or JobSet), but the TrainJob CR itself would remain as an orphan

tenzen-y · 2026-02-25T11:28:59Z

docs/proposals/2899-add-TTL/README.md

+    // Once reached, all running Pods are terminated and the TrainJob status becomes
+    // Failed with reason: DeadlineExceeded.
+    // +optional
+    // +kubebuilder:validation:Minimum=1


Doens't this minimum validation conflict with +optional validation?

+optional allows the field to be absent (nil pointer), while +kubebuilder:validation:Minimum=1 only applies when a value is provided.

tenzen-y · 2026-02-25T11:31:05Z

docs/proposals/2899-add-TTL/README.md

+
+**Cons:**
+- No centralized policy enforcement for platform admins
+- Data scientists must set TTL on every job


I don't think so. Platform admins can set TTL on trainingRuntime Job / Pod spec.

Yes, you are right here, the con is incorrect. The con is there's no way for a data scientist to set a TTLSecondsAfterFinished on their TrainJob that actually cleans up the TrainJob CR itself. I will update it.

tenzen-y · 2026-02-25T11:33:09Z

docs/proposals/2899-add-TTL/README.md

+metadata:
+  name: torch-distributed-gpu
+spec:
+  ttlSecondsAfterFinished: 86400      # Auto-delete after 24 hours


As I checked #2899, the request is cleaning Pods / Job /TrainJob.
I didn't se any request about runtime clean up.

Indeed, runtime is just template, not actual workload resource.

the comment is creating confusion here ig, the more accurate comment would be # Auto-delete TrainJobs using this runtime after 24 hours. I think this brings more clarity.

Copilot AI review requested due to automatic review settings January 5, 2026 21:10

google-oss-prow bot requested review from jinchihe and kuizhiqing January 5, 2026 21:10

google-oss-prow bot added the size/L label Jan 5, 2026

Copilot started reviewing on behalf of XploY04 January 5, 2026 21:11 View session

Copilot AI reviewed Jan 5, 2026

View reviewed changes

docs/proposals/2899-add-TTL/README.md Outdated Show resolved Hide resolved

XploY04 force-pushed the proposal branch from 657ef0b to 861da7f Compare January 5, 2026 21:13

XploY04 force-pushed the proposal branch from 8cab8e0 to 477d88e Compare January 22, 2026 18:05

XploY04 mentioned this pull request Jan 22, 2026

feat: add TTLSecondsAfterFinished and ActiveDeadlineSeconds fields to TrainJob CRD #3065

Closed

andreyvelich mentioned this pull request Jan 25, 2026

Feature: Native TTL Controller & Job History Limits #3125

Closed

andreyvelich reviewed Feb 12, 2026

View reviewed changes

XploY04 requested a review from andreyvelich February 12, 2026 16:23

astefanutti reviewed Feb 16, 2026

View reviewed changes

google-oss-prow bot added the ok-to-test label Feb 16, 2026

google-oss-prow bot added size/XXL size/L and removed size/L size/XXL labels Feb 24, 2026

XploY04 added 2 commits February 24, 2026 23:44

feat(docs): proposal for adding TTLSecondsAfterFinished and ActiveDea…

7e9141a

…dlineSeconds fields to TrainJob CRD Signed-off-by: XploY04 <2004agarwalyash@gmail.com>

fix(docs): add title

ca51d5c

Signed-off-by: XploY04 <2004agarwalyash@gmail.com>

XploY04 added 6 commits February 24, 2026 23:44

fix(docs): made fixes to match Kubeflow's template

09a3d14

Signed-off-by: XploY04 <2004agarwalyash@gmail.com>

fix(docs): updated adding the fields from TrainJob CRD to Trainer APIs

8555c44

Signed-off-by: XploY04 <2004agarwalyash@gmail.com>

docs: Clarify ActiveDeadlineSeconds behavior with suspension, add TTL…

3ccf27c

…SecondsAfterFinished validation, and remove proposed status fields, SDK changes, and metrics. Signed-off-by: XploY04 <2004agarwalyash@gmail.com>

fix(docs): improvements in proposal

2f92bab

Signed-off-by: XploY04 <2004agarwalyash@gmail.com>

timeout changed to active_deadline_seconds

315df89

Signed-off-by: XploY04 <2004agarwalyash@gmail.com>

docs: update parameter to in the proposal README.

c1c3bc9

Signed-off-by: XploY04 <2004agarwalyash@gmail.com>

XploY04 force-pushed the proposal branch from 68813de to c1c3bc9 Compare February 24, 2026 18:19

andreyvelich reviewed Feb 24, 2026

View reviewed changes

google-oss-prow bot assigned akshaychitneni, astefanutti, tenzen-y and andreyvelich Feb 24, 2026

google-oss-prow bot added the lgtm label Feb 24, 2026

astefanutti reviewed Feb 25, 2026

View reviewed changes

tenzen-y reviewed Feb 25, 2026

View reviewed changes

XploY04 mentioned this pull request Feb 25, 2026

feat: add TTL and activeDeadlineSeconds #3258

Open

1 task


		### Implementation Overview

		Controller Changes (`pkg/controller/trainjob_controller.go`):


		This design ensures no TrainJobs are "forgotten" after a controller restart.

		Validation:

Conversation

XploY04 commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

google-oss-prow bot commented Jan 5, 2026

Uh oh!

github-actions bot commented Jan 5, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Key Changes

Uh oh!

Uh oh!

XploY04 commented Jan 22, 2026

Uh oh!

coveralls commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Test Coverage Report for Build 22064803787

Details

💛 - Coveralls

Uh oh!

andreyvelich left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andreyvelich Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andreyvelich Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

XploY04 commented Feb 12, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

XploY04 commented Jan 5, 2026 •

edited

Loading

coveralls commented Feb 10, 2026 •

edited

Loading

andreyvelich Feb 12, 2026 •

edited

Loading

andreyvelich Feb 12, 2026 •

edited

Loading