chore: Trainer: Specialized Trainers by szaher · Pull Request #308 · kubeflow/sdk

szaher · 2026-02-19T22:43:57Z

What this PR does / why we need it:

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):

Fixes #

Checklist:

Docs included if any changes are user facing

Proposing framework-aware trainer classes (TorchTrainer, MPITrainer, JAXTrainer, XGBoostTrainer) with automatic runtime discovery via the trainer.kubeflow.org/framework label, and a RuntimeConfig dataclass to separate per-job environment settings from training logic. Issue: kubeflow#285 Signed-off-by: Saad Zaher <szaher@redhat.com>

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Saad Zaher <szaher@redhat.com>

Co-authored-by: Antonin Stefanutti <astefanutti@users.noreply.github.com> Signed-off-by: Saad Zaher <szaher@redhat.com>

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Saad Zaher <szaher@redhat.com>

Co-authored-by: Antonin Stefanutti <astefanutti@users.noreply.github.com> Signed-off-by: Saad Zaher <szaher@redhat.com>

google-oss-prow · 2026-02-19T22:44:03Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign kramaranya for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copilot

Pull request overview

This PR adds a comprehensive design proposal for specialized trainer abstractions and a RuntimeConfig dataclass to the Kubeflow SDK. The proposal addresses current limitations in the SDK's trainer subsystem by introducing framework-aware trainer classes that bridge the gap between the generic CustomTrainer and the highly specific BuiltinTrainer.

Changes:

Adds a detailed design proposal document describing a new BaseTrainer abstract interface and specialized framework trainers (TorchTrainer, MPITrainer, JAXTrainer, XGBoostTrainer)
Proposes a RuntimeConfig dataclass to cleanly separate runtime environment settings from training logic
Includes comprehensive documentation covering motivation, design details, API examples, migration strategy, test plan, and alternatives considered

Copilot · 2026-02-19T22:47:02Z

docs/proposals/285-specialized-trainers/README.md

+3. **Deprecating `CustomTrainer` or `BuiltinTrainer`.** Both remain supported.
+   Specialized trainers are an additional option, not a replacement.
+4. **Tier 2 trainer implementations.** This proposal defines the extension mechanism
+   and interface. Concrete Tier 2 implementations (HuggingFace, DeepSpeed, Unsloth,


The company name should be spelled "Hugging Face" (with a space) rather than "HuggingFace" throughout the document. This applies to references in text and comments, though the class name "HuggingFaceTrainer" would be correct as Python class names don't use spaces.

Suggested change

and interface. Concrete Tier 2 implementations (HuggingFace, DeepSpeed, Unsloth,

and interface. Concrete Tier 2 implementations (Hugging Face, DeepSpeed, Unsloth,

Copilot · 2026-02-19T22:47:03Z

docs/proposals/285-specialized-trainers/README.md

+# Example: future HuggingFaceTrainer (NOT part of this proposal's implementation scope)
+
+@dataclass
+class TransformersTrainer(BaseTrainer):
+    """Trainer for HuggingFace Transformers training.
+
+    Wraps HuggingFace's Trainer API and maps to a PyTorch runtime.


The company name should be spelled "Hugging Face" (with a space) rather than "HuggingFace" in the comment and docstring text.

Suggested change

# Example: future HuggingFaceTrainer (NOT part of this proposal's implementation scope)

@dataclass

class TransformersTrainer(BaseTrainer):

"""Trainer for HuggingFace Transformers training.

Wraps HuggingFace's Trainer API and maps to a PyTorch runtime.

# Example: future Hugging Face trainer (NOT part of this proposal's implementation scope)

@dataclass

class TransformersTrainer(BaseTrainer):

"""Trainer for Hugging Face Transformers training.

Wraps Hugging Face's Trainer API and maps to a PyTorch runtime.

Copilot · 2026-02-19T22:47:03Z

docs/proposals/285-specialized-trainers/README.md

+          │
+    ┌─────┴──────────┐
+    │                │
+HuggingFace     DeepSpeed


The company name should be spelled "Hugging Face" (with a space) rather than "HuggingFace" in the diagram text.

Suggested change

HuggingFace DeepSpeed

Hugging Face DeepSpeed

krishdef7 · 2026-02-25T16:40:26Z

@szaher @andreyvelich — this proposal is really well thought out, especially the separation between BaseTrainer and framework-specific trainers along with RuntimeConfig.

I had a question regarding TorchTrainer extensibility and runtime selection:

Given that multiple torch-based runtimes may coexist (as discussed earlier in #287), how do you envision selecting the appropriate runtime for a given TorchTrainer instance?

One possible approach could be:

Allow an optional runtime_name in RuntimeConfig (explicit selection), and
Fall back to a priority-based selection among compatible runtimes (e.g., via annotations or ordering) when not specified.

This might help keep the API simple while still supporting multiple backends (e.g., TorchTune vs custom PEFT/TRL runtimes for LLM workflows)

Curious if something along these lines aligns with the intended direction.

Happy to explore this further or prototype once the design is clearer.

kramaranya

Thanks @szaher!
Looks great to me, and it should be a great improvement to the user experience in Kubeflow SDK!

/assign @andreyvelich @astefanutti @briangallagher @Fiona-Waters @MStokluska

kramaranya · 2026-02-26T10:14:46Z

docs/proposals/285-specialized-trainers/README.md

+2. **`RuntimeConfig` dataclass** — A dedicated configuration object that cleanly separates
+   per-job runtime environment settings (packages, pip config, environment variables) from
+   training logic and scaling parameters. This replaces the current pattern where
+   `CustomTrainer` conflates runtime concerns with trainer concerns.


Would this require runtime/controller changes?

kramaranya · 2026-02-26T10:19:35Z

docs/proposals/285-specialized-trainers/README.md

+3. **Deprecating `CustomTrainer` or `BuiltinTrainer`.** Both remain supported.
+   Specialized trainers are an additional option, not a replacement.


Is the plan to eventually deprecate those or do we want to always maintain both options?

kramaranya · 2026-02-26T10:24:52Z

docs/proposals/285-specialized-trainers/README.md

+        if runtime.trainer.framework not in self.supported_frameworks:
+            raise ValueError(
+                f"{type(self).__name__} supports frameworks "
+                f"{self.supported_frameworks}, but runtime '{runtime.name}' "
+                f"has framework '{runtime.trainer.framework}'"
+            )


We also would need to validate runtime.trainer.trainer_type too

kramaranya · 2026-02-26T10:27:04Z

docs/proposals/285-specialized-trainers/README.md

+    def get_framework_args(self) -> dict:
+        args = {}
+        if self.max_restarts is not None:
+            args["max-restarts"] = str(self.max_restarts)
+        if self.monitor_interval is not None:
+            args["monitor-interval"] = str(self.monitor_interval)
+        return args


where these new args go in the TrainJob spec?

google-oss-prow · 2026-02-26T12:30:41Z

@kramaranya: GitHub didn't allow me to assign the following users: MStokluska.

Note that only kubeflow members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

Details

In response to this:

Thanks @szaher!
Looks great to me, and it should be a great improvement to the user experience in Kubeflow SDK!

/assign @andreyvelich @astefanutti @briangallagher @Fiona-Waters @MStokluska

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

krishdef7 · 2026-02-26T12:48:45Z

+1 on the points around validation and argument placement, I had a related question while reading through this.

For the framework-specific args (e.g. max_restarts, monitor_interval), where do you envision these being materialized in the resulting TrainJob spec?

Do they map directly to existing fields in the underlying CRDs (e.g. TorchJob/MPIJob), or
Are they intended to flow through a more generic extension mechanism (e.g. annotations / plugin args)?

This also seems tied to whether RuntimeConfig and the specialized trainers remain purely SDK-layer abstractions, or if they imply corresponding changes in the controller/runtime layer.

Clarifying this mapping would help understand how far the abstraction goes (SDK-only vs API/CRD impact).

szaher and others added 5 commits February 11, 2026 19:17

Update docs/proposals/285-specialized-trainers/README.md

6795b74

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Saad Zaher <szaher@redhat.com>

Update docs/proposals/285-specialized-trainers/README.md

83d2713

Co-authored-by: Antonin Stefanutti <astefanutti@users.noreply.github.com> Signed-off-by: Saad Zaher <szaher@redhat.com>

Update docs/proposals/285-specialized-trainers/README.md

5902038

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Saad Zaher <szaher@redhat.com>

Update docs/proposals/285-specialized-trainers/README.md

1562655

Co-authored-by: Antonin Stefanutti <astefanutti@users.noreply.github.com> Signed-off-by: Saad Zaher <szaher@redhat.com>

Copilot AI review requested due to automatic review settings February 19, 2026 22:43

google-oss-prow bot requested review from Electronic-Waste and kramaranya February 19, 2026 22:44

google-oss-prow bot added the size/XXL label Feb 19, 2026

Copilot started reviewing on behalf of szaher February 19, 2026 22:44 View session

szaher mentioned this pull request Feb 19, 2026

chore: Trainer: Specialized Trainers #287

Closed

1 task

Copilot AI reviewed Feb 19, 2026

View reviewed changes

andreyvelich mentioned this pull request Feb 25, 2026

Add KEP-936: Introduce Kubeflow-MCP proposal for AI-Powered Training … kubeflow/community#937

Open

kramaranya reviewed Feb 26, 2026

View reviewed changes

google-oss-prow bot assigned andreyvelich, astefanutti, briangallagher and Fiona-Waters Feb 26, 2026

	and interface. Concrete Tier 2 implementations (HuggingFace, DeepSpeed, Unsloth,
	and interface. Concrete Tier 2 implementations (Hugging Face, DeepSpeed, Unsloth,

		3. Deprecating `CustomTrainer` or `BuiltinTrainer`. Both remain supported.
		Specialized trainers are an additional option, not a replacement.

Conversation

szaher commented Feb 19, 2026

Uh oh!

google-oss-prow bot commented Feb 19, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

krishdef7 commented Feb 25, 2026

Uh oh!

kramaranya left a comment

Choose a reason for hiding this comment

Uh oh!

kramaranya Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

kramaranya Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

kramaranya Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

kramaranya Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

google-oss-prow bot commented Feb 26, 2026

Uh oh!

krishdef7 commented Feb 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants