feat(api): BREAKING CHANGE: Remove numProcPerNode from Torch API by andreyvelich · Pull Request #3239 · kubeflow/trainer

andreyvelich · 2026-02-23T14:59:56Z

This BREAKING CHANGE will remove numProcPerNode API from the Torch MLPolicy. As we discussed during the latest Trainer call, we would like to remove this API and rely on container resources to set this value: https://youtu.be/e9_g28XdpHg?t=351

We would like to keep this API in trainJob.spec.trainer.numProcPerNode for now, since for MPI use-cases users might want to set number of slots.

cc @kubeflow/kubeflow-trainer-team @kubeflow/kubeflow-sdk-team @vsoch @akshaychitneni

google-oss-prow · 2026-02-23T15:00:04Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from andreyvelich. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

andreyvelich · 2026-02-23T15:00:06Z

/hold for review

Copilot

Pull request overview

This PR implements a BREAKING CHANGE that removes the numProcPerNode field from the TorchMLPolicySource API and changes TrainJob.Spec.Trainer.NumProcPerNode from IntOrString to *int32. This simplifies the API by relying on container resources and PyTorch's native auto-detection for determining the number of processes per node.

Changes:

Removed numProcPerNode from TorchMLPolicySource in TrainingRuntime/ClusterTrainingRuntime
Changed TrainJob.Spec.Trainer.NumProcPerNode from IntOrString to *int32
For Torch runtime: defaults to "auto" internally and can be overridden with an int value
For MPI runtime: behavior unchanged, continues to use int values for number of slots per node

Reviewed changes

Copilot reviewed 38 out of 38 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
pkg/apis/trainer/v1alpha1/trainingruntime_types.go	Removed NumProcPerNode field from TorchMLPolicySource struct
pkg/apis/trainer/v1alpha1/trainjob_types.go	Changed NumProcPerNode type from IntOrString to *int32 in Trainer struct
pkg/apis/trainer/v1alpha1/zz_generated.deepcopy.go	Updated generated deepcopy code to remove IntOrString handling for TorchMLPolicySource.NumProcPerNode
pkg/apis/trainer/v1alpha1/zz_generated.openapi.go	Updated OpenAPI schema to reflect API changes
pkg/runtime/framework/plugins/torch/torch.go	Updated to default numProcPerNode to "auto" internally, convert int32 override to IntOrString for internal processing
pkg/runtime/framework/plugins/torch/torchtune.go	Updated to handle nil NumProcPerNode by defaulting to "auto"
pkg/runtime/framework/plugins/mpi/mpi.go	Simplified to directly use int32 values instead of IntOrString conversion
pkg/util/testing/wrapper.go	Updated test wrapper methods: TorchPolicy() simplified, NumProcPerNode() changed to accept int32
test/integration/webhooks/trainjob_test.go	Removed validation tests for string values, updated tests to use int32 values
test/integration/webhooks/trainingruntime_webhook_test.go	Removed defaulting test for torch.numProcPerNode
pkg/runtime/framework/plugins/torch/torch_test.go	Removed validation tests for string values, updated all test cases to use simplified TorchPolicy() and int32 NumProcPerNode
pkg/runtime/framework/plugins/mpi/mpi_test.go	Removed validation test for string values, updated to use int32 NumProcPerNode
manifests/base/runtimes/torch_distributed.yaml	Changed torch.numProcPerNode: auto to torch: {}
manifests/base/runtimes/torchtune/*	Changed torch.numProcPerNode: auto to torch: {} in all torchtune runtime manifests
manifests/base/crds/*.yaml	Updated CRD schemas to reflect API changes
charts/kubeflow-trainer/templates/runtimes/*.yaml	Updated Helm chart runtime templates to use torch: {}
charts/kubeflow-trainer/crds/*.yaml	Updated Helm chart CRDs to match base manifests
api/python_api/kubeflow_trainer_api/models/*	Updated Python SDK models to reflect API changes
api/openapi-spec/swagger.json	Updated OpenAPI specification
docs/proposals/2170-kubeflow-trainer-v2/README.md	Updated proposal documentation with API changes and implementation history
pkg/client/applyconfiguration/trainer/v1alpha1/*	Updated apply configurations to reflect API changes
test/integration/controller/trainjob_controller_test.go	Updated controller tests to use simplified TorchPolicy()
pkg/runtime/core/trainingruntime_test.go	Updated core runtime tests to use simplified TorchPolicy() and int32 NumProcPerNode
pkg/runtime/framework/plugins/plainml/plainml_test.go	Updated plainml tests to use simplified TorchPolicy()
charts/kubeflow-trainer/tests/runtimes/torch_distributed_test.yaml	Updated Helm chart test to expect torch: {} instead of torch.numProcPerNode: auto

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

Copilot AI review requested due to automatic review settings February 23, 2026 14:59

google-oss-prow bot requested review from akshaychitneni and jinchihe February 23, 2026 15:00

google-oss-prow bot added size/L do-not-merge/hold labels Feb 23, 2026

Copilot AI reviewed Feb 23, 2026

View reviewed changes

feat(api): BREAKING CHANGE: Remove numProcPerNode from Torch API

350a6d5

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

andreyvelich force-pushed the remove-num-proc branch from 1672171 to 350a6d5 Compare February 23, 2026 16:38

andreyvelich mentioned this pull request Feb 24, 2026

feat: support for flux framework as hpc manager #3188

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(api): BREAKING CHANGE: Remove numProcPerNode from Torch API#3239

feat(api): BREAKING CHANGE: Remove numProcPerNode from Torch API#3239
andreyvelich wants to merge 1 commit intokubeflow:masterfrom
andreyvelich:remove-num-proc

andreyvelich commented Feb 23, 2026

Uh oh!

google-oss-prow bot commented Feb 23, 2026

Uh oh!

andreyvelich commented Feb 23, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

andreyvelich commented Feb 23, 2026

Uh oh!

google-oss-prow bot commented Feb 23, 2026

Uh oh!

andreyvelich commented Feb 23, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants