feat: support for flux framework as hpc manager by vsoch · Pull Request #3188 · kubeflow/trainer

vsoch · 2026-02-09T02:32:06Z

This is an update to #3064 to include (what comes down to) a rebase. I did a fresh clone and re-applied the changes. Please see the description and review there!

coveralls · 2026-02-09T02:36:43Z

Pull Request Test Coverage Report for Build 21810351691

Details

504 of 579 (87.05%) changed or added relevant lines in 3 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage increased (+6.7%) to 58.658%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
pkg/runtime/framework/plugins/registry.go	0	1	0.0%
pkg/runtime/framework/plugins/jobset/jobset.go	0	6	0.0%
pkg/runtime/framework/plugins/flux/flux.go	504	572	88.11%

Totals
Change from base Build 21807339200:	6.7%
Covered Lines:	1792
Relevant Lines:	3055

💛 - Coveralls

Copilot

Pull request overview

This PR adds Flux Framework as a new Kubeflow Trainer runtime plugin and extends the Trainer APIs/CRDs so Flux-specific ML policy can be configured via TrainingRuntime/ClusterTrainingRuntime, along with example manifests.

Changes:

Register a new flux framework plugin and integrate it into the runtime framework plugin pipeline.
Extend the MLPolicy API surface (Go types, CRDs, OpenAPI, Python client) with flux / FluxMLPolicySource (currently numProcPerNode).
Add Flux plugin implementation + unit tests and provide example runtime + TrainJob YAMLs.

Reviewed changes

Copilot reviewed 25 out of 25 changed files in this pull request and generated 13 comments.

Show a summary per file

File	Description
pkg/runtime/runtime.go	Extends runtime policy/container abstractions to carry Flux policy + image/command fields.
pkg/runtime/framework/plugins/registry.go	Registers the Flux plugin in the framework plugin registry.
pkg/runtime/framework/plugins/jobset/jobset.go	Propagates runtime PodSet container `Command`/`Image` into JobSet apply spec.
pkg/runtime/framework/plugins/flux/flux.go	New Flux plugin: enforces Flux behavior, generates ConfigMap scripts, creates CURVE Secret, adds watches.
pkg/runtime/framework/plugins/flux/flux_test.go	New Flux plugin tests.
pkg/runtime/framework/core/framework_test.go	Updates framework tests to include Flux plugin in expected plugin sets.
pkg/client/applyconfiguration/utils.go	Adds applyconfiguration kind mapping for FluxMLPolicySource.
pkg/client/applyconfiguration/trainer/v1alpha1/mlpolicysource.go	Adds Flux field + builder method to MLPolicySource apply config.
pkg/client/applyconfiguration/trainer/v1alpha1/mlpolicy.go	Adds Flux builder method to MLPolicy apply config.
pkg/client/applyconfiguration/trainer/v1alpha1/fluxmlpolicysource.go	Generated apply config for FluxMLPolicySource.
pkg/apis/trainer/v1alpha1/*	Adds FluxMLPolicySource to Go API types + generated deepcopy/openapi.
manifests/base/crds/*	CRD schema updates to include `mlPolicy.flux`.
charts/kubeflow-trainer/crds/*	Helm CRD schema updates to include `mlPolicy.flux`.
api/python_api/kubeflow_trainer_api/models/*	Python client model updates for Flux ML policy.
api/openapi-spec/swagger.json	OpenAPI schema updates for FluxMLPolicySource.
examples/flux/*	Adds example Flux runtime + TrainJob manifests (batch + interactive).

Comments suppressed due to low confidence (1)

pkg/runtime/runtime.go:87

runtime.Container gained Image and Command fields, but toPodSetContainer never populates them from the runtime PodSpec apply configs, so these values will always be empty unless manually set later; update the conversion to copy image/command from corev1ac.ContainerApplyConfiguration.

type Container struct {
	Name         string
	Image        string
	Command      []string
	Env          []corev1ac.EnvVarApplyConfiguration
	Ports        []corev1ac.ContainerPortApplyConfiguration
	VolumeMounts []corev1ac.VolumeMountApplyConfiguration
}

pkg/runtime/framework/plugins/flux/flux.go

Copilot · 2026-02-09T02:37:58Z

pkg/runtime/framework/plugins/flux/flux.go

+    node_spec="-n2"
+    node_spec="${node_spec}"
+    flags="${node_spec}  "


The Flux submit flags are hard-coded (node_spec="-n2") and ignore the configured fluxPolicy.NumProcPerNode and the TrainJob node count, so most jobs will run with the wrong parallelism; derive -n/-N (or equivalent) from NumProcPerNode and trainJob.Spec.Trainer.NumNodes.

Suggested change

node_spec="-n2"

node_spec="${node_spec}"

flags="${node_spec} "

# Derive node and task counts from environment when available.

# FLUX_NUM_NODES is expected to be the TrainJob node count and

# FLUX_NUM_PROC_PER_NODE is expected to be fluxPolicy.NumProcPerNode.

if [[ -n "${FLUX_NUM_NODES}" ]] && [[ -n "${FLUX_NUM_PROC_PER_NODE}" ]]; then

total_tasks=$((FLUX_NUM_NODES * FLUX_NUM_PROC_PER_NODE))

node_spec="-N${FLUX_NUM_NODES} -n${total_tasks}"

elif [[ -n "${FLUX_NUM_NODES}" ]]; then

node_spec="-N${FLUX_NUM_NODES}"

elif [[ -n "${FLUX_NUM_PROC_PER_NODE}" ]]; then

node_spec="-n${FLUX_NUM_PROC_PER_NODE}"

else

# Fallback to the previous hard-coded behavior.

node_spec="-n2"

fi

flags="${node_spec}"

This is a bug (oversight) converting from the previous implementation. I'll update this to have tasks == numProcPerNode. The main issue here is that the number of tasks for the job may not necessarily equate to the number of processes per node, but for most Trainer examples I've seen, this is what is expected. We also are not properly supporting GPUs. For that, flux needs to have a -g flag, and that corresponds to GPUs per task.

@andreyvelich how should we address that? E.g., if we have a single node with 8 GPU and we want the entire pod to consume the entire node (and all gpu) we would do -g 8. E.g., "two nodes, each has 8 cores and each of the 8 cores is assigned to 1 gpu"

flux run -N2 -n 8 -g 1 /opt/multi-gpu-programming-models/mpi/jacobi -niter 5000

Here is a more realistic (complex) example.

flux run --cores-per-task 1 --env OMP_NUM_THREADS=1 -N16 -n 128 -g 1 -o gpu-affinity=per-task kripke --arch CUDA --layout GDZ --dset 8 --zones 128,128,128 --gset 16 --groups 64 --niter 50 --legendre 8 --quad 8 --procs 4,8,4 done

In the above, we want 16 physical nodes, each of those has 8 GPU, so a total of 128 GPU across the job. That means that each "slot" or "task" gets one gpu (-g 1) and just one core (--cores-per-task). The above does not say that each physical node has 128 "numProcPerNode" and it isn't clear how I'd specify this to run with the Kubeflow Trainer right now. Let's discuss.

There are other examples in that README if needed.

Can we start with simple assignment that -n == .trainer.numProcPerNode * trainer.numNodes if it is set in TrainJob, otherwise
-n == .flux.numProcPerNode * .numNodes?

-N = numNodes

@andreyvelich another thing I'm thinking is that there are cases when we want to leave out -n and just define -N (numNodes). I think tasks here is required to minimally be 1? Is this something we can support another way?

How is this handled if it can be a string? Is there a common function?

trainer/pkg/apis/trainer/v1alpha1/trainingruntime_types.go

Lines 204 to 212 in 54eab65

type TorchMLPolicySource struct {

// numProcPerNode is the number of processes per node.

// This value is inserted into the `--nproc-per-node` argument of the `torchrun` CLI.

// Supported values: `auto`, `cpu`, `gpu`, or int value.

// Defaults to `auto`.

// +kubebuilder:default="auto"

// +kubebuilder:validation:XValidation:rule="self > 0 || self in ['auto', 'cpu', 'gpu']", message="NumProcPerNode must be equal to auto, cpu, gpu, or int value"

// +optional

NumProcPerNode *intstr.IntOrString `json:"numProcPerNode,omitempty"`

@andreyvelich I just pushed a change that better covers these cases, but I need some help understanding the possible string definitions for the Trainer numProcPerNode. With "auto" we can at least support GPU - with flux when you do -N <n> and then --exclusive that essentially says "give me all the resources (cpu) on the node." What it's clear is how that is different from cpu, and then what "gpu" means. For Flux, to enable GPU we do need to set the -g (gpus per task) flag, which requires knowing the number on the node and of course the number requested by the user per task.

I think tasks here is required to minimally be 1? Is this something we can support another way?

Yes, we can set numProcPerNode to 1 by default.

How is this handled if it can be a string? Is there a common function?

We are going to refactor this as part of this PR: #3239
We are going to accept only int values.

For Flux, to enable GPU we do need to set the -g (gpus per task) flag, which requires knowing the number on the node and of course the number requested by the user per task.

Let's not worry about auto or other values in numProcPerNode, it is specific to PyTorch.
@vsoch For now, can we just check the container resources to automatically configure the appropriate command for Flux?
For example, if GPU is requested, we set -g in the command.

pkg/runtime/framework/plugins/flux/flux.go

Copilot · 2026-02-09T02:37:59Z

pkg/runtime/framework/plugins/flux/flux.go

+foundroot=$(find $viewroot -maxdepth 2 -type d -path $viewroot/lib/python3\*) > /dev/null 2>&1
+pythonversion=$(basename ${foundroot})
+pythonversion=${viewroot}/bin/${pythonversion}


foundroot=$(find ... ) > /dev/null 2>&1 redirects the assignment output rather than silencing find, and also loses the actual path(s); capture find output properly (and pick one directory) so pythonversion/PYTHONPATH are computed correctly.

Suggested change

foundroot=$(find $viewroot -maxdepth 2 -type d -path $viewroot/lib/python3\*) > /dev/null 2>&1

pythonversion=$(basename ${foundroot})

pythonversion=${viewroot}/bin/${pythonversion}

foundroot=$(find "$viewroot" -maxdepth 2 -type d -path "$viewroot"/lib/python3\* 2>/dev/null | head -n 1)

pythonversion=$(basename "${foundroot}")

pythonversion="${viewroot}/bin/${pythonversion}"

Copilot · 2026-02-09T02:37:59Z

pkg/runtime/framework/plugins/flux/flux.go

+// buildCurveSecret generates a cluster wide curve certificate for flux
+func (f *Flux) buildCurveSecret(trainJob *trainer.TrainJob) (*corev1ac.SecretApplyConfiguration, error) {
+	// Generate a deterministic Secret Key from the UID
+	secretSeed := sha256.Sum256([]byte(trainJob.UID))
+
+	// Derive the Public Key using standard X25519 (CURVE25519)
+	// ZeroMQ/Flux uses X25519.
+	priv, err := ecdh.X25519().NewPrivateKey(secretSeed[:])
+	if err != nil {
+		return nil, fmt.Errorf("failed to create curve private key: %w", err)
+	}
+	pub := priv.PublicKey()
+
+	// Encode both to Z85 (40 characters each)
+	z85Secret := encodeZ85(priv.Bytes())
+	z85Public := encodeZ85(pub.Bytes())


buildCurveSecret derives the CURVE secret key deterministically from the TrainJob UID, making it predictable to anyone who can read object metadata; generate a cryptographically random key once (store it in the Secret) and re-use it on subsequent reconciles (similar to the MPI SSH secret pattern).

Copilot · 2026-02-09T02:37:59Z

pkg/runtime/framework/plugins/flux/flux_test.go

+			err := p.(framework.EnforceMLPolicyPlugin).EnforceMLPolicy(tc.info, tc.trainJob)
+			if err != nil {
+				t.Fatalf("EnforceMLPolicy failed: %v", err)
+			}
+
+			if tc.info.RuntimePolicy.FluxPolicySource != nil && tc.info.TemplateSpec.ObjApply != nil {
+				js := tc.info.TemplateSpec.ObjApply.(*v1alpha2.JobSetSpecApplyConfiguration)
+				for _, rj := range js.ReplicatedJobs {
+					if ptr.Deref(rj.Name, "") == constants.Node {
+						podSpec := rj.Template.Spec.Template.Spec
+						var icNames []string
+						for _, ic := range podSpec.InitContainers {
+							icNames = append(icNames, ptr.Deref(ic.Name, ""))
+						}
+						if diff := gocmp.Diff(tc.wantInitContainers, icNames); len(diff) != 0 {
+							t.Errorf("Unexpected init containers (-want, +got): %s", diff)
+						}
+						for _, c := range podSpec.Containers {
+							if ptr.Deref(c.Name, "") == constants.Node {
+								if diff := gocmp.Diff(tc.wantCommand, c.Command); len(diff) != 0 {
+									t.Errorf("Unexpected command (-want, +got): %s", diff)
+								}
+								if ptr.Deref(c.TTY, false) != tc.wantTTY {
+									t.Errorf("Expected TTY %v, got %v", tc.wantTTY, ptr.Deref(c.TTY, false))
+								}
+							}
+						}
+					}
+				}
+			}


Tests gate Flux assertions on tc.info.RuntimePolicy.FluxPolicySource, but the plugin checks RuntimePolicy.MLPolicySource.Flux, so the assertions are skipped and the test can pass without verifying mutations; update tests to use MLPolicySource.Flux and remove the unused FluxPolicySource field references.

Copilot · 2026-02-09T02:38:00Z

pkg/runtime/framework/plugins/jobset/jobset.go

+				jobSetSpec.ReplicatedJobs[psIdx].Template.Spec.Template.Spec.Containers[containerIdx].Command = container.Command
+			}
+			if container.Image != "" {
+				jobSetSpec.ReplicatedJobs[psIdx].Template.Spec.Template.Spec.Containers[containerIdx].Image = &container.Image


In JobSet Build, assigning Image = &container.Image takes the address of the range loop variable field, so multiple containers can end up sharing the same pointer value; use a stable pointer (e.g., ptr.To(container.Image)) instead.

Suggested change

jobSetSpec.ReplicatedJobs[psIdx].Template.Spec.Template.Spec.Containers[containerIdx].Image = &container.Image

jobSetSpec.ReplicatedJobs[psIdx].Template.Spec.Template.Spec.Containers[containerIdx].Image = ptr.To(container.Image)

@andreyvelich would you like this change? I don't see why it would be an issue. It might even be beneficial if the value changes somewhere else (and then is changed here). It could also be irrelevant.

I think, it should be fine to keep it with &
@astefanutti @tenzen-y Any recommendations?

Copilot · 2026-02-09T02:38:00Z

api/python_api/kubeflow_trainer_api/models/trainer_v1alpha1_flux_ml_policy_source.py

+
+from __future__ import annotations
+import pprint
+import re  # noqa: F401


Import of 're' is not used.

Suggested change

import re # noqa: F401

pkg/runtime/framework/plugins/flux/flux.go

andreyvelich · 2026-02-12T03:20:19Z

pkg/runtime/framework/plugins/flux/flux.go

+	trainJob.Annotations[AnnotationOriginalCommand] = originalCmd
+	trainJob.Annotations[AnnotationViewImage] = settings["FLUX_VIEW_IMAGE"]


I might be missing something, but why do you place these values to the TrainJob annotations?

The settings are global (and would be shared between instances) and annotations persist between creations. It also allows programmatic understanding of the creation by other Kubernetes controllers or the user.

Users can always check the actual JobSet to see the FLUX_VIEW_IMAGE and the original command, since we wrap it.
I would suggest to avoid using annotations to passing some state between the objects unless it is really necessary. cc @tenzen-y @astefanutti

@vsoch Did you get a chance to check this comment?

@andreyvelich I can try to remove annotations, but we need somewhere to put the original command. It will not persist with the train job after we update it. The annotation served as a place to put it.

@vsoch Any specific reason to preserve the original command?
Do users require it for observability?

We need the original command to run their workflow in the MiniCluster. It gets replaced with an execution to the entrypoint. The fallback to put them both in one place would be to prefix the original command with the entrypoint, and then receive it as whatever arguments come in to the script.

@andreyvelich I just pushed a change that gets rid of the need to set annotations. We have the command as a suffix to the flux entrypoint, which captures it via all the args.

It's unclear to me at the moment how MiniCluster will be used with TrainJob, so we can discuss it after initial implementation.

We have the command as a suffix to the flux entrypoint, which captures it via all the args.

That sounds good, thanks.

Specifically, there is no MiniCluster here (that is a Flux Operator CRD) and we aren't using the Flux Operator here. We can definitely discuss how Flu can be used in the context of TrainJob, looking forward to it!

pkg/runtime/runtime.go

andreyvelich · 2026-02-24T18:47:39Z

@vsoch If you rebase your PR, it should fix the E2Es.

vsoch · 2026-02-24T19:05:08Z

@vsoch If you rebase your PR, it should fix the E2Es.

Done! Hopefully didn't bork anything - I'll watch for issues.

andreyvelich

Thanks for the updates @vsoch! I left a few small comment.

It looks great, I think we should be ready to merge it.
/assign @astefanutti @tenzen-y

In case you want to provide more feedback on the initial Flux integration

pkg/apis/trainer/v1alpha1/trainingruntime_types.go

pkg/runtime/framework/plugins/flux/flux.go

andreyvelich · 2026-02-24T19:59:21Z

pkg/runtime/framework/plugins/flux/flux_test.go

@@ -0,0 +1,434 @@
+/*


We also need to add:

Integration tests: https://github.com/converged-computing/trainer/blob/57805da503042c765bf2a422146bd050067b33f9/test/integration/controller/trainjob_controller_test.go#L925

E2E tests: https://github.com/converged-computing/trainer/blob/57805da503042c765bf2a422146bd050067b33f9/test/e2e/e2e_test.go#L108

@vsoch Please can you create tracking issue to add this as a followup.

google-oss-prow · 2026-02-24T22:39:59Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from astefanutti. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

vsoch · 2026-02-24T23:01:07Z

@andreyvelich can you give feedback on getting resources - the trainJob container does not specify them, e.g., no Resources here:

	var gpusPerNode int64 = 0
	trainerContainer := info.FindContainerByPodSetAncestorContainerName(constants.AncestorTrainer, constants.Node)
	if trainerContainer != nil {
		if val, ok := trainerContainer.Resources.Limits["nvidia.com/gpu"]; ok {
			gpusPerNode = val.Value()
		} else if val, ok := trainerContainer.Resources.Requests["nvidia.com/gpu"]; ok {
			// Fallback to requests if limits aren't set
			gpusPerNode = val.Value()
		}
	}

andreyvelich · 2026-02-24T23:16:21Z

e.g., no Resources here

Just extract them from the Runtime first. Actually, just re-use the same helper functions we run in the Torch plugin:
https://github.com/converged-computing/trainer/blob/bd83d71863c21fc1196d9d38dd54da46f256ab6b/pkg/runtime/framework/plugins/torch/torch.go#L114

	resourcesPerNode := ptr.Deref(runtime.ExtractResourcePerNodeFromRuntime(info), corev1.ResourceRequirements{})
	if jobTrainer := trainJob.Spec.Trainer; jobTrainer != nil && jobTrainer.ResourcesPerNode != nil {
		resourcesPerNode = ptr.Deref(jobTrainer.ResourcesPerNode, corev1.ResourceRequirements{})
	}
	gpuQ := runtime.GetNumGPUPerNode(&resourcesPerNode)

Flux supports the majority of MPI flavors/variants, and can be used to bootstrap MPI as a plugin. It adds other features for scheduling and topology that can be used for simulations and ai/ml jobs. This changeset adds the plugin implementation, including the plugin module, tests, and an example with a small README to serve as documentation for the time being. Signed-off-by: vsoch <vsoch@users.noreply.github.com>

We still need to put the original command in an annotation to retieve later, but others can be re-derived from the environment Signed-off-by: vsoch <vsoch@users.noreply.github.com>

Signed-off-by: vsoch <vsoch@users.noreply.github.com>

vsoch · 2026-02-25T00:30:34Z

@andreyvelich the test is failing because of a nil pointer reference in the function you suggested. I tried all the logical fixes for:

nodeResources := runtime.ExtractResourcePerNodeFromRuntime(info)

and it always errors with this check:

rJob.Template.Labels[constants.LabelTrainJobAncestor] == constants.AncestorTrainer

I can check up to that level for nil, still errors. I suspect this is something weird about the apply, so going to need to ask for another set of eyes on it.

andreyvelich · 2026-02-25T01:18:16Z

@vsoch Your unit test has incorrect PodSet object here: https://github.com/converged-computing/trainer/blob/16c4162ea90298888927705b462fa27fc26a9ebc/pkg/runtime/framework/plugins/flux/flux_test.go#L93-L96

It should be like this, and you should remove ObjApply

{
	Name:     constants.Node,
	Ancestor: ptr.To(constants.AncestorTrainer),
	Count:    ptr.To[int32](1),
},

Check example here: https://github.com/converged-computing/trainer/blob/16c4162ea90298888927705b462fa27fc26a9ebc/pkg/runtime/framework/plugins/mpi/mpi_test.go#L103

We should refactor this unit tests to align with other plugins as part of: #3179

vsoch · 2026-02-25T04:13:40Z

@andreyvelich I have my test fixed locally, but the larger build is failing with the update from the rebase:

validatingwebhookconfiguration.admissionregistration.k8s.io/jobset-validating-webhook-configuration serverside-applied
validatingwebhookconfiguration.admissionregistration.k8s.io/validator.trainer.kubeflow.org serverside-applied
Error from server (Invalid): CustomResourceDefinition.apiextensions.k8s.io "clustertrainingruntimes.trainer.kubeflow.org" is invalid: spec.validation.openAPIV3Schema.properties[spec].properties[mlPolicy].x-kubernetes-validations[0].rule: Invalid value: apiextensions.ValidationRule{Rule:"!(has(self.numNodes) && (has(self.torch) && has(self.torch.elasticPolicy)))", Message:"numNodes should not be set if torch.elasticPolicy is configured", MessageExpression:"", Reason:(*apiextensions.FieldValueErrorReason)(nil), FieldPath:"", OptionalOldSelf:(*bool)(nil)}: compilation failed: ERROR: <input>:1:48: undefined field 'elasticPolicy'
 | !(has(self.numNodes) && (has(self.torch) && has(self.torch.elasticPolicy)))
 | ...............................................^
Error from server (Invalid): CustomResourceDefinition.apiextensions.k8s.io "trainingruntimes.trainer.kubeflow.org" is invalid: spec.validation.openAPIV3Schema.properties[spec].properties[mlPolicy].x-kubernetes-validations[0].rule: Invalid value: apiextensions.ValidationRule{Rule:"!(has(self.numNodes) && (has(self.torch) && has(self.torch.elasticPolicy)))", Message:"numNodes should not be set if torch.elasticPolicy is configured", MessageExpression:"", Reason:(*apiextensions.FieldValueErrorReason)(nil), FieldPath:"", OptionalOldSelf:(*bool)(nil)}: compilation failed: ERROR: <input>:1:48: undefined field 'elasticPolicy'
 | !(has(self.numNodes) && (has(self.torch) && has(self.torch.elasticPolicy)))
 | ...............................................^

Copilot AI review requested due to automatic review settings February 9, 2026 02:32

google-oss-prow bot requested review from akshaychitneni and kuizhiqing February 9, 2026 02:32

google-oss-prow bot added the size/XXL label Feb 9, 2026

Copilot started reviewing on behalf of vsoch February 9, 2026 02:32 View session

Copilot AI reviewed Feb 9, 2026

View reviewed changes

vsoch mentioned this pull request Feb 9, 2026

feat: support for Flux Framework as HPC manager #3064

Closed

1 task

andreyvelich mentioned this pull request Feb 10, 2026

investigate PMIx kubeflow/mpi-operator#12

Open

andreyvelich reviewed Feb 12, 2026

View reviewed changes

andreyvelich mentioned this pull request Feb 13, 2026

Upgrade Kubernetes dependencies to v1.35 kubeflow/mpi-operator#768

Closed

vsoch force-pushed the flux-framework-plugin branch from 114ae38 to 57805da Compare February 24, 2026 19:03

andreyvelich reviewed Feb 24, 2026

View reviewed changes

google-oss-prow bot assigned astefanutti and tenzen-y Feb 24, 2026

vsoch mentioned this pull request Feb 24, 2026

[Testing] Integration and E2E tests for Flux Framework #3256

Open

vsoch added 7 commits February 24, 2026 15:28

review: slim down use of annotations

19d3ad1

We still need to put the original command in an annotation to retieve later, but others can be re-derived from the environment Signed-off-by: vsoch <vsoch@users.noreply.github.com>

review: clean up tasks and remove extra FluxPolicySource

d73f6f8

Signed-off-by: vsoch <vsoch@users.noreply.github.com>

review: ensure to unset trainer args

35bf1d8

Signed-off-by: vsoch <vsoch@users.noreply.github.com>

review: better support for nproc per node and nodes

6b03daf

Signed-off-by: vsoch <vsoch@users.noreply.github.com>

review: add constants types for flux

06030f2

Signed-off-by: vsoch <vsoch@users.noreply.github.com>

review: add gpu specification from resources

16c4162

Signed-off-by: vsoch <vsoch@users.noreply.github.com>

vsoch force-pushed the flux-framework-plugin branch from bd83d71 to 16c4162 Compare February 24, 2026 23:29

-    node_spec="-n2"
-    node_spec="${node_spec}"
-    flags="${node_spec}  "
+    # Derive node and task counts from environment when available.
+    # FLUX_NUM_NODES is expected to be the TrainJob node count and
+    # FLUX_NUM_PROC_PER_NODE is expected to be fluxPolicy.NumProcPerNode.
+    if [[ -n "${FLUX_NUM_NODES}" ]] && [[ -n "${FLUX_NUM_PROC_PER_NODE}" ]]; then
+      total_tasks=$((FLUX_NUM_NODES * FLUX_NUM_PROC_PER_NODE))
+      node_spec="-N${FLUX_NUM_NODES} -n${total_tasks}"
+    elif [[ -n "${FLUX_NUM_NODES}" ]]; then
+      node_spec="-N${FLUX_NUM_NODES}"
+    elif [[ -n "${FLUX_NUM_PROC_PER_NODE}" ]]; then
+      node_spec="-n${FLUX_NUM_PROC_PER_NODE}"
+    else
+      # Fallback to the previous hard-coded behavior.
+      node_spec="-n2"
+    fi
+    flags="${node_spec}"

	type TorchMLPolicySource struct {
	// numProcPerNode is the number of processes per node.
	// This value is inserted into the `--nproc-per-node` argument of the `torchrun` CLI.
	// Supported values: `auto`, `cpu`, `gpu`, or int value.
	// Defaults to `auto`.
	// +kubebuilder:default="auto"
	// +kubebuilder:validation:XValidation:rule="self > 0 \|\| self in ['auto', 'cpu', 'gpu']", message="NumProcPerNode must be equal to auto, cpu, gpu, or int value"
	// +optional
	NumProcPerNode *intstr.IntOrString `json:"numProcPerNode,omitempty"`

	jobSetSpec.ReplicatedJobs[psIdx].Template.Spec.Template.Spec.Containers[containerIdx].Image = &container.Image
	jobSetSpec.ReplicatedJobs[psIdx].Template.Spec.Template.Spec.Containers[containerIdx].Image = ptr.To(container.Image)

		trainJob.Annotations[AnnotationOriginalCommand] = originalCmd
		trainJob.Annotations[AnnotationViewImage] = settings["FLUX_VIEW_IMAGE"]

Conversation

vsoch commented Feb 9, 2026

Uh oh!

coveralls commented Feb 9, 2026

Pull Request Test Coverage Report for Build 21810351691

Details

💛 - Coveralls

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andreyvelich Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andreyvelich Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

andreyvelich Feb 24, 2026 •

edited

Loading

andreyvelich Feb 12, 2026 •

edited

Loading

andreyvelich commented Feb 25, 2026 •

edited

Loading