Skip to content

feat: support for Flux Framework as HPC manager#3064

Closed
vsoch wants to merge 5 commits intokubeflow:masterfrom
converged-computing:plugin/flux
Closed

feat: support for Flux Framework as HPC manager#3064
vsoch wants to merge 5 commits intokubeflow:masterfrom
converged-computing:plugin/flux

Conversation

@vsoch
Copy link
Contributor

@vsoch vsoch commented Jan 4, 2026

This pull request adds Flux Framework as a plugin to the Kubeflow Trainer. 🌀

Overview

Flux supports the majority of MPI flavors/variants, and can be used to bootstrap MPI as a plugin. It adds other features for scheduling and topology that can be used for simulations and ai/ml jobs. This changeset adds the plugin implementation, including the plugin module, tests, and an example with a small README to serve as documentation for the example.

What this PR does / why we need it:

See https://github.com/kubeflow/trainer/tree/master/docs/proposals/2841-flux-hpc. To summarize, Flux Framework supports more MPI variants out of the box than the current MPI plugin. It brings more scheduling features, topology awareness, higher throughput, and dynamism / elasticity of the scheduler and jobs. See https://github.com/kubeflow/trainer/tree/master/docs/proposals/2841-flux-hpc#motivation. For full provenance / history, here is the initial discussion in the Kubeflow Trainer meeting.

Which issue(s) this PR fixes
Fixes #2841 (and note here, we should follow up with discussion on next steps for scoped issues)

Checklist:

  • Docs included if any changes are user facing

See kubeflow/website#4283

@andreyvelich some notes for you.

  • This is my first time developing for Kubeflow, and using ApplyConfiguration. If I made a mistake in design or process please tell me directly, and give a pointer to a correct way to go about it.
  • I am exposing the variables we talked about (network, queue policy) as envars that can be defined in the training job. I think this is a simple and reasonable approach in that it is flexible, but let me know if there is another idea for discussion.

Here is the first completion of LAMMPS. When you remove the command it turns into an interactive minicluster (fairly simple / straight-forward I think).

image image

Thanks in advance for the review! I won't be able to finish the PR work tonight (figuring out the linting still) but I'll pick up tomorrow after some sleep. Really excited about this.

cc @milroy

Copilot AI review requested due to automatic review settings January 4, 2026 09:11
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request adds support for Flux Framework as an HPC workload manager plugin for the Kubeflow Trainer. Flux Framework provides sophisticated resource management, supports multiple MPI variants, and enables distributed HPC workloads in Kubernetes environments.

Key Changes

  • Implements a new Flux plugin that integrates with the Kubeflow Trainer runtime framework
  • Adds automatic Flux installation via init containers and configuration management through ConfigMaps and Secrets
  • Provides support for both batch execution and interactive HPC cluster modes

Reviewed changes

Copilot reviewed 33 out of 33 changed files in this pull request and generated 14 comments.

Show a summary per file
File Description
pkg/runtime/framework/plugins/flux/*.go Core plugin implementation including broker configuration, curve certificate generation, hostlist management, and command extraction
pkg/runtime/framework/plugins/flux/*_test.go Comprehensive test coverage for plugin functionality
pkg/runtime/framework/plugins/registry.go Registers the Flux plugin in the framework
pkg/runtime/runtime.go Extends RuntimePolicy to include FluxPolicySource
pkg/apis/trainer/v1alpha1/trainingruntime_types.go Adds FluxMLPolicySource type definition with numProcPerNode parameter
pkg/apis/trainer/v1alpha1/zz_generated.* Generated code for deepcopy, openapi specs, and API types
pkg/client/applyconfiguration/**/*.go Generated apply configurations for Flux types
manifests/base/crds/*.yaml Updated CRDs to include Flux policy configuration
charts/kubeflow-trainer/crds/*.yaml Updated Helm chart CRDs
examples/flux/*.yaml Example runtime and TrainJob configurations demonstrating LAMMPS workload
examples/flux/README.md Comprehensive documentation for using the Flux plugin
api/python_api/**/*.py Python API updates to support Flux policy types
api/openapi-spec/swagger.json OpenAPI specification updates
build.sh Development helper script (should be removed per PR description)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@coveralls
Copy link

coveralls commented Jan 4, 2026

Pull Request Test Coverage Report for Build 21715866737

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 441 of 581 (75.9%) changed or added relevant lines in 4 files are covered.
  • 1 unchanged line in 1 file lost coverage.
  • Overall coverage increased (+4.8%) to 56.011%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/runtime/framework/plugins/registry.go 0 1 0.0%
pkg/runtime/runtime.go 0 1 0.0%
pkg/runtime/framework/plugins/jobset/jobset.go 0 6 0.0%
pkg/runtime/framework/plugins/flux/flux.go 441 573 76.96%
Files with Coverage Reduction New Missed Lines %
pkg/runtime/runtime.go 1 57.69%
Totals Coverage Status
Change from base Build 21703881784: 4.8%
Covered Lines: 1682
Relevant Lines: 3003

💛 - Coveralls

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this effort @vsoch!
I left my initial comments.
/assign @akshaychitneni @astefanutti @Electronic-Waste @tenzen-y
Appreciate your review too!

// Generate hostlists. The hostname (prefix) is the trainJob Name
// We need the initial jobset size, and container command
size := getJobSetSize(trainJob)
hosts := generateHostlist(trainJob.Name, size)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check this how we get the host list:

hostFile.WriteString(fmt.Sprintf("%s slots=%d\n", e, slots))

You can extract the address of the Pods by using Endpoint from PodSet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you show me what one of those addresses look like? Generally we don't want the entire address, but a pattern (and range) that describes it. The code there appears to be generating a massive list of host, which won't scale nicely for a config file.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this: trainjob-node-1-0.trainjob.
We use this in our hostfile configuration for the default MPI plugin.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay - sounds like the fully qualified name isn't enabled (the bit with the cluster.local). We will need that for flux.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need cluster.local since it tries to reach the pods inside the cluster?
For MPI hostfile it works well. Do you want to try to use the same hostnames in Flux config?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the broker setup, I was never able to get it to work without the full name. I don't remember the details (this was many years ago), but I am sure about that. We also can't assume that we will always be connecting to only local nodes, or even Kubernetes. For context, this is built into the bootstrap config. Here is an example:

https://github.com/flux-framework/flux-operator/blob/321a9e14d3180f9c0bae72bb8e1ab47b98069f82/examples/tests/custom-config/minicluster.yaml#L30-L36

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at this again - is there any reason we cannot use the method that we currently have to generate the hosts? It's fairly simple and seems to work OK.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's fine, we can start with that initially.
It would be nice to explore why our plugin: IdentifyPodNetwork doesn't work for Flux (cc @tenzen-y).

func (j *JobSet) IdentifyPodNetwork(info *runtime.Info, trainJob *trainer.TrainJob) error {

@vsoch
Copy link
Contributor Author

vsoch commented Jan 19, 2026

Thank you @andreyvelich - I will get started on these changes right away.

I wanted to get a GPU example in and was testing with AWS Trainium - the CPU example worked beautifully but I couldn't get the Trainium devices to work (in any context, even with their tutorials, etc). I think we will get there (we have a great collaborator there!), but in the meantime I'm going to try a small example on Google Cloud, likely with any small GPU I can get.

I will keep you in the loop to my progress, update the PR, and we will be attending the next Kubeflow meeting to discuss any details that come up.

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from astefanutti. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@vsoch vsoch force-pushed the plugin/flux branch 3 times, most recently from 278cd0e to 8bb88a5 Compare January 19, 2026 04:32
@vsoch
Copy link
Contributor Author

vsoch commented Jan 25, 2026

OK - one of the changes above actually seems to have broken the entire setup, because we don't have a view and are running lammps independently on each pod. I need to revert everything and start over.

@vsoch
Copy link
Contributor Author

vsoch commented Jan 25, 2026

Still no go.

The lesson here is not to rebase until the end - one of the changes (maybe to move between files, etc) completely broke the setup, and I don't have the original version that I spent a long time on. There is no longer an init container, even when I move back to what is here. I don't know this will be done by mid- February @andreyvelich.

Update: I was able to restore back to (mostly) what (I think) I had, and the configmap and secret are generating again. I think there is still one bug to work through before I start this new work, but I'm relieved that it's partially back. This week is going to be busy so I'll set expectations for next weekend or after for another update.

@andreyvelich
Copy link
Member

Update: I was able to restore back to (mostly) what (I think) I had, and the configmap and secret are generating again. I think there is still one bug to work through before I start this new work, but I'm relieved that it's partially back. This week is going to be busy so I'll set expectations for next weekend or after for another update.

Sounds good, thank you for your work! Let us know if you have any additional questions for the runtime extension framework.

@vsoch
Copy link
Contributor Author

vsoch commented Jan 29, 2026

@andreyvelich I am not sure you saw my question here: #3064 (comment)

I cannot use PodSet for the InitContainer if it does not expose a Command and Image. It just has Env, VolumeMount, etc. I can add these, and if so, I would want to ask where to set them properly from the runtime.Container interface onto the actual Jobset (e.g., PodSet -> runtime.Container -> JobSet).

vsoch added 2 commits February 3, 2026 18:00
Flux supports the majority of MPI flavors/variants, and can be used
to bootstrap MPI as a plugin. It adds other features for scheduling
and topology that can be used for simulations and ai/ml jobs.
This changeset adds the plugin implementation, including the
plugin module, tests, and an example with a small README to
serve as documentation for the time being.

Signed-off-by: vsoch <vsoch@users.noreply.github.com>
Signed-off-by: vsoch <vsoch@users.noreply.github.com>
Signed-off-by: vsoch <vsoch@users.noreply.github.com>
@vsoch
Copy link
Contributor Author

vsoch commented Feb 4, 2026

@andreyvelich that's probably the best I can do to integrate PodSet - it was really challenging understanding the design of the API here. E.g., often I would make changes that would not persist, and in order to actually see the change I'd need to update the JobSet directly. I apologize it took me so long (and if there are still issues). If there are, I am not sure I can solve them alone within the time frame - just a lot going on right now.

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @vsoch!
Overall looks good, I left a few comments.

// Generate hostlists. The hostname (prefix) is the trainJob Name
// We need the initial jobset size, and container command
size := getJobSetSize(trainJob)
hosts := generateHostlist(trainJob.Name, size)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's fine, we can start with that initially.
It would be nice to explore why our plugin: IdentifyPodNetwork doesn't work for Flux (cc @tenzen-y).

func (j *JobSet) IdentifyPodNetwork(info *runtime.Info, trainJob *trainer.TrainJob) error {

Comment on lines +269 to +274
if len(container.Command) > 0 {
jobSetSpec.ReplicatedJobs[psIdx].Template.Spec.Template.Spec.Containers[containerIdx].Command = container.Command
}
if container.Image != "" {
jobSetSpec.ReplicatedJobs[psIdx].Template.Spec.Template.Spec.Containers[containerIdx].Image = &container.Image
}
Copy link
Member

@andreyvelich andreyvelich Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain why do we need this?
We are going to always override container command from Trainer here:

if image := jobTrainer.Image; image != nil {
b.Spec.ReplicatedJobs[i].Template.Spec.Template.Spec.Containers[j].Image = image
}
if command := jobTrainer.Command; command != nil {
b.Spec.ReplicatedJobs[i].Template.Spec.Template.Spec.Containers[j].Command = command
}

I guess, for InitContainer we still don't use PodSet internal structure and directly apply changes to the JobSet template spec for now:

jobSetSpec.ReplicatedJobs[psIdx].Template.Spec.Template.Spec.InitContainers = append(
jobSetSpec.ReplicatedJobs[psIdx].Template.Spec.Template.Spec.InitContainers,
*fluxInstaller,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the container command or image is not set via the pod set there, I don't think we'd want to essentially erase it.

return func(o *InfoOptions) {
if mlPolicy != nil {
o.runtimePolicy.MLPolicySource = &mlPolicy.MLPolicySource
o.runtimePolicy.FluxPolicySource = mlPolicy.Flux
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need this?
You can directly access Flux using: o.runtimePolicy.MLPolicySource.Flux

Suggested change
o.runtimePolicy.FluxPolicySource = mlPolicy.Flux

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about FluxPolicySource one level up?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I don't see that - I don't even know how that works. 🙃

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is wrong there must be a nested bug in here, because when I remove it:

image

And when I add it back:

image

Copy link
Member

@andreyvelich andreyvelich Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need to touch this code, since it just constructs the InfoOptions that we use here:

runtime.WithMLPolicySource(mlPolicy),

That is used further down in the framework, when we initially construct the Info object:

RuntimePolicy: options.runtimePolicy,

I think, your bug is unrelated to this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vsoch Did you get a chance to check this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - I tried removing that change, and (even with tweaks to flux.go) removing it consistently prevented creation of the configmap, or the view. I can't get it to work without that line, and I do not see how anything in flux.go would lead to the bug. I probably need another set of eyes here.

If anyone is able/willing to test, note that for the example, it's important to verify that lammps is run by the lead broker in index 0, and we only have brokers running for the remainder of indexes. One bug that can happen is that the flux view is not properly generated and individual instances of lammps are run, one per pod. That would show lammps output in each log, and that is wrong for Flux.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vsoch Found an issue, you should change this line to: https://github.com/converged-computing/trainer/blob/plugin/flux/pkg/runtime/framework/plugins/flux/flux.go#L212

	if info == nil || info.RuntimePolicy.MLPolicySource == nil || info.RuntimePolicy.MLPolicySource.Flux == nil {
		return nil, nil
	}

@vsoch
Copy link
Contributor Author

vsoch commented Feb 5, 2026

I made good progress tonight but still have a little bit to do (and sleepy). I am going to try and make time to finish up this round tomorrow!

Signed-off-by: vsoch <vsoch@users.noreply.github.com>
@vsoch
Copy link
Contributor Author

vsoch commented Feb 5, 2026

@andreyvelich not perfect yet, but I think ready for another round!

Comment on lines 26 to 27
# Uncomment for local development with kind
# imagePullPolicy: Never
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Uncomment for local development with kind
# imagePullPolicy: Never

return func(o *InfoOptions) {
if mlPolicy != nil {
o.runtimePolicy.MLPolicySource = &mlPolicy.MLPolicySource
o.runtimePolicy.FluxPolicySource = mlPolicy.Flux
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vsoch Found an issue, you should change this line to: https://github.com/converged-computing/trainer/blob/plugin/flux/pkg/runtime/framework/plugins/flux/flux.go#L212

	if info == nil || info.RuntimePolicy.MLPolicySource == nil || info.RuntimePolicy.MLPolicySource.Flux == nil {
		return nil, nil
	}

@vsoch
Copy link
Contributor Author

vsoch commented Feb 8, 2026

Taking a look! I'll do a rebase too.

Signed-off-by: vsoch <vsoch@users.noreply.github.com>
@vsoch
Copy link
Contributor Author

vsoch commented Feb 8, 2026

@andreyvelich what is your recommendation for resolving this merge conflict? The files below appear to be mostly auto-generated, so I'm not sure the best way to resolve conflicts (which I usually do manually).

image

@andreyvelich
Copy link
Member

@andreyvelich what is your recommendation for resolving this merge conflict?

During the rebase, try to delete the entire directory, and run make generate after @vsoch.
It should re-create required files.

@vsoch
Copy link
Contributor Author

vsoch commented Feb 8, 2026

I think I just borked my entire local clone.

@vsoch
Copy link
Contributor Author

vsoch commented Feb 8, 2026

Hold tight, I have an idea.

@vsoch
Copy link
Contributor Author

vsoch commented Feb 8, 2026

okay that failed. I think my best idea is to re-create the entire PR, which I need to do manually. I need to go outside first, so I'll do it later this evening.

@vsoch
Copy link
Contributor Author

vsoch commented Feb 9, 2026

@andreyvelich after a few failed efforts, I decided to go through the changes carefully and recreate the pull request from the updated branch:

#3188

And I think the subtle issue with the bug we found is the difference between:

case1 := &runtime.Info{
	RuntimePolicy: runtime.RuntimePolicy{
		MLPolicySource: &trainer.MLPolicySource{
			Flux: &trainer.FluxMLPolicySource{
				NumProcPerNode: &procs,
			},
		},
	},
}

case2 := &runtime.Info{
	RuntimePolicy: runtime.RuntimePolicy{
		FluxPolicySource: &trainer.FluxMLPolicySource{
			NumProcPerNode: &procs,
		},
	},
},

Both are valid (structurally, not functionally). I don't fully understand why we have both FluxPolicySource and MLPolicySource.Flux is that a bug?

@vsoch
Copy link
Contributor Author

vsoch commented Feb 23, 2026

Closing in favor of #3188

@vsoch vsoch closed this Feb 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: Flux Framework as a plugin for HPC and MPI bootstrap

8 participants