Trainer: Add JAX distributed training guide to Kubeflow Trainer docs by Amir380-A · Pull Request #4305 · kubeflow/website

Amir380-A · 2026-02-09T13:05:18Z

Description of Changes

This PR adds a new JAX user guide describing how to run distributed JAX
training jobs with Kubeflow Trainer.

The guide covers:

JAX SPMD execution model overview
Built-in jax-distributed runtime
Required JAX distributed initialization
TrainJob configuration and example (YAML and Python SDK)

Related Issues

Closes: kubeflow/trainer#3183

Related Issues

Update screenshot preview

Checklist

You have signed off your commits
Ensure you follow best practices from our contributing guide.
(for big changes) I will post screenshots of the changes in a PR comment

google-oss-prow · 2026-02-09T13:05:24Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign jeffwan for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

content/en/docs/components/trainer/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

google-oss-prow · 2026-02-09T13:05:29Z

Hi @Amir380-A. Thanks for your PR.

I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

github-actions · 2026-02-09T13:05:45Z

🚫 This command cannot be processed. Only organization members or owners can use the commands.

Signed-off-by: Amir380-A <62997533+Amir380-A@users.noreply.github.com>

Arhell

/ok-to-test

Signed-off-by: Amir380-A <62997533+Amir380-A@users.noreply.github.com>

kevo-1

Thanks for putting this together! It's great to see the new JAX distributed training guide. It covers the core SPMD concepts and built-in runtime perfectly.

To align this guide with the other training framework documentation (like PyTorch, DeepSpeed, and MLX), I have some suggested some changes.

Everything else looks fantastic.

kevo-1 · 2026-02-20T12:06:20Z

content/en/docs/components/trainer/user-guides/jax.md

+```python 
+
+from kubeflow.trainer import TrainerClient, TrainJob
+
+client = TrainerClient()
+
+job = TrainJob(
+    name="jax-sdk-example",
+    runtime="jax-distributed",
+    num_nodes=2,
+    container={
+        "image": "nvcr.io/nvidia/jax:25.10-py3",
+        "command": ["python", "train.py"],
+    },
+)
+
+client.create_trainjob(job)
+``` 


Suggestion: Let's update this Python SDK example to use the new CustomTrainer API wrapper rather than manually constructing the TrainJob pod spec. This makes it consistent with how we show PyTorch, DeepSpeed, and MLX Python SDK usage.

Suggested change

```python

from kubeflow.trainer import TrainerClient, TrainJob

client = TrainerClient()

job = TrainJob(

name="jax-sdk-example",

runtime="jax-distributed",

num_nodes=2,

container={

"image": "nvcr.io/nvidia/jax:25.10-py3",

"command": ["python", "train.py"],

},

)

client.create_trainjob(job)

```

```python

from kubeflow.trainer import TrainerClient, CustomTrainer

def train_jax():

import os

import jax

import jax.distributed as dist

dist.initialize(

num_processes=int(os.environ["JAX_NUM_PROCESSES"]),

process_id=int(os.environ["JAX_PROCESS_ID"]),

coordinator_address=os.environ["JAX_COORDINATOR_ADDRESS"],

)

print("JAX Distributed Environment")

print("Global devices:", jax.devices())

print("Local devices:", jax.local_devices())

job_id = TrainerClient().train(

runtime=TrainerClient().get_runtime("jax-distributed"),

trainer=CustomTrainer(

func=train_jax,

num_nodes=2,

resources_per_node={

"cpu": 2,

},

),

)

```

Edited the code. Thank you for the review.

kevo-1 · 2026-02-20T12:11:40Z

content/en/docs/components/trainer/user-guides/jax.md

+
+## Next Steps
+
+- Check out [the MNIST JAX example](https://github.com/kaisoz/trainer/blob/ca27f54971070a1f65f2d9bf3a1b643f92736448/examples/jax/image-classification/mnist.ipynb).


Suggestion: We should link to the official kubeflow repository on master instead of the personal fork, to prevent broken links later. I also added a link to the TrainerClient SDK documentation!

Suggested change

- Check out [the MNIST JAX example](https://github.com/kaisoz/trainer/blob/ca27f54971070a1f65f2d9bf3a1b643f92736448/examples/jax/image-classification/mnist.ipynb).

- Check out [the MNIST JAX example](https://github.com/kubeflow/trainer/blob/master/examples/jax/image-classification/mnist.ipynb).

- Learn more about `TrainerClient()` APIs [in the Kubeflow SDK](https://github.com/kubeflow/sdk/blob/main/kubeflow/trainer/api/trainer_client.py).

Noted and added.

kevo-1 · 2026-02-20T12:11:45Z

content/en/docs/components/trainer/user-guides/jax.md

+``` 
+
+
+


Suggested change

### Get the TrainJob Results

You can use the `get_job_logs()` API to see your TrainJob logs. For JAX distributed training, logs are typically available on all nodes. You can inspect node 0:

```py

print("\n".join(TrainerClient().get_job_logs(name=job_id, step="node-0")))

```

noted and added. Thank you.

google-oss-prow · 2026-02-20T12:14:09Z

@kevo-1: changing LGTM is restricted to collaborators

Details

In response to this:

Thanks for putting this together! It's great to see the new JAX distributed training guide. It covers the core SPMD concepts and built-in runtime perfectly.

To align this guide with the other training framework documentation (like PyTorch, DeepSpeed, and MLX), I have some suggested some changes.

Everything else looks fantastic.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

andreyvelich

Sorry for the late review @Amir380-A!
I left a few comments in addition to @kevo-1.

andreyvelich · 2026-02-24T21:03:19Z

content/en/docs/components/trainer/user-guides/jax.md

@@ -0,0 +1,231 @@
+++
+title = "JAX Guide"
+description = "How to run JAX training on Kubernetes with Kubeflow Trainer"


Suggested change

description = "How to run JAX training on Kubernetes with Kubeflow Trainer"

description = "How to run JAX on Kubernetes with Kubeflow Trainer"

sorry for the late reply, I reviewed all the comments and committed with all proposed changes.
Noted and added, thank you for the review.

andreyvelich · 2026-02-24T21:05:53Z

content/en/docs/components/trainer/user-guides/jax.md

+++
+title = "JAX Guide"
+description = "How to run JAX training on Kubernetes with Kubeflow Trainer"
+weight = 10


Let's move it under PyTorch

Suggested change

weight = 10

weight = 15

OK, edited the weight.

andreyvelich · 2026-02-24T21:06:22Z

content/en/docs/components/trainer/user-guides/jax.md

+TPU workloads are not supported because installing both `jax[cuda]`
+and `jax[tpu]` in the same image leads to backend and plugin conflicts.
+A separate TPU-specific runtime is required.
+{{% /alert %}}


@kaisoz Do we have a tracking issue to support TPUs?

andreyvelich · 2026-02-24T21:07:30Z

content/en/docs/components/trainer/user-guides/jax.md

+TrainerClient().get_runtime_packages(
+    runtime=TrainerClient().get_runtime("jax-distributed")
+)
+


Can you show output of this command, like we did for DeepSpeed: https://deploy-preview-4305--competent-brattain-de2d6d.netlify.app/docs/components/trainer/user-guides/deepspeed/#get-deepspeed-runtime-packages

Added the output, please check.

andreyvelich · 2026-02-24T21:08:24Z

content/en/docs/components/trainer/user-guides/jax.md

+
+Your training script must explicitly initialize the JAX distributed runtime before performing any JAX computation.
+
+### Example: train.py


Can you modify this example to define the JAX script under training function, and showcase the example with calling train() API like here: https://deploy-preview-4305--competent-brattain-de2d6d.netlify.app/docs/components/trainer/user-guides/deepspeed/#deepspeed-distributed-environment

Updated the example to define the JAX logic inside a train() function and added the entrypoint call, Please let me know if you’d like any further adjustments.

andreyvelich · 2026-02-24T21:09:21Z

content/en/docs/components/trainer/user-guides/jax.md

+)
+
+```
+## Initializing the JAX Distributed Runtime


Suggested change

## Initializing the JAX Distributed Runtime

## JAX Distributed Environment

Added. thank you.

andreyvelich · 2026-02-24T21:10:07Z

content/en/docs/components/trainer/user-guides/jax.md

+
+---
+
+## Creating a TrainJob with JAX Runtime


can you refactor this section to use Kubeflow SDK to submit jobs, to be aligned with other examples: https://deploy-preview-4305--competent-brattain-de2d6d.netlify.app/docs/components/trainer/user-guides/deepspeed/#deepspeed-distributed-environment

Refactored it and merge it into one example for the python SDK.

andreyvelich · 2026-02-24T21:10:59Z

content/en/docs/components/trainer/user-guides/jax.md

+Kubeflow Trainer automatically injects the following environment variables into each trainer container:
+
+| Variable | Description |
+|--------|-------------|
+| `JAX_NUM_PROCESSES` | Total number of JAX processes |
+| `JAX_PROCESS_ID` | Global process index (0-based) |
+| `JAX_COORDINATOR_ADDRESS` | Address of the coordinator (process 0) |


This can be moved to the JAX Distributed Environment section.

Moved the environment variables table to the JAX Distributed Environment section as suggested. Please let me know if the placement looks correct.

andreyvelich · 2026-02-24T21:11:25Z

content/en/docs/components/trainer/user-guides/jax.md

+## Limitations
+
+Current limitations of the JAX runtime include:
+
+- No TPU support
+- No elastic or dynamic scaling
+- Homogeneous node and device configurations are assumed
+- All processes must start and finish together


I don't think that is needed.

Suggested change

## Limitations

Current limitations of the JAX runtime include:

- No TPU support

- No elastic or dynamic scaling

- Homogeneous node and device configurations are assumed

- All processes must start and finish together

Removed the Limitations section as suggested. Thanks!

andreyvelich · 2026-02-24T21:11:45Z

content/en/docs/components/trainer/user-guides/jax.md

+## Parallelism with JAX Primitives
+
+Once initialized, you can use JAX SPMD primitives normally:
+
+- `pmap` — data-parallel execution  
+- `pjit` — explicit global sharding  
+- `shard_map` — low-level SPMD control  
+
+Kubeflow Trainer does not alter JAX semantics, it only provides the distributed execution environment.


Same suggestion to move it to JAX Distributed Environment section.

Done. Please Let me know if anything else I can edit or review!

andreyvelich · 2026-02-25T21:43:51Z

Hi @Amir380-A, did you get a chance to review proposed changes, or we should ask @kevo-1 to take over this work?

Updated the JAX guide to improve clarity. Edited and added examples for using the Python SDK. Adjusted weight and description for better organization. Signed-off-by: Amir Ibrahim <62997533+Amir380-A@users.noreply.github.com>

Signed-off-by: Amir Ibrahim <62997533+Amir380-A@users.noreply.github.com>

google-oss-prow bot added the area/trainer AREA: Kubeflow Trainer / Kubeflow Training Operator label Feb 9, 2026

google-oss-prow bot requested review from Jeffwan and johnugeorge February 9, 2026 13:05

google-oss-prow bot added the needs-ok-to-test label Feb 9, 2026

google-oss-prow bot added the size/L label Feb 9, 2026

Amir380-A force-pushed the docs/jax-guide branch from c27c24f to 15011f7 Compare February 9, 2026 17:01

Amir380-A added 2 commits February 9, 2026 19:04

docs(trainer): add JAX user guide

03db5f3

Signed-off-by: Amir380-A <62997533+Amir380-A@users.noreply.github.com>

docs: remove typos and Extra text

339c764

Signed-off-by: Amir380-A <62997533+Amir380-A@users.noreply.github.com>

Amir380-A force-pushed the docs/jax-guide branch from 15011f7 to 339c764 Compare February 9, 2026 17:05

Arhell reviewed Feb 10, 2026

View reviewed changes

google-oss-prow bot added ok-to-test and removed needs-ok-to-test labels Feb 10, 2026

docs: improve format and hyperlinks

880eca2

Signed-off-by: Amir380-A <62997533+Amir380-A@users.noreply.github.com>

Amir380-A force-pushed the docs/jax-guide branch from 54a2380 to 880eca2 Compare February 10, 2026 10:10

andreyvelich mentioned this pull request Feb 20, 2026

Documentation for Distributed JAX with TrainJob kubeflow/trainer#3183

Open

kevo-1 suggested changes Feb 20, 2026

View reviewed changes

andreyvelich reviewed Feb 24, 2026

View reviewed changes

Amir380-A added 2 commits February 26, 2026 08:09

Revise JAX guide and enhance SDK examples

ad9c65c

Updated the JAX guide to improve clarity. Edited and added examples for using the Python SDK. Adjusted weight and description for better organization. Signed-off-by: Amir Ibrahim <62997533+Amir380-A@users.noreply.github.com>

Update JAX documentation for clarity and examples

7f38bd8

Signed-off-by: Amir Ibrahim <62997533+Amir380-A@users.noreply.github.com>

-```python
-from kubeflow.trainer import TrainerClient, TrainJob
-client = TrainerClient()
-job = TrainJob(
-    name="jax-sdk-example",
-    runtime="jax-distributed",
-    num_nodes=2,
-    container={
-        "image": "nvcr.io/nvidia/jax:25.10-py3",
-        "command": ["python", "train.py"],
-    },
-)
-client.create_trainjob(job)
-```
+```python
+from kubeflow.trainer import TrainerClient, CustomTrainer
+def train_jax():
+    import os
+    import jax
+    import jax.distributed as dist
+    dist.initialize(
+        num_processes=int(os.environ["JAX_NUM_PROCESSES"]),
+        process_id=int(os.environ["JAX_PROCESS_ID"]),
+        coordinator_address=os.environ["JAX_COORDINATOR_ADDRESS"],
+    )
+    print("JAX Distributed Environment")
+    print("Global devices:", jax.devices())
+    print("Local devices:", jax.local_devices())
+job_id = TrainerClient().train(
+    runtime=TrainerClient().get_runtime("jax-distributed"),
+    trainer=CustomTrainer(
+        func=train_jax,
+        num_nodes=2,
+        resources_per_node={
+            "cpu": 2,
+        },
+    ),
+)
+```


		## Next Steps

		- Check out [the MNIST JAX example](https://github.com/kaisoz/trainer/blob/ca27f54971070a1f65f2d9bf3a1b643f92736448/examples/jax/image-classification/mnist.ipynb).

-- Check out [the MNIST JAX example](https://github.com/kaisoz/trainer/blob/ca27f54971070a1f65f2d9bf3a1b643f92736448/examples/jax/image-classification/mnist.ipynb).
+- Check out [the MNIST JAX example](https://github.com/kubeflow/trainer/blob/master/examples/jax/image-classification/mnist.ipynb).
+- Learn more about `TrainerClient()` APIs [in the Kubeflow SDK](https://github.com/kubeflow/sdk/blob/main/kubeflow/trainer/api/trainer_client.py).

+### Get the TrainJob Results
+You can use the `get_job_logs()` API to see your TrainJob logs. For JAX distributed training, logs are typically available on all nodes. You can inspect node 0:
+```py
+print("\n".join(TrainerClient().get_job_logs(name=job_id, step="node-0")))
+```

	description = "How to run JAX training on Kubernetes with Kubeflow Trainer"
	description = "How to run JAX on Kubernetes with Kubeflow Trainer"


		Your training script must explicitly initialize the JAX distributed runtime before performing any JAX computation.

		### Example: train.py

	## Initializing the JAX Distributed Runtime
	## JAX Distributed Environment

Conversation

Amir380-A commented Feb 9, 2026

Description of Changes

Related Issues

Related Issues

Checklist

Uh oh!

google-oss-prow bot commented Feb 9, 2026

Uh oh!

google-oss-prow bot commented Feb 9, 2026

Uh oh!

github-actions bot commented Feb 9, 2026

Uh oh!

Arhell left a comment

Choose a reason for hiding this comment

Uh oh!

kevo-1 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Amir380-A Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

google-oss-prow bot commented Feb 20, 2026

Uh oh!

andreyvelich left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andreyvelich commented Feb 25, 2026

Uh oh!

Reviewers

Assignees

Amir380-A Feb 26, 2026 •

edited

Loading