Skip to content

doc(trainer): add architecture diagrams to local execution guides#4301

Open
sh4shv4t wants to merge 2 commits intokubeflow:masterfrom
sh4shv4t:fix-trainer-docs-architecture
Open

doc(trainer): add architecture diagrams to local execution guides#4301
sh4shv4t wants to merge 2 commits intokubeflow:masterfrom
sh4shv4t:fix-trainer-docs-architecture

Conversation

@sh4shv4t
Copy link

@sh4shv4t sh4shv4t commented Feb 4, 2026

Description of Changes

This PR adds detailed Architecture sections and Mermaid diagrams to the Local Execution guides for the Kubeflow Trainer. These additions visualize the workflow and component interactions for the three local backends:

  • Docker Backend: Visualizes the interaction between the SDK, Docker Daemon, and container networking.
  • Podman Backend: detailed flow for rootless container execution and process isolation.
  • Local Process Backend: Visualizes the creation of virtual environments and native process management.

These diagrams help users better understand the internal mechanics of how TrainerClient orchestrates jobs locally before scaling to a cluster.

Related Issues

Closes: #4231

Checklist

Signed-off-by: sh4shv4t <shashvat.k.singh.16@gmail.com>
@google-oss-prow google-oss-prow bot added the area/trainer AREA: Kubeflow Trainer / Kubeflow Training Operator label Feb 4, 2026
@google-oss-prow google-oss-prow bot requested a review from ChanYiLin February 4, 2026 12:00
@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign andreyvelich for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot requested a review from Jeffwan February 4, 2026 12:00
@google-oss-prow
Copy link

Hi @sh4shv4t. Thanks for your PR.

I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@github-actions
Copy link

github-actions bot commented Feb 4, 2026

🚫 This command cannot be processed. Only organization members or owners can use the commands.

@sh4shv4t
Copy link
Author

sh4shv4t commented Feb 4, 2026

Here are the rendered screenshots of the new architecture diagrams for verification:

image image image

Copy link
Member

@Arhell Arhell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/ok-to-test

Copy link
Contributor

@Fiona-Waters Fiona-Waters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, thanks. Just some notes and comments.

  • The docker diagram show logs from node-0 but actually we can stream logs from all nodes.
  • Auto-remove is conditional/optional
  • The podman diagram should be updated to be more in line with the docker one (with network creation and multi-node, for example)
  • The local process diagram has a few discrepancies, the flow is not quite right (create - extract - install (in 1 script)). We can show the bash script generation step and make it clear that other processes happen within a single subprocess via the generated bash script.
    I hope that makes sense. Please let me know if you need to clarify anything.

@sh4shv4t
Copy link
Author

sh4shv4t commented Feb 6, 2026

Hi @Fiona-Waters , thank you so much for the review! I apologize for the terminology mix-up and the technical discrepancies in the diagrams. I’ll take all your suggestions into account and push the updated changes shortly. Thanks for the guidance!

- docker.md: show logs streaming from all nodes, clarify conditional cleanup
- podman.md: correct architecture text (Docker→Podman), align diagram with implementation, remove incorrect workflow details
- local_process.md: update diagram to reflect bash script generation and single subprocess execution

These changes address reviewer feedback and align documentation with actual SDK implementation.

Signed-off-by: sh4shv4t <shashvat.k.singh.16@gmail.com>
@sh4shv4t sh4shv4t force-pushed the fix-trainer-docs-architecture branch from 2b04b76 to 11700f7 Compare February 6, 2026 23:13
@sh4shv4t
Copy link
Author

sh4shv4t commented Feb 6, 2026

Hi @Fiona-Waters , thanks again for the detailed feedback! I have updated the diagrams and documentation to accurately reflect the internal logic this time (hopefully!). Specifically, I’ve made the following changes:

  1. Docker Backend:
    Log Streaming: Updated the diagram to show that logs can be streamed from all nodes, not just node-0.
    Auto-remove: Clarified that container removal is conditional/optional based on the job configuration.
  2. Podman Backend:
    Consistency: Refactored the Podman diagram to align with the Docker version (including network creation and multi-node setup).
    Terminology: Fixed the descriptions in podman.md to remove "Docker" references and corrected the pull_policy wording to be less misleading.
  3. Local Process Backend:
    Flow Correction: Updated the diagram to show the correct sequence: the SDK generates a bash script, which then handles the environment extraction and installation in a single subprocess.
    Clarity: Made it clear that these operations happen within the generated script execution rather than as separate SDK-managed steps.

The new rendered diagrams are:
image
image
image

I believe these changes resolve the discrepancies mentioned. Please let me know if any further adjustments are needed!

Copy link
Contributor

@Fiona-Waters Fiona-Waters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm thanks @sh4shv4t !
/assign @kramaranya @andreyvelich please review

@google-oss-prow
Copy link

@Fiona-Waters: GitHub didn't allow me to assign the following users: please, review.

Note that only kubeflow members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

Details

In response to this:

/lgtm thanks @sh4shv4t !
/assign @kramaranya @andreyvelich please review

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/trainer AREA: Kubeflow Trainer / Kubeflow Training Operator ok-to-test size/M

Projects

None yet

Development

Successfully merging this pull request may close these issues.

chore(trainer): Add architecture section and diagram to Execute TrainJobs Locally section

5 participants