Skip to content

feat: Run dataset and model initializers in parallel#313

Open
priyank766 wants to merge 1 commit intokubeflow:mainfrom
priyank766:feat/parallel-initializers-290
Open

feat: Run dataset and model initializers in parallel#313
priyank766 wants to merge 1 commit intokubeflow:mainfrom
priyank766:feat/parallel-initializers-290

Conversation

@priyank766
Copy link

What this PR does / why we need it:
Currently, the ContainerBackend (Docker/Podman) runs dataset and model initializers sequentially. For Large Language Model (LLM) fine-tuning or heavy data workloads, downloading both a massive dataset and a base model separately can add significant overhead to the job startup time.

This PR refactors the _run_initializers logic to use concurrent.futures.ThreadPoolExecutor. If both a dataset and a model are configured, they are now initialized in parallel threads. This reduces the total initialization time to the duration of the longest single download, rather than the sum of both.

Key technical changes:

  • Refactored kubeflow/trainer/backends/container/backend.py to use a thread pool for parallel initializer dispatch.
  • Added import concurrent.futures to the backend.
  • Ensured thread safety and proper error propagation (if one initializer fails, the main thread correctly identifies the failure and cleans up).
  • Verified that shared volume mounts work correctly during concurrent writes from separate containers.

Which issue(s) this PR fixes:
Fixes #290

Checklist:

  • Docs included if any changes are user facing (Internal performance improvement, no change to public API surface)
  • Unit tests pass: uv run pytest kubeflow/trainer/backends/container/backend_test.py
  • Code adheres to Ruff formatting/linting standards verified via make verify.

Verification Results:
Ran the specific backend tests to confirm the threading logic works as expected without race conditions:

uv run pytest kubeflow/trainer/backends/container/backend_test.py
# Result: 19 passed

Copilot AI review requested due to automatic review settings February 22, 2026 09:00
@google-oss-prow
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign astefanutti for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@github-actions
Copy link
Contributor

🎉 Welcome to the Kubeflow SDK! 🎉

Thanks for opening your first PR! We're happy to have you as part of our community 🚀

Here's what happens next:

  • If you haven't already, please check out our Contributing Guide for repo-specific guidelines and the Kubeflow Contributor Guide for general community standards
  • Our team will review your PR soon! cc @kubeflow/kubeflow-sdk-team

Join the community:

Feel free to ask questions in the comments if you need any help or clarification!
Thanks again for contributing to Kubeflow! 🙏

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the ContainerBackend to run dataset and model initializers in parallel using Python's ThreadPoolExecutor, reducing total initialization time from sequential (sum of both downloads) to parallel (duration of the longest download). This is particularly beneficial for LLM fine-tuning workloads with large datasets and models.

Changes:

  • Refactored _run_initializers to use concurrent.futures.ThreadPoolExecutor with max_workers=2
  • Updated method documentation to reflect parallel execution
  • Modified logging messages to indicate queueing and parallel completion

@priyank766 priyank766 force-pushed the feat/parallel-initializers-290 branch 2 times, most recently from af5276e to be3c0a8 Compare February 22, 2026 09:07
@priyank766 priyank766 changed the title feat(container): Run dataset and model initializers in parallel (#290) feat : Run dataset and model initializers in parallel (#290) Feb 22, 2026
@priyank766
Copy link
Author

I wanted to know this issue is Open right i submitted PR but I had issue for PR Title can anyone can help me as I didn't get the format for PR Title issue
@astefanutti

@astefanutti
Copy link
Contributor

/retitle feat: Run dataset and model initializers in parallel

@google-oss-prow google-oss-prow bot changed the title feat : Run dataset and model initializers in parallel (#290) feat: Run dataset and model initializers in parallel Feb 23, 2026
Copy link
Contributor

@astefanutti astefanutti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/ok-to-test

@priyank766 priyank766 force-pushed the feat/parallel-initializers-290 branch from bb80231 to c05ed89 Compare February 23, 2026 16:19
@priyank766
Copy link
Author

@astefanutti E2E Test are solved here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(container): Run dataset and model initializers in parallel

3 participants