feat: Run dataset and model initializers in parallel#313
feat: Run dataset and model initializers in parallel#313priyank766 wants to merge 1 commit intokubeflow:mainfrom
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
🎉 Welcome to the Kubeflow SDK! 🎉 Thanks for opening your first PR! We're happy to have you as part of our community 🚀 Here's what happens next:
Join the community:
Feel free to ask questions in the comments if you need any help or clarification! |
There was a problem hiding this comment.
Pull request overview
This PR refactors the ContainerBackend to run dataset and model initializers in parallel using Python's ThreadPoolExecutor, reducing total initialization time from sequential (sum of both downloads) to parallel (duration of the longest download). This is particularly beneficial for LLM fine-tuning workloads with large datasets and models.
Changes:
- Refactored
_run_initializersto useconcurrent.futures.ThreadPoolExecutorwith max_workers=2 - Updated method documentation to reflect parallel execution
- Modified logging messages to indicate queueing and parallel completion
af5276e to
be3c0a8
Compare
|
I wanted to know this issue is Open right i submitted PR but I had issue for PR Title can anyone can help me as I didn't get the format for PR Title issue |
|
/retitle feat: Run dataset and model initializers in parallel |
…ow#290) Signed-off-by: priyank <priyank8445@gmail.com>
bb80231 to
c05ed89
Compare
|
@astefanutti E2E Test are solved here |
What this PR does / why we need it:
Currently, the ContainerBackend (Docker/Podman) runs dataset and model initializers sequentially. For Large Language Model (LLM) fine-tuning or heavy data workloads, downloading both a massive dataset and a base model separately can add significant overhead to the job startup time.
This PR refactors the _run_initializers logic to use
concurrent.futures.ThreadPoolExecutor. If both a dataset and a model are configured, they are now initialized in parallel threads. This reduces the total initialization time to the duration of the longest single download, rather than the sum of both.Key technical changes:
import concurrent.futuresto the backend.Which issue(s) this PR fixes:
Fixes #290
Checklist:
uv run pytest kubeflow/trainer/backends/container/backend_test.pymake verify.Verification Results:
Ran the specific backend tests to confirm the threading logic works as expected without race conditions:
uv run pytest kubeflow/trainer/backends/container/backend_test.py # Result: 19 passed