-
Notifications
You must be signed in to change notification settings - Fork 44
Parallel local #337
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Parallel local #337
Conversation
…strowf into parallel_local
# Conflicts: # maestrowf/datastructures/core/executiongraph.py # maestrowf/datastructures/core/study.py # maestrowf/interfaces/script/localparscriptadapter.py # maestrowf/maestro.py
| LOGGER = logging.getLogger(__name__) | ||
|
|
||
|
|
||
| class LocalParallelScriptAdapter(SchedulerScriptAdapter): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm looking over this and it could be because I haven't looked at it/talked with you about it in a while, but it looks like the adapter in its current form doesn't do any substitutions. It's entirely reliant on hard-scripting of srun and other MPI calls in the steps themselves.
I think we talked about this a while back -- but this is where we could write a new adapter that takes in as an argument a scheduler type and then "borrows" its parallelization methods. In that same respect, we could bridge a little further and allow that to dictate scheduling Maestro the cluster essentially. That would come down to a hard check somewhere, so I'd have to mock something up. For reference, here's a mock local scheduler-esque type class I wrote in pyaestro: Executor class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, this is essentially just the replacement for the local adapter. So it's more suited to being the base class for the other scheduler adapters. So one extra thing that this setup might want for running on clusters is some extension of the input language to allow specifying multiple allocations to run in parallel and splitting jobs across them in some fashion. Unsure if that's a good thing to add in the initial pr though as that could be a lot of work to make it robust/user friendly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On second thought, that allocation splitting thing is definitely not a good plan in this pr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On second thought, that allocation splitting thing is definitely not a good plan in this pr.
I'm not sure I remember the specifics of this one. Can you refresh me?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's more of a next stage or even next layer up in the ecosystem: the idea of orchestrating one workflow across multiple allocations at once, so a bit beyond the scope of the threaded adapters
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah -- got it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, digging around in these adapters again, there seems to be a few paths to consider for launcher token enabled version of this adapter -> make a parallelschedulerscript adapter and fork off variants of the concrete instances from that, or make that schedulerscriptadapter the base for everything to bring in the launcher token replacement machinery. the latter case would potentially allow the one adapter to accept a scheduler type key so it can go use the get_parallelize_command from some dummy instances of the concrete scheduler adapters. so, it would seem the design question is have one potentially more complicated adapter or double the total number of adapter types to enable scheduler specific tweaks in this parallel version -> so slurmparalleladapter, fluxparalleladapter (maybe this one's not needed?), etc..
|
Ok, just circling back to this after far too long. So, how about this for a plan: add it now as a separate adapter, and let it get tested on user workflows for a bit and then in a followup replace it with a stable version of pyaestro executor and then also just drop the standard local adapter? The local adapter replacement is likely a good follow on given local parallel runs share more incommon with the scheduled adapters than the local? launcher tokens then become a feature of all adapters with threaded, mpirun type launcher tokens enabled locally |
One more thought on this particular PR: do we want to enable launcher tokens on this adapter right now and hide the pain of mpirun/srun/etc, or save that for a second PR enabling more general token replacement facilities (i.e. per step specifications of tokens, possibly even different users specified named tokens to mix and match in single steps, ...) |
Add parallel local adapter type for improved throughput on multi-core machines/allocations. Adapter is resource aware and can support non-uniform job task/job sizes.