Skip to content

Conversation

@jwhite242
Copy link
Collaborator

Add parallel local adapter type for improved throughput on multi-core machines/allocations. Adapter is resource aware and can support non-uniform job task/job sizes.

jwhite242 and others added 25 commits April 10, 2020 16:10
# Conflicts:
#   maestrowf/datastructures/core/executiongraph.py
#   maestrowf/datastructures/core/study.py
#   maestrowf/interfaces/script/localparscriptadapter.py
#   maestrowf/maestro.py
LOGGER = logging.getLogger(__name__)


class LocalParallelScriptAdapter(SchedulerScriptAdapter):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm looking over this and it could be because I haven't looked at it/talked with you about it in a while, but it looks like the adapter in its current form doesn't do any substitutions. It's entirely reliant on hard-scripting of srun and other MPI calls in the steps themselves.

I think we talked about this a while back -- but this is where we could write a new adapter that takes in as an argument a scheduler type and then "borrows" its parallelization methods. In that same respect, we could bridge a little further and allow that to dictate scheduling Maestro the cluster essentially. That would come down to a hard check somewhere, so I'd have to mock something up. For reference, here's a mock local scheduler-esque type class I wrote in pyaestro: Executor class

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, this is essentially just the replacement for the local adapter. So it's more suited to being the base class for the other scheduler adapters. So one extra thing that this setup might want for running on clusters is some extension of the input language to allow specifying multiple allocations to run in parallel and splitting jobs across them in some fashion. Unsure if that's a good thing to add in the initial pr though as that could be a lot of work to make it robust/user friendly.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought, that allocation splitting thing is definitely not a good plan in this pr.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought, that allocation splitting thing is definitely not a good plan in this pr.

I'm not sure I remember the specifics of this one. Can you refresh me?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's more of a next stage or even next layer up in the ecosystem: the idea of orchestrating one workflow across multiple allocations at once, so a bit beyond the scope of the threaded adapters

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah -- got it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, digging around in these adapters again, there seems to be a few paths to consider for launcher token enabled version of this adapter -> make a parallelschedulerscript adapter and fork off variants of the concrete instances from that, or make that schedulerscriptadapter the base for everything to bring in the launcher token replacement machinery. the latter case would potentially allow the one adapter to accept a scheduler type key so it can go use the get_parallelize_command from some dummy instances of the concrete scheduler adapters. so, it would seem the design question is have one potentially more complicated adapter or double the total number of adapter types to enable scheduler specific tweaks in this parallel version -> so slurmparalleladapter, fluxparalleladapter (maybe this one's not needed?), etc..

@jwhite242
Copy link
Collaborator Author

Ok, just circling back to this after far too long. So, how about this for a plan: add it now as a separate adapter, and let it get tested on user workflows for a bit and then in a followup replace it with a stable version of pyaestro executor and then also just drop the standard local adapter?

The local adapter replacement is likely a good follow on given local parallel runs share more incommon with the scheduled adapters than the local? launcher tokens then become a feature of all adapters with threaded, mpirun type launcher tokens enabled locally

@jwhite242
Copy link
Collaborator Author

Ok, just circling back to this after far too long. So, how about this for a plan: add it now as a separate adapter, and let it get tested on user workflows for a bit and then in a followup replace it with a stable version of pyaestro executor and then also just drop the standard local adapter?

The local adapter replacement is likely a good follow on given local parallel runs share more incommon with the scheduled adapters than the local? launcher tokens then become a feature of all adapters with threaded, mpirun type launcher tokens enabled locally

One more thought on this particular PR: do we want to enable launcher tokens on this adapter right now and hide the pain of mpirun/srun/etc, or save that for a second PR enabling more general token replacement facilities (i.e. per step specifications of tokens, possibly even different users specified named tokens to mix and match in single steps, ...)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants