Parallel local #337

jwhite242 · 2021-01-28T01:20:28Z

Add parallel local adapter type for improved throughput on multi-core machines/allocations. Adapter is resource aware and can support non-uniform job task/job sizes.

…strowf into parallel_local

state updates

submit loop

state updates

submit loop

# Conflicts: # maestrowf/datastructures/core/executiongraph.py # maestrowf/datastructures/core/study.py # maestrowf/interfaces/script/localparscriptadapter.py # maestrowf/maestro.py

FrankD412 · 2021-02-02T03:21:25Z

maestrowf/interfaces/script/localparscriptadapter.py

+LOGGER = logging.getLogger(__name__)
+
+
+class LocalParallelScriptAdapter(SchedulerScriptAdapter):


I'm looking over this and it could be because I haven't looked at it/talked with you about it in a while, but it looks like the adapter in its current form doesn't do any substitutions. It's entirely reliant on hard-scripting of srun and other MPI calls in the steps themselves.

I think we talked about this a while back -- but this is where we could write a new adapter that takes in as an argument a scheduler type and then "borrows" its parallelization methods. In that same respect, we could bridge a little further and allow that to dictate scheduling Maestro the cluster essentially. That would come down to a hard check somewhere, so I'd have to mock something up. For reference, here's a mock local scheduler-esque type class I wrote in pyaestro: Executor class

Right, this is essentially just the replacement for the local adapter. So it's more suited to being the base class for the other scheduler adapters. So one extra thing that this setup might want for running on clusters is some extension of the input language to allow specifying multiple allocations to run in parallel and splitting jobs across them in some fashion. Unsure if that's a good thing to add in the initial pr though as that could be a lot of work to make it robust/user friendly.

On second thought, that allocation splitting thing is definitely not a good plan in this pr.

On second thought, that allocation splitting thing is definitely not a good plan in this pr.

I'm not sure I remember the specifics of this one. Can you refresh me?

That's more of a next stage or even next layer up in the ecosystem: the idea of orchestrating one workflow across multiple allocations at once, so a bit beyond the scope of the threaded adapters

Ah -- got it.

So, digging around in these adapters again, there seems to be a few paths to consider for launcher token enabled version of this adapter -> make a parallelschedulerscript adapter and fork off variants of the concrete instances from that, or make that schedulerscriptadapter the base for everything to bring in the launcher token replacement machinery. the latter case would potentially allow the one adapter to accept a scheduler type key so it can go use the get_parallelize_command from some dummy instances of the concrete scheduler adapters. so, it would seem the design question is have one potentially more complicated adapter or double the total number of adapter types to enable scheduler specific tweaks in this parallel version -> so slurmparalleladapter, fluxparalleladapter (maybe this one's not needed?), etc..

jwhite242 · 2022-02-02T03:58:47Z

Ok, just circling back to this after far too long. So, how about this for a plan: add it now as a separate adapter, and let it get tested on user workflows for a bit and then in a followup replace it with a stable version of pyaestro executor and then also just drop the standard local adapter?

The local adapter replacement is likely a good follow on given local parallel runs share more incommon with the scheduled adapters than the local? launcher tokens then become a feature of all adapters with threaded, mpirun type launcher tokens enabled locally

jwhite242 · 2022-02-02T04:13:13Z

Ok, just circling back to this after far too long. So, how about this for a plan: add it now as a separate adapter, and let it get tested on user workflows for a bit and then in a followup replace it with a stable version of pyaestro executor and then also just drop the standard local adapter?

The local adapter replacement is likely a good follow on given local parallel runs share more incommon with the scheduled adapters than the local? launcher tokens then become a feature of all adapters with threaded, mpirun type launcher tokens enabled locally

One more thought on this particular PR: do we want to enable launcher tokens on this adapter right now and hide the pain of mpirun/srun/etc, or save that for a second PR enabling more general token replacement facilities (i.e. per step specifications of tokens, possibly even different users specified named tokens to mix and match in single steps, ...)

…eritance

…allel

jwhite242 and others added 25 commits April 10, 2020 16:10

Check point on threaded local script adapter

9552d34

Merge branch 'develop' into parallel_local

9f41c29

Fixes and hacks to get demo spec running.

0fe1b35

Update future callbacks, exclude threadpool from pickling

92709b1

Some dirty hacks to deal with hidden pending state.

5c04a33

Remove unneeded hooks for command line local_procs parameter

04906cd

correctly set avail_procs and total_procs for local parallel adapter

b9892a9

Enable cancellation of processes, fix cancel status marking

6daa109

Merge branch 'parallel_local' of https://wci-git.llnl.gov/maestro/mae…

7c8e849

…strowf into parallel_local

Update jobid to use uuid, fix handling of future completion and job

579d629

state updates

Avoid popping ready steps until sure it will run, add early exit of

e1eaa80

submit loop

Check point on threaded local script adapter

2db2510

Fixes and hacks to get demo spec running.

492dc11

Update future callbacks, exclude threadpool from pickling

c121010

Some dirty hacks to deal with hidden pending state.

9232b7b

Remove unneeded hooks for command line local_procs parameter

2d32a51

Enable cancellation of processes, fix cancel status marking

1383c37

correctly set avail_procs and total_procs for local parallel adapter

83739e2

Update jobid to use uuid, fix handling of future completion and job

7ba9ed3

state updates

Avoid popping ready steps until sure it will run, add early exit of

bdc2d0b

submit loop

Removal of stray local_procs variables

fecc7a7

Removal of more stray local_procs

1544696

Porting of some missed variables in rebase.

be4ced0

Merge branch 'parallel_local' into 'rebase/parallel_local'

2114730

# Conflicts: # maestrowf/datastructures/core/executiongraph.py # maestrowf/datastructures/core/study.py # maestrowf/interfaces/script/localparscriptadapter.py # maestrowf/maestro.py

Fix previous broken rebase...

1eae4d5

FrankD412 reviewed Feb 2, 2021

View reviewed changes

jwhite242 added 2 commits February 1, 2022 19:17

Merge branch 'develop' into parallel_local

529764f

Fix incorrect variable

b2c51af

Initial pass hooking up launcher token replacement, fix up broken inh…

a4bbd4a

…eritance

FrankD412 mentioned this pull request Feb 21, 2022

Parallelization on a single node level #383

Closed

jwhite242 added 3 commits February 27, 2022 21:05

Add rich repr to studystep for better debugging

696bf07

Initial test of script writer

e863bca

Hook up slurm parallelize cmd and enable test of its use in local par…

517d851

…allel

jwhite242 mentioned this pull request Apr 5, 2022

Enable more complete configurability of jsrun launcher #384

Merged

Merge branch 'develop' into parallel_local

5be3ae7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Parallel local #337

Parallel local #337

jwhite242 commented Jan 28, 2021

Uh oh!

FrankD412 Feb 2, 2021

Uh oh!

jwhite242 Feb 2, 2021

Uh oh!

jwhite242 Feb 2, 2022

Uh oh!

FrankD412 Feb 3, 2022

Uh oh!

jwhite242 Feb 3, 2022

Uh oh!

FrankD412 Feb 3, 2022

Uh oh!

jwhite242 Feb 4, 2022

Uh oh!

jwhite242 commented Feb 2, 2022

Uh oh!

jwhite242 commented Feb 2, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		LOGGER = logging.getLogger(__name__)


		class LocalParallelScriptAdapter(SchedulerScriptAdapter):

Parallel local #337

Are you sure you want to change the base?

Parallel local #337

Conversation

jwhite242 commented Jan 28, 2021

Uh oh!

FrankD412 Feb 2, 2021

Choose a reason for hiding this comment

Uh oh!

jwhite242 Feb 2, 2021

Choose a reason for hiding this comment

Uh oh!

jwhite242 Feb 2, 2022

Choose a reason for hiding this comment

Uh oh!

FrankD412 Feb 3, 2022

Choose a reason for hiding this comment

Uh oh!

jwhite242 Feb 3, 2022

Choose a reason for hiding this comment

Uh oh!

FrankD412 Feb 3, 2022

Choose a reason for hiding this comment

Uh oh!

jwhite242 Feb 4, 2022

Choose a reason for hiding this comment

Uh oh!

jwhite242 commented Feb 2, 2022

Uh oh!

jwhite242 commented Feb 2, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants