-
Notifications
You must be signed in to change notification settings - Fork 2
Open
Description
LBANN/DistConv#13 Made some breaking changes
(1) ps.num_shards is now a tuple, instead of an int
[rank3]: File "/usr/WS1/mckinsey/bp_scaffold-riken/ScaFFold/ScaFFold/utils/trainer.py", line 300, in train
[rank3]: if images.size(0) < ps.num_shards:
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: TypeError: '<' not supported between instances of 'int' and 'tuple'
(2) mesh_dim_names format is different
[rank1]: File "/usr/WS1/mckinsey/bp_scaffold-riken/ScaFFold/ScaFFold/worker.py", line 228, in main
[rank1]: trainer.train()
[rank1]: File "/usr/WS1/mckinsey/bp_scaffold-riken/ScaFFold/ScaFFold/utils/trainer.py", line 442, in train
[rank1]: device_mesh=ps.device_mesh["dc"],
[rank1]: ~~~~~~~~~~~~~~^^^^^^
[rank1]: File "/usr/WS1/mckinsey/bp_scaffold-riken/ScaFFold/.venvs/scaffoldvenv-tuo/lib/python3.11/site-packages/torch/distributed/device_mesh.py", line 730, in __getitem__
[rank1]: slice_mesh_dims = _mesh_resources._get_slice_mesh_dims(
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/usr/WS1/mckinsey/bp_scaffold-riken/ScaFFold/.venvs/scaffoldvenv-tuo/lib/python3.11/site-packages/torch/distributed/device_mesh.py", line 315, in _get_slice_mesh_dims
[rank1]: raise KeyError(
[rank1]: KeyError: "Invalid mesh_dim_names ('dc',) specified. Valid mesh_dim_names are ['ddp', 'dc0']."
- Also on this - shouldn't the option be dc2? As our current default config sets
shard_dim: 2, but maybe this is not being configured properly -> another note, maybeshard_dim: 2should beshard_dim: 1as @PatrickRMiles found this is the fastest configuration -> So all in all, this should probably be dc1
Metadata
Metadata
Assignees
Labels
No labels