Skip to content

Integrate changes for latest Distconv #6

@michaelmckinsey1

Description

@michaelmckinsey1

LBANN/DistConv#13 Made some breaking changes

(1) ps.num_shards is now a tuple, instead of an int

[rank3]:   File "/usr/WS1/mckinsey/bp_scaffold-riken/ScaFFold/ScaFFold/utils/trainer.py", line 300, in train
[rank3]:     if images.size(0) < ps.num_shards:
[rank3]:        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: TypeError: '<' not supported between instances of 'int' and 'tuple'

(2) mesh_dim_names format is different

[rank1]:   File "/usr/WS1/mckinsey/bp_scaffold-riken/ScaFFold/ScaFFold/worker.py", line 228, in main
[rank1]:     trainer.train()
[rank1]:   File "/usr/WS1/mckinsey/bp_scaffold-riken/ScaFFold/ScaFFold/utils/trainer.py", line 442, in train
[rank1]:     device_mesh=ps.device_mesh["dc"],
[rank1]:                 ~~~~~~~~~~~~~~^^^^^^
[rank1]:   File "/usr/WS1/mckinsey/bp_scaffold-riken/ScaFFold/.venvs/scaffoldvenv-tuo/lib/python3.11/site-packages/torch/distributed/device_mesh.py", line 730, in __getitem__
[rank1]:     slice_mesh_dims = _mesh_resources._get_slice_mesh_dims(
[rank1]:                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/usr/WS1/mckinsey/bp_scaffold-riken/ScaFFold/.venvs/scaffoldvenv-tuo/lib/python3.11/site-packages/torch/distributed/device_mesh.py", line 315, in _get_slice_mesh_dims
[rank1]:     raise KeyError(
[rank1]: KeyError: "Invalid mesh_dim_names ('dc',) specified. Valid mesh_dim_names are ['ddp', 'dc0']."
  • Also on this - shouldn't the option be dc2? As our current default config sets shard_dim: 2, but maybe this is not being configured properly -> another note, maybe shard_dim: 2 should be shard_dim: 1 as @PatrickRMiles found this is the fastest configuration -> So all in all, this should probably be dc1

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions