-
Notifications
You must be signed in to change notification settings - Fork 66
Description
Describe the Bug
A clear and concise description of what the bug is.
Using the latest git rev 656b66c, I run into the TypeError: unsupported operand type(s) for |: 'type' and 'NoneType' error.
Earlier git rev a4dbc72, I was able to get PARAM comms running.
I suspect the version of Python used in my CentOS Stream 9 could be related, as there are changes in certain python syntax in newer python, or differences between Python 3.9 vs Python 3.10+.
(venv-param) [amd@hostname-1e707-b05-2 PARAMcomms]$ python --version
Python 3.9.21
Steps to Reproduce
Steps to reproduce the behavior.
Please include the version information where the bug was observed.
steps:
cd param-656b66c/
cd train/compute/python/
pip install .
cd ../../comms/pt/
pip install .
To run:
ROCM_PATH=${ROCM_PATH:-/opt/rocm}
NFS_PATH=/share2/amd-share
OMPI_INSTALL_DIR=${NFS_PATH}/ompi4-install
RCCL_INSTALL_DIR=${NFS_PATH}/rccl_develop/build/release
RCCL_TESTS_INSTALL_DIR=${NFS_PATH}/rccl-tests/build
export PATH=${OMPI_INSTALL_DIR}/bin:$PATH
export LD_LIBRARY_PATH=${RCCL_INSTALL_DIR}:${OMPI_INSTALL_DIR}/lib:$LD_LIBRARY_PATH
source /share2/PARAMcomms/venv-param/bin/activate
To run:
mpirun --allow-run-as-root -np 8 -x NCCL_DEBUG=INFO -x PYTHONPATH=/usr/bin/python3 -host hostname-1e707-b05-2:8 -map-by ppr:8:node --bind-to none --mca pml ucx --mca btl ^openib -x PATH=${PATH} -x LD_LIBRARY_PATH=${LD_LIBRARY_PATH} -x NCCL_IB_GID_INDEX=3 -x RCCL_ENABLE_INTRANET=1 -x NCCL_IB_HCA=bnxt_re0,bnxt_re1,bnxt_re2,bnxt_re3,bnxt_re4,bnxt_re5,bnxt_re6,bnxt_re7 -x NCCL_IGNORE_CPU_AFFINITY=1 /share2/PARAMcomms/param/train/comms/pt/comms.py --device rocm --master-ip hostname-1e707-b05-2 -b 1 -e 1G -n 10 -f 2 -z 0 --collective all_reduce --data-type float32
python version:
(venv-param) [amd@hostname-1e707-b05-2 PARAMcomms]$ python --version
Python 3.9.21
pip version:
(venv-param) [amd@hostname-1e707-b05-2 PARAMcomms]$ pip list
Package Version
------------------------ ----------------------------
apex 1.6.0+rocm6.5.0.git004991b6
fbgemm_gpu 1.2.0
filelock 3.18.0
fsspec 2025.3.2
future 1.0.0
gitdb 4.0.12
GitPython 3.1.44
Jinja2 3.1.6
MarkupSafe 3.0.2
mpmath 1.3.0
networkx 3.2.1
numpy 2.0.2
parambench-train-comms 0.0.0
parambench-train-compute 1.0.0+git.1747955991
pillow 11.2.1
pip 25.1.1
pydot 4.0.0
pyparsing 3.2.3
pytorch-triton-rocm 3.2.0+rocm6.5.0.git6da9e660
scipy 1.13.1
setuptools 53.0.0
smmap 5.0.2
sympy 1.13.1
torch 2.6.0+rocm6.5.0.gitcf65c6f2
torchaudio 2.6.0+rocm6.5.0.gitd8831425
torchvision 0.21.0+rocm6.5.0.git7af69879
typing_extensions 4.13.2
Expected Behavior
A clear and concise description of what you expected to happen.
Expect to run. If I use the older version, such as a4dbc72, I was able to run.
+ mpirun --allow-run-as-root -np 8 -x NCCL_DEBUG=VERSION -x PYTHONPATH=/usr/bin/python3 -host hostname-1e707-b05-2:8 -map-by ppr:8:node --bind-to none --mca pml ucx --mca btl '^openib' -x PATH=/share2/PARAMcomms/venv-param/bin:/share2/amd-share/ompi4-install/bin:/share2/PARAMcomms/venv-param/bin:/home/amd/.local/bin:/home/amd/bin:/usr/share/Modules/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin -x LD_LIBRARY_PATH=/share2/amd-share/rccl_develop/build/release:/share2/amd-share/ompi4-install/lib: -x NCCL_IB_GID_INDEX=3 -x RCCL_ENABLE_INTRANET=1 -x NCCL_IB_HCA=bnxt_re0,bnxt_re1,bnxt_re2,bnxt_re3,bnxt_re4,bnxt_re5,bnxt_re6,bnxt_re7 -x NCCL_IGNORE_CPU_AFFINITY=1 /share2/PARAMcomms/param/train/comms/pt/comms.py --device rocm --master-ip hostname-1e707-b05-2 --b 4G --e 4G --n 100 --f 2 --z 0 --collective all_reduce --data-type float32
PARAM COMM environment: {'world_size': 8, 'local_size': 8, 'global_rank': 0, 'local_rank': 0}
backend: nccl nw-stack: pytorch-dist args.data_types: ['float32'] args.b: 4G args.e: 4G args.f: 2 args.z: 0 args.master_ip: hostname-1e707-b05-2
Hello from Rank 0: [Rank 0] host hostname-1e707-b05-2, device: cuda:0, local_rank: 0 world_size: 8, master_ip: hostname-1e707-b05-2
Hello from Rank 1: [Rank 1] host hostname-1e707-b05-2, device: cuda:1, local_rank: 1 world_size: 8, master_ip: hostname-1e707-b05-2
Hello from Rank 2: [Rank 2] host hostname-1e707-b05-2, device: cuda:2, local_rank: 2 world_size: 8, master_ip: hostname-1e707-b05-2
Hello from Rank 3: [Rank 3] host hostname-1e707-b05-2, device: cuda:3, local_rank: 3 world_size: 8, master_ip: hostname-1e707-b05-2
Hello from Rank 4: [Rank 4] host hostname-1e707-b05-2, device: cuda:4, local_rank: 4 world_size: 8, master_ip: hostname-1e707-b05-2
Hello from Rank 5: [Rank 5] host hostname-1e707-b05-2, device: cuda:5, local_rank: 5 world_size: 8, master_ip: hostname-1e707-b05-2
Hello from Rank 6: [Rank 6] host hostname-1e707-b05-2, device: cuda:6, local_rank: 6 world_size: 8, master_ip: hostname-1e707-b05-2
Hello from Rank 7: [Rank 7] host hostname-1e707-b05-2, device: cuda:7, local_rank: 7 world_size: 8, master_ip: hostname-1e707-b05-2
RCCL version : 2.24.3-HEAD:2c0eecf
HIP version : 6.5.50421-a90f5536a
ROCm version : 6.5.0.0-990-de37842
Hostname : hostname-1e707-b05-2
Librccl path : /share2/PARAMcomms/venv-param/lib64/python3.9/site-packages/torch/lib/librccl.so
[Rank 0] allSizes: [4294967296] element_size: 4 local_rank: 0, num_pg 1, groupSize 8
collective=all_reduce, src_ranks=None, dst_ranks=None
COMMS-RES total-size (B) nElementsPerRank nElementsPairPerRank Latency(us):p50 p75 p95 Min Max AlgBW(GB/s) BusBW(GB/s)
COMMS-RES-all_reduce-float32 4294967296 1073741824 ...
Screenshots
If applicable, add screenshots to help explain your problem.
+ mpirun --allow-run-as-root -np 8 -x NCCL_DEBUG=INFO -x PYTHONPATH=/usr/bin/python3 -host 1e707-b05-2:8 -map-by ppr:8:node --bind-to none --mca pml ucx --mca btl '^openib' -x PATH=/share2/PARAMcomms/venv-param/bin:/share2/amd-share/ompi4-install/bin:/share2/PARAMcomms/venv-param/bin:/home/amd/.local/bin:/home/amd/bin:/usr/share/Modules/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin -x LD_LIBRARY_PATH=/share2/amd-share/rccl_develop/build/release:/share2/amd-share/ompi4-install/lib: -x NCCL_IGNORE_CPU_AFFINITY=1 /share2/PARAMcomms/param/train/comms/pt/comms.py --device rocm --backend nccl --master-ip hostname-1e707-b05-2 -b 1 -e 1G -n 10 -f 2 -z 0 --collective all_reduce --data-type float32
CollectiveArgsMixin does not exist or module not found. Default to empty class.
Traceback (most recent call last):
File "/share2/PARAMcomms/param/train/comms/pt/comms.py", line 19, in <module>
from param_bench.train.comms.pt import comms_utils
File "/share2/PARAMcomms/venv-param/lib64/python3.9/site-packages/param_bench/train/comms/pt/comms_utils.py", line 25, in <module>
from param_bench.train.comms.pt.pytorch_backend_utils import (
File "/share2/PARAMcomms/venv-param/lib64/python3.9/site-packages/param_bench/train/comms/pt/pytorch_backend_utils.py", line 392, in <module>
device: str | None = None,
TypeError: unsupported operand type(s) for |: 'type' and 'NoneType'
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[503,1],0]
Exit code: 1
--------------------------------------------------------------------------