Skip to content

Merge branch 'develop' into samuel/downgrade-numpy

85451f3
Select commit
Loading
Failed to load commit list.
Open

Downgrade NumPy < 2.0 in pyt_huggingface.ubuntu.amd.Dockerfile #73

Merge branch 'develop' into samuel/downgrade-numpy
85451f3
Select commit
Loading
Failed to load commit list.
ROCm Repo Management API / Jenkins failed Feb 6, 2026 in 26m 55s

the-matrix/Matrix - arch = 'gfx908'/models/pyt_huggingface_bert-gfx908: error in 'error' step

the-matrix / Matrix - arch = 'gfx908' / Matrix - arch = 'gfx908' / models / pyt_huggingface_gpt2-gfx908 / pyt_huggingface_gpt2-gfx908 / Shell Script

Error in sh step, with arguments madengine run --tags pyt_huggingface_gpt2 --live-output -o perf_gfx908.csv 2>&1 | tee madengine.run.log if grep -i -e '= EXCEPTION =' -e 'unrecognized arguments:' -e 'RuntimeError:' madengine.run.log 1>/dev/null; then echo Found error/exception during madengine command run exit 1 fi .

script returned exit code 1
Build log
[2026-02-06T17:17:36.314Z] + madengine run --tags pyt_huggingface_gpt2 --live-output -o+  perf_gfx908.csv
[2026-02-06T17:17:36.314Z] tee madengine.run.log
[2026-02-06T17:17:36.759Z] MAD_MINIO environment variable is not set.
[2026-02-06T17:17:36.759Z] MAD_MINIO is using default values.
[2026-02-06T17:17:36.759Z] Running models on container
[2026-02-06T17:17:36.759Z] > if [ -f 'ctx_test' ]; then cat ctx_test; else echo 'None'; fi || true
[2026-02-06T17:17:36.759Z] > if [ -f "$(which apt)" ]; then echo 'HOST_UBUNTU'; elif [ -f "$(which yum)" ]; then echo 'HOST_CENTOS'; elif [ -f "$(which zypper)" ]; then echo 'HOST_SLES'; elif [ -f "$(which tdnf)" ]; then echo 'HOST_AZURE'; else echo 'Unable to detect Host OS'; fi || true
[2026-02-06T17:17:36.759Z] > cat /proc/sys/kernel/numa_balancing || true
[2026-02-06T17:17:36.759Z] Warning: numa balancing is OFF ...
[2026-02-06T17:17:36.759Z] > bash -c 'if [[ -f /usr/bin/nvidia-smi ]] && $(/usr/bin/nvidia-smi > /dev/null 2>&1); then echo "NVIDIA"; elif [[ -f /opt/rocm/bin/amd-smi ]]; then echo "AMD"; elif [[ -f /usr/local/bin/amd-smi ]]; then echo "AMD"; else echo "Unable to detect GPU vendor"; fi || true'
[2026-02-06T17:17:36.759Z] > amd-smi list --csv | tail -n +3 | wc -l
[2026-02-06T17:17:36.759Z] > /opt/rocm/bin/rocminfo |grep -o -m 1 'gfx.*'
[2026-02-06T17:17:36.759Z] > amd-smi static -g 0 | grep MARKET_NAME: | cut -d ':' -f 2
[2026-02-06T17:17:37.199Z] > hipconfig --version | cut -d'.' -f1,2
[2026-02-06T17:17:37.199Z] > /opt/rocm/bin/rocminfo |grep -o -m 1 'gfx.*'
[2026-02-06T17:17:37.199Z] > amd-smi static -g 0 | grep MARKET_NAME: | cut -d ':' -f 2
[2026-02-06T17:17:37.632Z] > cat /opt/rocm/.info/version | cut -d'-' -f1
[2026-02-06T17:17:37.632Z] > grep -r drm_render_minor /sys/devices/virtual/kfd/kfd/topology/nodes
[2026-02-06T17:17:37.632Z] > grep -r unique_id /sys/devices/virtual/kfd/kfd/topology/nodes
[2026-02-06T17:17:37.632Z] > rocm-smi --showuniqueid | grep 'Unique.*:'
[2026-02-06T17:17:37.632Z] Traceback (most recent call last):
[2026-02-06T17:17:37.632Z]   File "/home/jenkins/workspace/DLM_Public-MAD-CI_PR-73/venv/lib/python3.10/site-packages/madengine/core/context.py", line 462, in get_gpu_renderD_nodes
[2026-02-06T17:17:37.632Z]     raise KeyError(f"Unique ID '{unique_id}' from rocm-smi not found in KFD mapping")
[2026-02-06T17:17:37.632Z] KeyError: "Unique ID 'N/A' from rocm-smi not found in KFD mapping"
[2026-02-06T17:17:37.633Z] 
[2026-02-06T17:17:37.633Z] During handling of the above exception, another exception occurred:
[2026-02-06T17:17:37.633Z] 
[2026-02-06T17:17:37.633Z] Traceback (most recent call last):
[2026-02-06T17:17:37.633Z]   File "/home/jenkins/workspace/DLM_Public-MAD-CI_PR-73/venv/lib/python3.10/site-packages/madengine/core/context.py", line 465, in get_gpu_renderD_nodes
[2026-02-06T17:17:37.633Z]     raise RuntimeError(f"Failed to map unique ID from line '{line}': {e}")
[2026-02-06T17:17:37.633Z] RuntimeError: Failed to map unique ID from line 'GPU[0]		: Unique ID: N/A': "Unique ID 'N/A' from rocm-smi not found in KFD mapping"
[2026-02-06T17:17:37.633Z] 
[2026-02-06T17:17:37.633Z] The above exception was the direct cause of the following exception:
[2026-02-06T17:17:37.633Z] 
[2026-02-06T17:17:37.633Z] Traceback (most recent call last):
[2026-02-06T17:17:37.633Z]   File "/home/jenkins/workspace/DLM_Public-MAD-CI_PR-73/venv/bin/madengine", line 6, in <module>
[2026-02-06T17:17:37.633Z]     sys.exit(main())
[2026-02-06T17:17:37.633Z]   File "/home/jenkins/workspace/DLM_Public-MAD-CI_PR-73/venv/lib/python3.10/site-packages/madengine/mad.py", line 283, in main
[2026-02-06T17:17:37.633Z]     result = args.func(args)
[2026-02-06T17:17:37.633Z]   File "/home/jenkins/workspace/DLM_Public-MAD-CI_PR-73/venv/lib/python3.10/site-packages/madengine/mad.py", line 37, in run_models
[2026-02-06T17:17:37.633Z]     run_models = RunModels(args=args)
[2026-02-06T17:17:37.633Z]   File "/home/jenkins/workspace/DLM_Public-MAD-CI_PR-73/venv/lib/python3.10/site-packages/madengine/tools/run_models.py", line 157, in __init__
[2026-02-06T17:17:37.633Z]     self.context = Context(
[2026-02-06T17:17:37.633Z]   File "/home/jenkins/workspace/DLM_Public-MAD-CI_PR-73/venv/lib/python3.10/site-packages/madengine/core/context.py", line 123, in __init__
[2026-02-06T17:17:37.633Z]     self.ctx["gpu_renderDs"] = self.get_gpu_renderD_nodes()
[2026-02-06T17:17:37.633Z]   File "/home/jenkins/workspace/DLM_Public-MAD-CI_PR-73/venv/lib/python3.10/site-packages/madengine/core/context.py", line 524, in get_gpu_renderD_nodes
[2026-02-06T17:17:37.633Z]     raise RuntimeError(f"Error in get_gpu_renderD_nodes: {e}") from e
[2026-02-06T17:17:37.633Z] RuntimeError: Error in get_gpu_renderD_nodes: Failed to map unique ID from line 'GPU[0]		: Unique ID: N/A': "Unique ID 'N/A' from rocm-smi not found in KFD mapping"
[2026-02-06T17:17:37.633Z] + grep -i -e = EXCEPTION = -e unrecognized arguments: -e RuntimeError: madengine.run.log
[2026-02-06T17:17:37.633Z] + echo Found error/exception during madengine command run
[2026-02-06T17:17:37.633Z] Found error/exception during madengine command run
[2026-02-06T17:17:37.633Z] + exit 1

the-matrix / Matrix - arch = 'gfx908' / Matrix - arch = 'gfx908' / models / pyt_huggingface_gpt2-gfx908 / pyt_huggingface_gpt2-gfx908 / Error signal

Error in error step, with arguments pyt_huggingface_gpt2-gfx908 threw "hudson.AbortException: script returned exit code 1"..

pyt_huggingface_gpt2-gfx908 threw "hudson.AbortException: script returned exit code 1".

the-matrix / Matrix - arch = 'gfx908' / Matrix - arch = 'gfx908' / models / pyt_huggingface_bert-gfx908 / pyt_huggingface_bert-gfx908 / Shell Script

Error in sh step, with arguments madengine run --tags pyt_huggingface_bert --live-output -o perf_gfx908.csv 2>&1 | tee madengine.run.log if grep -i -e '= EXCEPTION =' -e 'unrecognized arguments:' -e 'RuntimeError:' madengine.run.log 1>/dev/null; then echo Found error/exception during madengine command run exit 1 fi .

script returned exit code 1
Build log
[2026-02-06T17:17:48.663Z] + madengine run --tags pyt_huggingface_bert --live-output -o perf_gfx908.csv
[2026-02-06T17:17:48.663Z] + tee madengine.run.log
[2026-02-06T17:17:49.094Z] MAD_MINIO environment variable is not set.
[2026-02-06T17:17:49.094Z] MAD_MINIO is using default values.
[2026-02-06T17:17:49.094Z] Running models on container
[2026-02-06T17:17:49.094Z] > if [ -f 'ctx_test' ]; then cat ctx_test; else echo 'None'; fi || true
[2026-02-06T17:17:49.094Z] > if [ -f "$(which apt)" ]; then echo 'HOST_UBUNTU'; elif [ -f "$(which yum)" ]; then echo 'HOST_CENTOS'; elif [ -f "$(which zypper)" ]; then echo 'HOST_SLES'; elif [ -f "$(which tdnf)" ]; then echo 'HOST_AZURE'; else echo 'Unable to detect Host OS'; fi || true
[2026-02-06T17:17:49.094Z] > cat /proc/sys/kernel/numa_balancing || true
[2026-02-06T17:17:49.094Z] Warning: numa balancing is OFF ...
[2026-02-06T17:17:49.094Z] > bash -c 'if [[ -f /usr/bin/nvidia-smi ]] && $(/usr/bin/nvidia-smi > /dev/null 2>&1); then echo "NVIDIA"; elif [[ -f /opt/rocm/bin/amd-smi ]]; then echo "AMD"; elif [[ -f /usr/local/bin/amd-smi ]]; then echo "AMD"; else echo "Unable to detect GPU vendor"; fi || true'
[2026-02-06T17:17:49.094Z] > amd-smi list --csv | tail -n +3 | wc -l
[2026-02-06T17:17:49.094Z] > /opt/rocm/bin/rocminfo |grep -o -m 1 'gfx.*'
[2026-02-06T17:17:49.094Z] > amd-smi static -g 0 | grep MARKET_NAME: | cut -d ':' -f 2
[2026-02-06T17:17:49.532Z] > hipconfig --version | cut -d'.' -f1,2
[2026-02-06T17:17:49.532Z] > /opt/rocm/bin/rocminfo |grep -o -m 1 'gfx.*'
[2026-02-06T17:17:49.532Z] > amd-smi static -g 0 | grep MARKET_NAME: | cut -d ':' -f 2
[2026-02-06T17:17:49.962Z] > cat /opt/rocm/.info/version | cut -d'-' -f1
[2026-02-06T17:17:49.962Z] > grep -r drm_render_minor /sys/devices/virtual/kfd/kfd/topology/nodes
[2026-02-06T17:17:49.962Z] > grep -r unique_id /sys/devices/virtual/kfd/kfd/topology/nodes
[2026-02-06T17:17:49.962Z] > rocm-smi --showuniqueid | grep 'Unique.*:'
[2026-02-06T17:17:49.962Z] Traceback (most recent call last):
[2026-02-06T17:17:49.962Z]   File "/home/jenkins/workspace/DLM_Public-MAD-CI_PR-73/venv/lib/python3.10/site-packages/madengine/core/context.py", line 462, in get_gpu_renderD_nodes
[2026-02-06T17:17:49.962Z]     raise KeyError(f"Unique ID '{unique_id}' from rocm-smi not found in KFD mapping")
[2026-02-06T17:17:49.962Z] KeyError: "Unique ID 'N/A' from rocm-smi not found in KFD mapping"
[2026-02-06T17:17:49.962Z] 
[2026-02-06T17:17:49.962Z] During handling of the above exception, another exception occurred:
[2026-02-06T17:17:49.962Z] 
[2026-02-06T17:17:49.962Z] Traceback (most recent call last):
[2026-02-06T17:17:49.962Z]   File "/home/jenkins/workspace/DLM_Public-MAD-CI_PR-73/venv/lib/python3.10/site-packages/madengine/core/context.py", line 465, in get_gpu_renderD_nodes
[2026-02-06T17:17:49.962Z]     raise RuntimeError(f"Failed to map unique ID from line '{line}': {e}")
[2026-02-06T17:17:49.962Z] RuntimeError: Failed to map unique ID from line 'GPU[0]		: Unique ID: N/A': "Unique ID 'N/A' from rocm-smi not found in KFD mapping"
[2026-02-06T17:17:49.962Z] 
[2026-02-06T17:17:49.962Z] The above exception was the direct cause of the following exception:
[2026-02-06T17:17:49.962Z] 
[2026-02-06T17:17:49.962Z] Traceback (most recent call last):
[2026-02-06T17:17:49.962Z]   File "/home/jenkins/workspace/DLM_Public-MAD-CI_PR-73/venv/bin/madengine", line 6, in <module>
[2026-02-06T17:17:49.963Z]     sys.exit(main())
[2026-02-06T17:17:49.963Z]   File "/home/jenkins/workspace/DLM_Public-MAD-CI_PR-73/venv/lib/python3.10/site-packages/madengine/mad.py", line 283, in main
[2026-02-06T17:17:49.963Z]     result = args.func(args)
[2026-02-06T17:17:49.963Z]   File "/home/jenkins/workspace/DLM_Public-MAD-CI_PR-73/venv/lib/python3.10/site-packages/madengine/mad.py", line 37, in run_models
[2026-02-06T17:17:49.963Z]     run_models = RunModels(args=args)
[2026-02-06T17:17:49.963Z]   File "/home/jenkins/workspace/DLM_Public-MAD-CI_PR-73/venv/lib/python3.10/site-packages/madengine/tools/run_models.py", line 157, in __init__
[2026-02-06T17:17:49.963Z]     self.context = Context(
[2026-02-06T17:17:49.963Z]   File "/home/jenkins/workspace/DLM_Public-MAD-CI_PR-73/venv/lib/python3.10/site-packages/madengine/core/context.py", line 123, in __init__
[2026-02-06T17:17:49.963Z]     self.ctx["gpu_renderDs"] = self.get_gpu_renderD_nodes()
[2026-02-06T17:17:49.963Z]   File "/home/jenkins/workspace/DLM_Public-MAD-CI_PR-73/venv/lib/python3.10/site-packages/madengine/core/context.py", line 524, in get_gpu_renderD_nodes
[2026-02-06T17:17:49.963Z]     raise RuntimeError(f"Error in get_gpu_renderD_nodes: {e}") from e
[2026-02-06T17:17:49.963Z] RuntimeError: Error in get_gpu_renderD_nodes: Failed to map unique ID from line 'GPU[0]		: Unique ID: N/A': "Unique ID 'N/A' from rocm-smi not found in KFD mapping"
[2026-02-06T17:17:50.098Z] + grep -i -e = EXCEPTION = -e unrecognized arguments: -e RuntimeError: madengine.run.log
[2026-02-06T17:17:50.098Z] + echo Found error/exception during madengine command run
[2026-02-06T17:17:50.098Z] Found error/exception during madengine command run
[2026-02-06T17:17:50.098Z] + exit 1

the-matrix / Matrix - arch = 'gfx908' / Matrix - arch = 'gfx908' / models / pyt_huggingface_bert-gfx908 / pyt_huggingface_bert-gfx908 / Error signal

Error in error step, with arguments pyt_huggingface_bert-gfx908 threw "hudson.AbortException: script returned exit code 1"..

pyt_huggingface_bert-gfx908 threw "hudson.AbortException: script returned exit code 1".

Details

  • Declarative: Checkout SCM (30 sec)
    • resetbuild (1.5 sec)
    • the-matrix (2 min 45 sec)
      • Matrix - arch = 'gfx908' (6 ms)
        • Matrix - arch = 'gfx908' (2 min 28 sec)
          • models (1 min 56 sec)
            • pyt_huggingface_gpt2-gfx908 (8 ms)
              • pyt_huggingface_gpt2-gfx908 (29 sec)
                Error: script returned exit code 1 - logs
                Error: pyt_huggingface_gpt2-gfx908 threw "hudson.AbortException: script returned exit code 1". - logs
            • pyt_huggingface_bert-gfx908 (45 sec)
              • pyt_huggingface_bert-gfx908 (41 sec)
                Error: script returned exit code 1 - logs
                Error: pyt_huggingface_bert-gfx908 threw "hudson.AbortException: script returned exit code 1". - logs
      • Matrix - arch = 'MI250' (6 ms)
        • Matrix - arch = 'MI250' (34 sec)
          • models (14 sec)
      • Matrix - arch = 'MI250_CA' (8 ms)
        • Matrix - arch = 'MI250_CA' (34 sec)
          • models (14 sec)
      • Matrix - arch = 'MI250X-A1' (5 ms)
        • Matrix - arch = 'MI250X-A1' (33 sec)
          • models (14 sec)
      • Matrix - arch = 'MI300X_BANFF' (6 ms)
        • Matrix - arch = 'MI300X_BANFF' (33 sec)
          • models (14 sec)
      • Matrix - arch = 'MI300X_GT' (6 ms)
        • Matrix - arch = 'MI300X_GT' (33 sec)
          • models (14 sec)
      • Matrix - arch = 'gfx1100' (7 ms)
        • Matrix - arch = 'gfx1100' (33 sec)
          • models (14 sec)
      • Matrix - arch = 'A100' (7 ms)
        • Matrix - arch = 'A100' (33 sec)
          • models (14 sec)
      • Matrix - arch = 'H100' (8 ms)
        • Matrix - arch = 'H100' (33 sec)
          • models (14 sec)
      • Matrix - arch = 'dlmodels' (7 ms)
        • Matrix - arch = 'dlmodels' (33 sec)
          • models (14 sec)
      • Matrix - arch = 'oci-64' (50 sec)
        • Matrix - arch = 'oci-64' (33 sec)
          • models (14 sec)