Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
161 changes: 103 additions & 58 deletions src/job-exporter/build/job-exporter.common.dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -16,75 +16,120 @@
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.


############################
# builder: only for compiling python wheels
############################
FROM mcr.microsoft.com/mirror/nvcr/nvidia/cuda:12.0.1-runtime-ubuntu22.04 AS builder

ARG TARGETARCH

RUN set -eux; \
apt-get update; \
DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \
ca-certificates \
python3-pip \
python3-dev \
build-essential \
gcc; \
rm -rf /var/lib/apt/lists/*

WORKDIR /w

# build wheels once
COPY requirements.txt /w/requirements.txt
RUN python3 -m pip install --no-cache-dir -U pip wheel && \
python3 -m pip wheel --no-cache-dir --wheel-dir /w/wheels \
-r /w/requirements.txt \
prometheus_client psutil filelock


Comment on lines +42 to +45
Copy link

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The package prometheus_client is being built twice as a wheel - once from requirements.txt (which pins it to version 0.20.0) on line 42, and again without version specification on line 43. This may cause version conflicts or unnecessary duplication. Similarly, the same packages (prometheus_client, psutil, filelock) are being installed again in lines 125-127 after already being installed from requirements.txt on line 124. Consider consolidating these package specifications into requirements.txt to ensure consistent versioning and avoid redundant installations.

Suggested change
-r /w/requirements.txt \
prometheus_client psutil filelock
-r /w/requirements.txt

Copilot uses AI. Check for mistakes.
############################
# runtime: final image
############################
FROM mcr.microsoft.com/mirror/nvcr/nvidia/cuda:12.0.1-runtime-ubuntu22.04

ARG TARGETARCH
# Register the ROCM package repository, and install rocm-dev package
ARG ROCM_VERSION=6.2.2
ARG AMDGPU_VERSION=6.2.2
ARG DCGM_TARGET_VERSION=1:4.4.1-1

RUN apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \
autoconf \
automake \
bash \
build-essential \
cmake \
curl \
file \
g++ \
git \
gnupg \
ibverbs-utils \
kmod \
libc++-dev \
libcap-dev \
libelf1 \
libgflags-dev \
libgtest-dev \
libnuma-dev \
libtool \
numactl \
pkg-config \
python3-dev \
python3-pip \
sudo \
unzip && \
if [ "$TARGETARCH" = "amd64" ]; then \
printf "Package: *\nPin: release o=repo.radeon.com\nPin-Priority: 600" | tee /etc/apt/preferences.d/rocm-pin-600 && \
curl -sL https://repo.radeon.com/rocm/rocm.gpg.key | apt-key add - && \
echo "deb https://repo.radeon.com/rocm/apt/$ROCM_VERSION/ jammy main" | tee /etc/apt/sources.list.d/rocm.list && \
echo "deb https://repo.radeon.com/amdgpu/$AMDGPU_VERSION/ubuntu jammy main" | tee /etc/apt/sources.list.d/amdgpu.list && \
apt-get update && \
DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends rocm-dev; \
fi

COPY src/Moneo /Moneo
# --------------------------
# base + REQUIRED apt upgrade
# --------------------------
RUN set -eux; \
apt-get update; \
apt-get upgrade -y; \
DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \
bash \
ca-certificates \
curl \
gnupg \
wget \
python3 \
python3-pip; \
apt-get clean; \
rm -rf /var/lib/apt/lists/* /var/cache/apt/*

# Install RDC
RUN if [ "$TARGETARCH" = "amd64" ]; then sudo bash Moneo/src/worker/install/amd.sh; fi
# --------------------------
# ROCm (runtime only)
# --------------------------
RUN set -eux; \
if [ "$TARGETARCH" = "amd64" ]; then \
printf "Package: *\nPin: release o=repo.radeon.com\nPin-Priority: 600" \
> /etc/apt/preferences.d/rocm-pin-600; \
curl -sL https://repo.radeon.com/rocm/rocm.gpg.key | apt-key add -; \
echo "deb https://repo.radeon.com/rocm/apt/$ROCM_VERSION/ jammy main" \
> /etc/apt/sources.list.d/rocm.list; \
echo "deb https://repo.radeon.com/amdgpu/$AMDGPU_VERSION/ubuntu jammy main" \
Comment on lines +80 to +83
Copy link

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The command apt-key add is deprecated and will be removed in a future Ubuntu release. Consider using the recommended approach of placing the GPG key in /etc/apt/trusted.gpg.d/ or /usr/share/keyrings/ and referencing it with the signed-by option in the sources.list entry. For example: curl -sL https://repo.radeon.com/rocm/rocm.gpg.key | gpg --dearmor -o /usr/share/keyrings/rocm-archive-keyring.gpg and then use [signed-by=/usr/share/keyrings/rocm-archive-keyring.gpg] in the deb line.

Suggested change
curl -sL https://repo.radeon.com/rocm/rocm.gpg.key | apt-key add -; \
echo "deb https://repo.radeon.com/rocm/apt/$ROCM_VERSION/ jammy main" \
> /etc/apt/sources.list.d/rocm.list; \
echo "deb https://repo.radeon.com/amdgpu/$AMDGPU_VERSION/ubuntu jammy main" \
curl -sL https://repo.radeon.com/rocm/rocm.gpg.key | gpg --dearmor -o /usr/share/keyrings/rocm-archive-keyring.gpg; \
echo "deb [signed-by=/usr/share/keyrings/rocm-archive-keyring.gpg] https://repo.radeon.com/rocm/apt/$ROCM_VERSION/ jammy main" \
> /etc/apt/sources.list.d/rocm.list; \
echo "deb [signed-by=/usr/share/keyrings/rocm-archive-keyring.gpg] https://repo.radeon.com/amdgpu/$AMDGPU_VERSION/ubuntu jammy main" \

Copilot uses AI. Check for mistakes.
> /etc/apt/sources.list.d/amdgpu.list; \
apt-get update; \
DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends rdc; \
rm -rf /var/lib/apt/lists/*; \
fi
Comment on lines +76 to +88
Copy link

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The AMD RDC installation has been significantly simplified from building from source (with a specific cherry-picked commit 660c5afaf49630781c1059ba6d30bae21743c32f from amd-staging branch) to installing the prebuilt rdc package. This changes the RDC version and removes the custom patches that were previously applied. Verify that the packaged RDC version includes the necessary functionality from the cherry-picked commit, or that the commit is no longer needed for the target ROCm version 6.2.2. This change could impact AMD GPU monitoring functionality if the custom patches were critical.

Copilot uses AI. Check for mistakes.

# Install DCGM
RUN sed -i 's/systemctl --now enable nvidia-dcgm/#&/' Moneo/src/worker/install/nvidia.sh && \
sed -i 's/systemctl start nvidia-dcgm/#&/' Moneo/src/worker/install/nvidia.sh && \
sudo bash Moneo/src/worker/install/nvidia.sh
# --------------------------
# DCGM (runtime only, same layer clean)
# --------------------------
RUN set -eux; \
apt-get update; \
DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \
datacenter-gpu-manager-4-cuda12=${DCGM_TARGET_VERSION} \
datacenter-gpu-manager-4-core=${DCGM_TARGET_VERSION} \
datacenter-gpu-manager-4-proprietary-cuda12=${DCGM_TARGET_VERSION}; \
apt-get clean; \
rm -rf /var/lib/apt/lists/*
Copy link

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The DCGM Python bindings path in nvidia_exporter.py is incompatible with DCGM 4. The new Dockerfile installs datacenter-gpu-manager-4 which places Python bindings at /usr/share/datacenter-gpu-manager-4/bindings/python3, but nvidia_exporter.py still references /usr/local/dcgm/bindings/python3. The update-dcgm.py script that previously handled this path migration has been removed, which will cause the NVIDIA exporter to fail with ImportError when trying to import dcgm_fields and DcgmReader. The sys.path.append line needs to be updated to match the new DCGM 4 installation path.

Suggested change
rm -rf /var/lib/apt/lists/*
rm -rf /var/lib/apt/lists/*
ENV PYTHONPATH=/usr/share/datacenter-gpu-manager-4/bindings/python3:${PYTHONPATH}

Copilot uses AI. Check for mistakes.

ENV PATH="${PATH}:/opt/rocm/bin"
COPY build/moneo-*-exporter_entrypoint.sh ./
COPY build/update-dcgm.py .
# --------------------------
# nerdctl
# --------------------------
ENV NERDCTL_VERSION=2.2.1
RUN set -eux; \
wget -O /tmp/nerdctl.tar.gz \
https://github.com/containerd/nerdctl/releases/download/v${NERDCTL_VERSION}/nerdctl-${NERDCTL_VERSION}-linux-${TARGETARCH}.tar.gz; \
mkdir -p /tmp/nerdctl; \
tar -xzf /tmp/nerdctl.tar.gz -C /tmp/nerdctl; \
mv /tmp/nerdctl/nerdctl /usr/local/bin/nerdctl; \
rm -rf /tmp/nerdctl* /tmp/nerdctl.tar.gz
Comment on lines +109 to +112
Copy link

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The nerdctl binary is downloaded from GitHub without checksum verification. This could pose a security risk if the download is compromised or if the download fails partially. Consider adding checksum verification using sha256sum or downloading and verifying the checksum file provided by the nerdctl releases. For example, the nerdctl releases include SHA256SUMS files that can be used to verify the integrity of the downloaded archive.

Suggested change
mkdir -p /tmp/nerdctl; \
tar -xzf /tmp/nerdctl.tar.gz -C /tmp/nerdctl; \
mv /tmp/nerdctl/nerdctl /usr/local/bin/nerdctl; \
rm -rf /tmp/nerdctl* /tmp/nerdctl.tar.gz
wget -O /tmp/nerdctl-SHA256SUMS \
https://github.com/containerd/nerdctl/releases/download/v${NERDCTL_VERSION}/SHA256SUMS; \
grep " nerdctl-${NERDCTL_VERSION}-linux-${TARGETARCH}.tar.gz$" /tmp/nerdctl-SHA256SUMS | sha256sum -c -; \
mkdir -p /tmp/nerdctl; \
tar -xzf /tmp/nerdctl.tar.gz -C /tmp/nerdctl; \
mv /tmp/nerdctl/nerdctl /usr/local/bin/nerdctl; \
rm -rf /tmp/nerdctl* /tmp/nerdctl.tar.gz /tmp/nerdctl-SHA256SUMS

Copilot uses AI. Check for mistakes.

# For the job exporter
ENV NERDCTL_VERSION=2.1.3
RUN apt-get update && apt-get install --no-install-recommends -y wget ca-certificates
RUN wget -O /tmp/nerdctl.tar.gz https://github.com/containerd/nerdctl/releases/download/v${NERDCTL_VERSION}/nerdctl-${NERDCTL_VERSION}-linux-${TARGETARCH}.tar.gz && \
mkdir -p /tmp/nerdctl && \
tar -xzvf /tmp/nerdctl.tar.gz -C /tmp/nerdctl && \
mv /tmp/nerdctl/nerdctl /usr/local/bin/nerdctl && \
mkdir -p /job_exporter && \
rm -rf /tmp/nerdctl*
# --------------------------
# python runtime deps (from wheels)
# --------------------------

COPY requirements.txt /job_exporter/
RUN pip3 install -r /job_exporter/requirements.txt
COPY --from=builder /w/wheels /wheels
COPY requirements.txt /job_exporter/requirements.txt

RUN apt update && apt upgrade -y && apt-get clean && rm -rf /var/lib/apt/lists/*
RUN python3 -m pip install --no-cache-dir -U pip && \
Copy link

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is trailing whitespace at the end of this line after the backslash continuation character. This should be removed for consistency and to avoid potential issues with shell script parsing.

Suggested change
RUN python3 -m pip install --no-cache-dir -U pip && \
RUN python3 -m pip install --no-cache-dir -U pip && \

Copilot uses AI. Check for mistakes.
python3 -m pip install --no-cache-dir \
--no-index --find-links=/wheels \
-r /job_exporter/requirements.txt && \
python3 -m pip install --no-cache-dir \
--no-index --find-links=/wheels \
prometheus_client psutil filelock && \
Comment on lines +121 to +127
Copy link

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The packages prometheus_client, psutil, and filelock are being installed redundantly. They were already installed from requirements.txt on line 124 (prometheus_client is pinned to 0.20.0 in requirements.txt). Installing them again without version specifications may cause version conflicts or simply waste build time. Consider removing these redundant installations or ensuring all package versions are consistently managed through requirements.txt.

Suggested change
RUN python3 -m pip install --no-cache-dir -U pip && \
python3 -m pip install --no-cache-dir \
--no-index --find-links=/wheels \
-r /job_exporter/requirements.txt && \
python3 -m pip install --no-cache-dir \
--no-index --find-links=/wheels \
prometheus_client psutil filelock && \
RUN python3 -m pip install --no-cache-dir -U pip && \
python3 -m pip install --no-cache-dir \
--no-index --find-links=/wheels \
-r /job_exporter/requirements.txt && \

Copilot uses AI. Check for mistakes.
rm -rf /wheels

# --------------------------
# app files
# --------------------------
COPY src/Moneo /Moneo
COPY src/*.py /job_exporter/
COPY build/moneo-*-exporter_entrypoint.sh ./
1 change: 0 additions & 1 deletion src/job-exporter/build/moneo-gpu-exporter_entrypoint.sh
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,6 @@ if lsmod | grep -qi amdgpu; then
echo "AMD Exporter Started!"
elif lsmod | grep -qi nvidia; then
echo "NVIDIA Graphics card detected."
python3 /update-dcgm.py
# Launches NVIDIA DCGM Daemon
nohup nv-hostengine &
echo "DCGM Daemon Started!"
Expand Down
117 changes: 0 additions & 117 deletions src/job-exporter/build/update-dcgm.py

This file was deleted.

60 changes: 0 additions & 60 deletions src/job-exporter/src/Moneo/src/worker/install/amd.sh

This file was deleted.

20 changes: 0 additions & 20 deletions src/job-exporter/src/Moneo/src/worker/install/common.sh

This file was deleted.

Loading
Loading