Skip to content

Latest CycleCloud (8.8.1) and Slurm clusters seems to have initialisation issues with GPU Nodes? #468

@garymansellricardo

Description

@garymansellricardo

I have just installed CycleCloud 8.8.1, Slurm Template 4.0.4 and running Slurm 25.05.2 which comes with it all.

Normal nodes seem to be fine, but GPU Enabled nodes all throw up initialisation errors in the CC UI and show as failed (and pending/drained with sinfo) before they get terminated and another node is tried repeatedly. I have tried this with OOTB Alma 8, Alma 9 and my own Custom Alma 9 images for both scheduler and execute nodes.

Image Image Image

I have also seen these errors on another GPU node:

Image

ChatGPT seems to suggest that at least some of these are benign and that it is just the newer version of CC or the software versions in the latest HPC images that is being overly "picky" testing stuff, but even so, it still stops my GPU solving jobs from running.

Any ideas what is causing this and how to remedy?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions