I have just installed CycleCloud 8.8.1, Slurm Template 4.0.4 and running Slurm 25.05.2 which comes with it all.
Normal nodes seem to be fine, but GPU Enabled nodes all throw up initialisation errors in the CC UI and show as failed (and pending/drained with sinfo) before they get terminated and another node is tried repeatedly. I have tried this with OOTB Alma 8, Alma 9 and my own Custom Alma 9 images for both scheduler and execute nodes.
I have also seen these errors on another GPU node:
ChatGPT seems to suggest that at least some of these are benign and that it is just the newer version of CC or the software versions in the latest HPC images that is being overly "picky" testing stuff, but even so, it still stops my GPU solving jobs from running.
Any ideas what is causing this and how to remedy?