Skip to content

Failed to submit job on LUMI inside a container #26

@anni-moisala

Description

@anni-moisala

Hi,

I am trying to run the package on LUMI, and it requires python 3.12 so I'm using a container image for that (/appl/local/containers/sif-images/lumi-pytorch-rocm-6.2.1-python-3.12-pytorch-20240918-vllm-4075b35.sif).

But I run into issues running batch jobs from inside the container.

I try to run schedule-eval while binding the slurm commands

singularity exec -B /usr/bin/ $SIF bash -c '$WITH_CONDA && source venv/bin/activate && oellm schedule-eval \
    --models "microsoft/DialoGPT-medium,EleutherAI/pythia-160m" \
    --tasks "hellaswag,mmlu" \
    --n_shot 5'

But I get the error

ERROR Failed to submit job: Command '['sbatch']' returned non-zero exit status 127.
ERROR sbatch stderr: sbatch: error while loading shared libraries: libslurmfull.so: cannot open shared object
file: No such file or directory

without -B /usr/bin/ I get an squeue error FileNotFoundError: [Errno 2] No such file or directory: 'squeue'

What can be done here? There are no modules on LUMI containing python 3.12 as far as I know.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions