From a9d16066cbd998fa02468c776ffe3b81deb217f3 Mon Sep 17 00:00:00 2001 From: "Jason C. Nucciarone" Date: Mon, 2 Mar 2026 16:25:59 -0500 Subject: [PATCH 1/3] fix: update tutorial to use code blocks Other changes: - Fix inconsistencies in list bulletings Signed-off-by: Jason C. Nucciarone --- getting-started.md | 477 ++++++++++++++++----------------------------- 1 file changed, 166 insertions(+), 311 deletions(-) diff --git a/getting-started.md b/getting-started.md index df7901b..509f333 100644 --- a/getting-started.md +++ b/getting-started.md @@ -9,20 +9,25 @@ This tutorial takes you through multiple aspects of Charmed HPC, such as: By the end of this tutorial, you will have worked with a variety of open source projects, such as: -* Multipass -* Juju -* Charms -* Apptainer -* Ceph -* Slurm +- Multipass +- Juju +- Charms +- Apptainer +- Ceph +- Slurm -This tutorial assumes that you have had some exposure to high-performance computing concepts such as batch scheduling, but does not assume prior experience building HPC clusters. This tutorial also does not expect you to have any prior experience with the listed projects. +This tutorial assumes that you have had some exposure to high-performance computing concepts such +as job scheduling, but does not assume prior experience building HPC clusters. This tutorial +also does not expect you to have any prior experience with the listed projects above. :::{admonition} Using Charmed HPC in production :class: note -The Charmed HPC cluster built in this tutorial is for learning purposes and should not be used as the basis for a production HPC cluster. For more in-depth steps on how to deploy a fully operational Charmed HPC cluster, see [Charmed HPC's How-to guides](#howtos). + +The Charmed HPC cluster built in this tutorial is for learning purposes and should not be +used as the basis for a production HPC cluster. For more in-depth steps on how to deploy +a fully operational Charmed HPC cluster, see [Charmed HPC's How-to guides](#howtos). ::: ## Prerequisites @@ -35,43 +40,56 @@ To successfully complete this tutorial, you will need: ## Create a virtual machine with Multipass -First, download a copy of the cloud initialization (cloud-init) file, [charmed-hpc-tutorial-cloud-init.yml], that defines the underlying cloud infrastructure for the virtual machine. +First, download a copy of the cloud initialization (cloud-init) file, +[charmed-hpc-tutorial-cloud-init.yml], that defines the underlying cloud infrastructure +for the virtual machine. + +For this tutorial, the file includes instructions for creating and configuring your LXD +machine cloud `localhost` with the `charmed-hpc-controller` Juju controller and creating +workload and submit scripts for the example jobs. The cloud-init step will be completed +as part of the virtual machine launch and will not be something you need to set up +manually. -For this tutorial, the file includes instructions for creating and configuring your LXD machine cloud `localhost` with the `charmed-hpc-controller` Juju controller and creating workload and submit scripts for the example jobs. The cloud-init step will be completed as part of the virtual machine launch and will not be something you need to set up manually. You can expand the dropdown below to view the full cloud-init file before downloading onto your local system: +You can expand the dropdown below to view the full cloud-init file before +downloading onto your local system: ::::{dropdown} charmed-hpc-tutorial-cloud-init.yml -:::{literalinclude} /reuse/tutorial/charmed-hpc-tutorial-cloud-init.yml +:::{literalinclude} /reuse/tutorial/charmed-hpc-tutorial-cloud-init.yml :caption: [charmed-hpc-tutorial-cloud-init.yml] :language: yaml :linenos: ::: :::: -From the local directory holding the cloud-init file, launch a virtual machine using Multipass: +Now, from the local directory holding the cloud-init file, launch a virtual machine using Multipass: [charmed-hpc-tutorial-cloud-init.yml]: /reuse/tutorial/charmed-hpc-tutorial-cloud-init.yml -:::{terminal} -:user: ubuntu -:host: local -:copy: - -multipass launch 24.04 --name charmed-hpc-tutorial --cloud-init charmed-hpc-tutorial-cloud-init.yml --memory 16G --disk 40G --cpus 8 --timeout 1000 +:::{code-block} shell +multipass launch 24.04 \ + --name charmed-hpc-tutorial \ + --cloud-init charmed-hpc-tutorial-cloud-init.yml \ + --memory 16G --disk 40G --cpus 8 --timeout 1000 ::: -The virtual machine launch process should take five minutes or less to complete, but may take longer due to network strength. Upon completion of the launch process, check the status of cloud-init to confirm that all processes completed successfully. +The virtual machine launch process should take five minutes or less to complete, but may +take longer depending on the speed of your network. Upon completion of the launch process, +check the status of cloud-init to confirm that all processes completed successfully. -Enter the virtual machine: - -:::{terminal} -:user: ubuntu -:host: local -:copy: +First, enter the `charmed-hpc-tutorial` virtual machine: +:::{code-block} shell multipass shell charmed-hpc-tutorial ::: -Then check `cloud-init status`{l=shell}: +Then, use `cloud-init status`{l=shell} to check if the `charmed-hpc-tutorial` was +successfully initialized: + +:::{code-block} shell +cloud-init status --long +::: + +The output of `cloud-init status`{l=shell} will be similar to the following: :::{terminal} :user: ubuntu @@ -89,144 +107,91 @@ errors: [] recoverable_errors: {} ::: -If the status shows `done` and there are no errors, then you are ready to move on to deploying the cluster charms. +If the status shows `done` and there are no errors, then you are ready to move on to +deploying the cluster charms. ## Deploy Slurm and shared filesystem -Next, you will deploy Slurm and the filesystem. The Slurm components of your deployment will be composed of: -* The Slurm management daemon: `slurmctld` -* Two Slurm compute daemons: `slurmd`, grouped in a partition named `tutorial-partition` -* The authentication and credential kiosk daemon: `sackd` to provide the login node -First, create the `slurm` model on your cloud `localhost`: +Next, you will deploy Slurm and the filesystem. The Slurm components of your deployment +will be composed of: -:::{terminal} -:user: ubuntu -:host: charmed-hpc-tutorial -:copy: +- The Slurm management daemon: `slurmctld` +- Two Slurm compute daemons: `slurmd`, grouped in a partition named `tutorial-partition` +- The authentication and credential kiosk daemon: `sackd` to provide the login node + +First, use `juju add-model`{l=shell} to create the `slurm` model on your cloud `localhost`: +:::{code-block} shell juju add-model slurm localhost ::: -Then deploy the Slurm components: +Then, use `juju deploy`{l=shell} to deploy `sackd`, `slurmctld`, and `slurmd`: -:::{terminal} -:user: ubuntu -:host: charmed-hpc-tutorial -:copy: +:::{code-block} shell +juju deploy slurmctld \ + --base "ubuntu@24.04" \ + --channel "edge" \ + --constraints="virt-type=virtual-machine" -juju deploy slurmctld --base "ubuntu@24.04" --channel "edge" --constraints="virt-type=virtual-machine" -::: +juju deploy slurmd tutorial-partition \ + --num-units 2 \ + --base "ubuntu@24.04" \ + --channel "edge" \ + --config default-node-state=idle + --constraints="virt-type=virtual-machine" -:::{terminal} -:user: ubuntu -:host: charmed-hpc-tutorial -:copy: - -juju deploy slurmd tutorial-partition -n 2 --base "ubuntu@24.04" --channel "edge" --constraints="virt-type=virtual-machine" +juju deploy sackd \ + --base "ubuntu@24.04" \ + --channel "edge" \ + --constraints="virt-type=virtual-machine" ::: -:::{terminal} -:user: ubuntu -:host: charmed-hpc-tutorial -:copy: - -juju deploy sackd --base "ubuntu@24.04" --channel "edge" --constraints="virt-type=virtual-machine" -::: - -And integrate them together: - -:::{terminal} -:user: ubuntu -:host: charmed-hpc-tutorial -:copy: +After that, use `juju integrate`{l=shell} to integrate the Slurm services together: +:::{code-block} shell juju integrate slurmctld sackd -::: - -:::{terminal} -:user: ubuntu -:host: charmed-hpc-tutorial -:copy: - juju integrate slurmctld tutorial-partition ::: -Next, you will deploy the filesystem pieces, which are: +Next, use `juju deploy`{l=shell} to deploy the filesystem pieces, which are: - `microceph`, the distributed storage system - `ceph-fs` to expose the MicroCeph cluster as a shared filesystem using [CephFS](https://docs.ceph.com/en/reef/cephfs/) -- `filesystem-client` to mount the filesystem, named `scratch` - -:::{terminal} -:user: ubuntu -:host: charmed-hpc-tutorial -:copy: - -juju deploy microceph --channel latest/edge --constraints="virt-type=virtual-machine mem=4G root-disk=20G" -::: - -:::{terminal} -:user: ubuntu -:host: charmed-hpc-tutorial -:copy: +- `filesystem-client` to mount the filesystem, named `scratch` -juju deploy ceph-fs --channel latest/edge -::: +:::{code-block} shell +juju deploy microceph \ + --channel latest/edge \ + --constraints="virt-type=virtual-machine mem=4G root-disk=20G" -:::{terminal} -:user: ubuntu -:host: charmed-hpc-tutorial -:copy: +juju deploy ceph-fs \ + --channel latest/edge -juju deploy filesystem-client scratch --channel latest/edge --config mountpoint=/scratch +juju deploy filesystem-client scratch \ + --channel latest/edge \ + --config mountpoint=/scratch ::: -:::{terminal} -:user: ubuntu -:host: charmed-hpc-tutorial -:copy: +After that, use `juju add-storage`{l=shell} to add storage to microceph: +:::{code-block} shell juju add-storage microceph/0 osd-standalone=loop,2G,3 ::: -And then integrate the filesystem components together: - -:::{terminal} -:user: ubuntu -:host: charmed-hpc-tutorial -:copy: +Then, use `juju integrate`{l=shell} to integrate the filesystem components +together aqnd with Slurm: +:::{code-block} shell juju integrate scratch ceph-fs -::: - -:::{terminal} -:user: ubuntu -:host: charmed-hpc-tutorial -:copy: - juju integrate ceph-fs microceph -::: - -:::{terminal} -:user: ubuntu -:host: charmed-hpc-tutorial -:copy: - juju integrate scratch tutorial-partition +juju integrate scratch sackd ::: -:::{terminal} -:user: ubuntu -:host: charmed-hpc-tutorial -:copy: - -juju integrate sackd scratch -::: - -After a few minutes, the Slurm deployment will become active. The output of the -`juju status`{l=shell} command should be similar to the following: +Your Charmed HPC cluster will become active within a few minutes. The output of the +`juju status`{l=shell} will be similar to the following: :::{terminal} :user: ubuntu @@ -269,129 +234,51 @@ Machine State Address Inst id Base AZ -## Get compute nodes ready for jobs - -Now that Slurm and the filesystem have been successfully deployed, the next step is to set up the compute nodes themselves. The compute nodes must be moved from the `down` state to the `idle` state so that they can start having jobs ran on them. First, check that the compute nodes are still down, which will show something similar to: - -:::{terminal} -:copy: - -juju exec -u sackd/0 -- sinfo - -PARTITION AVAIL TIMELIMIT NODES STATE NODELIST -tutorial-partition up infinite 2 down juju-e16200-[1-2] -::: - -Then, bring up the compute nodes: - -:::{terminal} -:user: ubuntu -:host: charmed-hpc-tutorial -:copy: - -juju run tutorial-partition/0 node-configured -::: - -:::{terminal} -:user: ubuntu -:host: charmed-hpc-tutorial -:copy: - -juju run tutorial-partition/1 node-configured -::: - -And verify that the `STATE` is now set to `idle`, which should now show: - -:::{terminal} -:user: ubuntu -:host: charmed-hpc-tutorial -:copy: - -juju exec -u sackd/0 -- sinfo - -PARTITION AVAIL TIMELIMIT NODES STATE NODELIST -tutorial-partition up infinite 2 idle juju-e16200-[1-2] -::: - - - ## Copy files onto cluster -The workload files that were created during the cloud initialization step now need to be copied onto the cluster filesystem from the virtual machine filesystem. First you will make the new example directories, then set appropriate permissions, and finally copy the files over: - -:::{terminal} -:user: ubuntu -:host: charmed-hpc-tutorial -:copy: +The workload files that were created during the cloud initialization step now need +to be copied onto the cluster filesystem from the virtual machine filesystem. -juju exec -u sackd/0 -- sudo mkdir /scratch/mpi_example /scratch/apptainer_example -::: +You will use `juju exec`{l=shell} and `juju scp`{l=shell} to make the new +example directories, set appropriate permissions, and then finally copy the files over: -:::{terminal} -:user: ubuntu -:host: charmed-hpc-tutorial -:copy: +:::{code-block} shell +juju exec -u sackd/0 -- \ + sudo mkdir /scratch/mpi_example /scratch/apptainer_example -juju exec -u sackd/0 -- sudo chown $USER: /scratch/* -::: +juju exec -u sackd/0 -- \ + sudo chown $USER: /scratch/* -:::{terminal} -:user: ubuntu -:host: charmed-hpc-tutorial -:copy: +juju scp submit_hello.sh mpi_hello_world.c \ + sackd/0:/scratch/mpi_example -juju scp submit_hello.sh mpi_hello_world.c sackd/0:/scratch/mpi_example +juju scp submit_apptainer_mascot.sh generate.py workload.py workload.def \ + sackd/0:/scratch/apptainer_example ::: -:::{terminal} -:user: ubuntu -:host: charmed-hpc-tutorial -:copy: - -juju scp submit_apptainer_mascot.sh generate.py workload.py workload.def sackd/0:/scratch/apptainer_example -::: - -The `/scratch` directory is mounted on the compute nodes and will be used to read and write from during the batch jobs. +The `/scratch` directory is mounted on the compute nodes and will be used to read +and write from during the batch jobs. ## Run a batch job -In the following steps, you will compile a small Hello World MPI script and run it by submitting a batch job to Slurm. +In the following steps, you will compile a small Hello World MPI script and run it +by submitting a batch job to Slurm. ### Compile First, SSH into the login node, `sackd/0`: -:::{terminal} -:user: ubuntu -:host: charmed-hpc-tutorial -:copy: - +:::{code-block} shell juju ssh sackd/0 ::: -This will place you in your home directory `/home/ubuntu`. Next, you will need to move to the `/scratch/mpi_example` directory, install the Open MPI libraries needed for compiling, and then compile the _mpi_hello_world.c_ file by running the `mpicc` command: - -:::{terminal} -:user: ubuntu -:host: login -:copy: +This will place you in your home directory `/home/ubuntu`. Next, you will need to move +to the `/scratch/mpi_example` directory, install the Open MPI libraries needed for +compiling, and then compile the _mpi_hello_world.c_ file by running the `mpicc` command: +:::{code-block} shell cd /scratch/mpi_example -::: - -:::{terminal} -:user: ubuntu -:host: login -:copy: - sudo apt install build-essential openmpi-bin libopenmpi-dev -::: - -:::{terminal} -:user: ubuntu -:host: login -:copy: - mpicc -o mpi_hello_world mpi_hello_world.c ::: @@ -415,17 +302,15 @@ For quick referencing, the two files for the MPI Hello World example are provide [submit_hello.sh]: /reuse/tutorial/submit_hello.sh ### Submit batch job -Now, submit your batch job to the queue using `sbatch`{l=shell}: -:::{terminal} -:user: ubuntu -:host: login -:copy: +Now, submit your batch job to the queue using `sbatch`{l=shell}: +:::{code-block} shell sbatch submit_hello.sh ::: -Your job will complete after a few seconds. The generated _output.txt_ file will look similar to the following: +Your job will complete after a few seconds. The generated _output.txt_ file will look +similar to the following: :::{terminal} :user: ubuntu @@ -438,51 +323,36 @@ Hello world from processor juju-640476-1, rank 0 out of 2 processors Hello world from processor juju-640476-2, rank 1 out of 2 processors ::: -The batch job successfully spread the MPI job across two nodes that were able to report back their MPI rank to a shared output file. +The batch job successfully spread the MPI job across two nodes that were able to report +back their MPI rank to a shared output file. ## Run a container job -Next you will go through the steps to generate a random sample of Ubuntu mascot votes and plot the results. The process requires Python and a few specific libraries so you will use Apptainer to build a container job and run the job on the cluster. +Next you will go through the steps to generate a random sample of Ubuntu mascot votes +and plot the results. The process requires Python and a few specific libraries so +you will use Apptainer to build a container job and run the job on the cluster. ### Set up Apptainer -Apptainer must be deployed and integrated with the existing Slurm deployment using Juju and these steps need to be completed from the `charmed-hpc-tutorial` environment; to return to that environment from within `sackd/0`, use the `exit`{l=shell} command. - -Deploy and integrate Apptainer: +Apptainer must be deployed and integrated with the existing Slurm deployment using Juju +and these steps need to be completed from the `charmed-hpc-tutorial` environment; to +return to that environment from within `sackd/0`, use the `exit`{l=shell} command. -:::{terminal} -:user: ubuntu -:host: charmed-hpc-tutorial -:copy: +First, use `juju deploy`{l=shell} to deploy Apptainer: -juju deploy apptainer +:::{code-block} shell +juju deploy apptainer --channel latest/edge ::: -:::{terminal} -:user: ubuntu -:host: charmed-hpc-tutorial -:copy: - -juju integrate apptainer tutorial-partition -::: - -:::{terminal} -:user: ubuntu -:host: charmed-hpc-tutorial -:copy: - -juju integrate apptainer sackd -::: - -:::{terminal} -:user: ubuntu -:host: charmed-hpc-tutorial -:copy: +Next, use `juju integrate`{l=shell} to integrate Apptainer with Slurm: +:::{code-block} shell juju integrate apptainer slurmctld +juju integrate apptainer sackd +juju integrate apptainer tutorial-partition ::: -After a few minutes, `juju status` should look similar to the following: +After a few minutes, the output `juju status` will look similar to the following: :::{terminal} :user: ubuntu @@ -495,26 +365,26 @@ Model Controller Cloud/Region Version SLA Timest slurm charmed-hpc-controller localhost/localhost 3.6.9 unsupported 17:34:46-04:00 App Version Status Scale Charm Channel Rev Exposed Message -apptainer 1.4.2 active 3 apptainer latest/stable 6 no +apptainer 1.4.2 active 3 apptainer latest/stable 6 no ceph-fs 19.2.1 active 1 ceph-fs latest/edge 196 no Unit is ready scratch active 3 filesystem-client latest/edge 20 no Integrated with `cephfs` provider microceph active 1 microceph latest/edge 161 no (workload) charm is ready -sackd 23.11.4-1.2u... active 1 sackd latest/edge 38 no +sackd 23.11.4-1.2u... active 1 sackd latest/edge 38 no slurmctld 23.11.4-1.2u... active 1 slurmctld latest/edge 120 no primary - UP -tutorial-partition 23.11.4-1.2u... active 2 slurmd latest/edge 141 no +tutorial-partition 23.11.4-1.2u... active 2 slurmd latest/edge 141 no Unit Workload Agent Machine Public address Ports Message ceph-fs/0* active idle 5 10.196.78.232 Unit is ready microceph/1* active idle 6 10.196.78.238 (workload) charm is ready -sackd/0* active idle 3 10.196.78.117 6818/tcp - apptainer/2 active idle 10.196.78.117 +sackd/0* active idle 3 10.196.78.117 6818/tcp + apptainer/2 active idle 10.196.78.117 scratch/2 active idle 10.196.78.117 Mounted filesystem at `/scratch` slurmctld/0* active idle 0 10.196.78.49 6817,9092/tcp primary - UP -tutorial-partition/0 active idle 1 10.196.78.244 6818/tcp - apptainer/0 active idle 10.196.78.244 +tutorial-partition/0 active idle 1 10.196.78.244 6818/tcp + apptainer/0 active idle 10.196.78.244 scratch/0* active idle 10.196.78.244 Mounted filesystem at `/scratch` -tutorial-partition/1* active idle 2 10.196.78.26 6818/tcp - apptainer/1* active idle 10.196.78.26 +tutorial-partition/1* active idle 2 10.196.78.26 6818/tcp + apptainer/1* active idle 10.196.78.26 scratch/1 active idle 10.196.78.26 Mounted filesystem at `/scratch` Machine State Address Inst id Base AZ Message @@ -529,31 +399,15 @@ Machine State Address Inst id Base AZ ### Build the container image using `apptainer` Before you can submit your container workload to your Charmed HPC cluster, -you must build the container image from the build recipe. The build recipe file _workload.def_ defines the environment and libraries that will be in the container image. - -To build the image, return to the cluster login node, move to the example directory, and call `apptainer build`: +you must build the container image from the build recipe. The build recipe file +_workload.def_ defines the environment and libraries that will be in the container image. -:::{terminal} -:user: ubuntu -:host: login -:copy: +To build the image, log back into to the cluster login node, change to the +example directory, and run `apptainer build`: +:::{code-block} shell juju ssh sackd/0 -::: - -:::{terminal} -:user: ubuntu -:host: login -:copy: - cd /scratch/apptainer_example -::: - -:::{terminal} -:user: ubuntu -:host: login -:copy: - apptainer build workload.sif workload.def ::: @@ -596,19 +450,19 @@ The files for the Apptainer Mascot Vote example are provided here for reference. ### Use the image to run jobs -Now that you have built the container image, you can submit a job to the cluster that uses the new _workload.sif_ image to generate one million lines in a table and then uses the resulting _favorite_lts_mascot.csv_ to build the bar plot: - -:::{terminal} -:user: ubuntu -:host: login -:copy: +Now that you have built the container image, you can submit a job to the +cluster that uses the new _workload.sif_ image to generate one million +lines in a table and then uses the resulting _favorite_lts_mascot.csv_ to +build the bar plot: +:::{code-block} shell sbatch submit_apptainer_mascot.sh ::: To view the status of the job while it is running, run `squeue`. -Once the job has completed, view the generated bar plot that will look similar to the following: +Once the job has completed, view the generated bar plot that will look +similar to the following: :::{terminal} :user: ubuntu @@ -635,20 +489,21 @@ cat graph.out In this tutorial, you: -* Deployed and integrated Slurm and a shared filesystem -* Launched an MPI batch job and saw cross-node communication results -* Built a container image with Apptainer and used it to run a batch job and generate a bar plot - -Now that you have completed the tutorial, if you would like to completely remove the virtual machine, return to your local terminal and `multipass delete` the virtual machine as follows: +- Deployed and integrated Slurm and a shared filesystem +- Launched an MPI batch job and saw cross-node communication results +- Built a container image with Apptainer and used it to run a batch job and +- generate a bar plot -:::{terminal} -:user: ubuntu -:host: local -:copy: +Now that you have completed the tutorial, if you would like to completely remove the +virtual machine, return to your local terminal and `multipass delete` the virtual +machine as follows: -multipass delete -p charmed-hpc-tutorial +:::{code-block} shell +multipass delete --purge charmed-hpc-tutorial ::: ## Next steps -Now that you have gotten started with Charmed HPC, check out the {ref}`explanation` section for details on important concepts and the {ref}`howtos` for how to use more of Charmed HPC's features. +Now that you have gotten started with Charmed HPC, check out the +{ref}`explanation` section for details on important concepts and the +{ref}`howtos` for how to use more of Charmed HPC's features. From e19aa7e0f5079bffcaa831affa8716cb7bbd20a3 Mon Sep 17 00:00:00 2001 From: "Jason C. Nucciarone" Date: Mon, 2 Mar 2026 16:24:35 -0500 Subject: [PATCH 2/3] fix: update apptainer commands to use code blocks instead of terminal blocks Signed-off-by: Jason C. Nucciarone --- howto/integrate/integrate-with-apptainer.md | 42 ++++--- howto/use/use-apptainer.md | 127 ++++++++------------ 2 files changed, 75 insertions(+), 94 deletions(-) diff --git a/howto/integrate/integrate-with-apptainer.md b/howto/integrate/integrate-with-apptainer.md index f9bfbef..85c2d3e 100644 --- a/howto/integrate/integrate-with-apptainer.md +++ b/howto/integrate/integrate-with-apptainer.md @@ -20,43 +20,45 @@ for a high-level introduction to administering Apptainer. - A [deployed Slurm cluster](#howto-setup-deploy-slurm). -## Deploy and integrate Apptainer +## Deploy Apptainer -First, in the same model holding your Slurm deployment, deploy Apptainer with `juju deploy`{l=shell}: +First, use `juju deploy`{l=shell} to deploy {term}`Apptainer` in the `slurm` model on +your `charmed-hpc` machine cloud: -:::{terminal} -:copy: +:::{code-block} shell juju deploy apptainer ::: :::{include} /reuse/common/tip-determine-current-juju-model.txt ::: -Now integrate Apptainer with Slurm using `juju integrate`{l=shell}: +## Integrate Apptainer with Slurm -:::{terminal} -:copy: +Next, use `juju integrate`{l=shell} to integrate Apptainer with Slurm: + +:::{code-block} shell juju integrate apptainer sackd +juju integrate apptainer slurmd +juju integrate apptainer slurmctld ::: -:::{terminal} -:copy: +Apptainer will install itself on all the `sackd` and `slurmd` units, and will +share its configuration with the Slurm controller service, `slurmctld`. -juju integrate apptainer slurmd -::: +## Verify that Apptainer is integrated with Slurm -:::{terminal} -:copy: -juju integrate apptainer slurmctld -::: +Use `juju exec`{l=text} to submit a test job. -Apptainer will be installed on all the `sackd` and `slurmd` units, and will share its configuration -with the Slurm controller, `slurmctld`. +For example, to submit a test job where the runtime environment is Ubuntu 22.04, run: -## Test that Apptainer is integrated with Slurm +:::{code-block} shell +juju exec -u sackd/0 -- \ + srun --partition slurmd --container=docker://ubuntu:22.04 \ + cat /etc/os-release | grep ^VERSION +::: -Submit a test job with `juju exec`{l=text}. If Apptainer has been successfully integrated with -Slurm, the output of your test job will be similar to the following: +If Apptainer has been successfully integrated with Slurm, the output of the test job +will be similar to the following: :::{terminal} :copy: diff --git a/howto/use/use-apptainer.md b/howto/use/use-apptainer.md index 498c6de..a5d02a2 100644 --- a/howto/use/use-apptainer.md +++ b/howto/use/use-apptainer.md @@ -6,13 +6,13 @@ relatedlinks: "[Apptainer user documenation](https://apptainer.org/docs/ # Use Apptainer Apptainer can be used as a container runtime environment on Charmed HPC for running -containerized workloads. This guide provides examples of using Apptainer on Charmed HPC +containerized jobs. This guide provides examples of using Apptainer on Charmed HPC to accomplish different tasks. :::{admonition} New to Apptainer? :class: note -If you're unfamiliar with using Apptainer in your workloads, see the [Apptainer user quick start guide](https://apptainer.org/docs/user/latest/quick_start.html) +If you're unfamiliar with using Apptainer in your jobs, see the [Apptainer user quick start guide](https://apptainer.org/docs/user/latest/quick_start.html) for a high-level introduction to using Apptainer on HPC clusters. ::: @@ -29,39 +29,28 @@ to the sections below for the different ways that you can use Apptainer on Charm ## Create a container image -Apptainer can create container images on your cluster that can then be used within -your workloads. The sections below demonstrate the different ways Apptainer can create -container images on your Charmed HPC cluster. - -### Using a pre-existing container image from a public container registry +### Use an image from a public container registry Apptainer can pull pre-existing container images from public container registries. -For example, to pull a Valkey container image from Dockerhub and start a local -Valkey service on your cluster: -:::{terminal} -:copy: -:host: login +For example, to pull a Valkey container image from Dockerhub and start a local +Valkey service on your cluster, run: +:::{code-block} shell apptainer pull valkey.sif docker://ubuntu/valkey:7.2.10-24.04_stable -::: - -:::{terminal} -:copy: -:host: login - apptainer overlay create --size 1024 valkey.img +apptainer instance run --overlay valkey.img valkey.sif valkey ::: -:::{terminal} -:copy: -:host: login - -apptainer instance run --overlay valkey.img valkey.sif valkey +Next, use `apptainer exec`{l=text} to test your connection to the Valkey service: -INFO: instance started successfully +:::{code-block} shell +apptainer exec instance://valkey valkey-cli ping ::: +If the Valkey service is active, the output of `apptainer exec`{l=text} will be similar +to the following: + :::{terminal} :copy: :host: login @@ -71,10 +60,14 @@ apptainer exec instance://valkey valkey-cli ping PONG ::: -Use `apptainer help pull`{l=text} to see the full list of public container registries -Apptainer can pull container images from. +:::{admonition} Using images from other container registries +:class: note + +You can use `apptainer help pull`{l=text} to view the full list of public +container registries Apptainer can pull images from. +::: -### Building your own custom container image +### Build your own image :::{admonition} Before attempting to build your own container images :class: warning @@ -86,6 +79,7 @@ building container images on specific cluster resources such as login nodes. ::: Apptainer can build container images using instructions from a container definition file. + For example, to build an Ubuntu 24.04 LTS-based container image with the `gfortran` compiler pre-installed, create the container definition file _fortran-runtime.def_: @@ -103,17 +97,13 @@ from: ubuntu:24.04 exit 0 ::: -Now use `apptainer build`{l=shell} to build the container image: - -:::{terminal} -:copy: -:host: login +Then, use `apptainer build`{l=shell} to build your container image: +:::{code-block} shell apptainer build fortran-runtime.sif fortran-runtime.def ::: -The built container image can now be used to compile and run Fortran workloads. For -example, create a simple Fortran program the prints "Hello world!" in the file _hello.f90_: +Next, create a simple Fortran program the prints "Hello world!" in the file _hello.f90_: :::{code-block} fortran :caption: hello.f90 @@ -123,15 +113,15 @@ PROGRAM hello_world END PROGRAM hello_world ::: -Now use the built container to compile and run your Fortran program: - -:::{terminal} -:copy: -:host: login +Now use your container with `apptainer exec`{l=text} to compile and run your Fortran program: +:::{code-block} shell apptainer exec fortran-runtime.sif gfortran --output hello hello.f90 +apptainer exec fortran-runtime.sif ./hello ::: +The output of `apptainer exec`{l=text} will be similar to the following: + :::{terminal} :copy: :host: login @@ -141,19 +131,16 @@ apptainer exec fortran-runtime.sif ./hello Hello world! ::: -## Provide your workload's runtime environment +## Provide your job's runtime environment -Apptainer can provide the runtime environment for your workloads. -The sections below demonstrate the different ways Apptainer can provide -your workload's runtime environment. +### Use the `apptainer`{l=shell} command -### Using the `apptainer`{l=shell} command directly in your workload +The `apptainer`{l=shell} command can be called directly in scripts to run job steps +in a container instance. -The `apptainer`{l=shell} command can be called directly in scripts to perform operations -inside a container instance. First, declare in your batch script the partition you want your -workload to run within and call the `apptainer`{l=shell} command from directly within your script. -For example, to select `compute` as the partition your workload will run within, and run some -Python code using a containerized Python 3.13 interpreter, create the batch script _job.batch_: +For example, to run some Python code using a containerized Python 3.13 interpreter, create +the job script _job.batch_. In this job script, you will call the `apptainer`{l=shell} +command directly, and you will submit the job to the partition `compute`: :::{code-block} shell :caption: job.batch @@ -166,18 +153,19 @@ apptainer --silent exec python-3.13.sif \ python3 -c 'import sys; print(f"Hello from Python {sys.version}!")' ::: -Now submit the _job.batch_ script to Slurm with `sbatch`{l=shell}: - -:::{terminal} -:copy: -:host: login +Now use `sbatch`{l=shell} to submit your job script to Slurm: +:::{code-block} shell sbatch job.batch +::: + +Then, use `cat`{l=shell} to view the results of your job after it completes: -Submitted batch job 1 +:::{code-block} shell +cat job.out ::: -Use `cat`{l=shell} to view the results of your workload after it completes: +The output of `cat job.out`{l=shell} will be similar to the following: :::{terminal} :copy: @@ -216,35 +204,26 @@ sbatch my-job.batch ::: --> -### Using the `--container` flag with `srun`{l=shell} +### Use the `--container` flag with `srun`{l=shell} Jobs submitted to Slurm with `srun`{l=shell} can be run inside a container instance using Apptainer. -First, declare both the partition you want your workload to run within and the container image -that will be used by Apptainer to provide the runtime environment of your workload. -For example, to select `compute` as the partition your workload will run within, -and use an Ubuntu 22.04 LTS container image as the runtime environment: - -:::{terminal} -:copy: -:host: login -PARTITION=slurmd -::: +For example, to submit a job to the partition `compute` that will use an Ubuntu 22.04 LTS container +image as the runtime environment, run: -:::{terminal} -:copy: -:host: login - -CONTAINER=docker://ubuntu:22.04 +:::{code-block} shell +srun --partition compute --container docker://ubuntu:22.04 \ + cat /etc/os-release | grep ^VERSION ::: -Now run your workload with `srun`{l=shell}: +The output of `srun`{l=shell} will be similar to the following: :::{terminal} :copy: :host: login -srun --partition $PARTITION --container $CONTAINER cat /etc/os-release | grep ^VERSION +srun --partition compute --container docker://ubuntu:22.04 \ + cat /etc/os-release | grep ^VERSION INFO: Converting OCI blobs to SIF format INFO: Starting build... From fa50455feb58ac662535d511bf8e42f979811539 Mon Sep 17 00:00:00 2001 From: "Jason C. Nucciarone" Date: Mon, 2 Mar 2026 16:25:21 -0500 Subject: [PATCH 3/3] fix: update "cleanup slurm" to use code blocks Other changes: - Updated the content to be more inline with our current docs. Signed-off-by: Jason C. Nucciarone --- howto/cleanup/cleanup-slurm.md | 86 +++++++++++------------- reuse/common/tip-listing-juju-models.txt | 18 +++++ 2 files changed, 56 insertions(+), 48 deletions(-) create mode 100644 reuse/common/tip-listing-juju-models.txt diff --git a/howto/cleanup/cleanup-slurm.md b/howto/cleanup/cleanup-slurm.md index 59433c5..325e3ca 100644 --- a/howto/cleanup/cleanup-slurm.md +++ b/howto/cleanup/cleanup-slurm.md @@ -1,5 +1,10 @@ +--- +relatedlinks: "[`juju destroy-model` documentation](https://documentation.ubuntu.com/juju/3.6/reference/juju-cli/list-of-juju-cli-commands/destroy-model/)" +--- + + (howto-cleanup-slurm)= -# How to clean up Slurm deployments +# How to clean up Slurm :::{admonition} Removing all Charmed HPC resources? :class: note @@ -12,64 +17,49 @@ including Slurm, in a single step. This how-to guide shows you how to remove a [previously deployed Slurm workload manager](#howto-setup-deploy-slurm) in a Charmed HPC cluster. -## Destroying the Slurm model - -To destroy a Slurm deployment, the name of the Juju model containing the Slurm resources is required. -Run the following command to list all Juju models: - -:::{terminal} -:copy: -juju models +## Destroy Slurm -Controller: charmed-hpc-controller +:::{admonition} Data loss warning +:class: warning -Model Cloud/Region Type Status Machines Units Access Last connection -controller localhost/localhost lxd available 1 1 admin just now -slurm* localhost/localhost lxd available 6 6 admin never connected +Destroying your Slurm deployment may result in **permanent data loss**. Ensure all data you wish to +preserve has been migrated to a safe location before proceeding, or consider using the flag +`--release-storage` with `juju destroy-model`{l=shell} to release the deployment's storage +rather than destroy it. ::: -Locate the model containing Slurm resources, here the name is `slurm`. Run the following command, -read the warnings, and enter the model name when prompted to destroy it and all associated storage: - -:::{admonition} Data loss warning -:class: warning +Use `juju destroy-model`{l=shell} to destroy your Slurm deployment. You will need to provide +the name of the model your Slurm deployment is located in. For example, to destroy to Slurm +deployment located in the `slurm` model, run: -Destroying storage may result in **permanent data loss**. Ensure all data you wish to preserve has -been migrated to a safe location before proceeding or consider using flag `--release-storage` to -release the storage rather than destroy it. +:::{code-block} shell +juju destroy-model --no-prompt --destroy-storage slurm ::: -:::{terminal} -:copy: -juju destroy-model --destroy-storage slurm - -WARNING This command will destroy the "slurm" model and affect the following resources. It cannot be stopped. - - - 6 machines will be destroyed - - machine list: "0 (juju-e88112-0)" "1 (juju-e88112-1)" "2 (juju-e88112-2)" "3 (juju-e88112-3)" "4 (juju-e88112-4)" "5 (juju-e88112-5)" - - 6 applications will be removed - - application list: "mysql" "sackd" "slurmctld" "slurmd" "slurmdbd" "slurmrestd" - - 1 filesystem and 0 volume will be destroyed - -To continue, enter the name of the model to be unregistered: slurm -Destroying model -Waiting for model to be removed, 6 machine(s), 6 application(s), 1 filesystems(s).......... -Waiting for model to be removed, 2 machine(s), 3 application(s), 1 filesystems(s)... -Waiting for model to be removed, 2 machine(s), 2 application(s), 1 filesystems(s)... -Waiting for model to be removed, 2 machine(s), 1 application(s)... -Waiting for model to be removed, 1 machine(s)..... -Model destroyed. +:::{include} /reuse/common/tip-listing-juju-models.txt ::: -## Force-destroying a stuck model +## Forcibly destroy a stuck Slurm deployment -If resources in the model are in an error state the `destroy-model` process may become stuck, -indicated by repeated `Waiting for model to be removed` messages. In this case, add the `--force` -flag to the command to remove resources while ignoring errors: +Your model may become stuck if any of Slurm's service are in an error state during +the model cleanup process. You can determine if your model is stuck by seeing repeated +`Waiting for model to be removed` messages printed to the terminal. -:::{terminal} -juju destroy-model --destroy-storage --force slurm +If your Slurm model is stuck, add the `--force` flag to `juju destroy-model`{l=shell} to +destroy your Slurm deployment and ignore errors: + +:::{code-block} shell +juju destroy-model --no-prompt --destroy-storage --force slurm ::: See the [Juju `destroy-model` documentation](https://documentation.ubuntu.com/juju/3.6/reference/juju-cli/list-of-juju-cli-commands/destroy-model/) -for the implications of this flag and details of further available options. \ No newline at end of file +for the implications of using the `--force` flag and details of further +available options. + +## Next Steps + +Now that you have destroyed your Slurm deployment, you can also clean up your cloud resources: + +- {ref}`howto-cleanup-cloud-resources` + +You can also revisit {ref}`howto-setup-deploy-slurm` if you want to create a new Slurm deployment. diff --git a/reuse/common/tip-listing-juju-models.txt b/reuse/common/tip-listing-juju-models.txt new file mode 100644 index 0000000..76877b2 --- /dev/null +++ b/reuse/common/tip-listing-juju-models.txt @@ -0,0 +1,18 @@ +::::{dropdown} Tip: List available Juju models + +`juju models`{l=shell} can be used to determine the models you have access to: + +:::{terminal} +:copy: +juju models + +Controller: charmed-hpc-controller + +Model Cloud/Region Type Status Machines Cores Units Access Last connection +controller charmed-hpc/default lxd available 1 - 1 admin just now +identity charmed-hpc/default lxd available 0 - - admin 2026-02-13 +slurm* charmed-hpc/default lxd available 7 4 8 admin 2026-02-26 +storage charmed-hpc/default lxd available 0 - - admin 2026-02-09 +::: + +::::