This repository contains a simple example of a Hydra project and how to use it with a Slurm cluster and WandB integration. This Readme will serve as a tutorial to walk you through the individual steps.
Create a new virtual environment (using conda, mamba, or virtualenv) and install the repository as a local package (from the root of the repository):
pip install -e .Have a look at the source files in the src directory and the main.py entry point. This projects implements a simple 1d regression task with a MLP model using torch and should
be fairly easy to understand with a fundamental machine learning background. Notice that every class takes as an input a config dict. This is a common pattern in Hydra projects.
Hydra is a powerful framework for configuring complex applications. It allows you to define hyperparameter configurations and compose them in a hierarchical way. This is particularly useful when you want to combine experiments from multiple sub modules.
For example, you can define your algorithm and your dataset as separate modules and then combine them in a single experiment. In this repository, we have 2 different datasets (Line or Sine 1d regression) and a simple MLP model as our algorithm.
The entry point of the project is the main.py file. In order to execute it, you have to provide a config file. You do this as a command line argument:
python main.py --config-name exp_1_local_sineThis executes the main.py script with the configuration exp_1_local_sine which is defined in the configs directory.
Taking a look at the config file, you can see that it first loads different subconfigs:
defaults:
- algorithm: mlp
- dataset: sine
- platform: local
- _self_After that, it sets specific paramters or overwrites default values:
epochs: 1000
device: cuda
name: exp_1_local_sine
group_name: local
visualize: True
wandb: False
seed: 100Importantly, in the - platform: local config, we define where the experiments should be saved. If you take a look at the config itself (configs/platform/local.yaml), you see that it starts with the line
# @package _global_This means, that it this config is "moved" to the global config level and that its keys are in the root level of the config instead inside the platform key.
We need that since the hydra: key needs to be in the root level of the config file.
This project is integrated with "Weights and Biases" (WandB). WandB is a great tool for tracking your experiments and visualizing your results. In order to use it, you have to create an account on the WandB website and get your API key. Then, you can set your API key as an environment variable, WandB will tell you how to do this when you run it for the first time.
When you change the config configs/exp_1_local_sine.yaml and set the key wandb: True we will log our results to WandB.
Try it out by running this config again! You should see a new run in your WandB dashboard. However, the train loss curve is not very informative.
You can change the y-axis to a logarithmic scale to see the loss better:
Hover over the train loss plot and click "Edit panel". Find the "Y Axis" option and tick the "Log Scale" button to the right.
Now you should see the loss curve better.
2 Things are recommended to change in the WandB GUI:
-
Group your runs to "Group" and "Job Type":

Groups are bigger gropus of runs, while Job Type can be seen as a sub group of a group. In our case, a job type only contains the same configuration, but we save multiple executions with different seeds in it.
-
Go to "Settings" (top right) -> "Line Plots" -> Tick "Random sampling" to see the correct grouping of multiple runs.

If you now start the second experiment exp_2_local_relu you will see that the runs are grouped by the "Group" and "Job Type" and that the line plots are correctly displayed.
Compare both runs. You should observe that ReLU converges faster than Tanh activation function (which was used by exp_1):
However, the ReLU prediction is a step wise linear function and looks janky, if you take a look a the the prediction plot:

This can be all easily compared in the WandB GUI. Hopefully, this helps you to get a better understanding of your experiments.
Hydra also provides a powerful tool for defining multiple experiments at once. Useful applications are hyperparameter grid search or rerunning your experiment with different seeds.
This is done with the MULTIRUN flag.
Take a look at configs/exp_3_local_multirun.yaml. It has this new section:
hydra:
mode: MULTIRUN
sweeper:
params:
seed: 0, 1, 2 # starts 3 jobs sequentially, overwrites seed valueWith this configuration, we start 3 jobs sequentially with different seeds. This is useful if you want to increase the statistical significance of your results.
Once you run this config, you should see in WandB now an aggregated line plot of these 3 runs.

Specific aggregation methods can be changed by the "Edit panel" option.
The sweeper section can be used to define a grid search: It creates a cartesian grid of all combinations given in the params dict.
If you are interested in a "list"/"tuple" search, check out my repository
hydra-list-sweeper.
Now we are ready to deploy our code on a Slurm cluster. We will use the BwUni 3.0 Cluster as an example cluster, but every Slurm cluster should work similarly. As a general rule, all runs on the cluster are only executed using the MULTIRUN mode.
The first step is to log in on the cluster, clone this repository and install it in a virtual environment. Then you are ready to submit your first job.
Hydra has a nice plugin to submit slurm jobs, where you don't have to touch any bash scripts. We configure the parameters of that in the platform subconfig.
Take a look at configs/platform/bwuni_dev.yaml:
# @package _global_
defaults:
- override /hydra/launcher: submitit_slurm
hydra:
mode: MULTIRUN # needed for launcher to be used
run:
dir: ./outputs/training/${now:%Y-%m-%d}/${name}
sweep:
dir: ./outputs/training/${now:%Y-%m-%d}
subdir: ${name}/seed_${seed}
launcher:
# launcher/cluster specific options
partition: "gpu_a100_short,dev_gpu_h100,dev_gpu_a100_il"
timeout_min: 30 # in minutes, maximum time on this queue
gres: gpu:1 # one gpu allocatedThis configuration tells Hydra to use the submitit_slurm launcher and to submit the job to the development partitions.
Check that you have the hydra-submitit-launcher package installed:
pip install hydra-submitit-launcherIn the launcher section,
you can specifiy the timeout (how long the job is allowed to run) and the number of gpus.
On the development queues, jobs are limited to 30 minutes.
Now, you can submit the job by executing the following command on the cluster:
python main.py --config-name exp_4_bwuni_devThis config is similar to the local multirun config, but it uses the bwuni platform config. You should see a new job being submitted to the cluster.
Check their status using squeue. Normally, they should deploy within a few minutes. The results are plotted to wandb again.
The dev queue is nice for debugging, but due to the time limit of 30 minutes, it is not practical for real experiments.
The config/exp_5_bwuni.yaml config uses the bwuni platform, which submits the job to all possible gpu queues.
Try to run this config on the cluster. The waiting time might now be much longer. In general, a shorter timeout_min results in a higher priority.
Another useful command to get an upper bound of the waiting time is squeue --start. To find out which partitions are currently idle, check out sinfo_t_idle.
Sometimes, you work with datasets that do not fully fit in memory and must be loaded from disk. This can be problematic on a cluster, as your workspace or home directory is typically a network drive, where input/output operations are slow. To address this, the BwUni cluster provides a fast SSD for temporary storage, accessible via the environment variable $TMPDIR. This storage is physically mounted on the machine where your job is executed. The recommended workflow is as follows:
- In your submitit config, add a line to copy the data to the temporary storage
$TMPDIR. - Implement a dataset which loads the requested index from the temporary storage
$TMPDIR. - Use a dataloader with multiple workers to load the data in parallel.
To give more detail, you can give a list of bash commands in the setup section of the submitit config which are executed before the job starts:
hydra:
launcher:
setup:
- cp ./outputs/datasets/on_disk_train.hdf5 $TMPDIR/on_disk_train.hdf5
- cp ./outputs/datasets/on_disk_test.hdf5 $TMPDIR/on_disk_test.hdf5It is also useful to change the path_to_dataset in the dataset config to $TMPDIR. This is also done in the platform config:
dataset:
path_to_dataset: "$TMPDIR"We added both changes to the platform configs bwuni_dev_copy_dataset and bwuni_copy_dataset.
In this example, we use the HDF5 format to store the dataset. This is convenient, since you can save the dataset as one file, but still access it in parallel. Checkout https://docs.h5py.org/en/latest/quick.html for more information on how to use HDF5 with python.
The dataset class now does not load the full data into memory during __init__. Instead, the data is accessed during the __getitem__ method from the temporary storage $TMPDIR.
The dataset class in dataset/on_disk.py shows how to do this.
If you don't work with an HDF5 file but with a folder of datapoints instead, you should always copy a zipped version and then unzip the dataset using another command in the setup section.
This makes the copying process much faster and reduces the I/O operations.
Since loading from an SSD is still slower than memory access, it is advisable to have multiple workers in the dataloader to load the data in parallel.
We set the number of workers in the dataset config to 4 for the on_disk dataset. The in-memory dataset does not use separate workers, hence the default value of 0.
If your dataset is very big and you want to debug stuff, it might make sense to create a debug dataset and use that. Then the copying of the dataset does not take that long.
In general, you should always use the $TMPDIR folder if you are dealing with big datasets on the cluster. Never load them from your home directory for every step.
The BwUni Cluster checks the amount if I/O operations and might kill your job if you exceed the limit. Also, it is way slower and inefficient.
Note that loading from disk during __getitem__ is slower than loading it from memory (even with multiple workers, compare the epoch over time plot in wandb for that!). Therefore, if you have a dataset that fits into memory, you should go for the simpler option and load it into memory during __init__.
We recommend using the turm tool to get a TUI overview of your jobs on the cluster. It is a nice way to monitor your jobs and see their status.
You can install it via pip:
pip install turmand run it by simply executing turm on the cluster. You can find more information on https://github.com/kabouzeid/turm

This repository showed you how to use Hydra, WandB, and Slurm together. This is a powerful combination to manage your experiments and to deploy them on a cluster. I hope this tutorial was helpful to you. If you have any questions, feel free to contact me. (philipp.dahlinger@kit.edu)
