This guide will walk you through setting up and using the GreatLakes HPC cluster for your ROB572 class project. Whether you're running simulations, training models for underwater perception, or processing sonar/sensor data, this guide covers everything you need to get started.
- Getting Access
- VS Code Remote Access
- Setting Up a Conda Environment
- Running GPU Jobs in Interactive Mode
- Running GPU Jobs in Batch Mode
- Monitoring Jobs
- Tips and Tricks
- Command Cheat Sheet
To use the GreatLakes cluster for your ROB572 class project, you need a user login. If you don't already have one, request a login here:
- Great Lakes User Login: https://arc.umich.edu/login-request/
Once you have a login, you can submit jobs to the class account. Use the following line in all of your batch scripts:
#SBATCH --account=rob572w26_class
If you're new to HPC, check out the Great Lakes User Guide and consider attending an ARC Training Event. For any account issues, email arc-support@umich.edu.
- Always use
--account=rob572w26_classwhen submitting jobs. - Be mindful of resource usage — this is a shared class account, so avoid requesting more resources than you need and cancel jobs you're no longer using.
- Do not leave idle interactive sessions running. Other students need access too.
-
VPN: If you're off campus, connect to the U-M VPN first. Download the client from https://its.umich.edu/enterprise/wifi-networks/vpn/getting-started.
-
Install Remote - SSH: In VS Code, open the Extensions view (Ctrl+Shift+X), search for "Remote - SSH", and install it.
-
Connect: Open the Command Palette (Ctrl+Shift+P), type
Remote-SSH: Connect to Host..., and enter:ssh [uniqname]@greatlakes.arc-ts.umich.eduEnter your password and complete Duo two-factor authentication. When prompted for the OS type, select Linux.
-
Open a workspace: Once connected, open a terminal in VS Code. We recommend creating a dedicated project directory:
mkdir -p ~/rob572_projectThen go to File > Open Folder and select
rob572_project. This keeps your project files organized and ensures VS Code extensions (like Python IntelliSense) work correctly.
Install these VS Code extensions for a smooth development experience:
- Python — Official Python extension with IntelliSense, linting, debugging, and formatting.
- Jupyter — Create, edit, and run Jupyter notebooks directly in VS Code.
Search for them in the Extensions view and click Install.
GitHub Copilot is an AI coding assistant available as a VS Code extension. As a student, you can get it free through the GitHub Student Developer Pack.
After getting access, sign in with your GitHub account by clicking the user icon in the bottom-left of VS Code.
Your home directory has limited space (~80 GB). For class projects, this is usually sufficient, but if you need more space (large datasets, multiple environments), consider using scratch storage:
/scratch/rob572w26_class_root/rob572w26_class/[uniqname]
You can create a symlink from your home directory for convenience:
mkdir -p /scratch/rob572w26_class_root/rob572w26_class/[uniqname]/conda
ln -s /scratch/rob572w26_class_root/rob572w26_class/[uniqname]/conda ~/conda
Note: Scratch storage is temporary — files are deleted after 90 days. You'll receive an email before deletion. Always back up important work (e.g., push to GitHub).
Common conda commands you'll use throughout your project:
-
Create a new environment:
conda create -n rob572_env python=3.10 -y -
Activate the environment:
conda activate rob572_env -
Deactivate the environment:
conda deactivate -
List all environments:
conda env list -
Export an environment (useful for sharing with teammates):
conda env export --name rob572_env > environment.yml -
Recreate from an exported file:
conda env create -f environment.yml -
Remove an environment:
conda env remove --name rob572_env
In VS Code, you can select your conda environment by opening a .py file and clicking the Python version in the bottom-right corner. Choose your rob572_env environment, and VS Code will automatically use it for terminals and code execution.
Below is an example setup script for a marine robotics project that uses deep learning. Adjust package versions to match your project's requirements.
#!/bin/bash
CONDA_ENV_NAME=rob572_env
UNIQNAME=[YOUR_UNIQNAME]
# Download and install miniconda (skip if already installed)
mkdir -p ~/Downloads && cd ~/Downloads
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh -b -p ~/conda/miniconda
# Initialize conda
source ~/conda/miniconda/bin/activate
conda init
source ~/.bashrc
# Clean up installer
rm -f ~/Downloads/Miniconda3-latest-Linux-x86_64.sh
# Create environment and install packages
conda create -n ${CONDA_ENV_NAME} python=3.10 -y
conda activate ${CONDA_ENV_NAME}
# GPU support (adjust CUDA version as needed)
conda install -c "nvidia/label/cuda-11.8.0" cuda-toolkit -y
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia -y
# Common scientific/robotics packages
conda install matplotlib scikit-learn pandas scipy -y
pip install opencv-python tensorboardTip: Adjust the CUDA and PyTorch versions to match the requirements of the libraries you plan to use. If your project uses ROS, you may need a separate environment or Docker container — consult the instructor for guidance.
Interactive mode is ideal for debugging and short tests. For long-running experiments, use batch mode (next section).
The GreatLakes Portal offers interactive apps (Jupyter, VS Code, Basic Desktop) under the "Interactive Apps" tab. For GPU work, we recommend the Basic Desktop option, which provides a full desktop environment.
When submitting, use:
- Account:
rob572w26_class - Partition:
gpu(check with the instructor for the correct partition) - Time: Keep it short (a few hours) for debugging
Resource guidelines: A single node typically has 8 GPUs, 32 CPU cores, and 372 GB memory. To be a good neighbor, limit CPUs to ~4 per GPU and memory to ~48 GB per GPU.
Once your interactive session is running, find the hostname in the session details (e.g., gl1709.arc-ts.umich.edu). From a VS Code terminal connected to Great Lakes:
ssh gl1709.arc-ts.umich.edu
Then cd to your project directory and activate your conda environment manually.
Request an interactive GPU session directly from the terminal:
salloc --job-name=debug --cpus-per-task=4 --nodes=1 --mem=16G --time=4:00:00 --account=rob572w26_class --partition=gpu --gres=gpu:1
Check your job status:
squeue -u [UNIQNAME]
Connect to the allocated node:
srun --jobid=[JOBID] --pty bash
Warning: If you close the
sallocterminal, your job will be terminated. Usetmuxorscreento keep your session alive:tmux new -s rob572You can detach with
Ctrl+BthenD, and reattach later withtmux attach -t rob572.
To run a Jupyter notebook on a GPU node and connect from VS Code:
- SSH into the allocated node,
cdto your project directory, and activate your conda environment. - Start the notebook server:
jupyter notebook --no-browser --port=8888 --ip=0.0.0.0 - Copy the URL with the token from the terminal output.
- In VS Code, open your
.ipynbfile, click "Select Kernel" in the top-right, choose "Existing Jupyter Server", and paste the URL.
Batch mode is the recommended way to run long experiments. It queues your job and runs it when resources are available — no need to keep a terminal open.
Here's a template batch script for a ROB572 project:
#!/bin/bash
#SBATCH --job-name=rob572_train
#SBATCH --mail-user=[UNIQNAME]@umich.edu
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --cpus-per-task=4
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --mem=16G
#SBATCH --time=8:00:00
#SBATCH --account=rob572w26_class
#SBATCH --partition=gpu
#SBATCH --output=/home/[UNIQNAME]/rob572_project/logs/%x-%j.log
#SBATCH --gres=gpu:1
# Set up environment
source /home/[UNIQNAME]/.bashrc
conda activate rob572_env
cd ~/rob572_project
# Optional: copy dataset to local SSD for faster I/O
mkdir -p /tmpssd/[UNIQNAME]
cp ~/rob572_project/data /tmpssd/[UNIQNAME]/ -r
# Run your training script
python train.py \
--data_dir /tmpssd/[UNIQNAME]/data \
--output_dir ~/rob572_project/results \
--epochs 50 \
--batch_size 32 \
--lr 0.001Submit the job:
sbatch train.sh
Note: Copying data to
/tmpssd(local SSD) avoids slow network transfers and can significantly speed up data loading. This is especially helpful for projects with large datasets (e.g., sonar imagery, point clouds).
Make sure the logs/ directory exists before submitting:
mkdir -p ~/rob572_project/logs
Structure your training scripts to accept command-line arguments, making it easy to run different experiments without editing code:
import argparse
def main(args):
# Your training/simulation logic here
print(f"Training with lr={args.lr}, epochs={args.epochs}")
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="ROB572 Project Training")
parser.add_argument("--data_dir", type=str, required=True, help="Path to dataset")
parser.add_argument("--output_dir", type=str, required=True, help="Path to save results")
parser.add_argument("--batch_size", type=int, default=32, help="Batch size")
parser.add_argument("--lr", type=float, default=0.001, help="Learning rate")
parser.add_argument("--epochs", type=int, default=10, help="Number of epochs")
args = parser.parse_args()
main(args)To sweep over hyperparameters, use a loop that submits multiple jobs:
#!/bin/bash
ACCOUNT=rob572w26_class
LR_LIST=(0.01 0.001 0.0001)
BATCH_SIZES=(16 32 64)
for LR in "${LR_LIST[@]}"; do
for BS in "${BATCH_SIZES[@]}"; do
sbatch <<EOT
#!/bin/bash
#SBATCH --job-name=rob572_lr${LR}_bs${BS}
#SBATCH --mail-user=[UNIQNAME]@umich.edu
#SBATCH --mail-type=END,FAIL
#SBATCH --cpus-per-task=4
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --mem=16G
#SBATCH --time=8:00:00
#SBATCH --account=${ACCOUNT}
#SBATCH --partition=gpu
#SBATCH --output=/home/[UNIQNAME]/rob572_project/logs/%x-%j.log
#SBATCH --gres=gpu:1
source /home/[UNIQNAME]/.bashrc
conda activate rob572_env
cd ~/rob572_project
python train.py \
--data_dir ~/rob572_project/data \
--output_dir ~/rob572_project/results/lr${LR}_bs${BS} \
--lr ${LR} \
--batch_size ${BS} \
--epochs 50
EOT
done
doneCheck your jobs:
squeue -u [UNIQNAME]
Check all jobs under the class account:
squeue -A rob572w26_class
Detailed job info:
scontrol show job [JOBID]
View resource usage for the class account:
squeue -A rob572w26_class -O "JobID,UserName,tres-per-job,tres-per-node,TimeUsed,TimeLeft"
Cancel a job:
scancel [JOBID]
Cancel all your jobs:
scancel -u [UNIQNAME]
- Keep a clear directory structure for your project (e.g.,
data/,src/,logs/,results/,notebooks/). - Use
argparseto parameterize your scripts so you can easily run different experiments. - Log your results programmatically (to files or tools like TensorBoard/Weights & Biases) instead of copying from terminal output.
- Don't hog resources. Only request what you need — if you need 1 GPU, don't request 4.
- Cancel idle jobs. If you're done debugging, cancel your interactive session with
scancel. - Use batch mode for long runs. Interactive sessions are for debugging, not overnight training.
- Check the queue before submitting large jobs — if the class account is busy, wait or use fewer resources.
- Always push your code to GitHub or another Git host. The cluster is not a backup service.
- Use
.gitignoreto exclude large data files, model checkpoints, and log files from your repo.
- Some packages cache data in your home directory. Redirect caches to scratch to save space:
# Add to your ~/.bashrc export HF_DATASETS_CACHE="/scratch/rob572w26_class_root/rob572w26_class/[UNIQNAME]/cache/huggingface" export HF_HOME="/scratch/rob572w26_class_root/rob572w26_class/[UNIQNAME]/cache/hf_home"
- Use
/tmpssdon compute nodes for fast local I/O during training jobs.
- Use interactive mode to test your code with a small dataset before submitting a batch job.
- Check job logs in your
logs/directory if a batch job fails. - Use
scontrol show job [JOBID]to see why a job is pending or failed.
| Command | Description |
|---|---|
ssh [uniqname]@greatlakes.arc-ts.umich.edu |
Connect to Great Lakes |
conda create -n rob572_env python=3.10 -y |
Create conda environment |
conda activate rob572_env |
Activate environment |
salloc --account=rob572w26_class --partition=gpu --gres=gpu:1 --mem=16G --time=4:00:00 |
Request interactive GPU session |
sbatch train.sh |
Submit a batch job |
squeue -u [UNIQNAME] |
Check your jobs |
squeue -A rob572w26_class |
Check all class jobs |
scancel [JOBID] |
Cancel a job |
scontrol show job [JOBID] |
Detailed job info |
tmux new -s rob572 |
Start a tmux session |
tmux attach -t rob572 |
Reattach to tmux session |
If you have questions about the cluster, email arc-support@umich.edu. For project-specific questions, reach out to the instructor or GSI.