Skip to content

umfieldrobotics/NAVARCH569-ROB472-ROB572-Marine-Robotics-Cluster-Tutorial

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 

Repository files navigation

GreatLakes Cluster Guide for ROB572: Marine Robotics

This guide will walk you through setting up and using the GreatLakes HPC cluster for your ROB572 class project. Whether you're running simulations, training models for underwater perception, or processing sonar/sensor data, this guide covers everything you need to get started.

  1. Getting Access
  2. VS Code Remote Access
  3. Setting Up a Conda Environment
  4. Running GPU Jobs in Interactive Mode
  5. Running GPU Jobs in Batch Mode
  6. Monitoring Jobs
  7. Tips and Tricks
  8. Command Cheat Sheet

Getting Access

To use the GreatLakes cluster for your ROB572 class project, you need a user login. If you don't already have one, request a login here:

Once you have a login, you can submit jobs to the class account. Use the following line in all of your batch scripts:

#SBATCH --account=rob572w26_class

If you're new to HPC, check out the Great Lakes User Guide and consider attending an ARC Training Event. For any account issues, email arc-support@umich.edu.

Important Notes

  • Always use --account=rob572w26_class when submitting jobs.
  • Be mindful of resource usage — this is a shared class account, so avoid requesting more resources than you need and cancel jobs you're no longer using.
  • Do not leave idle interactive sessions running. Other students need access too.

VS Code Remote Access

VPN and SSH Setup

  1. VPN: If you're off campus, connect to the U-M VPN first. Download the client from https://its.umich.edu/enterprise/wifi-networks/vpn/getting-started.

  2. Install Remote - SSH: In VS Code, open the Extensions view (Ctrl+Shift+X), search for "Remote - SSH", and install it.

  3. Connect: Open the Command Palette (Ctrl+Shift+P), type Remote-SSH: Connect to Host..., and enter:

    ssh [uniqname]@greatlakes.arc-ts.umich.edu
    

    Enter your password and complete Duo two-factor authentication. When prompted for the OS type, select Linux.

  4. Open a workspace: Once connected, open a terminal in VS Code. We recommend creating a dedicated project directory:

    mkdir -p ~/rob572_project
    

    Then go to File > Open Folder and select rob572_project. This keeps your project files organized and ensures VS Code extensions (like Python IntelliSense) work correctly.

Python Extensions and Environment Setup

Install these VS Code extensions for a smooth development experience:

  1. Python — Official Python extension with IntelliSense, linting, debugging, and formatting.
  2. Jupyter — Create, edit, and run Jupyter notebooks directly in VS Code.

Search for them in the Extensions view and click Install.

GitHub Copilot Setup

GitHub Copilot is an AI coding assistant available as a VS Code extension. As a student, you can get it free through the GitHub Student Developer Pack.

After getting access, sign in with your GitHub account by clicking the user icon in the bottom-left of VS Code.


Setting Up a Conda Environment

Managing Storage on the Cluster

Your home directory has limited space (~80 GB). For class projects, this is usually sufficient, but if you need more space (large datasets, multiple environments), consider using scratch storage:

/scratch/rob572w26_class_root/rob572w26_class/[uniqname]

You can create a symlink from your home directory for convenience:

mkdir -p /scratch/rob572w26_class_root/rob572w26_class/[uniqname]/conda
ln -s /scratch/rob572w26_class_root/rob572w26_class/[uniqname]/conda ~/conda

Note: Scratch storage is temporary — files are deleted after 90 days. You'll receive an email before deletion. Always back up important work (e.g., push to GitHub).

Conda Environment Management

Common conda commands you'll use throughout your project:

  1. Create a new environment:

    conda create -n rob572_env python=3.10 -y
    
  2. Activate the environment:

    conda activate rob572_env
    
  3. Deactivate the environment:

    conda deactivate
    
  4. List all environments:

    conda env list
    
  5. Export an environment (useful for sharing with teammates):

    conda env export --name rob572_env > environment.yml
    
  6. Recreate from an exported file:

    conda env create -f environment.yml
    
  7. Remove an environment:

    conda env remove --name rob572_env
    

In VS Code, you can select your conda environment by opening a .py file and clicking the Python version in the bottom-right corner. Choose your rob572_env environment, and VS Code will automatically use it for terminals and code execution.

Installing Packages

Below is an example setup script for a marine robotics project that uses deep learning. Adjust package versions to match your project's requirements.

#!/bin/bash
CONDA_ENV_NAME=rob572_env
UNIQNAME=[YOUR_UNIQNAME]

# Download and install miniconda (skip if already installed)
mkdir -p ~/Downloads && cd ~/Downloads
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh -b -p ~/conda/miniconda

# Initialize conda
source ~/conda/miniconda/bin/activate
conda init
source ~/.bashrc

# Clean up installer
rm -f ~/Downloads/Miniconda3-latest-Linux-x86_64.sh

# Create environment and install packages
conda create -n ${CONDA_ENV_NAME} python=3.10 -y
conda activate ${CONDA_ENV_NAME}

# GPU support (adjust CUDA version as needed)
conda install -c "nvidia/label/cuda-11.8.0" cuda-toolkit -y
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia -y

# Common scientific/robotics packages
conda install matplotlib scikit-learn pandas scipy -y
pip install opencv-python tensorboard

Tip: Adjust the CUDA and PyTorch versions to match the requirements of the libraries you plan to use. If your project uses ROS, you may need a separate environment or Docker container — consult the instructor for guidance.


Running GPU Jobs in Interactive Mode

Interactive mode is ideal for debugging and short tests. For long-running experiments, use batch mode (next section).

Web-Based Access

The GreatLakes Portal offers interactive apps (Jupyter, VS Code, Basic Desktop) under the "Interactive Apps" tab. For GPU work, we recommend the Basic Desktop option, which provides a full desktop environment.

When submitting, use:

  • Account: rob572w26_class
  • Partition: gpu (check with the instructor for the correct partition)
  • Time: Keep it short (a few hours) for debugging

Resource guidelines: A single node typically has 8 GPUs, 32 CPU cores, and 372 GB memory. To be a good neighbor, limit CPUs to ~4 per GPU and memory to ~48 GB per GPU.

Connecting to Your Node via SSH

Once your interactive session is running, find the hostname in the session details (e.g., gl1709.arc-ts.umich.edu). From a VS Code terminal connected to Great Lakes:

ssh gl1709.arc-ts.umich.edu

Then cd to your project directory and activate your conda environment manually.

Command-Line Access

Request an interactive GPU session directly from the terminal:

salloc --job-name=debug --cpus-per-task=4 --nodes=1 --mem=16G --time=4:00:00 --account=rob572w26_class --partition=gpu --gres=gpu:1

Check your job status:

squeue -u [UNIQNAME]

Connect to the allocated node:

srun --jobid=[JOBID] --pty bash

Warning: If you close the salloc terminal, your job will be terminated. Use tmux or screen to keep your session alive:

tmux new -s rob572

You can detach with Ctrl+B then D, and reattach later with tmux attach -t rob572.

Jupyter Notebooks with GPU

To run a Jupyter notebook on a GPU node and connect from VS Code:

  1. SSH into the allocated node, cd to your project directory, and activate your conda environment.
  2. Start the notebook server:
    jupyter notebook --no-browser --port=8888 --ip=0.0.0.0
    
  3. Copy the URL with the token from the terminal output.
  4. In VS Code, open your .ipynb file, click "Select Kernel" in the top-right, choose "Existing Jupyter Server", and paste the URL.

Running GPU Jobs in Batch Mode

Batch mode is the recommended way to run long experiments. It queues your job and runs it when resources are available — no need to keep a terminal open.

Creating a Batch Script

Here's a template batch script for a ROB572 project:

#!/bin/bash

#SBATCH --job-name=rob572_train
#SBATCH --mail-user=[UNIQNAME]@umich.edu
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --cpus-per-task=4
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --mem=16G
#SBATCH --time=8:00:00
#SBATCH --account=rob572w26_class
#SBATCH --partition=gpu
#SBATCH --output=/home/[UNIQNAME]/rob572_project/logs/%x-%j.log
#SBATCH --gres=gpu:1

# Set up environment
source /home/[UNIQNAME]/.bashrc
conda activate rob572_env
cd ~/rob572_project

# Optional: copy dataset to local SSD for faster I/O
mkdir -p /tmpssd/[UNIQNAME]
cp ~/rob572_project/data /tmpssd/[UNIQNAME]/ -r

# Run your training script
python train.py \
    --data_dir /tmpssd/[UNIQNAME]/data \
    --output_dir ~/rob572_project/results \
    --epochs 50 \
    --batch_size 32 \
    --lr 0.001

Submit the job:

sbatch train.sh

Note: Copying data to /tmpssd (local SSD) avoids slow network transfers and can significantly speed up data loading. This is especially helpful for projects with large datasets (e.g., sonar imagery, point clouds).

Make sure the logs/ directory exists before submitting:

mkdir -p ~/rob572_project/logs

Parameterized Python Code

Structure your training scripts to accept command-line arguments, making it easy to run different experiments without editing code:

import argparse

def main(args):
    # Your training/simulation logic here
    print(f"Training with lr={args.lr}, epochs={args.epochs}")

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="ROB572 Project Training")
    parser.add_argument("--data_dir", type=str, required=True, help="Path to dataset")
    parser.add_argument("--output_dir", type=str, required=True, help="Path to save results")
    parser.add_argument("--batch_size", type=int, default=32, help="Batch size")
    parser.add_argument("--lr", type=float, default=0.001, help="Learning rate")
    parser.add_argument("--epochs", type=int, default=10, help="Number of epochs")
    args = parser.parse_args()
    main(args)

Parameterized Batch Scripts

To sweep over hyperparameters, use a loop that submits multiple jobs:

#!/bin/bash
ACCOUNT=rob572w26_class

LR_LIST=(0.01 0.001 0.0001)
BATCH_SIZES=(16 32 64)

for LR in "${LR_LIST[@]}"; do
for BS in "${BATCH_SIZES[@]}"; do
sbatch <<EOT
#!/bin/bash
#SBATCH --job-name=rob572_lr${LR}_bs${BS}
#SBATCH --mail-user=[UNIQNAME]@umich.edu
#SBATCH --mail-type=END,FAIL
#SBATCH --cpus-per-task=4
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --mem=16G
#SBATCH --time=8:00:00
#SBATCH --account=${ACCOUNT}
#SBATCH --partition=gpu
#SBATCH --output=/home/[UNIQNAME]/rob572_project/logs/%x-%j.log
#SBATCH --gres=gpu:1

source /home/[UNIQNAME]/.bashrc
conda activate rob572_env
cd ~/rob572_project

python train.py \
    --data_dir ~/rob572_project/data \
    --output_dir ~/rob572_project/results/lr${LR}_bs${BS} \
    --lr ${LR} \
    --batch_size ${BS} \
    --epochs 50
EOT
done
done

Monitoring Jobs

Check your jobs:

squeue -u [UNIQNAME]

Check all jobs under the class account:

squeue -A rob572w26_class

Detailed job info:

scontrol show job [JOBID]

View resource usage for the class account:

squeue -A rob572w26_class -O "JobID,UserName,tres-per-job,tres-per-node,TimeUsed,TimeLeft"

Cancel a job:

scancel [JOBID]

Cancel all your jobs:

scancel -u [UNIQNAME]

Tips and Tricks

Project Organization

  • Keep a clear directory structure for your project (e.g., data/, src/, logs/, results/, notebooks/).
  • Use argparse to parameterize your scripts so you can easily run different experiments.
  • Log your results programmatically (to files or tools like TensorBoard/Weights & Biases) instead of copying from terminal output.

Being a Good Cluster Citizen

  • Don't hog resources. Only request what you need — if you need 1 GPU, don't request 4.
  • Cancel idle jobs. If you're done debugging, cancel your interactive session with scancel.
  • Use batch mode for long runs. Interactive sessions are for debugging, not overnight training.
  • Check the queue before submitting large jobs — if the class account is busy, wait or use fewer resources.

Version Control

  • Always push your code to GitHub or another Git host. The cluster is not a backup service.
  • Use .gitignore to exclude large data files, model checkpoints, and log files from your repo.

Storage Tips

  • Some packages cache data in your home directory. Redirect caches to scratch to save space:
    # Add to your ~/.bashrc
    export HF_DATASETS_CACHE="/scratch/rob572w26_class_root/rob572w26_class/[UNIQNAME]/cache/huggingface"
    export HF_HOME="/scratch/rob572w26_class_root/rob572w26_class/[UNIQNAME]/cache/hf_home"
  • Use /tmpssd on compute nodes for fast local I/O during training jobs.

Debugging

  • Use interactive mode to test your code with a small dataset before submitting a batch job.
  • Check job logs in your logs/ directory if a batch job fails.
  • Use scontrol show job [JOBID] to see why a job is pending or failed.

Command Cheat Sheet

Command Description
ssh [uniqname]@greatlakes.arc-ts.umich.edu Connect to Great Lakes
conda create -n rob572_env python=3.10 -y Create conda environment
conda activate rob572_env Activate environment
salloc --account=rob572w26_class --partition=gpu --gres=gpu:1 --mem=16G --time=4:00:00 Request interactive GPU session
sbatch train.sh Submit a batch job
squeue -u [UNIQNAME] Check your jobs
squeue -A rob572w26_class Check all class jobs
scancel [JOBID] Cancel a job
scontrol show job [JOBID] Detailed job info
tmux new -s rob572 Start a tmux session
tmux attach -t rob572 Reattach to tmux session

If you have questions about the cluster, email arc-support@umich.edu. For project-specific questions, reach out to the instructor or GSI.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors