Unified HPC job submission across multiple schedulers
Write your jobs once, run them on any cluster - SGE, Slurm, PBS, or locally for testing.
- Unified CLI - Same commands work across SGE, Slurm, PBS
- Python API - Programmatic job submission with dependencies and pipelines
- Auto-detection - Automatically finds your cluster's scheduler
- Interactive TUI - Monitor jobs with a terminal dashboard
- Job Dependencies - Chain jobs with afterok, afterany, afternotok
- Array Jobs - Batch processing with throttling support
- Virtual Environment Handling - Automatic venv activation on compute nodes
- Module Integration - Load environment modules in job scripts
- Dry-run Mode - Preview generated scripts before submission
pip install hpc-runnerOr with uv:
uv pip install hpc-runner# Basic job submission
hpc run python train.py
# With resources
hpc run --cpu 4 --mem 16G --time 4:00:00 "python train.py"
# GPU job
hpc run --queue gpu --cpu 4 --mem 32G "python train.py --epochs 100"
# Preview without submitting
hpc run --dry-run --cpu 8 "make -j8"
# Interactive session
hpc run --interactive bash
# Array job
hpc run --array 1-100 "python process.py --task-id \$SGE_TASK_ID"
# Wait for completion
hpc run --wait python long_job.pyfrom hpc_runner import Job
# Create and submit a job
job = Job(
command="python train.py",
cpu=4,
mem="16G",
time="4:00:00",
queue="gpu",
)
result = job.submit()
# Wait for completion
status = result.wait()
print(f"Exit code: {result.returncode}")
# Read output
print(result.read_stdout())from hpc_runner import Job
# First job
preprocess = Job(command="python preprocess.py", cpu=8, mem="32G")
result1 = preprocess.submit()
# Second job runs after first succeeds
train = Job(command="python train.py", cpu=4, mem="48G", queue="gpu")
train.after(result1, type="afterok")
result2 = train.submit()from hpc_runner import Pipeline
with Pipeline("ml_workflow") as p:
p.add("python preprocess.py", name="preprocess", cpu=8)
p.add("python train.py", name="train", depends_on=["preprocess"], queue="gpu")
p.add("python evaluate.py", name="evaluate", depends_on=["train"])
# Auto-submitted when the with-block exits.
# Results are accessible on the pipeline object.
for name, result in p.results.items():
print(name, result.job_id, result.status)
p.wait()| Scheduler | Status | Notes |
|---|---|---|
| SGE | Fully implemented | qsub, qstat, qdel, qrsh |
| Local | Fully implemented | Run as subprocess (for testing) |
| Slurm | Planned | sbatch, squeue, scancel |
| PBS | Planned | qsub, qstat, qdel |
HPC_SCHEDULERenvironment variable- SGE (
SGE_ROOTorqstatavailable) - Slurm (
sbatchavailable) - PBS (
qsubwith PBS) - Local fallback
hpc-runner uses TOML configuration files. Discovery uses a two-level model:
- Project config:
$HPC_RUNNER_CONFIGenv var (if set), otherwise<git-root>/hpc-runner.toml - Local override:
./hpc-runner.tomlin cwd (when different from level 1)
Both levels are deep-merged (local override wins). You can bypass discovery with --config /path/to/file.toml.
[defaults]
cpu = 1
mem = "4G"
time = "1:00:00"
inherit_env = true
[schedulers.sge]
parallel_environment = "smp"
memory_resource = "mem_free"
purge_modules = true
[types.gpu]
queue = "gpu"
resources = [{name = "gpu", value = 1}]
[types.interactive]
queue = "interactive"
time = "8:00:00"Use named job types:
hpc run --job-type gpu "python train.py"SGE clusters vary widely in how resources are named. The [schedulers.sge]
section lets you match your site's conventions without touching job definitions.
How job fields map to SGE flags:
| Job Field | SGE Flag | Configurable Via |
|---|---|---|
cpu |
-pe <pe_name> <slots> |
parallel_environment |
mem |
-l <resource>=<value> |
memory_resource |
time |
-l <resource>=<value> |
time_resource |
queue |
-q <queue> |
direct |
resources |
-l <name>=<value> |
direct |
Full [schedulers.sge] reference:
[schedulers.sge]
# Resource naming -- these must match your site's SGE configuration
parallel_environment = "smp" # PE name for CPU slots (some sites use "mpi", "threaded", etc.)
memory_resource = "mem_free" # Memory resource name (common alternatives: "h_vmem", "virtual_free")
time_resource = "h_rt" # Time limit resource name (commonly "h_rt")
# Output handling
merge_output = true # Merge stderr into stdout (-j y)
# Module system
purge_modules = true # Run 'module purge' before loading job modules
silent_modules = false # Suppress module command output (-s flag)
module_init_script = "" # Path to module init script (auto-detected if empty)
# Environment
expand_makeflags = true # Expand $NSLOTS in MAKEFLAGS for parallel make
unset_vars = [] # Environment variables to unset in jobs
# e.g. ["https_proxy", "http_proxy"]Fully populated config example:
[defaults]
scheduler = "auto"
cpu = 1
mem = "4G"
time = "1:00:00"
queue = "batch.q"
use_cwd = true
inherit_env = true
stdout = "hpc.%N.%J.out"
modules = ["gcc/12.2", "python/3.11"]
resources = [
{ name = "scratch", value = "20G" }
]
[schedulers.sge]
parallel_environment = "smp"
memory_resource = "mem_free"
time_resource = "h_rt"
merge_output = true
purge_modules = true
silent_modules = false
expand_makeflags = true
unset_vars = ["https_proxy", "http_proxy"]
[tools.python]
cpu = 4
mem = "16G"
time = "4:00:00"
queue = "short.q"
modules = ["-", "python/3.11"] # leading "-" replaces the list instead of merging
resources = [
{ name = "tmpfs", value = "8G" }
]
[types.interactive]
queue = "interactive.q"
time = "8:00:00"
cpu = 2
mem = "8G"
[types.gpu]
queue = "gpu.q"
cpu = 8
mem = "64G"
time = "12:00:00"
resources = [
{ name = "gpu", value = 1 }
]Launch the interactive job monitor:
hpc monitorKey bindings:
q- Quitr- Refreshu- Toggle user filter (my jobs / all)/- SearchEnter- View job detailsTab- Switch tabs
hpc run [OPTIONS] [-- COMMAND]
Options:
--job-name TEXT Job name
--cpu INTEGER Number of CPUs
--mem TEXT Memory (e.g., 16G, 4096M)
--time TEXT Time limit (e.g., 4:00:00)
--queue TEXT Queue/partition name
--nodes INTEGER Number of nodes (MPI jobs)
--ntasks INTEGER Number of tasks (MPI jobs)
--directory PATH Working directory
--job-type TEXT Job type from config
--module TEXT Module to load (repeatable)
--module-path PATH Module path to use (repeatable)
--stdout TEXT Stdout file path pattern
--stderr TEXT Separate stderr file (default: merged)
--array TEXT Array spec (e.g., 1-100, 1-100%5)
--depend TEXT Job dependencies
--inherit-env/--no-inherit-env
--interactive Run interactively (qrsh/srun)
--local Run locally (no scheduler)
--dry-run Show script without submitting
--wait Wait for completion
--keep-script Keep job script for debugging
-h, --help Show help
Other commands:
hpc status [JOB_ID] Check job status (--history, --watch, --json)
hpc cancel JOB_ID Cancel a job
hpc monitor Interactive TUI
hpc config show Show active configuration
hpc config init Create starter config
# Setup environment
source sourceme
source sourceme --clean # Clean rebuild
# Run tests
pytest
pytest -v
pytest -k "test_job"
# Type checking
mypy src/hpc_runner
# Linting
ruff check src/hpc_runner
ruff format src/hpc_runnerMIT License - see LICENSE file for details.