Unified Slurm scripts by pentschev · Pull Request #241 · rapidsai/velox-testing

pentschev · 2026-02-18T15:44:34Z

Unified Launcher: Overview and Motivation

The presto/slurm/unified/ directory provides an alternative launcher for running Presto TPC-H benchmarks on Slurm. It is intended to produce identical benchmark results to the existing presto/slurm/presto-nvl72/ scripts while consolidating the execution logic into fewer files with a more linear control flow.

How the original launcher works

The existing launcher involves several files and external tools that coordinate across multiple directories in the repository:

launch-run.sh parses CLI arguments and submits the job via sbatch, passing parameters as exported environment variables.
run-presto-benchmarks.slurm receives those environment variables, sets up paths and computed values (e.g., CONFIGS pointing to presto/docker/config/generated/gpu/), then delegates to a shell script.
run-presto-benchmarks.sh sources two helper files (echo_helpers.sh and functions.sh) and orchestrates the benchmark phases: setup, coordinator launch, worker launch, schema creation, query execution, and result collection.
functions.sh contains the actual implementation of each phase. During the setup phase, if generated configs do not already exist, it invokes presto/scripts/generate_presto_config.sh.
generate_presto_config.sh calls the pbench binary (presto/pbench/pbench genconfig) to render Go-style templates from presto/docker/config/template/ using parameters from presto/docker/config/params.json. After generation, it applies variant-specific patches (e.g., uncommenting GPU optimizer flags, toggling multi-worker settings) and duplicates per-worker configs via the duplicate_worker_configs() function, which applies sed transformations to adjust ports, node IDs, and multi-node flags.

In total, a single benchmark run traverses at least six files across four directories, plus the pbench binary and its template/parameter inputs.

How the unified launcher works

The unified launcher consists of two files:

launch.sh parses CLI arguments and submits the job via sbatch.
run.slurm contains all logic in a single file: environment setup, config generation (from local templates in config-templates/), coordinator launch, worker launch, schema setup, and query execution.

Configuration templates live in config-templates/ within the unified directory itself and use simple __PLACEHOLDER__ substitution via sed, with no external tooling required. Memory settings are computed dynamically from the node's actual RAM using the same formulas defined in params.json.

Parallel container loading

One structural improvement in the unified launcher is that all container images are loaded in parallel. The original launcher waits on the host for the coordinator to become active before launching workers, and waits for workers to register before launching the benchmark container. Each of these phases incurs a container image load delay. The unified launcher issues all srun commands immediately and moves the dependency waits inside the containers themselves, so image loading happens concurrently.

Scope

This unified launcher was developed primarily for the use case of running TPC-H GPU benchmarks on Slurm with a straightforward configuration. The original implementation may have been architected with additional considerations in mind -- for example, supporting multiple variant types (CPU, GPU, Java), integration with broader infrastructure tooling via pbench, or other workflows that benefit from the separation of template rendering and config patching. There may be more advanced use cases where the original implementation's flexibility is preferred or required, and this unified alternative is not intended to replace it in those scenarios.

Note

Please note that this is based off of an older version of #202, and it seems there were some changes to the benchmark code that have broken it, so there may be a little bit of work to bring it up to speed. In its current form though the PR was still running on NVL72.

… misiug/slurmscripts

misiugodfrey and others added 29 commits January 28, 2026 08:15

Slurm scripts

16b3b23

untested refactor

ba18211

Refactor

5883a96

fix config bug

0e61ed7

more generate fixes

a8c995b

Appeneded to launch

30efdd6

reverted script changes and copy metadata

5a3e4b4

Merge branch 'main' into misiug/slurmscripts

cf4db8f

remove dead code

cb86355

remove absolute paths

82da5a2

Merge branch 'main' of https://github.com/rapidsai/velox-testing into…

e2d76ce

… misiug/slurmscripts

Add simplified Slurm scripts

ac96833

Reduce timeout

56d8703

Increase coordinator/workers timeout

d56f868

Fix worker configuration

a7316f7

Fix memory calculation

780cbf4

Fix missing cluster tag and GPU optimizations

da7e297

Remove cluster-tag (Java-only coordinator config)

1f52268

Reintroduce config comments

2e7e05d

Remove tpcds properties, match LD_LIBRARY_PATH

9159fe5

Copy pre-analyzed hive metastore data

ac47c24

Reduce startup latency starting all containers simultaneously

9754227

Disable cuDF JIT expression

d0c63e8

Combine setup and run benchmarks in single step for faster startup

4b1c21e

Remove senseless defaults

e85006e

Add missing files to directory structure

c2b06a7

Remove unnecessary properties

653a469

Remove tpch.properties

70f2106

Rename to unified

89be5ea

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unified Slurm scripts#241

Unified Slurm scripts#241
pentschev wants to merge 29 commits intorapidsai:mainfrom
pentschev:slurm-unified

pentschev commented Feb 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

pentschev commented Feb 18, 2026

Unified Launcher: Overview and Motivation

How the original launcher works

How the unified launcher works

Parallel container loading

Scope

Note

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants