Updates to slurm launcher by raviguptaamd · Pull Request #68 · ROCm/madengine

raviguptaamd · 2026-02-05T07:40:20Z

Motivation

PR Summary: SGLang Disaggregated Inference Support for Madengine v2

Overview
This PR adds support for running SGLang Disaggregated (P/D) inference workloads on SLURM clusters. The key architectural change is introducing a "baremetal launcher" path for disaggregated inference launchers that manage their own Docker containers via SLURM.

Files Modified

src/madengine/cli/commands/build.py (+18 lines)
Purpose: Add CLI options for flexible build workflows
Changes:
Added --use-image option: Skip Docker build and use pre-built image
Added --build-on-compute option: Build Docker images on SLURM compute node
Safety:
Both options are optional and default to disabled
Existing build workflows unchanged when options not specified
No impact on other commands

New options (both optional, default False/None)use_image: Optional[str] = Nonebuild_on_compute: bool = False

src/madengine/deployment/slurm.py (+344 lines)
Purpose: Support baremetal execution for disagg launchers
Changes:
Added self.reservation parsing from slurm config
Added baremetal launcher detection in prepare():
if launcher_normalized in ["sglang-disagg", "vllm-disagg"]: return self._prepare_baremetal_script(model_info)
Added _prepare_baremetal_script() method that generates simple wrapper scripts
Safety:
Detection is explicit: Only triggers for sglang-disagg or vllm-disagg launchers
All other launchers continue to use existing job.sh.j2 template path
Fallback to standard flow if launcher not in baremetal list

Only these launchers trigger baremetal pathBAREMETAL_LAUNCHERS = ["sglang-disagg", "vllm-disagg"]

src/madengine/execution/container_runner.py (+206 lines)
Purpose: Execute scripts directly on baremetal for disagg launchers
Changes:
Added BAREMETAL_LAUNCHERS constant at top of file
Added _run_on_baremetal() method for direct script execution
Added launcher detection before Docker container creation
Safety:
Check is explicit and early exit:
if launcher_normalized in BAREMETAL_LAUNCHERS: return self._run_on_baremetal(...) # Standard Docker path continues below
All other launchers (torchrun, deepspeed, vllm, sglang, etc.) continue to Docker execution
No changes to Docker container creation logic
src/madengine/orchestration/build_orchestrator.py (+283 lines)
Purpose: Support pre-built images and compute-node builds
Changes:
Added _execute_with_prebuilt_image() for --use-image option
Added _execute_build_on_compute() for --build-on-compute option
Safety:
Both methods are only called when respective CLI options are specified
Standard execute() flow unchanged when options not used:
def execute(self, ..., use_image=None, build_on_compute=False): if use_image: return self._execute_with_prebuilt_image(...) if build_on_compute: return self._execute_build_on_compute(...) # Standard build flow continues...
Impact Analysis
Component Affected Launchers Other Launchers
slurm.py sglang-disagg, vllm-disagg No change - uses existing template
container_runner.py sglang-disagg, vllm-disagg No change - uses Docker execution
build.py All (optional flags) No change - flags default to disabled
build_orchestrator.py All (optional flags) No change - standard build if flags not used

Backward Compatibility
Existing models: All existing models should continue to work unchanged
Existing launchers: torchrun, deepspeed, megatron-lm, vllm, sglang (non-disagg) all unchanged
Existing build workflows: Default behavior preserved
Existing SLURM deployments: Standard job template path unchanged

Testing
Tested with:
pyt_sglang_disagg_qwen3-32b model on 3-node SLURM cluster
Both salloc (existing allocation) and sbatch (new job) modes
Pre-built image workflow with --use-image
Benchmarks completed successfully (GSM8K accuracy + throughput sweep)

Updates to slurm launcher

8c4b959

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updates to slurm launcher#68

Updates to slurm launcher#68
raviguptaamd wants to merge 1 commit intocoketaste/refactor-disfrom
raviguptaamd/update_slurm_launcher

raviguptaamd commented Feb 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

raviguptaamd commented Feb 5, 2026

Motivation

New options (both optional, default False/None)use_image: Optional[str] = Nonebuild_on_compute: bool = False

Only these launchers trigger baremetal pathBAREMETAL_LAUNCHERS = ["sglang-disagg", "vllm-disagg"]

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant