Skip to content

Updates to slurm launcher#68

Open
raviguptaamd wants to merge 1 commit intocoketaste/refactor-disfrom
raviguptaamd/update_slurm_launcher
Open

Updates to slurm launcher#68
raviguptaamd wants to merge 1 commit intocoketaste/refactor-disfrom
raviguptaamd/update_slurm_launcher

Conversation

@raviguptaamd
Copy link

Motivation

PR Summary: SGLang Disaggregated Inference Support for Madengine v2

Overview
This PR adds support for running SGLang Disaggregated (P/D) inference workloads on SLURM clusters. The key architectural change is introducing a "baremetal launcher" path for disaggregated inference launchers that manage their own Docker containers via SLURM.

Files Modified

  1. src/madengine/cli/commands/build.py (+18 lines)
    Purpose: Add CLI options for flexible build workflows
    Changes:
    Added --use-image option: Skip Docker build and use pre-built image
    Added --build-on-compute option: Build Docker images on SLURM compute node
    Safety:
    Both options are optional and default to disabled
    Existing build workflows unchanged when options not specified
    No impact on other commands

New options (both optional, default False/None)use_image: Optional[str] = Nonebuild_on_compute: bool = False

  1. src/madengine/deployment/slurm.py (+344 lines)
    Purpose: Support baremetal execution for disagg launchers
    Changes:
    Added self.reservation parsing from slurm config
    Added baremetal launcher detection in prepare():
    if launcher_normalized in ["sglang-disagg", "vllm-disagg"]: return self._prepare_baremetal_script(model_info)
    Added _prepare_baremetal_script() method that generates simple wrapper scripts
    Safety:
    Detection is explicit: Only triggers for sglang-disagg or vllm-disagg launchers
    All other launchers continue to use existing job.sh.j2 template path
    Fallback to standard flow if launcher not in baremetal list

Only these launchers trigger baremetal pathBAREMETAL_LAUNCHERS = ["sglang-disagg", "vllm-disagg"]

  1. src/madengine/execution/container_runner.py (+206 lines)
    Purpose: Execute scripts directly on baremetal for disagg launchers
    Changes:
    Added BAREMETAL_LAUNCHERS constant at top of file
    Added _run_on_baremetal() method for direct script execution
    Added launcher detection before Docker container creation
    Safety:
    Check is explicit and early exit:
    if launcher_normalized in BAREMETAL_LAUNCHERS: return self._run_on_baremetal(...) # Standard Docker path continues below
    All other launchers (torchrun, deepspeed, vllm, sglang, etc.) continue to Docker execution
    No changes to Docker container creation logic

  2. src/madengine/orchestration/build_orchestrator.py (+283 lines)
    Purpose: Support pre-built images and compute-node builds
    Changes:
    Added _execute_with_prebuilt_image() for --use-image option
    Added _execute_build_on_compute() for --build-on-compute option
    Safety:
    Both methods are only called when respective CLI options are specified
    Standard execute() flow unchanged when options not used:
    def execute(self, ..., use_image=None, build_on_compute=False): if use_image: return self._execute_with_prebuilt_image(...) if build_on_compute: return self._execute_build_on_compute(...) # Standard build flow continues...
    Impact Analysis
    Component Affected Launchers Other Launchers
    slurm.py sglang-disagg, vllm-disagg No change - uses existing template
    container_runner.py sglang-disagg, vllm-disagg No change - uses Docker execution
    build.py All (optional flags) No change - flags default to disabled
    build_orchestrator.py All (optional flags) No change - standard build if flags not used

Backward Compatibility
Existing models: All existing models should continue to work unchanged
Existing launchers: torchrun, deepspeed, megatron-lm, vllm, sglang (non-disagg) all unchanged
Existing build workflows: Default behavior preserved
Existing SLURM deployments: Standard job template path unchanged

Testing
Tested with:
pyt_sglang_disagg_qwen3-32b model on 3-node SLURM cluster
Both salloc (existing allocation) and sbatch (new job) modes
Pre-built image workflow with --use-image
Benchmarks completed successfully (GSM8K accuracy + throughput sweep)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant