Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
PR Summary: SGLang Disaggregated Inference Support for Madengine v2
Overview
This PR adds support for running SGLang Disaggregated (P/D) inference workloads on SLURM clusters. The key architectural change is introducing a "baremetal launcher" path for disaggregated inference launchers that manage their own Docker containers via SLURM.
Files Modified
Purpose: Add CLI options for flexible build workflows
Changes:
Added --use-image option: Skip Docker build and use pre-built image
Added --build-on-compute option: Build Docker images on SLURM compute node
Safety:
Both options are optional and default to disabled
Existing build workflows unchanged when options not specified
No impact on other commands
New options (both optional, default False/None)use_image: Optional[str] = Nonebuild_on_compute: bool = False
Purpose: Support baremetal execution for disagg launchers
Changes:
Added self.reservation parsing from slurm config
Added baremetal launcher detection in prepare():
if launcher_normalized in ["sglang-disagg", "vllm-disagg"]: return self._prepare_baremetal_script(model_info)
Added _prepare_baremetal_script() method that generates simple wrapper scripts
Safety:
Detection is explicit: Only triggers for sglang-disagg or vllm-disagg launchers
All other launchers continue to use existing job.sh.j2 template path
Fallback to standard flow if launcher not in baremetal list
Only these launchers trigger baremetal pathBAREMETAL_LAUNCHERS = ["sglang-disagg", "vllm-disagg"]
src/madengine/execution/container_runner.py (+206 lines)
Purpose: Execute scripts directly on baremetal for disagg launchers
Changes:
Added BAREMETAL_LAUNCHERS constant at top of file
Added _run_on_baremetal() method for direct script execution
Added launcher detection before Docker container creation
Safety:
Check is explicit and early exit:
if launcher_normalized in BAREMETAL_LAUNCHERS: return self._run_on_baremetal(...) # Standard Docker path continues below
All other launchers (torchrun, deepspeed, vllm, sglang, etc.) continue to Docker execution
No changes to Docker container creation logic
src/madengine/orchestration/build_orchestrator.py (+283 lines)
Purpose: Support pre-built images and compute-node builds
Changes:
Added _execute_with_prebuilt_image() for --use-image option
Added _execute_build_on_compute() for --build-on-compute option
Safety:
Both methods are only called when respective CLI options are specified
Standard execute() flow unchanged when options not used:
def execute(self, ..., use_image=None, build_on_compute=False): if use_image: return self._execute_with_prebuilt_image(...) if build_on_compute: return self._execute_build_on_compute(...) # Standard build flow continues...
Impact Analysis
Component Affected Launchers Other Launchers
slurm.py sglang-disagg, vllm-disagg No change - uses existing template
container_runner.py sglang-disagg, vllm-disagg No change - uses Docker execution
build.py All (optional flags) No change - flags default to disabled
build_orchestrator.py All (optional flags) No change - standard build if flags not used
Backward Compatibility
Existing models: All existing models should continue to work unchanged
Existing launchers: torchrun, deepspeed, megatron-lm, vllm, sglang (non-disagg) all unchanged
Existing build workflows: Default behavior preserved
Existing SLURM deployments: Standard job template path unchanged
Testing
Tested with:
pyt_sglang_disagg_qwen3-32b model on 3-node SLURM cluster
Both salloc (existing allocation) and sbatch (new job) modes
Pre-built image workflow with --use-image
Benchmarks completed successfully (GSM8K accuracy + throughput sweep)