Integrate sglang disagg models running on SLURM Cluster#42
Open
Integrate sglang disagg models running on SLURM Cluster#42
Conversation
Contributor
Author
|
Models of SGLang Disagg have been added in the PR: https://github.com/ROCm/MAD-private/pull/112 |
… improve robustness
…fig which should work on SLURM login nodes
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Integrate sglang disagg models running on SLURM Cluster
Technical Details
Refactor madengine to run models in sglang_disagg of MAD-private repo .
(1) adopted models have been added to models.json
(2) use the same interface of legacy madengine, i.e., madengine run --tags sglang_disagg_pd_qwen3-32B --additional-context "{'slurm_args': {'FRAMEWORK': 'sglang_disagg', 'PREFILL_NODES': '2', 'DECODE_NODES': '2', 'PARTITION': 'amd-rccl', 'TIME': '12:00:00', 'DOCKER_IMAGE': ''}}"
(3) update the field of slurm_args to context, the fields include FRAMEWORK, PREFILL_NODES, DECODE_NODES, PARTITION, TIME, DOCKER_IMAGE. if DOCKER_IMAGE is empty, it will use the default image in run.sh. Read the field of the selected model in models.json, the model name which will be set as MODEL_NAME (the string without --model) is in the attribute of args, e.g., --model DeepSeek-V2.
(4) if the flow check the slurm_args in context, it will execute the script 'scripts/sglang_disagg/run.sh' to submit the job to SLURM cluster directly, skip the run_model function to build docker image and run container.
Test Plan
Test Result
Submission Checklist