Skip to content

label workers by logical_resources to constrain deployment#129

Merged
dongwang218 merged 1 commit intomainfrom
logical_resources
Feb 12, 2026
Merged

label workers by logical_resources to constrain deployment#129
dongwang218 merged 1 commit intomainfrom
logical_resources

Conversation

@dongwang218
Copy link
Contributor

Why ?

for #58 and #111 , small models ocuppy multiple nodes, causing fragmentation.

How ?

  • Manually control how models are placed by label nodes and deploy models using the node labels.

Test plan

// request a node and label it with small_model
matrix --cluster_id rllm start_cluster --add_workers 1 --slurm '{"account": data, "qos": h100_data_high}' --logical_resources "{'small_model': 1}"

// deploying models only on the nodes labeled with "small_model"
matrix deploy_applications --cluster_id rllm --action add --applications "[{'model_name': '/datasets/pretrained-llms/Llama-3.1-8B-Instruct', 'min_replica': 8, 'model_size': '8B', 'name': '8B', 'use_grpc': 'false', 'ray_resources': {'resources': {'small_model': 0.01}}}]"

@dongwang218 dongwang218 requested a review from jc-audet February 9, 2026 22:27
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 10, 2026
@yangli5t
Copy link
Contributor

yangli5t commented Feb 12, 2026

curious in test plan command, why it is {'small_model': 0.01} instead of {'small_model': 1}. actually logical_resources is defined as tp.Dict[str, int], so it should be 1?

@dongwang218
Copy link
Contributor Author

curious in test plan command, why it is {'small_model': 0.01} instead of {'small_model': 1}. actually logical_resources is defined as tp.Dict[str, int], so it should be 1?

each node counted as 1, so if you have two nodes labeled with small_model, then the system has 2 such resource. Each replica will use at least 2 * 0.01, so 8 replica will be using 0.16. That is why 0.01 is safe, but eg 0.1 is not, as it ask for 1.6 per node.

@dongwang218 dongwang218 merged commit 319c68a into main Feb 12, 2026
8 checks passed
@dongwang218 dongwang218 deleted the logical_resources branch February 12, 2026 03:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants