Skip to content

Allocate workers in one slurm job#126

Merged
dongwang218 merged 8 commits intomainfrom
avoid_array
Jan 12, 2026
Merged

Allocate workers in one slurm job#126
dongwang218 merged 8 commits intomainfrom
avoid_array

Conversation

@dongwang218
Copy link
Contributor

@dongwang218 dongwang218 commented Jan 7, 2026

Why ?

Currently there are many slurm job: one for head and several for workers. For running training job, we want to use one worker job for topology aware allocation.

How ?

This diff add a parameter --use_array False, so that

  • there are only two slurm job, one for head, one for all workers.

Test plan

matrix start_cluster --add_workers 2 --slurm '{"account": data, "qos": h100_lowest}' --use_array False

squeue --me
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
2721455 h100 ray_work dongwang R 0:10 2 h100-011-158,h100-017-251
2721453 h100 ray_head dongwang R 1:10 1 h100-010-023

matrix deploy_applications --action add --applications "[{'model_name': '/datasets/pretrained-llms/Llama-3.1-8B-Instruct', 'use_grpc': 'false', 'min_replica': 16, 'model_size': '8B', 'name': '8B'}]"

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jan 7, 2026
@dongwang218 dongwang218 changed the title Use at most two slurm jobs to launch Ray cluster Allocate workers in one slurm job Jan 11, 2026
@dongwang218 dongwang218 merged commit 7f2cc93 into main Jan 12, 2026
8 checks passed
@dongwang218 dongwang218 deleted the avoid_array branch January 12, 2026 22:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants