-
Notifications
You must be signed in to change notification settings - Fork 391
[WIP] Implement RDMA P2P weight update using TransferEngine #1164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| def register_memory_region_v1(named_param_with_buffers: Sequence[tuple[str, torch.Tensor]], transfer_engine): | ||
| weight_mr_dict = {} | ||
| for name, weight in named_param_with_buffers: | ||
| ret = transfer_engine.register(weight.data_ptr(), weight.numel() * weight.element_size()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe can use transfer_engine.batch_register_memory
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will remove this function, we only use the efficient registration version below
|
regarding Rollout TP, the tensor slice for the Down Projection layer is non-contiguous in physical memory relative to the all-gathered full parameters on the training side. How should we handle this for efficient remote transfer (e.g., RDMA)? |
|
with the current design, we force the source side to construct a model replica that has the exact shape as the rollout side. |
Doesn't creating a rollout replica occupy more GPU memory (VRAM)? |
|
yep. The focus of the design is to avoid costly registration/deregistration of the transfer engine. We will add CPU offloading while keeping the virtual memory registration |
i wonder if using virtual memory registration maybe degrade perf beacuse nic only read physical position? |
|
we will reopen once we hit our design target, stay tuned |

This is a WIP PR that demonstrate an efficient implementation of remote weight update through RDMA and TransferEngine, instead of through nccl.
It is tested with sglang 0.5.6 post2.

We reuse sglang engine memory registration introduced here: remote_instance_engine_info. To reduce registration overhead, we propose a solution with engine replica on trainer side to enable zero-copy; single registration weight transfer.
The goal of this PR is to demonstrate a first working example using Qwen3-4B with TP on Sglang, and TP on the Megatron trainer.
CUDA_VISIBLE_DEVICES="1,2" python ./tests/test_weight_transfer.py --mode=rdmatests a minimal E2E example