Skip to content

Conversation

@JD-ETH
Copy link
Contributor

@JD-ETH JD-ETH commented Dec 20, 2025

This is a WIP PR that demonstrate an efficient implementation of remote weight update through RDMA and TransferEngine, instead of through nccl.

It is tested with sglang 0.5.6 post2.
replica

We reuse sglang engine memory registration introduced here: remote_instance_engine_info. To reduce registration overhead, we propose a solution with engine replica on trainer side to enable zero-copy; single registration weight transfer.

The goal of this PR is to demonstrate a first working example using Qwen3-4B with TP on Sglang, and TP on the Megatron trainer.

CUDA_VISIBLE_DEVICES="1,2" python ./tests/test_weight_transfer.py --mode=rdma tests a minimal E2E example

def register_memory_region_v1(named_param_with_buffers: Sequence[tuple[str, torch.Tensor]], transfer_engine):
weight_mr_dict = {}
for name, weight in named_param_with_buffers:
ret = transfer_engine.register(weight.data_ptr(), weight.numel() * weight.element_size())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe can use transfer_engine.batch_register_memory

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will remove this function, we only use the efficient registration version below

@lilei199908
Copy link
Collaborator

regarding Rollout TP, the tensor slice for the Down Projection layer is non-contiguous in physical memory relative to the all-gathered full parameters on the training side. How should we handle this for efficient remote transfer (e.g., RDMA)?

@JD-ETH
Copy link
Contributor Author

JD-ETH commented Dec 23, 2025

with the current design, we force the source side to construct a model replica that has the exact shape as the rollout side.

@lilei199908
Copy link
Collaborator

This is a WIP PR that demonstrate an efficient implementation of remote weight update through RDMA and TransferEngine, instead of through nccl.

It is tested with sglang 0.5.6 post2. replica

We reuse sglang engine memory registration introduced here: remote_instance_engine_info. To reduce registration overhead, we propose a solution with engine replica on trainer side to enable zero-copy; single registration weight transfer.

The goal of this PR is to demonstrate a first working example using Qwen3-4B with TP on Sglang, and TP on the Megatron trainer.

CUDA_VISIBLE_DEVICES="1,2" python ./tests/test_weight_transfer.py --mode=rdma tests a minimal E2E example

with the current design, we force the source side to construct a model replica that has the exact shape as the rollout side.

Doesn't creating a rollout replica occupy more GPU memory (VRAM)?

@JD-ETH
Copy link
Contributor Author

JD-ETH commented Dec 24, 2025

yep. The focus of the design is to avoid costly registration/deregistration of the transfer engine. We will add CPU offloading while keeping the virtual memory registration

@lilei199908
Copy link
Collaborator

yep. The focus of the design is to avoid costly registration/deregistration of the transfer engine. We will add CPU offloading while keeping the virtual memory registration

i wonder if using virtual memory registration maybe degrade perf beacuse nic only read physical position?

@JD-ETH
Copy link
Contributor Author

JD-ETH commented Jan 4, 2026

we will reopen once we hit our design target, stay tuned

@JD-ETH JD-ETH closed this Jan 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants