[FSDP][1/n] Support LoRA training for FSDP backend. #1140

GuanxingLu · 2025-12-17T17:52:40Z

Approaches Considered:

Merge/Unmerge: Merge LoRA to base -> send merged weights -> unmerge for training.
File-based (Current): Save adapter to disk -> rollout engine loads via load_lora_adapter (e.g., AReal).
Direct Tensor: Send tensors via memory (e.g., verl).

Current Implementation: Implemented Option 2 using UpdateWeightFromDisk.

Limitation: When rollout engine offloading is enabled, the base model should be re-transmitted in each update (though overhead is manageable).

Status:
✅ Train Qwen3-4B with LoRA on single A100 40G (Full param triggers OOM).
🚧 Reward curve pending.

TODO:

Implement Options 1 & 3 for comparison.

Update

Current Status:

Add '--offload-rollout-level' argument, level 1: offload kv cache/cuda graph only, level 2: offload weights + kv cache/cuda graph. When the rollout engine does not release the base model weights, we only need to sync LoRA weights.

I modified UpdateWeight base class (this should be invasive) to support LoRA update alongside base model weight update.

Current LoRA weight update logic works for both colocated and disaggregated modes.

Fixed nits you mentioned before. But I can not remove _lora_loaded because unload_lora_adapter will throw error if LoRA is not loaded.

More TODOs:

Save/Load logic.
Efficient layer summon for LoRA weight update.
Option 3: Send LoRA tensors rather than files.

Discussion Feedback on the trade-offs of these approaches is welcome!

PopSoda2002 · 2025-12-17T17:54:37Z

Great work!

PopSoda2002

Has some nit for you :)

slime/backends/sglang_utils/sglang_engine.py

slime/backends/fsdp_utils/update_weight_utils.py

PopSoda2002 · 2025-12-18T18:01:21Z

@PopSoda2002 Could you please check this? Thank you!

Current Status:

Add '--offload-rollout-level' argument, level 1: offload kv cache/cuda graph only, level 2: offload weights + kv cache/cuda graph. When the rollout engine does not release the base model weights, we only need to sync LoRA weights.

I modified UpdateWeight base class (this should be invasive) to support LoRA update alongside base model weight update.

Current LoRA weight update logic works for both colocated and disaggregated modes.

Fixed nits you mentioned before. But I can not remove _lora_loaded because unload_lora_adapter will throw error if LoRA is not loaded.

More TODOs: [] Save/Load logic. [] Efficient layer summon for LoRA weight update. [] Option 3: Send LoRA tensors rather than files.

I moved these to the PR description

PopSoda2002 · 2025-12-18T18:03:57Z

Can you have some test case in wandb compared with no lora? I can help with this part

GuanxingLu · 2025-12-18T18:17:51Z

Sure, you can run the test. I am also running them in my local setting.

PopSoda2002 · 2025-12-18T18:32:02Z

@GuanxingLu And also, can you please fix the linting? You can use pre-commit

Co-authored-by: PopSoda2002 <zhouhp.me@gmail.com>

GuanxingLu · 2025-12-18T22:58:32Z

Sorry, now pre-commit tests are passed.

Hecate0821 · 2025-12-21T21:25:37Z

slime/backends/fsdp_utils/update_weight_utils.py

+            refs = [engine.flush_cache.remote() for engine in self.rollout_engines]
+            ray.get(refs)
+
+            refs = [


I am curious how long it take for SGLang to read LoRA weights from disk. Is it possible to pass through NCCL？I am not sure about the file size

And we can not assume that in distributed training, there is a shared file system for every node. So the read from disk approach may not work here

PopSoda2002 · 2025-12-19T07:30:27Z

slime/backends/fsdp_utils/actor.py

+        if self.args.colocate:
+            self.weight_updater = UpdateWeightFromTensor(self.args, self.model)
+        else:
+            self.weight_updater = UpdateWeightFromDistributed(self.args, self.model)


Can I ask why change this?

PopSoda2002 reviewed Dec 18, 2025

View reviewed changes

slime/backends/sglang_utils/sglang_engine.py Outdated Show resolved Hide resolved

slime/backends/fsdp_utils/update_weight_utils.py Outdated Show resolved Hide resolved

GuanxingLu force-pushed the feature/fsdp_lora branch 2 times, most recently from a3cde86 to d67ec28 Compare December 18, 2025 17:18

GuanxingLu marked this pull request as ready for review December 18, 2025 17:19

This comment was marked as duplicate.

Sign in to view

GuanxingLu force-pushed the feature/fsdp_lora branch from d67ec28 to d57c505 Compare December 18, 2025 17:34

PopSoda2002 changed the title ~~[WIP] Support LoRA training for FSDP backend.~~ [FSDP Support LoRA training for FSDP backend. Dec 18, 2025

PopSoda2002 changed the title ~~[FSDP Support LoRA training for FSDP backend.~~ [FSDP][1/n] Support LoRA training for FSDP backend. Dec 18, 2025

Add LoRA for FSDP backend. (THUDM#416)

d5f88a7

Co-authored-by: PopSoda2002 <zhouhp.me@gmail.com>

GuanxingLu force-pushed the feature/fsdp_lora branch from d57c505 to d5f88a7 Compare December 18, 2025 22:57

Add LoRA save/load logic.

ead38b4

GuanxingLu force-pushed the feature/fsdp_lora branch 2 times, most recently from 1331fc3 to b1ee5aa Compare December 20, 2025 15:38

Add LoRA: fixed weight update misalignment

8047cfc

GuanxingLu force-pushed the feature/fsdp_lora branch from b1ee5aa to 8047cfc Compare December 20, 2025 15:42

Hecate0821 reviewed Dec 21, 2025

View reviewed changes

PopSoda2002 reviewed Dec 21, 2025

View reviewed changes

THUDM deleted a comment from PopSoda2002 Dec 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FSDP][1/n] Support LoRA training for FSDP backend. #1140

[FSDP][1/n] Support LoRA training for FSDP backend. #1140

Uh oh!

GuanxingLu commented Dec 17, 2025 •

edited by PopSoda2002

Loading

Uh oh!

PopSoda2002 commented Dec 17, 2025

Uh oh!

PopSoda2002 left a comment

Uh oh!

Uh oh!

Uh oh!

This comment was marked as duplicate.

PopSoda2002 commented Dec 18, 2025

Uh oh!

PopSoda2002 commented Dec 18, 2025

Uh oh!

GuanxingLu commented Dec 18, 2025

Uh oh!

PopSoda2002 commented Dec 18, 2025

Uh oh!

GuanxingLu commented Dec 18, 2025

Uh oh!

Hecate0821 Dec 21, 2025

Uh oh!

Hecate0821 Dec 21, 2025

Uh oh!

PopSoda2002 Dec 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[FSDP][1/n] Support LoRA training for FSDP backend. #1140

Are you sure you want to change the base?

[FSDP][1/n] Support LoRA training for FSDP backend. #1140

Uh oh!

Conversation

GuanxingLu commented Dec 17, 2025 • edited by PopSoda2002 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Update

Uh oh!

PopSoda2002 commented Dec 17, 2025

Uh oh!

PopSoda2002 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

This comment was marked as duplicate.

PopSoda2002 commented Dec 18, 2025

Uh oh!

PopSoda2002 commented Dec 18, 2025

Uh oh!

GuanxingLu commented Dec 18, 2025

Uh oh!

PopSoda2002 commented Dec 18, 2025

Uh oh!

GuanxingLu commented Dec 18, 2025

Uh oh!

Hecate0821 Dec 21, 2025

Choose a reason for hiding this comment

Uh oh!

Hecate0821 Dec 21, 2025

Choose a reason for hiding this comment

Uh oh!

PopSoda2002 Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

GuanxingLu commented Dec 17, 2025 •

edited by PopSoda2002

Loading