Skip to content

Conversation

@GuanxingLu
Copy link
Contributor

@GuanxingLu GuanxingLu commented Dec 17, 2025

Approaches Considered:

  1. Merge/Unmerge: Merge LoRA to base -> send merged weights -> unmerge for training.
  2. File-based (Current): Save adapter to disk -> rollout engine loads via load_lora_adapter (e.g., AReal).
  3. Direct Tensor: Send tensors via memory (e.g., verl).

Current Implementation: Implemented Option 2 using UpdateWeightFromDisk.

Limitation: When rollout engine offloading is enabled, the base model should be re-transmitted in each update (though overhead is manageable).

Status:
✅ Train Qwen3-4B with LoRA on single A100 40G (Full param triggers OOM).
🚧 Reward curve pending.

TODO:

  • Implement Options 1 & 3 for comparison.

Update

Current Status:

  • Add '--offload-rollout-level' argument, level 1: offload kv cache/cuda graph only, level 2: offload weights + kv cache/cuda graph. When the rollout engine does not release the base model weights, we only need to sync LoRA weights.
  • I modified UpdateWeight base class (this should be invasive) to support LoRA update alongside base model weight update.
  • Current LoRA weight update logic works for both colocated and disaggregated modes.
  • Fixed nits you mentioned before. But I can not remove _lora_loaded because unload_lora_adapter will throw error if LoRA is not loaded.

More TODOs:

  • Save/Load logic.
  • Efficient layer summon for LoRA weight update.
  • Option 3: Send LoRA tensors rather than files.

Discussion Feedback on the trade-offs of these approaches is welcome!

@PopSoda2002
Copy link
Collaborator

Great work!

Copy link
Collaborator

@PopSoda2002 PopSoda2002 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Has some nit for you :)

@GuanxingLu GuanxingLu force-pushed the feature/fsdp_lora branch 2 times, most recently from a3cde86 to d67ec28 Compare December 18, 2025 17:18
@GuanxingLu GuanxingLu marked this pull request as ready for review December 18, 2025 17:19
@GuanxingLu

This comment was marked as duplicate.

@PopSoda2002
Copy link
Collaborator

@PopSoda2002 Could you please check this? Thank you!

Current Status:

  • Add '--offload-rollout-level' argument, level 1: offload kv cache/cuda graph only, level 2: offload weights + kv cache/cuda graph. When the rollout engine does not release the base model weights, we only need to sync LoRA weights.
  • I modified UpdateWeight base class (this should be invasive) to support LoRA update alongside base model weight update.
  • Current LoRA weight update logic works for both colocated and disaggregated modes.
  • Fixed nits you mentioned before. But I can not remove _lora_loaded because unload_lora_adapter will throw error if LoRA is not loaded.

More TODOs: [] Save/Load logic. [] Efficient layer summon for LoRA weight update. [] Option 3: Send LoRA tensors rather than files.

I moved these to the PR description

@PopSoda2002
Copy link
Collaborator

Can you have some test case in wandb compared with no lora? I can help with this part

@GuanxingLu
Copy link
Contributor Author

Sure, you can run the test. I am also running them in my local setting.

@PopSoda2002
Copy link
Collaborator

@GuanxingLu And also, can you please fix the linting? You can use pre-commit

@PopSoda2002 PopSoda2002 changed the title [WIP] Support LoRA training for FSDP backend. [FSDP Support LoRA training for FSDP backend. Dec 18, 2025
@PopSoda2002 PopSoda2002 changed the title [FSDP Support LoRA training for FSDP backend. [FSDP][1/n] Support LoRA training for FSDP backend. Dec 18, 2025
Co-authored-by: PopSoda2002 <zhouhp.me@gmail.com>
@GuanxingLu
Copy link
Contributor Author

Sorry, now pre-commit tests are passed.

@GuanxingLu GuanxingLu force-pushed the feature/fsdp_lora branch 2 times, most recently from 1331fc3 to b1ee5aa Compare December 20, 2025 15:38
refs = [engine.flush_cache.remote() for engine in self.rollout_engines]
ray.get(refs)

refs = [
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am curious how long it take for SGLang to read LoRA weights from disk. Is it possible to pass through NCCL?I am not sure about the file size

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And we can not assume that in distributed training, there is a shared file system for every node. So the read from disk approach may not work here

if self.args.colocate:
self.weight_updater = UpdateWeightFromTensor(self.args, self.model)
else:
self.weight_updater = UpdateWeightFromDistributed(self.args, self.model)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can I ask why change this?

@THUDM THUDM deleted a comment from PopSoda2002 Dec 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants