-
Notifications
You must be signed in to change notification settings - Fork 388
[FSDP][1/n] Support LoRA training for FSDP backend. #1140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Great work! |
PopSoda2002
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Has some nit for you :)
a3cde86 to
d67ec28
Compare
This comment was marked as duplicate.
This comment was marked as duplicate.
d67ec28 to
d57c505
Compare
I moved these to the PR description |
|
Can you have some test case in wandb compared with no lora? I can help with this part |
|
Sure, you can run the test. I am also running them in my local setting. |
|
@GuanxingLu And also, can you please fix the linting? You can use |
Co-authored-by: PopSoda2002 <zhouhp.me@gmail.com>
d57c505 to
d5f88a7
Compare
|
Sorry, now |
1331fc3 to
b1ee5aa
Compare
b1ee5aa to
8047cfc
Compare
| refs = [engine.flush_cache.remote() for engine in self.rollout_engines] | ||
| ray.get(refs) | ||
|
|
||
| refs = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am curious how long it take for SGLang to read LoRA weights from disk. Is it possible to pass through NCCL?I am not sure about the file size
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And we can not assume that in distributed training, there is a shared file system for every node. So the read from disk approach may not work here
slime/backends/fsdp_utils/actor.py
Outdated
| if self.args.colocate: | ||
| self.weight_updater = UpdateWeightFromTensor(self.args, self.model) | ||
| else: | ||
| self.weight_updater = UpdateWeightFromDistributed(self.args, self.model) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can I ask why change this?
Approaches Considered:
Current Implementation: Implemented Option 2 using UpdateWeightFromDisk.
Limitation: When rollout engine offloading is enabled, the base model should be re-transmitted in each update (though overhead is manageable).
Status:
✅ Train Qwen3-4B with LoRA on single A100 40G (Full param triggers OOM).
🚧 Reward curve pending.
TODO:
Update
Discussion Feedback on the trade-offs of these approaches is welcome!