vLLM Server Sync via LoRA Adapter Reload (avoid merge + full weight sync) for GRPO#5188
Open
lfranceschetti wants to merge 3 commits intohuggingface:mainfrom
Open
vLLM Server Sync via LoRA Adapter Reload (avoid merge + full weight sync) for GRPO#5188lfranceschetti wants to merge 3 commits intohuggingface:mainfrom
lfranceschetti wants to merge 3 commits intohuggingface:mainfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Add a new
vllm_sync_strategy="lora_adapter"option for TRL's vLLM server mode that, when using PEFT LoRA/QLoRA, updates the vLLM-side policy by saving and reloading LoRA adapters instead of merging adapters and syncing full model weights (the current behavior).This is intended to make GRPO/online RL training with PEFT more reliable and efficient — especially for QLoRA-style training — while keeping
vllm_sync_strategy="weights"as the default.Fixes
Sync is slow and memory intensive — The merge-sync-unmerge loop iterates over all named parameters and sequentially updates each one on the vLLM server. For Zero3 this can lead to OOMs (also from own experience). With
vllm_sync_strategy="lora_adapter", only the small LoRA checkpoint is written and loaded.Relates to GRPOTrainer: Parallelism for updating named params #3557 (Fixes it if LoRA is used)
QLoRA merge causes quantization rounding errors — Merging LoRA weights into a 4-bit quantized base model and then unmerging is lossy. The
vllm_sync_strategy="lora_adapter"approach avoids merging entirely;Fixes [GRPO] bnb quantization + vllm #3466 - However support of 4-bit models on the vllm server still need to be implemented (working on it)
NCCL weight transfer is brittle over long runs — Current approach sometimes fails after hours of training due to NCCL communication errors. The
vllm_sync_strategy="lora_adapter"approach replaces NCCL with a simple file write + HTTP reload request, which is more robust.Fixes GRPOTrainer fails to transfer weights to vLLM with
_move_model_to_vllmafter 7.5 hours of the job running #2840 (if LoRA is used)Other Advantages
Decoupled from parameter naming — The
"weights"strategy iteratesstate_dict()keys and must match each one to the vLLM-side parameter name, which is fragile across PEFT versions and model architectures (see the manual prefix-stripping in #2818). The"lora_adapter"strategy saves a standard PEFT checkpoint and lets vLLM load it through its own adapter path — the two sides never need to agree on internal parameter names.Prior PRs and Why This PR is Different
This approach was originally proposed in #2730 but was superseded by #2818, which chose the merge-sync-unmerge approach because it supports all PEFT adapter types (DoRA, IA3, etc.), not just LoRA.
This PR addresses that objection by:
vllm_sync_strategy="weights"as the default — full backward compatibility, works with any adapter type.vllm_sync_strategy="lora_adapter"as an opt-in optimization for the most common case (standard LoRA/rsLoRA), raising a clear error for non-LoRA adapter types (IA3, Prefix Tuning, etc.) and guiding users to thevllm_sync_strategy="weights"default.The original #2730 noted a potential vLLM memory leak from repeatedly loading adapters. This might still be an issue; however, only very small memory increases have been observed during experimental runs.
Future Improvements
vLLM's
load_inplacefeature (vllm-project/vllm#31326, merged Jan 2026) could further improve this idea in the future but is not required. TRL adaptation suggested in vllm-project/vllm#20149.