Merge safety-tuned and multilingual-tuned LLMs by analyzing per-layer/module weight changes.

Swap at layer granularity (entire transformer layers):
python layer_swap.py \
-b meta-llama/Llama-3.1-8B-Instruct \
-s ./checkpoint/safety_model \
-m ./checkpoint/multi_model \
-o ./checkpoint/mergedSwap at module granularity (Attention vs FFN separately):
python module_swap.py \
-b meta-llama/Llama-3.1-8B-Instruct \
-s ./checkpoint/safety_model \
-m ./checkpoint/multi_model \
-o ./checkpoint/merged| Argument | Description | Default |
|---|---|---|
-b, --base-model |
Base model (HuggingFace ID or path) | Required |
-s, --safety-model |
Safety-tuned model path | Required |
-m, --multi-model |
Multilingual-tuned model path | Required |
-o, --output |
Output path | Required |
--tau |
Threshold for decision | 0.001 |
--alpha |
Blend ratio | 0.5 |
--figure-dir |
Directory for figures | <output>/figures |
- Compute relative weight changes (ΔW) compared to base model
- For each layer/module, compare safety vs multilingual changes
- Decision per layer/module:
diff > tau→ Use safetydiff < -tau→ Use multilingual- Otherwise → Blend with ratio α
See data/README.md for preparing multilingual SFT datasets.