DP-LLM: Runtime Model Adaptation with Dynamic Layer-wise Precision Assignment [Paper]
Dynamic-Precision LLM (DP-LLM) is a model runtime adaptation mechanism that supports dynamic layer-wise precision assignment.
Warning: Currently, this repository only contains the performance evaluation codes. The codes for latency measurement will soon be updated.
Prerequisites are identical to Any-Precision LLM.
- Python 3.11
- CUDA Toolkit 12 or higher
- gcc-9 or higher
The setup process is identical to Any-Precision LLM.
- Clone this repository.
git clone https://github.com/SNU-ARC/dp-llm
cd dp-llm- Install the required Python packages.
pip install -r requirements.txt- Install the Any-Precision CUDA kernels.
cd any_precision/modules/kernels
pip install .from any_precision import DPLLMForCausalLM
model = DPLLMForCausalLM.from_quantized(model_path,
precisions=precisions,
max_mem_dict=max_mem_dict,
linear_reg_d=linear_reg_d,
jl_d=jl_d,
T_d=T_d,
prefill_by_decode=True, # True for pp evaluation,
# False for downstream tasks
)precisions: An array of available precisions.max_mem_dict: A dictionary containing the max precisions assigned for each linear layer.linear_reg_d: A dictionary containing parameters for linear regression based relative error estimator.jl_d: A dictionary containing parameters for random projection based relative error estimator (Noted asGin the paper).T_d: A dictionary containing threshold values for each linear layer (Noted asTin the paper).prefill_by_decode: For perplexity evaluations, set toTruefor efficient evaluations. When set toTrue, the model will activate dynamic precision assignment during the prefill phase. When set toFalse, max precision will be used for the prefill phase, and dynamic precision assignment will only be active during the decoding phase.
Some fine-tuned results are provided for quick evaluation. The pre-finetuned results can be found at https://github.com/SNU-ARC/DP-LLM_pre_finetuned.
Load each .pt files within the directory using torch.load, then provide them as arguments(max_mem_dict, linear_reg_d, jl_d and T_d).
The following configurations are available:
- Meta-Llama-3-8B, 3,4,5,6 bits, 5.0-bit memory budget: 3.25, 3.5, 3.75, 4.0, 4.25, 4.5, 4.75 target precisions
Please refer to https://github.com/SNU-ARC/any-precision-llm#quantization for more precise instructions.
python quantize.py <model> [options]Run 0_set_configs.py to record linear sizes and write them to the config files.
python 0_set_configs.py <ap_model_path>
# e.g.
# python 0_set_configs.py cache/packed/anyprec-(Meta-Llama-3-8B-hf)-w8_orig3-gc1-c4_s100_blk512Run 1_find_maxmem.py to find layer-wise maximum precision.
python 1_find_maxmem.py <model> <ap_model_path> --hessian_path path/to/hessian --memory_budget <memory budgets>
# e.g.
# python 1_find_maxmem.py \
# meta-llama/Meta-Llama-3-8B-hf \
# cache/packed/anyprec-(Meta-Llama-3-8B-hf)-w8_orig3-gc1-c4_s100_blk512 \
# --hessian_path \
# cache/packed/gradients/(Meta-Llama-3-8B-hf)-c4_s100_blk512.pt \
# --memory_budget 4.0 5.0Run 2_finetune.py to find layer-wise average precision.
python 2_finetune.py <ap_model_path> --maxmem <memory budget> --targ_bits <target precision>
# e.g.
# python 2_finetune.py \
# cache/packed/anyprec-(Meta-Llama-3-8B-hf)-w8_orig3-gc1-c4_s100_blk512 \
# --maxmem 5.0 --targ_bits 3.5Run 3_save_estimator.py to create error estimators.
python 3_save_estimator.py <model> <ap_model_path> --arr_path <finetuned result>
# e.g.
# python 3_save_estimator.py \
# meta-llama/Meta-Llama-3-8B-hf \
# cache/packed/anyprec-(Meta-Llama-3-8B-hf)-w8_orig3-gc1-c4_s100_blk512 \
# --arr_path \
# finetuned_results/anyprec-()-w8_orig3-gc1-c4_s100_blk512/finetuned_max5.0_3b-6b_th_pb_train_0.01_1.0_5ep_targ3.5b_init_0-1000_adam.ptRun 4_save_th.py to save threshold values.
python 4_save_th.py <ap_model_path> --arr_path <finetuned_result>
# e.g.
# python 4_save_th.py \
# cache/packed/anyprec-(Meta-Llama-3-8B-hf)-w8_orig3-gc1-c4_s100_blk512 \
# --arr_path \
# finetuned_results/anyprec-()-w8_orig3-gc1-c4_s100_blk512/finetuned_max5.0_3b-6b_th_pb_train_0.01_1.0_5ep_targ3.5b_init_0-1000_adam.ptRun test_pp.py to test DP-LLM's perplexity results.
python test_pp.py <ap_model_path> --estimator_results <estimator directory>
# e.g.
# python test_pp.py \
# cache/packed/anyprec-(Meta-Llama-3-8B-hf)-w8_orig3-gc1-c4_s100_blk512 \
# --estimator_results \
# estimator_private_values/anyprec-()-w8_orig3-gc1-c4_s100_blk512/finetuned_max5.0_3b-6b_th_pb_train_0.01_1.0_5ep_targ3.5b_init_0-1000_adamPlease cite our paper if you find our work useful:
@inproceedings{kwon2025dp,
title={DP-LLM: Runtime Model Adaptation with Dynamic Layer-wise Precision Assignment},
author={Sangwoo Kwon and Seong Hoon Seo and Jae W. Lee and Yeonhong Park},
year={2025},
booktitle={Proceedings of the 39th Conference on Neural Information Processing Systems}
}