Jérémie Dentan1, Davide Buscaldi1, 2, Aymen Shabou3, Sonia Vanier1
1LIX (École Polytechnique, IP Paris, CNRS) 2LIPN (Sorbonne Paris Nord) 3Crédit Agricole SA
This repository implements the experiments of our paper "Predicting memorization within Large Language Models fine-tuned for classification", published at ECAI 2025.
Large Language Models have received significant attention due to their abilities to solve a wide range of complex tasks. However these models memorize a significant proportion of their training data, posing a serious threat when disclosed at inference time. To mitigate this unintended memorization, it is crucial to understand what elements are memorized and why. This area of research is largely unexplored, with most existing works providing a posteriori explanations. To address this gap, we propose a new approach to detect memorized samples a priori in LLMs fine-tuned for classification tasks. This method is effective from the early stages of training and readily adaptable to other classification settings, such as training vision models from scratch. Our method is supported by new theoretical results, and requires a low computational budget. We achieve strong empirical results, paving the way for the systematic identification and protection of vulnerable samples before they are memorized.
Copyright 2023-present Laboratoire d'Informatique de Polytechnique. Apache Licence v2.0.
Please cite this work as follows:
@inproceedings{dentan_predicting_2025,
title = {Predicting Memorization within Large Language Models Fine-Tuned for Classification},
author = {Dentan, Jérémie and Buscaldi, Davide and Shabou, Aymen and Vanier, Sonia},
booktitle = {Proceedings of the 28th European Conference on Artificial Intelligence (ECAI 2025)},
year = {2025},
note = {To appear},
url = {https://arxiv.org/abs/2409.18858}
}This repository contains the source code needed to reproduce our results, except the experiments with CIFAR-10 dataset. For these experiments, we provide another repository containing the corresponding source code: https://github.com/orailix/predict_llm_memorization_cifar10
The repository contains three main directories
grokking_llmcontains Python source code for the experimentsscriptscontains Bash and Slurm scripts for deployment on an HPC clusterfigurescontains notebook to reproduce our figures from the paper
Important notice: the module we developed is called grokking_llm because the original purpose of this project was to study the Grokking phenomenon on LLM.
- Appart from the training configs and the deployment configs (see below), two config files are necessary:
main.cfg: To declare where the HuggingFace cache should be stored (for deploment on an offline HPC cluster, for example), as well as the paths where ouputs and logs should be stored.env_vars.cfg: Optionally, to declare environment variables. For example on a HPC cluster with shared CPUS, you might have to use variableOMP_NUM_THREADSto make sure that default libraries do not use too many threads compared to what is really available.
training_cfg.pyEvery training config is mapped to an instance of this class. The instance is associated to an alphanumeric hash (theconfig_id), and all output associated to this training config will be stored inoutputs/individual/<config_id>. You can useTrainingCfg.autoconfigto retrieve any config that was already created.deployment_cfg.pyA deployment config describes the procedure to train models with many training configs. For example, we use deployment config to vary the random split of the dataset between 0 and 99 to train shadow models. Similarly, every deployment config is associated with adeployment_idand its outputs stored inoutputs/deployment/<deployment_id>
- Contains the scripts needed to train models and manage datasets
- In appendix A of the paper, we explain the difference between local and global measures of memorization. In this paper, we use the terms
dynamicandstaticto refer to these concepts, respectively. grokking_llm.measures_dyncontains scripts for the local measures, i.e. the ones aligned with our threat model: a practitioners willing to audit a fixed model trained on a fixed dataset.grokking_llm.measures_statcontains scripts for the global measures, i.e. the ones not aligned with our threat model: we obtain average vulnerability metrics of a population of models trained on random splits of a dataset.
01_main_figures.ipynb: code used for the main figures of the paper01_compare_memorization.ipynb: code used for figure 6 in the appendix
We provide our Bash and Slurm scripts for deplyment on an HPC cluster. We used Jean-Zay HPC cluster from IDRIS. We used some Nvidia A100 80G GPUs and Intel Xeon 6248 CPUs with 40 cores. The training took between 3 and 10 hours on a single GPU. Overall, our experiments are equivalent to around 5000 hours of single GPU and 4000 hours of single-core CPU.
arc_mistral: Deployment scripts for a Mistral 7B model [1] trained on ARC dataset [2].ethics_mistral: Deployment scripts for a Mistral 7B model [1] trained on ETHICS dataset [3].mmlu_mistral: Deployment scripts for a Mistral 7B model [1] trained on MMLU dataset [4].mmlu_llama: Deployment scripts for a Llama 2 7B model [5] trained on MMLU dataset [4].mmlu_gemma: Deployment scripts for a Gemma 7B model [6] trained on MMLU dataset [4].
- [1] Albert Q. Jiang et al. Mistral 7B, October 2023. http://arxiv.org/abs/2310.06825
- [2] Michael Boratko et al. Systematic Classification of Knowledge, Reasoning, and Context within the ARC Dataset. In Proceedings of the Workshop on Machine Reading for Question Answering, 2018. http://aclweb.org/anthology/W18-2607
- [3] Dan Hendrycks et al. Aligning AI With Shared Human Values. In ICLR, 2021. https://openreview.net/forum?id=dNy_RKzJacY
- [4] Dan Hendrycks et al. Measuring Massive Multitask Language Understanding. In ICLR, 2021. https://openreview.net/forum?id=d7KBjmI3GmQ
- [5] Hugo Touvron et al. LLaMA: Open and Efficient Foundation Language Models, February 2023. https://arxiv.org/abs/2302.13971
- [6] Gemma Team et al. Gemma: Open Models Based on Gem- ini Research and Technology, April 2024. http://arxiv.org/abs/2403.08295
This work received financial support from Crédit Agricole SA through the research chair “Trustworthy and responsible AI” with École Polytechnique. This work was performed using HPC resources from GENCI-IDRIS 2023-AD011014843. We thank Arnaud Grivet Sébert and Mohamed Dhouib for discussions on this paper.