Skip to content

secureIT-project/peft4apr

Repository files navigation

source under MIT licence data under CC BY 4.0 license

Replication Package for "The Impact of Fine-tuning Large Language Models on Automated Program Repair"

This repository contains the replication package for the paper "The Impact of Fine-tuning Large Language Models on Automated Program Repair", by Roman Macháček, Anastasiia Grishina, Max Hort and Leon Moonen, published in the research track of the 41th International Conference on Software Maintenance and Evolution (ICSME 2025). A preprint of the paper is included in the root directory of this repository and will be deposited on arXiv.

This project builds on code from the clm project (c) 2023 The ASSET research group led by Lin Tan, Purdue University, licensed under the BSD 3-Clause License (see jasper/LICENSE.BSD). All modifications and new contributions are (c) 2025 by the authors of this package and distributed under the MIT License (see LICENSE.MIT). The data and preprint are distributed under the CC BY 4.0 license. The source code is also made available on GitHub.

Description

This replication package contains the scripts for our empirical study into the impact of fine-tuning LLMs on their APR performance. We compare different fine-tuning strategies for APR, including no fine-tuning, full-model fine-tuning as well as two parameter-efficient fine-tuning (PEFT) techniques (LoRA and IA3). The performance of the resulting are then compared on three widely used APR benchmarks.

The code provided in this replication package builds on code by Jiang et al. (paper, codebase), adapting it to accommodate the benchmarking of additional models and parameter-efficient fine-tuning. Our extensions were developed in a modular fashion to facilitate the easy integration of additional models, techniques, and benchmarks.

Organisation

The repository is organized as follows:

  • src/: This directory contains the Python source code related to the various fine-tuning methods and the overall experimental setup.

    • finetune.py: This script handles the full fine-tuning process for the LLMs.
    • lora.py: Implements the LoRA (Low-Rank Adaptation) parameter-efficient fine-tuning technique.
    • ia3.py: Implements the IA3 (Infused Adapter by Inhibiting and Amplifying Activations) parameter-efficient fine-tuning technique.
    • evaluator.py: This script is used to evaluate the performance of the fine-tuned models.
  • src/analysers/: This directory contains scripts for analyzing the results from model evaluations and for inspecting the models themselves.

    • analyse_benchmark.py: This script processes the JSON files produced after validating model outputs. It aggregates the results, counting the number of "plausible," "wrong," "uncompilable," and "timeout" patches for each model and benchmark configuration. It then generates a summary of these statistics in LaTeX table format for reporting and comparing performance.
    • analyse_model.py: This script is for inspecting the models, particularly those that have been fine-tuned using PEFT techniques. It loads a model from a given checkpoint and calculates the ratio of trainable parameters to total parameters, which helps in understanding the efficiency of the fine-tuning approach.
  • src/benchmarks/: This directory forms the core of the evaluation framework. It contains the logic for preparing data, generating model outputs, and validating the generated patches against different program repair benchmarks.

    • benchmark.py: This file defines the benchmark-specific validation logic. It contains a base Benchmark class and specialized classes for HumanEval, QuixBugs, and Defects4j. Each class knows how to take a generated patch, insert it into the corresponding benchmark's source code, compile the project, and run its test suite to check if the patch is a valid fix.
    • models.py: This file defines classes for each of the supported Large Language Models (e.g., CodeGen, CodeT5, StarCoder, DeepSeekCoder, CodeLlama, Bloom). These classes handle the model-specific details, including loading the correct tokenizer and model from Hugging Face, preparing the input prompts in the format the model expects, generating the output, and processing the output to extract the code patch. It also contains logic for preparing data for fine-tuning.
    • generate_inputs.py and generate_input_finetuned.py: These scripts are responsible for creating the input files for the models. They take the raw benchmark data and use the helper classes in models.py to convert it into the specific JSON format that the generation scripts expect.
    • generate_outputs.py and generate_outputs_finetuned.py: These scripts run the models to generate patches. They load the input files created by the generate_inputs scripts, pass the prompts to the appropriate model, and save the generated code into output JSON files.
    • validate.py: This script orchestrates the final validation step. It takes the output files from the generation scripts and uses the classes in benchmark.py to run the test suite for each generated patch, saving the results (e.g., "plausible", "wrong") to a new JSON file.
    • collators.py: This file provides data collator functions used during the training/fine-tuning process. These functions are responsible for taking batches of data samples and padding them to a uniform length, which is necessary for efficient model training.
    • fim.py: This script implements the "Fill-in-the-Middle" (FIM) functionality, a specific data transformation technique used by models like StarCoder and CodeLlama to handle code completion and infilling tasks.
  • src/utils/: contains utility scripts and modules that provide supporting functions for the main processes of the project.

    • dataset.py: This file defines the CLMDataset class used for language modeling.
    • metrics.py: This script is responsible for computing and handling evaluation metrics.
    • preprocess.py: This file contains functions for preprocessing data before it's fed into the models.
  • jasper/: This directory contains a copy of the Jasper project, an AST-based Java Parser for Program Repair.

    • The code is included verbatim from its source.
  • datasets/: This directory is created as part of the setup process documented below. It will contain subdirectories for each of the datasets used for finetuning (CLM) and benchmarking (QuixBugs, Defects4J, and HumanEval-Java). After running the experiments, it will also contain the results. More details on the structure are provided below, in the section Data Locations.

  • models/: This directory is created as part of running the experiments and contains the checkpoints for fine-tuned models. More details on the structure are provided below in the Data Locations section.

    • We have also deposited individual archive files containing the checkpoints for the fine-tuned models. These come in two versions:
      • <model>-fmft.tar.zst for the full-model fine-tuned checkpoints, and
      • <model>-peft.tar.zst for the parameter-efficient fine-tuned checkpoints (both LoRA and IA3)
    • They can be downloaded and extracted inside the root folder of the project to populate the models/ directory instead of running the fine-tuning steps yourself. To stay below Zenodo size limitations of 200GB, the files are compressed using the Zstandard format. Most modern tar implementations support Zstandard, so you can unpack them using tar --zstd -xvf <model>-<ft>.tar.zst.

Prerequisites

Setting up the environment and installing dependencies

The Python requirements and the environment can be set up as follows:

# Add the Conda-forge to the channels
conda config --add channels conda-forge
conda config --set channel_priority strict
	
# Set up the environment
conda create --name coderepair python=3.11.5
conda activate coderepair
	
# Install main ML requirements
conda install pytorch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 pytorch-cuda=12.1 -c pytorch -c nvidia
conda install transformers==4.36.2
conda install accelerate==0.23.0
pip install peft==0.6.0
pip install evaluate==0.4.3
pip install numpy==1.26.0
pip install wandb==0.15.12

# CodeBleu
pip install codebleu==0.7.0
pip install tree-sitter-java==0.21

# other requirements
pip install zenodo-get tqdm click

Next, we need to install Jasper: an AST-based Java Parser for Program Repair. A copy of the code is provided, along with the authors' instructions, in jasper/README.md. Note that Java 8 is needed for compiling Jasper

conda install openjdk=8
cd jasper
mkdir target
javac -cp ".:lib/*" -d target src/main/java/clm/jasper/*.java src/main/java/clm/codet5/*.java src/main/java/clm/codegen/*.java src/main/java/clm/plbart/*.java src/main/java/clm/incoder/*.java src/main/java/clm/finetuning/*.java

Setting up APR benchmarks and fine-tuning datasets

The following APR benchmarks and fine-tuning datasets need to be installed: Quixbugs, Defects4j, and HumanEval-Java, and CLM.

You can download them from within the project root, using:

mkdir -p datasets

# Download QuixBugs, Defects4J, and CLM
git clone https://github.com/jkoppel/QuixBugs.git datasets/QuixBugs
git clone https://github.com/rjust/defects4j.git datasets/defects4j
zenodo_get 7559208 -o ./datasets/clm
mv QuixBugs quixbugs

# Humaneval-Java
cd datasets
wget https://raw.githubusercontent.com/lin-tan/clm/refs/heads/main/humaneval-java/humaneval-java.tar.gz
tar -xzvf humaneval-java.tar.gz
cd humaneval-java
cp src src_bak
cd ..

# Quixbugs
cd quixbugs
cp java_programs java_programs_bak
cd ..

# Set the correct version for Defects4J
cd defects4j
git checkout tags/v2.0.1 -b d4j-2.0.1 --force
cd ..

# Add bug location info
mv quixbugs_loc.txt quixbugs
mv humaneval_loc.txt humaneval-java
mv defects4j_loc.txt defects4j

# Create project folders for the benchmarks
mkdir datasets/humaneval-java/proj
mkdir datasets/quixbugs/proj
mkdir datasets/defects4j/proj

# Make a temporary folder to store copies of the benchmarks (so they can easily be reinitialized)
cd ..
mkdir tmp
cp -r humaneval-java tmp
cp -r quixbugs tmp
cp -r defects4j tmp

# Install Defects4j
cd defects4j
conda install perl
conda install compilers
conda install conda-forge gcc_linux-64 sysroot_linux-64=2.17
cpan App::cpanminus
./init.sh
export PATH=$PATH:$(pwd)/framework/bin

Environment.yml

For reference, we provide a file environment.yml in the root of the replication package that includes our complete conda environment after following these installation steps.

Setting up model access

The code downloads the following models from HuggingFace: Salesforce/codegen-1B-multi, Salesforce/codegen-2B-multi, Salesforce/codegen-6B-multi, Salesforce/codet5-small, Salesforce/codet5-base, Salesforce/codet5-large, bigcode/starcoderbase-1b, bigcode/starcoderbase-3b, bigcode/starcoderbase-7b, deepseek-ai/deepseek-coder-1.3b-base, deepseek-ai/deepseek-coder-6.3b-base, bigscience/bloom-560m, bigscience/bloom-1b7, bigscience/bloom-7b1, and meta-llama/CodeLlama-7b-hf.

Note that the StarCoder and CodeLLama models require you to log in to HuggingFace and accept a license (use the links provided above). After that, you can cache your HuggingFace access token using:

huggingface-cli login

Usage

There are several scripts to run depending on the various steps of the experiment:

  • Full fine-tuning of various models using src/finetune.py

     python finetune.py \
     --experiment_name <EXPERIMENT_NAME> \
     --model_name <MODEL_NAME> \
     --dataset_name <DATASET_NAME> \
     --checkpoint <CHECKPOINT> \
     --batch_size_train <BATCH_SIZE_TRAIN> \
     --batch_size_test <BATCH_SIZE_TEST> \
     --epochs <EPOCHS> \
     --max_length <MAX_LENGTH> \
     --max_new_tokens <MAX_NEW_TOKENS>
  • PEFT fine-tuning with src/lora.py and src/ia3.py

    • LoRA fine-tuning:

       python lora.py \
       --experiment_name <EXPERIMENT_NAME> \
       --model_name <MODEL_NAME> \
       --dataset_name <DATASET_NAME> \
       --checkpoint <CHECKPOINT> \
       --batch_size_train <BATCH_SIZE_TRAIN> \
       --batch_size_test <BATCH_SIZE_TEST> \
       --epochs <EPOCHS> \
       --max_length <MAX_LENGTH> \
       --max_new_tokens <MAX_NEW_TOKENS> \
       --rank <RANK> \
       --scaling <SCALING>
    • IA3 fine-tuning:

       python ia3.py \
       --experiment_name <EXPERIMENT_NAME> \
       --model_name <MODEL_NAME> \
       --dataset_name <DATASET_NAME> \
       --checkpoint <CHECKPOINT> \
       --batch_size_train <BATCH_SIZE_TRAIN> \
       --batch_size_test <BATCH_SIZE_TEST> \
       --epochs <EPOCHS> \
       --max_length <MAX_LENGTH> \
       --max_new_tokens <MAX_NEW_TOKENS>
  • Prepare inputs and outputs of models using src/benchmarks/generate_inputs.py, src/benchmarks/generate_outputs.py

     # Raw models
     python src/benchmarks/generate_inputs.py <MODEL_NAME> <DATASET_NAME>
     python src/benchmarks/generate_outputs.py <MODEL_NAME> <DATASET_NAME>
     
     # Fine-tuned models
     python src/benchmarks/generate_input_finetuned.py <DATASET_NAME>
     python src/benchmarks/generate_outputs_finetuned.py <MODEL_NAME> <MODEL_TYPE> <MODEL_CHECKPOINT> <DATASET_NAME> <ADAPTER_NAME>
  • Benchmarking of various models using src/benchmarks/validate.py

     python src/benchmarks/validate.py <MODEL_NAME> <DATASET_NAME> <OUTPUT_RESULTS_PATH>
  • Perform post-processing of results by src/benchmarks/analyse.py

     python src/analysers/analyse_benchmark.py <OUTPUT_RESULTS_PATH> <DATASET_NAME> <MODEL_NAME>

Data Locations

Model checkpoints from training are stored inside models folder:

  • models/
    • <adapter_name>/
      • clm/
        • <model_name>/
          • <model_type>/
            • <checkpoint>/ : checkpoints from the fine-tuning

All benchmarks, datasets, and experiment results are stored inside datasets folder:

  • datasets/
    • clm/ : fine-tuning dataset
    • defects4j/ : benchmark
    • humaneval-java/ : benchmark
    • quixbugs/ : benchmark
    • results/
      • <benchmark_name>/
        • <model_name>/
          • <finetuned>/
            • input_<model_name>_finetuned.json : inputs for fine-tuned models
            • <outputs>.json : predictions for the inputs
            • <validations>.json : validations of the outputs
          • input_c1.json : inputs for raw (not-fine-tuned) models, without buggy line comments
          • input_c2.json : inputs for raw (not-fine-tuned) models, with buggy line comments
          • <outputs>.json : predictions for the inputs
          • <validations>.json : validations of the outputs
    • tmp/: folder with copies of benchmarks for fast restoring of the initial setup (before repairs were done).

Example

We illustrate the pipeline for the codegen-350M model (assuming that the replication package is extracted in ~/peft4apr and execution is started in the root of this folder):

# First generate inputs
python src/benchmarks/generate_inputs.py codegen humaneval
python src/benchmarks/generate_input_finetuned.py humaneval

# Generate outputs before fine-tuning (check generate_outputs.py for running all the models)
python src/benchmarks/generate_outputs.py codegen humaneval

# Validate generated results
python src/benchmarks/validate.py codegen humaneval ~/peft4apr/datasets/results/humaneval/codegen

# Analyse the validations
python src/analysers/analyse_benchmark.py ~/peft4apr/datasets/results/humaneval/codegen humaneval codegen-350M

# (Optional) Fine-tune
python finetune.py \
  --experiment_name "Finetuning_CLM_Example" \
  --model_name "codegen" \
  --dataset_name "clm" \
  --checkpoint "Salesforce/codegen-350M-multi" \
  --batch_size_train 16 \
  --batch_size_test 16 \
  --epochs 3 \
  --max_length 768 \
  --max_new_tokens 768

# Use checkpoint for generation
python benchmarks/generate_outputs_finetuned.py codegen Salesforce/codegen-350M-multi ~/peft4apr/models/none/clm/codegen/Salesforce-codegen-350M-multi/checkpoint-18171 humaneval none

# Validate generated results
python src/benchmarks/validate.py codegen humaneval ~/peft4apr/datasets/results/humaneval/codegen/finetuned/none

# Analyse the validations
python src/analysers/analyse_benchmark.py ~/peft4apr/datasets/results/humaneval/codegen/finetuned/none humaneval codegen-350M

Citation and Zenodo links

Please cite this work by referring to the published paper:

Roman Macháček, Anastasiia Grishina, Max Hort and Leon Moonen. 2025. The Impact of Fine-tuning Large Language Models on Automated Program Repair. In Proceedings of the 41th International Conference on Software Maintenance and Evolution (ICSME). IEEE, 13 pages.

@inproceedings{machacek2025:impact,
    title = {{The Impact of Fine-tuning Large Language Models on Automated Program Repair}},
    author = {Mach\'{a}\v{c}ek, Roman and Grishina, Anastasiia and Hort, Max and Moonen, Leon},
    booktitle = {{Proceedings of the 41th International Conference on Software Maintenance and Evolution (ICSME)}},
    year = {2025},
    pages = {13},
    publisher = {{IEEE}},
    language = {en}
}

This replication package has been registered at Zenodo with DOI: 10.5281/zenodo.16359186.

Acknowledgement

This work has been financially supported by the Research Council of Norway through the secureIT project (RCN contract #288787), and by the European Union through the Horizon Europe Marie Skłodowska-Curie Actions (#101151798). The empirical evaluation made use of the Experimental Infrastructure for Exploration of Exascale Computing (eX3), financially supported by the Research Council of Norway under contract #270053.

About

The Impact of Fine-tuning Large Language Models on Automated Program Repair

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors