This repository hosts the code of KERAG-R.
- Method 1: Using the Provided Docker Image (Recommended)
We implement KERAG-R in Python Version 3.10.13, and PyTorch Version 2.5.1+cu121.
The ./KERAG-R/requirements.txt file lists the core dependencies.
Our experiments are conducted on a computing cluster.
Pull the prebuilt Docker image as the base environment:
docker pull reconmmendationsystem/notebook:cuda12.1_unslothAfter starting the container, install the following additional packages in order to avoid version conflicts:
pip install vllm==0.6.5
pip install transformers==4.47.0
pip install https://download.pytorch.org/whl/cu121/torch-2.5.1%2Bcu121-cp310-cp310-linux_x86_64.whl#sha256=92af92c569de5da937dd1afb45ecfdd598ec1254cf2e49e3d698cb24d71aae14
pip install accelerate==1.2.0
pip install peft==0.13.2
pip install jsonlines
pip install flash-attn==2.8.3- Method 2: Installing Directly on a Local or Cluster Environment (Without Docker)
If you do not wish to use the provided Docker image, you can install the dependencies directly in a fresh Python 3.10 environment
- Create and activate a virtual environment (optional but recommended):
conda create -n kerag-r python=3.10
conda activate kerag-ror
python3.10 -m venv kerag-r
source kerag-r/bin/activate- Install dependencies from requirements.txt:
pip install -r ./KERAG-R/requirements.txt- Install specific versions to ensure compatibility (same order as Method 1):
This method allows you to run KERAG-R without using Docker, but make sure your CUDA version matches the PyTorch wheel you install (above uses CUDA 12.1).
Run main.py in the ./KERAG-R/train+inference/ directory, and the output file is saved in the ./KERAG-R/train+inference/ path:
python main.py pipeline \
--hf_token "hf_xxx" \
--model_name meta-llama/Llama-3.1-8B-Instruct \
--train_data_file ./listwisetrain.jsonl \
--per_device_train_batch_size 2 \
--gradient_accumulation_steps 16 \
--infer_model meta-llama/Llama-3.1-8B-Instruct \
--adapter_dir ./trained_model \
--infer_data_file ./test.jsonl \
--batch_size 80- Load original dataset
User-item interaction files and knowledge graph files are in the ./KERAG-R/dataset directory.
The interaction files for ml-10m and AmazonBook are too large to be included in this repository. Please download them from their official sources before running the code.
- Preprocessing the dataset
All datasets can be processed following the steps below.
Users also can directly use the files provided in the ml-1m dataset included in this repository.
Use the ratings file into the ./KERAG-R/data-process/ directory, then run split.py to split the dataset. The input file is ratings.csv. The output files are like.txt, dislike.txt, train_set.txt, valid_set.txt, and test_set.txt. All output files will be saved in the ./KERAG-R/data-process/ directory.
In order to obtain the pkl file required to build prompts as well as the initial recommendation list and ground truth, KERAG-R needs to run the initial recommendation model first.
Use ./KERAG-R/data-process/processing-format.ipynb to modify the dislike.txt of the corresponding dataset to a file with spaces as delimiters to adapt to the initial recommendation model. The input file is dislike.txt , which is obtained from step 3. The output file is dislike_set.txt of the corresponding dataset.
Run ./KERAG-R/top-k-recommendation.py to use initial recommendation model to obtain the relevant files needed to build prompts later.
!python top-k-recommendation.pyThe input files are train_set.txt, test_set.txt, valid_set.txt, and dislike_set.txt in the corresponding dataset files in the ./KERAG-R/dataset/ directory, which are obtained from step 3. The output files are the initial recommendation list file LightGCNrec_save_dict1.csv, the ground truth file LightGCNgt_save_dict1.csv in the model_result subfolder of the corresponding dataset file in the ./KERAG-R/dataset/ directory, and user.pkl, user_id_mapping.pkl, rating_matrix.pkl, pred.pkl, item.pkl, item_id_mapping.pkl, and item_id_mapping-all.pkl in the ./KERAG-R/ directory.
The parameters and configuration file of the model is in the ./KERAG-R/conf directory.
Run graphrag.py in the ./KERAG-R/ directory to obtain the retrieved KG triples. Use processed_kg_id.tsv from ./KERAG-R/ml-1m/ into the current directory as the input file. The output file will be pretrain-output_kg_id.tsv.
python graphrag.pyThe input files are: train_set.txt, dislike.txt, movie_info.csv, which are obtained from step 3; and processed_kg_text.tsv, pretrain-output_kg_id.tsv, which are obtained from step 5; and user.pkl, user_id_mapping.pkl, rating_matrix.pkl, pred.pkl, item.pkl, item_id_mapping.pkl, and item_id_mapping-all.pkl, which are obtained from step 4. The output file is listwisetrain.jsonl in the ./KERAG-R/make-prompt/ directory. Run make-train-prompt.py in the ./KERAG-R/make-prompt/ directory to generate listwisetrain.jsonl for training.
python make-train-prompt.pyThe input files are: train_set.txt, dislike.txt, movie_info.csv, which are obtained from step 3; and user.pkl, user_id_mapping.pkl, rating_matrix.pkl, pred.pkl, item.pkl, item_id_mapping.pkl, item_id_mapping-all.pkl, item information file, LightGCNrec_save_dict1.csv and LightGCNgt_save_dict1, which are obtained from step 4; and processed_kg_text.tsv, pretrain-output_kg_id.tsv, which are obtained from step 5. The output files is test.jsonl in the ./KERAG-R/make-prompt/ directory. Run make-test-prompt.py in the ./KERAG-R/make-prompt/ directory to get test.jsonl for inference.
python make-test-prompt.pyPlace the previously generated training and test prompts into the ./KERAG-R/train+inference/ directory. Then, run train.py and inference.py in the ./KERAG-R/train+inference/ directory to perform instruction tuning of the LLM and run inference (as shown in the Quick Start). The output file is inference.txt.
!python train.py \
--hf_token "" \
--model_name meta-llama/Llama-3.1-8B-Instruct \
--data_file ./listwisetrain.jsonl \
--per_device_train_batch_size 2 \
--gradient_accumulation_steps 16!python inference.py
--model meta-llama/Llama-3.1-8B-Instruct \
--adapter_dir ./trained_model \
--data_file ./test.jsonl \
--batch_size 80 \
--hf_token ""After inference is completed, run the ./KERAG-R/train+inference/evaluation.ipynb script to process the data and calculate the metrics.