Skip to content

terrierteam/KERAG-R

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 

Repository files navigation

Knowledge-Enhanced Retrieval-Augmented Generation for Effective Recommendation

This repository hosts the code of KERAG-R.

1.Install environment

  • Method 1: Using the Provided Docker Image (Recommended)

We implement KERAG-R in Python Version 3.10.13, and PyTorch Version 2.5.1+cu121.

The ./KERAG-R/requirements.txt file lists the core dependencies.

Our experiments are conducted on a computing cluster.

Pull the prebuilt Docker image as the base environment:

docker pull reconmmendationsystem/notebook:cuda12.1_unsloth

After starting the container, install the following additional packages in order to avoid version conflicts:

pip install vllm==0.6.5

pip install transformers==4.47.0

pip install https://download.pytorch.org/whl/cu121/torch-2.5.1%2Bcu121-cp310-cp310-linux_x86_64.whl#sha256=92af92c569de5da937dd1afb45ecfdd598ec1254cf2e49e3d698cb24d71aae14

pip install accelerate==1.2.0

pip install peft==0.13.2

pip install jsonlines

pip install flash-attn==2.8.3
  • Method 2: Installing Directly on a Local or Cluster Environment (Without Docker)

If you do not wish to use the provided Docker image, you can install the dependencies directly in a fresh Python 3.10 environment

  1. Create and activate a virtual environment (optional but recommended):
conda create -n kerag-r python=3.10
conda activate kerag-r

or

python3.10 -m venv kerag-r
source kerag-r/bin/activate
  1. Install dependencies from requirements.txt:
pip install -r ./KERAG-R/requirements.txt
  1. Install specific versions to ensure compatibility (same order as Method 1):

This method allows you to run KERAG-R without using Docker, but make sure your CUDA version matches the PyTorch wheel you install (above uses CUDA 12.1).

2.Quick Start

Run main.py in the ./KERAG-R/train+inference/ directory, and the output file is saved in the ./KERAG-R/train+inference/ path:

Instruction tuning Llama3 (train) and inference

python main.py pipeline \
  --hf_token "hf_xxx" \
  --model_name meta-llama/Llama-3.1-8B-Instruct \
  --train_data_file ./listwisetrain.jsonl \
  --per_device_train_batch_size 2 \
  --gradient_accumulation_steps 16 \
  --infer_model meta-llama/Llama-3.1-8B-Instruct \
  --adapter_dir ./trained_model \
  --infer_data_file ./test.jsonl \
  --batch_size 80

3. Process the Original Dataset

  • Load original dataset

User-item interaction files and knowledge graph files are in the ./KERAG-R/dataset directory.

The interaction files for ml-10m and AmazonBook are too large to be included in this repository. Please download them from their official sources before running the code.

  • Preprocessing the dataset

All datasets can be processed following the steps below.

Users also can directly use the files provided in the ml-1m dataset included in this repository.

Use the ratings file into the ./KERAG-R/data-process/ directory, then run split.py to split the dataset. The input file is ratings.csv. The output files are like.txt, dislike.txt, train_set.txt, valid_set.txt, and test_set.txt. All output files will be saved in the ./KERAG-R/data-process/ directory.

4. Run the Initial Recommendation Model

In order to obtain the pkl file required to build prompts as well as the initial recommendation list and ground truth, KERAG-R needs to run the initial recommendation model first.

Processing Format

Use ./KERAG-R/data-process/processing-format.ipynb to modify the dislike.txt of the corresponding dataset to a file with spaces as delimiters to adapt to the initial recommendation model. The input file is dislike.txt , which is obtained from step 3. The output file is dislike_set.txt of the corresponding dataset.

Run the Initial Recommendation Model

Run ./KERAG-R/top-k-recommendation.py to use initial recommendation model to obtain the relevant files needed to build prompts later.

!python top-k-recommendation.py

The input files are train_set.txt, test_set.txt, valid_set.txt, and dislike_set.txt in the corresponding dataset files in the ./KERAG-R/dataset/ directory, which are obtained from step 3. The output files are the initial recommendation list file LightGCNrec_save_dict1.csv, the ground truth file LightGCNgt_save_dict1.csv in the model_result subfolder of the corresponding dataset file in the ./KERAG-R/dataset/ directory, and user.pkl, user_id_mapping.pkl, rating_matrix.pkl, pred.pkl, item.pkl, item_id_mapping.pkl, and item_id_mapping-all.pkl in the ./KERAG-R/ directory.

The parameters and configuration file of the model is in the ./KERAG-R/conf directory.

5. Use GraphRAG to get triples from the knowledge graph.

Run graphrag.py in the ./KERAG-R/ directory to obtain the retrieved KG triples. Use processed_kg_id.tsv from ./KERAG-R/ml-1m/ into the current directory as the input file. The output file will be pretrain-output_kg_id.tsv.

python graphrag.py

6. Build Train and Test Prompts, Run Inference with the Instruction-Tuned LLM.

Build Train Prompt

The input files are: train_set.txt, dislike.txt, movie_info.csv, which are obtained from step 3; and processed_kg_text.tsv, pretrain-output_kg_id.tsv, which are obtained from step 5; and user.pkl, user_id_mapping.pkl, rating_matrix.pkl, pred.pkl, item.pkl, item_id_mapping.pkl, and item_id_mapping-all.pkl, which are obtained from step 4. The output file is listwisetrain.jsonl in the ./KERAG-R/make-prompt/ directory. Run make-train-prompt.py in the ./KERAG-R/make-prompt/ directory to generate listwisetrain.jsonl for training.

python make-train-prompt.py

Build Inference Prompt

The input files are: train_set.txt, dislike.txt, movie_info.csv, which are obtained from step 3; and user.pkl, user_id_mapping.pkl, rating_matrix.pkl, pred.pkl, item.pkl, item_id_mapping.pkl, item_id_mapping-all.pkl, item information file, LightGCNrec_save_dict1.csv and LightGCNgt_save_dict1, which are obtained from step 4; and processed_kg_text.tsv, pretrain-output_kg_id.tsv, which are obtained from step 5. The output files is test.jsonl in the ./KERAG-R/make-prompt/ directory. Run make-test-prompt.py in the ./KERAG-R/make-prompt/ directory to get test.jsonl for inference.

python make-test-prompt.py

Place the previously generated training and test prompts into the ./KERAG-R/train+inference/ directory. Then, run train.py and inference.py in the ./KERAG-R/train+inference/ directory to perform instruction tuning of the LLM and run inference (as shown in the Quick Start). The output file is inference.txt.

!python train.py \
  --hf_token "" \
  --model_name meta-llama/Llama-3.1-8B-Instruct \
  --data_file ./listwisetrain.jsonl \
  --per_device_train_batch_size 2 \
  --gradient_accumulation_steps 16
!python inference.py 
    --model meta-llama/Llama-3.1-8B-Instruct \
    --adapter_dir ./trained_model \
    --data_file ./test.jsonl \
    --batch_size 80 \
    --hf_token ""

After inference is completed, run the ./KERAG-R/train+inference/evaluation.ipynb script to process the data and calculate the metrics.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published