This repository contains the implementation of the paper Learning Safety Constraints for Large Language Models (ICML2025 Spotlight).
- Conda (Miniconda or Anaconda)
- Git
- Clone the repository:
git clone git@github.com:lasgroup/SafetyPolytope.git
cd SafetyPolytope- Create and activate a new conda environment:
conda create -n sap python=3.10 -y
conda activate sap- Install the package in development mode:
pip install -e .To run the BeaverTails pipeline with default settings:
python src/safety_polytope/polytope/run_beaver_pipeline.py \
--model_path=Qwen/Qwen2-1.5B-Instruct \
--mode=local \
--reduced_dataThe --reduced_data flag will run the pipeline with reduced data. Remove this flag if you want to train on the full dataset.
For instructions on replicating the HarmBench experiments from the paper, please see src/safety_polytope/harmbench/README.md.
MIT License.