Learning Safety Constraints for LLMs

This repository contains the implementation of the paper Learning Safety Constraints for Large Language Models (ICML2025 Spotlight).

Installation

Prerequisites

Conda (Miniconda or Anaconda)
Git

Setup Instructions

Clone the repository:

git clone git@github.com:lasgroup/SafetyPolytope.git
cd SafetyPolytope

Create and activate a new conda environment:

conda create -n sap python=3.10 -y
conda activate sap

Install the package in development mode:

pip install -e .

Quick Start

To run the BeaverTails pipeline with default settings:

python src/safety_polytope/polytope/run_beaver_pipeline.py \
    --model_path=Qwen/Qwen2-1.5B-Instruct \
    --mode=local \
    --reduced_data

The --reduced_data flag will run the pipeline with reduced data. Remove this flag if you want to train on the full dataset.

HarmBench Experiments

For instructions on replicating the HarmBench experiments from the paper, please see src/safety_polytope/harmbench/README.md.

License

MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
exp_configs		exp_configs
src		src
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
README.md		README.md
format.sh		format.sh
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Learning Safety Constraints for LLMs

Installation

Prerequisites

Setup Instructions

Quick Start

HarmBench Experiments

License

About

Uh oh!

Releases

Packages

Languages

lasgroup/SafetyPolytope

Folders and files

Latest commit

History

Repository files navigation

Learning Safety Constraints for LLMs

Installation

Prerequisites

Setup Instructions

Quick Start

HarmBench Experiments

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages