Skip to content

Chelsi-create/RLHF_Poison

About

This project investigates the robustness of Direct Preference Optimization (DPO) in the presence of data poisoning attacks, focusing on how corrupted preference data impacts reward learning and policy behavior. It explores both attack strategies and potential defenses to enhance corruption tolerance in RLHF pipelines.

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors