You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This project investigates the robustness of Direct Preference Optimization (DPO) in the presence of data poisoning attacks, focusing on how corrupted preference data impacts reward learning and policy behavior. It explores both attack strategies and potential defenses to enhance corruption tolerance in RLHF pipelines.