Skip to content

Model organisms research: do AI goals drift due to "catastrophic forgetting"? Does alignment drift?

Notifications You must be signed in to change notification settings

awestover/misalignment-by-default

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

345 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

If we train an AI on math does it end up misaligned?

Code summary:

  1. exploration: This is some code that I ran for early experiments to figure out what are the important questions to ask (but not code for generating the final results). It also has code to generate grade and filter some evals.
  2. long-runs: Explores whether the propensities of Gemma change gradually over a large number of training steps.
  3. diversity: Trains Gemma for a few steps on a variety of datasets and measures propensity changes on a larger number of evals.
  4. inference-time: Code for investigating inference time drift.
  5. alignment-faking: Meausres change in propensity to alignment fake.

Sorry the code is a mess!

Also, some of the json files are not included bc I git-ignored them. You can access the evals through google drive if you want.

About

Model organisms research: do AI goals drift due to "catastrophic forgetting"? Does alignment drift?

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors