GitHub - awestover/misalignment-by-default: Model organisms research: do AI goals drift due to "catastrophic forgetting"? Does alignment drift?

If we train an AI on math does it end up misaligned?

Code summary:

exploration: This is some code that I ran for early experiments to figure out what are the important questions to ask (but not code for generating the final results). It also has code to generate grade and filter some evals.
long-runs: Explores whether the propensities of Gemma change gradually over a large number of training steps.
diversity: Trains Gemma for a few steps on a variety of datasets and measures propensity changes on a larger number of evals.
inference-time: Code for investigating inference time drift.
alignment-faking: Meausres change in propensity to alignment fake.

Sorry the code is a mess!

Also, some of the json files are not included bc I git-ignored them. You can access the evals through google drive if you want.

Name		Name	Last commit message	Last commit date
Latest commit History 345 Commits
clean-code		clean-code
evals		evals
gross-code		gross-code
more-stuff		more-stuff
.gitignore		.gitignore
README.md		README.md
misalignment.png		misalignment.png

Provide feedback