This repository contains code to replicate experiments from our paper Are Sparse Autoencoders Useful? A Case Study in Sparse Probing. The workflow of our code involves three primary stages. Each part should be mostly executable independently from artifacts we make available:
-
Generating Model and SAE Activations:
- Model activations for probing datasets are generated in
generate_model_activations.py - SAE activations are generated in
generate_sae_activations.py. Because of CUDA memory leakage, we rerun the script for every SAE, we do this insave_sae_acts_and_train_probes.sh, which should work if you just run it. - OOD regime activations are specifically generated in
plot_ood.ipynb. - Mutli-token activations are specifically generated in
generate_model_and_sae_multi_token_acts.py. Caution: this will take up a lot of memory (~1TB).
- Model activations for probing datasets are generated in
-
Training Probes:
- Baseline probes are trained using
run_baselines.py. This script also includes additional functions for OOD experiments related to probe pruning and latent interpretability (see Sections 4.1 and 4.2 of the paper). - SAE probes are trained using
train_sae_probes.py. Sklearn regression is most efficient when run in a single thread, and then many of those threads can be run in parallel. We do this insave_sae_acts_and_train_probes.sh. - Multi token SAE probes and baseline probes are trained using
run_multi_token_acts.py. - Combining all results into csvs after they are done is done with
combine_results.py.
- Baseline probes are trained using
-
Visualizing Results:
- Standard condition plots:
plot_normal.ipynb - Data scarcity, class imbalance, and corrupted data regimes:
plot_combined.ipynb - OOD plots:
plot_ood.ipynb - Llama-3.1-8B results replication:
plot_llama.ipynb - GLUE CoLA and AIMade investigations (Sections 4.3.1 and 4.3.2):
dataset_investigations/ - AI vs. human final token plots:
ai_vs_humanmade_plot.py - SAE architectural improvements (Section 6):
sae_improvement.ipynb - Multi token:
plot_multi_token.py - K vs. AUC plot broken down by dataset (in appendix):
k_vs_auc_plot.py
- Standard condition plots:
Note that these should all be runnable as is from the results data in the repo.
- Raw Text Datasets: Accessible via Dropbox link. Note that datasets 161-163 are modified from their source. An error in our formatting reframes them as differentiating between news headlines and code samples.
- Model Activations: Also stored on Dropbox (Note: Files are large).
We recommend you create a new python venv named probing and install required packages with pip:
python -m venv probing
source probing/bin/activate
pip install transformer_lens sae_lens transformers datasets torch xgboost sae_bench scikit-learn natsort
Let us know if anything does not work with this environment!
For any questions or clarifications, please open an issue or reach out to us!