Are Sparse Autoencoders Useful? A Case Study in Sparse Probing

This repository contains code to replicate experiments from our paper Are Sparse Autoencoders Useful? A Case Study in Sparse Probing. The workflow of our code involves three primary stages. Each part should be mostly executable independently from artifacts we make available:

Generating Model and SAE Activations:
- Model activations for probing datasets are generated in generate_model_activations.py
- SAE activations are generated in generate_sae_activations.py. Because of CUDA memory leakage, we rerun the script for every SAE, we do this in save_sae_acts_and_train_probes.sh, which should work if you just run it.
- OOD regime activations are specifically generated in plot_ood.ipynb.
- Mutli-token activations are specifically generated in generate_model_and_sae_multi_token_acts.py. Caution: this will take up a lot of memory (~1TB).
Training Probes:
- Baseline probes are trained using run_baselines.py. This script also includes additional functions for OOD experiments related to probe pruning and latent interpretability (see Sections 4.1 and 4.2 of the paper).
- SAE probes are trained using train_sae_probes.py. Sklearn regression is most efficient when run in a single thread, and then many of those threads can be run in parallel. We do this in save_sae_acts_and_train_probes.sh.
- Multi token SAE probes and baseline probes are trained using run_multi_token_acts.py.
- Combining all results into csvs after they are done is done with combine_results.py.
Visualizing Results:
- Standard condition plots: plot_normal.ipynb
- Data scarcity, class imbalance, and corrupted data regimes: plot_combined.ipynb
- OOD plots: plot_ood.ipynb
- Llama-3.1-8B results replication: plot_llama.ipynb
- GLUE CoLA and AIMade investigations (Sections 4.3.1 and 4.3.2): dataset_investigations/
- AI vs. human final token plots: ai_vs_humanmade_plot.py
- SAE architectural improvements (Section 6): sae_improvement.ipynb
- Multi token: plot_multi_token.py
- K vs. AUC plot broken down by dataset (in appendix): k_vs_auc_plot.py

Note that these should all be runnable as is from the results data in the repo.

Datasets

Raw Text Datasets: Accessible via Dropbox link. Note that datasets 161-163 are modified from their source. An error in our formatting reframes them as differentiating between news headlines and code samples.
Model Activations: Also stored on Dropbox (Note: Files are large).

Requirements

We recommend you create a new python venv named probing and install required packages with pip:

python -m venv probing
source probing/bin/activate
pip install transformer_lens sae_lens transformers datasets torch xgboost sae_bench scikit-learn natsort

Let us know if anything does not work with this environment!

For any questions or clarifications, please open an issue or reach out to us!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Are Sparse Autoencoders Useful? A Case Study in Sparse Probing

Datasets

Requirements

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
data		data
dataset_investigations		dataset_investigations
rebuttal_plots		rebuttal_plots
results		results
.gitignore		.gitignore
README.md		README.md
ai_vs_humanmade_plot.py		ai_vs_humanmade_plot.py
combine_results.py		combine_results.py
generate_model_activations.py		generate_model_activations.py
generate_model_and_sae_multi_token_acts.py		generate_model_and_sae_multi_token_acts.py
generate_sae_activations.py		generate_sae_activations.py
handle_sae_bench_saes.py		handle_sae_bench_saes.py
k_vs_auc_plot.py		k_vs_auc_plot.py
plot_combined.ipynb		plot_combined.ipynb
plot_llama.ipynb		plot_llama.ipynb
plot_multi_token.py		plot_multi_token.py
plot_normal.ipynb		plot_normal.ipynb
plot_ood.ipynb		plot_ood.ipynb
rebuttal_plots.py		rebuttal_plots.py
run_baselines.py		run_baselines.py
run_multi_token_acts.py		run_multi_token_acts.py
sae_improvement.ipynb		sae_improvement.ipynb
save_sae_acts_and_train_probes.sh		save_sae_acts_and_train_probes.sh
train_sae_probes.py		train_sae_probes.py
utils_data.py		utils_data.py
utils_sae.py		utils_sae.py
utils_training.py		utils_training.py

JoshEngels/SAE-Probes

Folders and files

Latest commit

History

Repository files navigation

Are Sparse Autoencoders Useful? A Case Study in Sparse Probing

Datasets

Requirements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages