A complete end-to-end pipeline for LLM interpretability with sparse autoencoders (SAEs) using Llama 3.2, written in pure PyTorch and fully reproducible.
-
Updated
Mar 23, 2025 - Python
A complete end-to-end pipeline for LLM interpretability with sparse autoencoders (SAEs) using Llama 3.2, written in pure PyTorch and fully reproducible.
Reproducible case study of pitfalls in contrastive SAE discovery and steering for "consciousness" features (GemmaScope SAEs, Gemma 3 4B/12B): reconstruction confound, delta-steering fix, matched controls, and false-positive scaling law vs dataset size.
Minimal replication of Anthropic's Golden Gate Claude on consumer hardware. Trains a Sparse Autoencoder on Qwen2.5-1.5B, discovers interpretable features, and steers model behavior, all on an RTX 3070 Ti.
Add a description, image, and links to the feature-steering topic page so that developers can more easily learn about it.
To associate your repository with the feature-steering topic, visit your repo's landing page and select "manage topics."