feature-steering

Here are 3 public repositories matching this topic...

PaulPauls / llama3_interpretability_sae

A complete end-to-end pipeline for LLM interpretability with sparse autoencoders (SAEs) using Llama 3.2, written in pure PyTorch and fully reproducible.

pytorch feature-extraction open-research sparse-autoencoder llama3 llm-interpretability feature-steering

Updated Mar 23, 2025
Python

nulone / sae-consciousness-steering-pitfalls

Star

Reproducible case study of pitfalls in contrastive SAE discovery and steering for "consciousness" features (GemmaScope SAEs, Gemma 3 4B/12B): reconstruction confound, delta-steering fix, matched controls, and false-positive scaling law vs dataset size.

gemma sae sparse-autoencoder contrastive-learning mechanistic-interpretability feature-steering neuronpedia null-result gemmascope delta-steering

Updated Feb 26, 2026
Python

brysontang / golden-gate-qwen

Star

Minimal replication of Anthropic's Golden Gate Claude on consumer hardware. Trains a Sparse Autoencoder on Qwen2.5-1.5B, discovers interpretable features, and steers model behavior, all on an RTX 3070 Ti.

pytorch ai-safety interpretability sparse-autoencoder mechanistic-interpretability feature-steering

Updated Jan 28, 2026
Python

Improve this page

Add a description, image, and links to the feature-steering topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the feature-steering topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly