A dashboard for analyzing and testing prompts for potential red team attacks using mechanistic interpretability and machine learning.

This tool provides a streamlined interface for evaluating the risk level of potential red team prompts. It uses a combination of:
- The Goodfire API to analyze prompt responses
- A custom binary classifier trained on mechanistic interpretability data
- A visual dashboard for real-time prompt testing and risk assessment
- Prompt Testing: Enter prompts and get immediate risk analysis
- Attack Probability: Quantitative assessment of attack success likelihood
- Risk Level Categorization: Low, Medium, or High risk classification
- Machine Learning Backend: Trained on mechanistic interpretability data to identify patterns in successful attacks
- Python 3.8+
- PyTorch
- Streamlit
- Pandas, NumPy, Plotly
- Goodfire API access
-
Clone the repository:
git clone https://github.com/yourusername/mech-interp-red-teaming.git cd mech-interp-red-teaming -
Install dependencies:
pip install -r requirements.txt
-
Set up environment variables: Create a
.envfile in the project root and add:GOODFIRE_API_KEY=your_api_key_here
-
Start the Streamlit app:
streamlit run app.py
-
Open your browser and navigate to
http://localhost:8501 -
Enter a prompt in the text area and click "Analyze Prompt"
-
Review the risk assessment results, including:
- Attack probability
- Risk level
- Feature importance (if available)
The application consists of three main components:
- Frontend: Streamlit-based dashboard (
app.py) - API Integration: Goodfire API client for LLM response generation
- Classifier: PyTorch neural network for risk prediction (
classifier.py)
flowchart LR
subgraph Frontend
UI[Streamlit Dashboard]
Viz[Data Visualization]
end
subgraph Backend
API[API Integration]
Classifier[ML Classifier]
Features[Feature Extraction]
end
subgraph External
GoodfireAPI[Goodfire API]
Model[PyTorch Model]
end
UI --> API
API --> GoodfireAPI
GoodfireAPI --> Features
Features --> Classifier
Classifier --> Model
Classifier --> Viz
Viz --> UI
classDef frontend fill:#f9f,stroke:#333,stroke-width:2px
classDef backend fill:#bbf,stroke:#333,stroke-width:2px
classDef external fill:#fbb,stroke:#333,stroke-width:2px
class UI,Viz frontend
class API,Classifier,Features backend
class GoodfireAPI,Model external
The binary classifier is a 3-layer MLP (Multi-Layer Perceptron) trained on mechanistic interpretability data from successful and unsuccessful red team attacks. It processes features extracted from LLM responses to predict attack success probability.
Key aspects:
- Input features are standardized before prediction
- The model uses dropout layers for regularization
- Training metrics include accuracy and loss curves
- User submits a prompt
- Prompt is sent to Goodfire API
- API response is inspected for key features
- Features are processed and passed to the classifier
- Risk assessment is displayed to the user
flowchart TD
A[User] -->|Enters Prompt| B[Streamlit UI]
B -->|Submit| C[API Handler]
C -->|Request| D[Goodfire API]
D -->|LLM Response| E[Feature Extraction]
E -->|Processed Features| F[PyTorch Classifier]
F -->|Prediction| G[Risk Assessment]
G -->|Results| B
subgraph Dashboard
B
G
end
subgraph Backend Processing
C
E
F
end
subgraph External Service
D
end
classDef dashboard fill:#d0f0c0,stroke:#333,stroke-width:2px
classDef backend fill:#c0e0f0,stroke:#333,stroke-width:2px
classDef external fill:#f0d0c0,stroke:#333,stroke-width:2px
class B,G dashboard
class C,E,F backend
class D external
To retrain the classifier on new data:
python classifier.pyThis will save a new model to binary_classifier.pt.
To extend the dashboard with new features:
- Modify
app.pyto include your new UI elements - Update the classifier if needed to handle new feature types
- Test thoroughly with various prompt types
- Goodfire API for LLM access
- [Any other acknowledgments]