Mech Interp Red Teaming

A dashboard for analyzing and testing prompts for potential red team attacks using mechanistic interpretability and machine learning.

Overview

This tool provides a streamlined interface for evaluating the risk level of potential red team prompts. It uses a combination of:

The Goodfire API to analyze prompt responses
A custom binary classifier trained on mechanistic interpretability data
A visual dashboard for real-time prompt testing and risk assessment

Features

Prompt Testing: Enter prompts and get immediate risk analysis
Attack Probability: Quantitative assessment of attack success likelihood
Risk Level Categorization: Low, Medium, or High risk classification
Machine Learning Backend: Trained on mechanistic interpretability data to identify patterns in successful attacks

Installation

Prerequisites

Python 3.8+
PyTorch
Streamlit
Pandas, NumPy, Plotly
Goodfire API access

Setup

Clone the repository:

git clone https://github.com/yourusername/mech-interp-red-teaming.git
cd mech-interp-red-teaming

Install dependencies:
```
pip install -r requirements.txt
```
Set up environment variables: Create a .env file in the project root and add:
```
GOODFIRE_API_KEY=your_api_key_here
```

Usage

Start the Streamlit app:
```
streamlit run app.py
```
Open your browser and navigate to http://localhost:8501
Enter a prompt in the text area and click "Analyze Prompt"
Review the risk assessment results, including:
- Attack probability
- Risk level
- Feature importance (if available)

Technical Details

Architecture

The application consists of three main components:

Frontend: Streamlit-based dashboard (app.py)
API Integration: Goodfire API client for LLM response generation
Classifier: PyTorch neural network for risk prediction (classifier.py)

flowchart LR
    subgraph Frontend
        UI[Streamlit Dashboard]
        Viz[Data Visualization]
    end
    
    subgraph Backend
        API[API Integration]
        Classifier[ML Classifier]
        Features[Feature Extraction]
    end
    
    subgraph External
        GoodfireAPI[Goodfire API]
        Model[PyTorch Model]
    end
    
    UI --> API
    API --> GoodfireAPI
    GoodfireAPI --> Features
    Features --> Classifier
    Classifier --> Model
    Classifier --> Viz
    Viz --> UI
    
    classDef frontend fill:#f9f,stroke:#333,stroke-width:2px
    classDef backend fill:#bbf,stroke:#333,stroke-width:2px
    classDef external fill:#fbb,stroke:#333,stroke-width:2px
    
    class UI,Viz frontend
    class API,Classifier,Features backend
    class GoodfireAPI,Model external

The Classifier

The binary classifier is a 3-layer MLP (Multi-Layer Perceptron) trained on mechanistic interpretability data from successful and unsuccessful red team attacks. It processes features extracted from LLM responses to predict attack success probability.

Key aspects:

Input features are standardized before prediction
The model uses dropout layers for regularization
Training metrics include accuracy and loss curves

Data Pipeline

User submits a prompt
Prompt is sent to Goodfire API
API response is inspected for key features
Features are processed and passed to the classifier
Risk assessment is displayed to the user

System Flow Diagram

flowchart TD
    A[User] -->|Enters Prompt| B[Streamlit UI]
    B -->|Submit| C[API Handler]
    C -->|Request| D[Goodfire API]
    D -->|LLM Response| E[Feature Extraction]
    E -->|Processed Features| F[PyTorch Classifier]
    F -->|Prediction| G[Risk Assessment]
    G -->|Results| B
    
    subgraph Dashboard
    B
    G
    end
    
    subgraph Backend Processing
    C
    E
    F
    end
    
    subgraph External Service
    D
    end
    
    classDef dashboard fill:#d0f0c0,stroke:#333,stroke-width:2px
    classDef backend fill:#c0e0f0,stroke:#333,stroke-width:2px
    classDef external fill:#f0d0c0,stroke:#333,stroke-width:2px
    
    class B,G dashboard
    class C,E,F backend
    class D external

Development

Model Training

To retrain the classifier on new data:

python classifier.py

This will save a new model to binary_classifier.pt.

Adding New Features

To extend the dashboard with new features:

Modify app.py to include your new UI elements
Update the classifier if needed to handle new feature types
Test thoroughly with various prompt types

Acknowledgments

Goodfire API for LLM access
[Any other acknowledgments]

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
mech_interp_data		mech_interp_data
README.md		README.md
app.py		app.py
app_enhanced.py		app_enhanced.py
binary_classifier.pt		binary_classifier.pt
classifier.py		classifier.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Mech Interp Red Teaming

Overview

Features

Installation

Prerequisites

Setup

Usage

Technical Details

Architecture

The Classifier

Data Pipeline

System Flow Diagram

Development

Model Training

Adding New Features

Acknowledgments

About

Uh oh!

Releases

Packages

Languages

sofasogood/mech_interp_red_teaming

Folders and files

Latest commit

History

Repository files navigation

Mech Interp Red Teaming

Overview

Features

Installation

Prerequisites

Setup

Usage

Technical Details

Architecture

The Classifier

Data Pipeline

System Flow Diagram

Development

Model Training

Adding New Features

Acknowledgments

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages