A research system for distinguishing human-written from AI-generated text using frozen BERT embeddings, with comprehensive neuron-level interpretability analysis.
Two-part research pipeline:
- Classification: Frozen BERT β Lightweight sklearn classifiers (>95% accuracy)
- Interpretability: Statistical analysis of 9,216 neurons across all 12 BERT layers
Key Features:
- β Config-driven experiments with automatic caching
- β Comprehensive neuron analysis (Mann-Whitney U + AUC + Cohen's d)
- β Hierarchical clustering of discriminative neurons
- β Publication-ready visualizations and statistical validation
- β Wandb integration for experiment tracking
Main Discovery: 1,350 discriminative neurons (14.6%) identified across BERT layers
Key Results:
- Late layers (9-12) contain 44.7% of discriminative neurons β semantic processing dominates
- Balanced detection: 688 AI-preferring vs 662 human-preferring neurons (1.04:1 ratio)
- Strong effects: 84.3% of discriminative neurons have |Cohen's d| > 0.8
- Three functional clusters: Human specialists (distributed), early AI detectors, late AI detectors
Statistical Validation:
- All neurons survive Bonferroni correction (Ξ± = 0.001, adjusted Ξ±' = 1.09Γ10β»β·)
- Median p-value: 8.68Γ10β»Β³β΄
- Low redundancy: mean correlation = 0.146
Human-vs-AI-text-Text-Classification-Analysis-/
β
βββ configs/ # YAML-based experiment configs
β βββ experiments/ # Complete pipeline configs
β βββ tokenizers/ # BERT tokenizer settings
β βββ encoders/ # Frozen BERT encoder settings
β βββ classifiers/ # sklearn classifier configs (SGD, LogReg, etc.)
β
βββ data/
β βββ raw/AI_Human.csv # Kaggle dataset
β βββ processed/ # Auto-cached tokenized/encoded data
β
βββ models/ # Trained classifiers (.pkl files)
β
βββ scripts/
β βββ run_training.py # π Main training pipeline
β βββ train_all_classifiers.py # Batch training script
β βββ extract_activations.py # π¬ Extract neuron activations (all 12 layers)
β βββ analyze_activations.py # π¬ Statistical analysis (deprecated, use notebooks)
β βββ tokenize_dataset.py # Standalone tokenization
β βββ encode_dataset.py # Standalone encoding
β βββ train_classifier.py # Standalone training
β
βββ utils/ # Core utilities
β βββ dataset_tokenizer.py # Tokenization logic
β βββ dataset_encoder.py # Encoding logic
β βββ classifier_trainer.py # sklearn training wrapper
β βββ activation_extractor.py # Layer activation extraction
β βββ training_pipeline.py # End-to-end pipeline orchestration
β
βββ notebooks/ # π Analysis & visualization
β βββ neurons_analysis.ipynb # Statistical analysis of discriminative neurons
β βββ neuron_clustering.ipynb # Hierarchical clustering & UMAP visualization
β βββ dataset_analysis.ipynb # Dataset exploration
β
βββ results/
β βββ activations/ # Neuron data (all 12 layers, 3000 samples)
β β βββ layer_{1-12}_activations.npy # Raw activation matrices
β β βββ layer_{1-12}_neuron_stats.csv # Per-neuron statistics
β β βββ metadata.json # Analysis metadata
β βββ figures/ # Publication-ready plots
β
βββ requirements.txt
pip install -r requirements.txt
# Download dataset from Kaggle and place at: data/raw/AI_Human.csv
# https://www.kaggle.com/datasets/shanegerami/ai-vs-human-text# Run complete pipeline (auto-caches tokenization/encoding)
python scripts/run_training.py sgd
# With wandb tracking (optional)
python scripts/run_training.py sgd --wandb-project my-projectWhat happens:
- Tokenize β
data/processed/AI_Human/tokenized/bert-base-uncased/ - Encode (frozen BERT) β
data/processed/AI_Human/encoded/bert_bert/ - Train SGD classifier β logs metrics
Subsequent runs: Only step 3 re-runs (steps 1-2 cached)
Experiments use YAML files in configs/experiments/. See existing configs and configs/README.md for examples.
Source: Kaggle - AI vs Human Text
- Task: Binary classification (0 = Human, 1 = AI-generated)
- Split: 80/20 train/test (stratified)
- Preprocessing: BERT tokenization, max 512 tokens, CLS token pooling
- Analysis subset: 3,000 samples (1,500 AI + 1,500 Human) for neuron analysis
Core: PyTorch, HuggingFace Transformers, scikit-learn
Analysis: NumPy, Pandas, SciPy, UMAP, matplotlib, seaborn
Tracking: Wandb (optional)
Comprehensive interpretability study of all 9,216 neurons across BERT's 12 layers.
# 1. Train classifier (creates tokenized dataset)
python scripts/run_training.py sgd
# 2. Extract activations from all 12 layers (~10 min, 3000 samples)
python scripts/extract_activations.py --layers 1 2 3 4 5 6 7 8 9 10 11 12 --samples 3000
# 3. Run analysis notebooks (recommended over deprecated analyze_activations.py)
# Open: notebooks/neurons_analysis.ipynb
# Open: notebooks/neuron_clustering.ipynbPer-neuron analysis (9,216 neurons = 12 layers Γ 768 neurons):
-
Mann-Whitney U Test: Tests if AI vs Human activation distributions differ
- Significance level: Ξ± = 0.001
- Bonferroni correction: Ξ±' = 1.09Γ10β»β· (for 9,216 tests)
-
Effect Sizes:
- AUC (Area Under ROC): Discriminative power (0.5 = random, 0/1 = perfect)
- Cohen's d: Magnitude of difference (|d| > 0.8 = large effect)
-
Discriminative Neuron Criteria:
p < Ξ±'(survives Bonferroni correction)AUC > 0.7(AI-preferring) ORAUC < 0.3(Human-preferring)
Outputs:
- Table 1: Summary statistics (1,350 discriminative neurons)
- Figure 1: Layer-wise distribution (percentage & AI:Human ratio)
- Figure 2: Top-10 neuron activation boxplots
- Figure 3: Mean activation scatter (AI vs Human)
- Statistical validation table (effect sizes, p-values, correlations)
Key Findings:
- 14.6% of neurons are discriminative
- Late layers (9-12): 604 neurons (44.7% of discriminative)
- Peak: Layer 11 with 181 discriminative neurons
- Mean |Cohen's d| = 0.967 (very large effects)
Outputs:
- UMAP visualization (layers & preference)
- Hierarchical clustering dendrogram
- Silhouette analysis (optimal K=3)
- Cluster characterization tables
Discovered Clusters:
- Cluster 0 (n=689, 51%): Human specialists, distributed across all layers
- Cluster 1 (n=212, 16%): Early AI detectors (90.6% from layers 1-4)
- Cluster 2 (n=449, 33%): Late AI detectors (56% from layers 10-12)
Activation Data: results/activations/
layer_{1-12}_activations.npy- Raw activation matrices (3000 samples Γ 768 neurons)layer_{1-12}_neuron_stats.csv- Per-neuron statistics (AUC, p-value, Cohen's d, etc.)labels.npy- Sample labels (0=Human, 1=AI)metadata.json- Analysis configuration
Figures: results/figures/
figure{1-3}_*.png- Statistical analysis plotsneuron_umap_*.png- UMAP visualizationsneuron_dendrogram_cut.png- Hierarchical clusteringneuron_silhouette_scores.png- Cluster validation
import pandas as pd
import numpy as np
# Load neuron statistics for layer 12
df = pd.read_csv('results/activations/layer_12_neuron_stats.csv')
# Top discriminative neuron
top = df.nlargest(1, 'auc_deviation').iloc[0]
print(f"Layer 12, Neuron {top['neuron_idx']}: AUC={top['auc']:.3f}, Cohen's d={top['cohens_d']:.2f}")
# Count discriminative neurons
disc = df[df['discriminative']]
print(f"Discriminative: {len(disc)}/768 ({len(disc)/768*100:.1f}%)")
print(f"AI-preferring: {(disc['auc'] > 0.7).sum()}")
print(f"Human-preferring: {(disc['auc'] < 0.3).sum()}")- SGD Classifier: >95% accuracy on test set
- Multiple classifiers tested: Logistic Regression, Random Forest, Decision Tree, Linear SVC
- Key insight: Frozen BERT embeddings are highly discriminative
- 1,350 discriminative neurons identified (14.6% of 9,216 total)
- Balanced bidirectionality: 688 AI-preferring vs 662 Human-preferring (1.04:1)
- Layer distribution: U-shaped pattern (early and late layers dominate)
- Effect sizes: 84.3% have large effects (|Cohen's d| > 0.8)
- Functional organization: 3 distinct clusters with specialized roles
- HuggingFace Transformers
- Dataset: Kaggle AI vs Human Text
- UMAP: Uniform Manifold Approximation and Projection
- scikit-learn
MIT License - Free for research and education
Status: β Research Complete | Last Updated: December 2024