Skip to content

Latest commit

 

History

History
276 lines (207 loc) · 10.6 KB

File metadata and controls

276 lines (207 loc) · 10.6 KB

Benchmarked: A Multi-Seed, Multi-Condition Study of Prompt-Based Emotion Classification with Efficient Transformers

arXiv Open In Colab Python PyTorch Transformers License


Overview

This repository contains the complete, reproducible experimental pipeline for our paper on prompt-based emotion classification using efficient transformer encoders. We present a controlled, multi-condition study covering five encoder architectures, classical baselines, three prompt template formulations, multi-seed robustness analysis, minority-class rebalancing, parameter-efficient prefix-tuning, few-shot evaluation, and zero-shot cross-domain generalisation — all within a single self-contained notebook.

Core finding: Multi-seed evaluation reverses single-run model rankings. ELECTRA-base wins at seed 42 (93.30% accuracy) but achieves the lowest mean across three seeds (0.9267 ± 0.0027), while DistilBERT achieves the highest multi-seed mean (0.9290 ± 0.0040). Prompt wording is statistically neutral across all tested conditions.


Results at a Glance

Main Benchmark — SetFit/emotion (seed 42, Template A)

Model Acc Macro-F1 W-F1 Throughput (sps)
ELECTRA-base 0.9330 0.8930 0.9334 1,015
RoBERTa-base 0.9305 0.8798 0.9296 987
DistilBERT 0.9270 0.8831 0.9275 1,954
ALBERT-base-v2 0.9260 0.8853 0.9262 597
DistilRoBERTa 0.9260 0.8829 0.9266 2,068
TF-IDF + LinearSVM 0.8795 0.8173 0.8775 1,970,808
TF-IDF + LR 0.8215 0.7070 0.8081 261,837

Multi-Seed Robustness (Seeds: 42, 2024, 7)

Model Mean Acc ± Std Mean MacF1 ± Std
DistilBERT+Prompt 0.9290 ± 0.0040 0.8864 ± 0.0107
DistilRoBERTa+Prompt 0.9288 ± 0.0030 0.8869 ± 0.0034
RoBERTa+Prompt 0.9277 ± 0.0027 0.8867 ± 0.0050
ELECTRA+Prompt 0.9267 ± 0.0027 0.8820 ± 0.0017

ELECTRA's seed-42 win is within the range of random seed variation. Single-run model selection is unreliable at this performance tier.

Prompt Ablation — McNemar's Test

Model Δ Acc p-value
DistilBERT +0.0020 0.752 (ns)
DistilRoBERTa +0.0035 0.418 (ns)
ELECTRA-base +0.0115 0.026 (*)

Prompt wording is neutral for DistilBERT and DistilRoBERTa. ELECTRA shows marginal significance that does not replicate across seeds (bootstrap CIs all span zero).

Few-Shot Sample Efficiency

Training Set N Acc Macro-F1
16 shots/class 96 0.3165 0.2501
64 shots/class 384 0.6740 0.6223
256 shots/class 1,536 0.8590 0.8237
Full fine-tune 16,000 0.9260 0.8829

At 256 shots per class (9.6% of training data), DistilRoBERTa achieves 92.9% of full fine-tune accuracy.


Repository Structure

emotion-classification/
│
├── notebooks/
│   └── Emotion_Experiment.ipynb      # Complete self-contained pipeline
│
├── results/
│   └── sample/                       # Sample JSON outputs for reference
│       ├── classical_metrics.json
│       ├── mcnemar_tests.json
│       ├── multiseed_aggregated.json
│       ├── prompt_variant_results.json
│       ├── fewshot_results.json
│       ├── weighted_loss_results.json
│       └── oversample_results.json
│
├── figures/                          # Publication figures (generated by notebook)
│   └── README.md
│
├── scripts/
│   ├── verify_environment.py         # Pre-run environment check
│   └── summarise_results.py          # Load Drive results → terminal table
│
├── .github/
│   └── ISSUE_TEMPLATE/
│       ├── bug_report.md
│       └── question.md
│
├── requirements.txt
├── CITATION.cff
├── LICENSE
└── README.md

Quickstart

Option 1 — Google Colab (Recommended)

Click the badge above or open directly:

https://colab.research.google.com/github/YOUR_USERNAME/emotion-classification/blob/main/notebooks/Emotion_Experiment.ipynb
  1. Set runtime to GPU (Runtime → Change runtime type → T4 GPU)
  2. Run Cell 1 (mounts Drive, installs packages)
  3. Restart runtime when prompted
  4. Run cells sequentially top to bottom

All results save automatically to MyDrive/emotion_experiment/. If the session crashes, re-running any cell reloads from Drive and skips completed experiments.

Option 2 — Local / HPC

git clone https://github.com/YOUR_USERNAME/emotion-classification.git
cd emotion-classification
pip install -r requirements.txt
python scripts/verify_environment.py   # confirm GPU and package versions
jupyter notebook notebooks/Emotion_Experiment.ipynb

Note: The notebook uses from google.colab import drive in Cell 1. On a local machine, skip that cell and manually set DRIVE_ROOT = Path("./outputs") in Cell 2 before running.


Experiment Pipeline

The notebook executes nine sequential phases:

Cell Phase Description ~Time (T4)
1 Setup Mount Drive, install packages, restart 2 min
2 Imports & Config Global seed, paths, hyperparameters <1 min
3 Dataset Load SetFit/emotion; verify splits 1 min
4 Helpers Metrics, McNemar, throughput functions <1 min
5 Classical Baselines TF-IDF + LR and LinearSVM 3 min
6 Transformer Function run_transformer() definition <1 min
7 Main Benchmark 8 runs (5 prompt + 3 no-prompt) 4–5 hr
7B Prompt Variants 3 templates × DistilRoBERTa ~45 min
7C Multi-Seed 5 models × 3 seeds ~3 hr
7D Weighted Loss ELECTRA + DistilRoBERTa ~1.5 hr
7E Soft Prompt Prefix-tuning on DistilRoBERTa ~30 min
7F Few-Shot k ∈ {16, 64, 256} shots/class ~45 min
7G Oversampling Balanced training set (32,172 samples) ~1.5 hr
8 McNemar Tests Significance testing across all ablations 5 min
9 Summary Table Aggregate all results into CSV 2 min
10 Figures 6 publication-quality figures (PDF + PNG) 5 min
11 LaTeX Tables Ready-to-paste table snippets 2 min
12 Reproducibility Verify all experimental safeguards 1 min
13–16 Cross-Domain GoEmotions + MELD zero-shot evaluation ~30 min

Hyperparameters

All transformer models use identical hyperparameters for fair comparison:

Parameter Value
Optimiser AdamW
Learning rate 2 × 10⁻⁵
Weight decay 0.01
Train batch size 32
Eval batch size 64
Max epochs 5
Early stopping patience 2 (val loss)
Max sequence length 128
Precision FP16 (GPU)
Primary seed 42
Multi-seed runs 42, 2024, 7

Classical baselines: TF-IDF (50K features, unigrams+bigrams, sublinear TF) + LogisticRegression / LinearSVC (both C=1).


Dataset

SetFit/emotion — HuggingFace Hub (SetFit/emotion)

Split Samples
Train 16,000
Validation 2,000
Test 2,000

Six emotion classes: joy (33.5%), sadness (29.2%), anger (13.5%), fear (12.1%), love (8.2%), surprise (3.6%).
Majority-to-minority ratio: 9.3× (joy vs. surprise).

Official pre-defined splits are used as-is. No re-splitting or stratification is performed. TF-IDF is fitted on training split only.


Models

Model HuggingFace ID Parameters Pretraining
DistilBERT distilbert-base-uncased 66M MLM (distilled from BERT)
DistilRoBERTa distilroberta-base 82M MLM (distilled from RoBERTa)
RoBERTa-base roberta-base 125M MLM (robust pretraining)
ALBERT-base-v2 albert-base-v2 12M* MLM + SOP
ELECTRA-base google/electra-base-discriminator 110M Replaced-token detection

*ALBERT has 12M unique parameters; cross-layer weight sharing means the full forward pass executes 12 layers.


Reproducibility

All experiments are deterministic under a fixed seed:

  • Python random, NumPy, PyTorch, and HuggingFace seeds set globally
  • Official HuggingFace dataset splits used without modification
  • TF-IDF fitted on training split only (.transform() for val/test)
  • Test set evaluated exactly once per model, after all training decisions
  • Throughput measured post-training with torch.no_grad()
  • save_only_model=True — optimizer state not saved (~500 MB per run)
  • All result files persisted to Google Drive as JSON after every model

The notebook includes a Reproducibility Checklist (Cell 12) and a Drive file map that verifies every expected output file is present before figures and tables are generated.


Computing Environment

Resource Specification
GPU NVIDIA T4 (15 GB VRAM), Google Colab
Training precision FP16 mixed precision
CPU (classical baselines) Intel Xeon, Google Colab
Disk (Drive) ~4 GB for all checkpoints + results
Total wall time ~12–14 hours (full pipeline, no skips)

Session crashes are handled gracefully: every experiment checks Drive for existing results before training.


Citation

If you use this code or results, please cite:

@inproceedings{gayen2025prompt,
  title     = {Benchmarked: A Multi-Seed, Multi-Condition Study of Prompt-Based Emotion Classification with Efficient Transformers},
  author    = {Gayen, Avijit and Naskar, Sayak and Mishra, Shilpi and Jana, Angshuman},
  booktitle = {[Venue]},
  year      = {2025},
  url       = {https://arxiv.org/abs/XXXX.XXXXX}
}

License

This project is licensed under the MIT License.
Model weights are subject to their respective HuggingFace model card licenses.
The SetFit/emotion dataset is subject to its original data license.


Acknowledgements

The authors thank Google for Colab GPU compute resources and the open-source communities behind HuggingFace Transformers, Datasets, and scikit-learn.