Benchmarked: A Multi-Seed, Multi-Condition Study of Prompt-Based Emotion Classification with Efficient Transformers
This repository contains the complete, reproducible experimental pipeline for our paper on prompt-based emotion classification using efficient transformer encoders. We present a controlled, multi-condition study covering five encoder architectures, classical baselines, three prompt template formulations, multi-seed robustness analysis, minority-class rebalancing, parameter-efficient prefix-tuning, few-shot evaluation, and zero-shot cross-domain generalisation — all within a single self-contained notebook.
Core finding: Multi-seed evaluation reverses single-run model rankings. ELECTRA-base wins at seed 42 (93.30% accuracy) but achieves the lowest mean across three seeds (0.9267 ± 0.0027), while DistilBERT achieves the highest multi-seed mean (0.9290 ± 0.0040). Prompt wording is statistically neutral across all tested conditions.
| Model | Acc | Macro-F1 | W-F1 | Throughput (sps) |
|---|---|---|---|---|
| ELECTRA-base | 0.9330 | 0.8930 | 0.9334 | 1,015 |
| RoBERTa-base | 0.9305 | 0.8798 | 0.9296 | 987 |
| DistilBERT | 0.9270 | 0.8831 | 0.9275 | 1,954 |
| ALBERT-base-v2 | 0.9260 | 0.8853 | 0.9262 | 597 |
| DistilRoBERTa | 0.9260 | 0.8829 | 0.9266 | 2,068 |
| TF-IDF + LinearSVM | 0.8795 | 0.8173 | 0.8775 | 1,970,808 |
| TF-IDF + LR | 0.8215 | 0.7070 | 0.8081 | 261,837 |
| Model | Mean Acc ± Std | Mean MacF1 ± Std |
|---|---|---|
| DistilBERT+Prompt | 0.9290 ± 0.0040 | 0.8864 ± 0.0107 |
| DistilRoBERTa+Prompt | 0.9288 ± 0.0030 | 0.8869 ± 0.0034 |
| RoBERTa+Prompt | 0.9277 ± 0.0027 | 0.8867 ± 0.0050 |
| ELECTRA+Prompt | 0.9267 ± 0.0027 | 0.8820 ± 0.0017 |
ELECTRA's seed-42 win is within the range of random seed variation. Single-run model selection is unreliable at this performance tier.
| Model | Δ Acc | p-value |
|---|---|---|
| DistilBERT | +0.0020 | 0.752 (ns) |
| DistilRoBERTa | +0.0035 | 0.418 (ns) |
| ELECTRA-base | +0.0115 | 0.026 (*) |
Prompt wording is neutral for DistilBERT and DistilRoBERTa. ELECTRA shows marginal significance that does not replicate across seeds (bootstrap CIs all span zero).
| Training Set | N | Acc | Macro-F1 |
|---|---|---|---|
| 16 shots/class | 96 | 0.3165 | 0.2501 |
| 64 shots/class | 384 | 0.6740 | 0.6223 |
| 256 shots/class | 1,536 | 0.8590 | 0.8237 |
| Full fine-tune | 16,000 | 0.9260 | 0.8829 |
At 256 shots per class (9.6% of training data), DistilRoBERTa achieves 92.9% of full fine-tune accuracy.
emotion-classification/
│
├── notebooks/
│ └── Emotion_Experiment.ipynb # Complete self-contained pipeline
│
├── results/
│ └── sample/ # Sample JSON outputs for reference
│ ├── classical_metrics.json
│ ├── mcnemar_tests.json
│ ├── multiseed_aggregated.json
│ ├── prompt_variant_results.json
│ ├── fewshot_results.json
│ ├── weighted_loss_results.json
│ └── oversample_results.json
│
├── figures/ # Publication figures (generated by notebook)
│ └── README.md
│
├── scripts/
│ ├── verify_environment.py # Pre-run environment check
│ └── summarise_results.py # Load Drive results → terminal table
│
├── .github/
│ └── ISSUE_TEMPLATE/
│ ├── bug_report.md
│ └── question.md
│
├── requirements.txt
├── CITATION.cff
├── LICENSE
└── README.md
Click the badge above or open directly:
https://colab.research.google.com/github/YOUR_USERNAME/emotion-classification/blob/main/notebooks/Emotion_Experiment.ipynb
- Set runtime to GPU (Runtime → Change runtime type → T4 GPU)
- Run Cell 1 (mounts Drive, installs packages)
- Restart runtime when prompted
- Run cells sequentially top to bottom
All results save automatically to MyDrive/emotion_experiment/. If the session crashes, re-running any cell reloads from Drive and skips completed experiments.
git clone https://github.com/YOUR_USERNAME/emotion-classification.git
cd emotion-classification
pip install -r requirements.txt
python scripts/verify_environment.py # confirm GPU and package versions
jupyter notebook notebooks/Emotion_Experiment.ipynbNote: The notebook uses
from google.colab import drivein Cell 1. On a local machine, skip that cell and manually setDRIVE_ROOT = Path("./outputs")in Cell 2 before running.
The notebook executes nine sequential phases:
| Cell | Phase | Description | ~Time (T4) |
|---|---|---|---|
| 1 | Setup | Mount Drive, install packages, restart | 2 min |
| 2 | Imports & Config | Global seed, paths, hyperparameters | <1 min |
| 3 | Dataset | Load SetFit/emotion; verify splits | 1 min |
| 4 | Helpers | Metrics, McNemar, throughput functions | <1 min |
| 5 | Classical Baselines | TF-IDF + LR and LinearSVM | 3 min |
| 6 | Transformer Function | run_transformer() definition |
<1 min |
| 7 | Main Benchmark | 8 runs (5 prompt + 3 no-prompt) | 4–5 hr |
| 7B | Prompt Variants | 3 templates × DistilRoBERTa | ~45 min |
| 7C | Multi-Seed | 5 models × 3 seeds | ~3 hr |
| 7D | Weighted Loss | ELECTRA + DistilRoBERTa | ~1.5 hr |
| 7E | Soft Prompt | Prefix-tuning on DistilRoBERTa | ~30 min |
| 7F | Few-Shot | k ∈ {16, 64, 256} shots/class | ~45 min |
| 7G | Oversampling | Balanced training set (32,172 samples) | ~1.5 hr |
| 8 | McNemar Tests | Significance testing across all ablations | 5 min |
| 9 | Summary Table | Aggregate all results into CSV | 2 min |
| 10 | Figures | 6 publication-quality figures (PDF + PNG) | 5 min |
| 11 | LaTeX Tables | Ready-to-paste table snippets | 2 min |
| 12 | Reproducibility | Verify all experimental safeguards | 1 min |
| 13–16 | Cross-Domain | GoEmotions + MELD zero-shot evaluation | ~30 min |
All transformer models use identical hyperparameters for fair comparison:
| Parameter | Value |
|---|---|
| Optimiser | AdamW |
| Learning rate | 2 × 10⁻⁵ |
| Weight decay | 0.01 |
| Train batch size | 32 |
| Eval batch size | 64 |
| Max epochs | 5 |
| Early stopping patience | 2 (val loss) |
| Max sequence length | 128 |
| Precision | FP16 (GPU) |
| Primary seed | 42 |
| Multi-seed runs | 42, 2024, 7 |
Classical baselines: TF-IDF (50K features, unigrams+bigrams, sublinear TF) + LogisticRegression / LinearSVC (both C=1).
SetFit/emotion — HuggingFace Hub (SetFit/emotion)
| Split | Samples |
|---|---|
| Train | 16,000 |
| Validation | 2,000 |
| Test | 2,000 |
Six emotion classes: joy (33.5%), sadness (29.2%), anger (13.5%), fear (12.1%), love (8.2%), surprise (3.6%).
Majority-to-minority ratio: 9.3× (joy vs. surprise).
Official pre-defined splits are used as-is. No re-splitting or stratification is performed. TF-IDF is fitted on training split only.
| Model | HuggingFace ID | Parameters | Pretraining |
|---|---|---|---|
| DistilBERT | distilbert-base-uncased |
66M | MLM (distilled from BERT) |
| DistilRoBERTa | distilroberta-base |
82M | MLM (distilled from RoBERTa) |
| RoBERTa-base | roberta-base |
125M | MLM (robust pretraining) |
| ALBERT-base-v2 | albert-base-v2 |
12M* | MLM + SOP |
| ELECTRA-base | google/electra-base-discriminator |
110M | Replaced-token detection |
*ALBERT has 12M unique parameters; cross-layer weight sharing means the full forward pass executes 12 layers.
All experiments are deterministic under a fixed seed:
- Python
random, NumPy, PyTorch, and HuggingFace seeds set globally - Official HuggingFace dataset splits used without modification
- TF-IDF fitted on training split only (
.transform()for val/test) - Test set evaluated exactly once per model, after all training decisions
- Throughput measured post-training with
torch.no_grad() save_only_model=True— optimizer state not saved (~500 MB per run)- All result files persisted to Google Drive as JSON after every model
The notebook includes a Reproducibility Checklist (Cell 12) and a Drive file map that verifies every expected output file is present before figures and tables are generated.
| Resource | Specification |
|---|---|
| GPU | NVIDIA T4 (15 GB VRAM), Google Colab |
| Training precision | FP16 mixed precision |
| CPU (classical baselines) | Intel Xeon, Google Colab |
| Disk (Drive) | ~4 GB for all checkpoints + results |
| Total wall time | ~12–14 hours (full pipeline, no skips) |
Session crashes are handled gracefully: every experiment checks Drive for existing results before training.
If you use this code or results, please cite:
@inproceedings{gayen2025prompt,
title = {Benchmarked: A Multi-Seed, Multi-Condition Study of Prompt-Based Emotion Classification with Efficient Transformers},
author = {Gayen, Avijit and Naskar, Sayak and Mishra, Shilpi and Jana, Angshuman},
booktitle = {[Venue]},
year = {2025},
url = {https://arxiv.org/abs/XXXX.XXXXX}
}This project is licensed under the MIT License.
Model weights are subject to their respective HuggingFace model card licenses.
The SetFit/emotion dataset is subject to its original data license.
The authors thank Google for Colab GPU compute resources and the open-source communities behind HuggingFace Transformers, Datasets, and scikit-learn.