Benchmarked: A Multi-Seed, Multi-Condition Study of Prompt-Based Emotion Classification with Efficient Transformers

Overview

This repository contains the complete, reproducible experimental pipeline for our paper on prompt-based emotion classification using efficient transformer encoders. We present a controlled, multi-condition study covering five encoder architectures, classical baselines, three prompt template formulations, multi-seed robustness analysis, minority-class rebalancing, parameter-efficient prefix-tuning, few-shot evaluation, and zero-shot cross-domain generalisation — all within a single self-contained notebook.

Core finding: Multi-seed evaluation reverses single-run model rankings. ELECTRA-base wins at seed 42 (93.30% accuracy) but achieves the lowest mean across three seeds (0.9267 ± 0.0027), while DistilBERT achieves the highest multi-seed mean (0.9290 ± 0.0040). Prompt wording is statistically neutral across all tested conditions.

Results at a Glance

Main Benchmark — SetFit/emotion (seed 42, Template A)

Model	Acc	Macro-F1	W-F1	Throughput (sps)
ELECTRA-base	0.9330	0.8930	0.9334	1,015
RoBERTa-base	0.9305	0.8798	0.9296	987
DistilBERT	0.9270	0.8831	0.9275	1,954
ALBERT-base-v2	0.9260	0.8853	0.9262	597
DistilRoBERTa	0.9260	0.8829	0.9266	2,068
TF-IDF + LinearSVM	0.8795	0.8173	0.8775	1,970,808
TF-IDF + LR	0.8215	0.7070	0.8081	261,837

Multi-Seed Robustness (Seeds: 42, 2024, 7)

Model	Mean Acc ± Std	Mean MacF1 ± Std
DistilBERT+Prompt	0.9290 ± 0.0040	0.8864 ± 0.0107
DistilRoBERTa+Prompt	0.9288 ± 0.0030	0.8869 ± 0.0034
RoBERTa+Prompt	0.9277 ± 0.0027	0.8867 ± 0.0050
ELECTRA+Prompt	0.9267 ± 0.0027	0.8820 ± 0.0017

ELECTRA's seed-42 win is within the range of random seed variation. Single-run model selection is unreliable at this performance tier.

Prompt Ablation — McNemar's Test

Model	Δ Acc	p-value
DistilBERT	+0.0020	0.752 (ns)
DistilRoBERTa	+0.0035	0.418 (ns)
ELECTRA-base	+0.0115	0.026 (*)

Prompt wording is neutral for DistilBERT and DistilRoBERTa. ELECTRA shows marginal significance that does not replicate across seeds (bootstrap CIs all span zero).

Few-Shot Sample Efficiency

Training Set	N	Acc	Macro-F1
16 shots/class	96	0.3165	0.2501
64 shots/class	384	0.6740	0.6223
256 shots/class	1,536	0.8590	0.8237
Full fine-tune	16,000	0.9260	0.8829

At 256 shots per class (9.6% of training data), DistilRoBERTa achieves 92.9% of full fine-tune accuracy.

Repository Structure

emotion-classification/
│
├── notebooks/
│   └── Emotion_Experiment.ipynb      # Complete self-contained pipeline
│
├── results/
│   └── sample/                       # Sample JSON outputs for reference
│       ├── classical_metrics.json
│       ├── mcnemar_tests.json
│       ├── multiseed_aggregated.json
│       ├── prompt_variant_results.json
│       ├── fewshot_results.json
│       ├── weighted_loss_results.json
│       └── oversample_results.json
│
├── figures/                          # Publication figures (generated by notebook)
│   └── README.md
│
├── scripts/
│   ├── verify_environment.py         # Pre-run environment check
│   └── summarise_results.py          # Load Drive results → terminal table
│
├── .github/
│   └── ISSUE_TEMPLATE/
│       ├── bug_report.md
│       └── question.md
│
├── requirements.txt
├── CITATION.cff
├── LICENSE
└── README.md

Quickstart

Option 1 — Google Colab (Recommended)

Click the badge above or open directly:

https://colab.research.google.com/github/YOUR_USERNAME/emotion-classification/blob/main/notebooks/Emotion_Experiment.ipynb

Set runtime to GPU (Runtime → Change runtime type → T4 GPU)
Run Cell 1 (mounts Drive, installs packages)
Restart runtime when prompted
Run cells sequentially top to bottom

All results save automatically to MyDrive/emotion_experiment/. If the session crashes, re-running any cell reloads from Drive and skips completed experiments.

Option 2 — Local / HPC

git clone https://github.com/YOUR_USERNAME/emotion-classification.git
cd emotion-classification
pip install -r requirements.txt
python scripts/verify_environment.py   # confirm GPU and package versions
jupyter notebook notebooks/Emotion_Experiment.ipynb

Note: The notebook uses from google.colab import drive in Cell 1. On a local machine, skip that cell and manually set DRIVE_ROOT = Path("./outputs") in Cell 2 before running.

Experiment Pipeline

The notebook executes nine sequential phases:

Cell	Phase	Description	~Time (T4)
1	Setup	Mount Drive, install packages, restart	2 min
2	Imports & Config	Global seed, paths, hyperparameters	<1 min
3	Dataset	Load SetFit/emotion; verify splits	1 min
4	Helpers	Metrics, McNemar, throughput functions	<1 min
5	Classical Baselines	TF-IDF + LR and LinearSVM	3 min
6	Transformer Function	`run_transformer()` definition	<1 min
7	Main Benchmark	8 runs (5 prompt + 3 no-prompt)	4–5 hr
7B	Prompt Variants	3 templates × DistilRoBERTa	~45 min
7C	Multi-Seed	5 models × 3 seeds	~3 hr
7D	Weighted Loss	ELECTRA + DistilRoBERTa	~1.5 hr
7E	Soft Prompt	Prefix-tuning on DistilRoBERTa	~30 min
7F	Few-Shot	k ∈ {16, 64, 256} shots/class	~45 min
7G	Oversampling	Balanced training set (32,172 samples)	~1.5 hr
8	McNemar Tests	Significance testing across all ablations	5 min
9	Summary Table	Aggregate all results into CSV	2 min
10	Figures	6 publication-quality figures (PDF + PNG)	5 min
11	LaTeX Tables	Ready-to-paste table snippets	2 min
12	Reproducibility	Verify all experimental safeguards	1 min
13–16	Cross-Domain	GoEmotions + MELD zero-shot evaluation	~30 min

Hyperparameters

All transformer models use identical hyperparameters for fair comparison:

Parameter	Value
Optimiser	AdamW
Learning rate	2 × 10⁻⁵
Weight decay	0.01
Train batch size	32
Eval batch size	64
Max epochs	5
Early stopping patience	2 (val loss)
Max sequence length	128
Precision	FP16 (GPU)
Primary seed	42
Multi-seed runs	42, 2024, 7

Classical baselines: TF-IDF (50K features, unigrams+bigrams, sublinear TF) + LogisticRegression / LinearSVC (both C=1).

Dataset

SetFit/emotion — HuggingFace Hub (SetFit/emotion)

Split	Samples
Train	16,000
Validation	2,000
Test	2,000

Six emotion classes: joy (33.5%), sadness (29.2%), anger (13.5%), fear (12.1%), love (8.2%), surprise (3.6%).
Majority-to-minority ratio: 9.3× (joy vs. surprise).

Official pre-defined splits are used as-is. No re-splitting or stratification is performed. TF-IDF is fitted on training split only.

Models

Model	HuggingFace ID	Parameters	Pretraining
DistilBERT	`distilbert-base-uncased`	66M	MLM (distilled from BERT)
DistilRoBERTa	`distilroberta-base`	82M	MLM (distilled from RoBERTa)
RoBERTa-base	`roberta-base`	125M	MLM (robust pretraining)
ALBERT-base-v2	`albert-base-v2`	12M*	MLM + SOP
ELECTRA-base	`google/electra-base-discriminator`	110M	Replaced-token detection

*ALBERT has 12M unique parameters; cross-layer weight sharing means the full forward pass executes 12 layers.

Reproducibility

All experiments are deterministic under a fixed seed:

Python random, NumPy, PyTorch, and HuggingFace seeds set globally
Official HuggingFace dataset splits used without modification
TF-IDF fitted on training split only (.transform() for val/test)
Test set evaluated exactly once per model, after all training decisions
Throughput measured post-training with torch.no_grad()
save_only_model=True — optimizer state not saved (~500 MB per run)
All result files persisted to Google Drive as JSON after every model

The notebook includes a Reproducibility Checklist (Cell 12) and a Drive file map that verifies every expected output file is present before figures and tables are generated.

Computing Environment

Resource	Specification
GPU	NVIDIA T4 (15 GB VRAM), Google Colab
Training precision	FP16 mixed precision
CPU (classical baselines)	Intel Xeon, Google Colab
Disk (Drive)	~4 GB for all checkpoints + results
Total wall time	~12–14 hours (full pipeline, no skips)

Session crashes are handled gracefully: every experiment checks Drive for existing results before training.

Citation

If you use this code or results, please cite:

@inproceedings{gayen2025prompt,
  title     = {Benchmarked: A Multi-Seed, Multi-Condition Study of Prompt-Based Emotion Classification with Efficient Transformers},
  author    = {Gayen, Avijit and Naskar, Sayak and Mishra, Shilpi and Jana, Angshuman},
  booktitle = {[Venue]},
  year      = {2025},
  url       = {https://arxiv.org/abs/XXXX.XXXXX}
}

License

This project is licensed under the MIT License.
Model weights are subject to their respective HuggingFace model card licenses.
The SetFit/emotion dataset is subject to its original data license.

Acknowledgements

The authors thank Google for Colab GPU compute resources and the open-source communities behind HuggingFace Transformers, Datasets, and scikit-learn.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmarked: A Multi-Seed, Multi-Condition Study of Prompt-Based Emotion Classification with Efficient Transformers

Overview

Results at a Glance

Main Benchmark — SetFit/emotion (seed 42, Template A)

Multi-Seed Robustness (Seeds: 42, 2024, 7)

Prompt Ablation — McNemar's Test

Few-Shot Sample Efficiency

Repository Structure

Quickstart

Option 1 — Google Colab (Recommended)

Option 2 — Local / HPC

Experiment Pipeline

Hyperparameters

Dataset

Models

Reproducibility

Computing Environment

Citation

License

Acknowledgements

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Benchmarked: A Multi-Seed, Multi-Condition Study of Prompt-Based Emotion Classification with Efficient Transformers

Overview

Results at a Glance

Main Benchmark — SetFit/emotion (seed 42, Template A)

Multi-Seed Robustness (Seeds: 42, 2024, 7)

Prompt Ablation — McNemar's Test

Few-Shot Sample Efficiency

Repository Structure

Quickstart

Option 1 — Google Colab (Recommended)

Option 2 — Local / HPC

Experiment Pipeline

Hyperparameters

Dataset

Models

Reproducibility

Computing Environment

Citation

License

Acknowledgements