Skip to content
/ phishdec Public

Evaluating phishing email detection classification models using curated datasets

Notifications You must be signed in to change notification settings

xcelt/phishdec

Repository files navigation

Phishdec Project: Evaluating Phishing Email Classification Models

An artefact for the dissertation:

Phishing Email Detection: A Comparison of Traditional Naive Bayes and Ensemble Random Forest Classification Models using Curated Datasets

By Xue Ling Teh

Overview

Datasets*

  1. phishing
  2. validate
  3. enron
  4. ling
  5. spamassassin
  6. ceas_08
  7. trec_05
  8. trec_06
  9. trec_07

*Oversized files have not been uploaded to GitHub remote due to the file size limit

Data preprocessing and cleaning

One of the following options:

  • Drop empty rows with null, NA or NaN values

OR

  • Impute/Fill NA or NaN values

Model Types

  1. Naive Bayes (NB): Multinomial
  2. Random Forest (RF)

Evaluation Metrics

  • Accuracy
  • Confusion Matrix
    • True Positive (TP)
    • True Negative (TN)
    • False Positive (FP)
    • False Negative (FN)
  • Classification Report:
    • precision
    • recall
    • f1-score
    • support: total number of occurrences of a specific class in the dataset
  • 5×2 Cross-Validation Paired T-test
    • t statistic
    • p-value

Setup

Prerequisites

  • ~= Python 3.13.0

Getting Started (Windows)

  1. Create a virtual environment named venv: python -m venv venv
  2. Install the required dependencies in requirements.txt:
    1. python -m pip install -r requirements.txt
  3. Activate the virtual environment: venv\Scripts\activate
  4. Run the program: python main.py

To zip up this project: git archive --format=zip --output ./artefact.zip HEAD

Notes

Filename convention: title*_datasetnumber_randomstate.fileextension

*Title with more than one word has extra underscores (_)

e.g. nb_c_matrix_1_860.png indicates a png image of a Naive Bayes Confusion Matrix of Dataset 1 with random state 860

Default random state is 42.

References

Downey, A. (2022) Think Bayes 2 — Think Bayes. Available from: https://allendowney.github.io/ThinkBayes2/ [Accessed 19 October 2024].

Downey, A. (2024) Think Python — Think Python. Available from: https://allendowney.github.io/ThinkPython/ [Accessed 16 September 2024].

Chakraborty, S. (2023) Phishing Email Detection. DOI: https://doi.org/10.34740/kaggle/dsv/6090437.

Champa, A.I., Rabbi, M.F. & Zibran, M.F. (2024) Curated Datasets and Feature Analysis for Phishing Email Detection with Machine Learning. In: 2024 IEEE 3rd International Conference on Computing and Machine Intelligence (ICMI). April 2024 pp. 1–7. DOI: https://doi.org/10.1109/ICMI60790.2024.10585821.

Miltchev, R., Dimitar, R. & Evgeni, G. (2024) Phishing validation emails dataset. DOI: https://doi.org/10.5281/ZENODO.13474746.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M. & Duchesnay, É. (2011) Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research. 12(85): 2825–2830.

Toledo Jr, T. (2021) Statistical Tests for Comparing Classification Algorithms. 23 November 2021. Towards Data Science. Available from: https://towardsdatascience.com/statistical-tests-for-comparing-classification-algorithms-ac1804e79bb7/ [Accessed 12 March 2025]. 

About

Evaluating phishing email detection classification models using curated datasets

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages