An artefact for the dissertation:
Phishing Email Detection: A Comparison of Traditional Naive Bayes and Ensemble Random Forest Classification Models using Curated Datasets
By Xue Ling Teh
- phishing
- validate
- enron
- ling
- spamassassin
- ceas_08
- trec_05
- trec_06
- trec_07
*Oversized files have not been uploaded to GitHub remote due to the file size limit
One of the following options:
- Drop empty rows with null, NA or NaN values
OR
- Impute/Fill NA or NaN values
- Naive Bayes (NB): Multinomial
- Random Forest (RF)
- Accuracy
- Confusion Matrix
- True Positive (TP)
- True Negative (TN)
- False Positive (FP)
- False Negative (FN)
- Classification Report:
- precision
- recall
- f1-score
- support: total number of occurrences of a specific class in the dataset
- 5×2 Cross-Validation Paired T-test
- t statistic
- p-value
- ~= Python 3.13.0
- Create a virtual environment named venv:
python -m venv venv - Install the required dependencies in
requirements.txt:python -m pip install -r requirements.txt
- Activate the virtual environment:
venv\Scripts\activate - Run the program:
python main.py
To zip up this project:
git archive --format=zip --output ./artefact.zip HEAD
Filename convention: title*_datasetnumber_randomstate.fileextension
*Title with more than one word has extra underscores (_)
e.g. nb_c_matrix_1_860.png indicates a png image of a Naive Bayes Confusion Matrix of Dataset 1 with random state 860
Default random state is 42.
Downey, A. (2022) Think Bayes 2 — Think Bayes. Available from: https://allendowney.github.io/ThinkBayes2/ [Accessed 19 October 2024].
Downey, A. (2024) Think Python — Think Python. Available from: https://allendowney.github.io/ThinkPython/ [Accessed 16 September 2024].
Chakraborty, S. (2023) Phishing Email Detection. DOI: https://doi.org/10.34740/kaggle/dsv/6090437.
Champa, A.I., Rabbi, M.F. & Zibran, M.F. (2024) Curated Datasets and Feature Analysis for Phishing Email Detection with Machine Learning. In: 2024 IEEE 3rd International Conference on Computing and Machine Intelligence (ICMI). April 2024 pp. 1–7. DOI: https://doi.org/10.1109/ICMI60790.2024.10585821.
Miltchev, R., Dimitar, R. & Evgeni, G. (2024) Phishing validation emails dataset. DOI: https://doi.org/10.5281/ZENODO.13474746.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M. & Duchesnay, É. (2011) Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research. 12(85): 2825–2830.
Toledo Jr, T. (2021) Statistical Tests for Comparing Classification Algorithms. 23 November 2021. Towards Data Science. Available from: https://towardsdatascience.com/statistical-tests-for-comparing-classification-algorithms-ac1804e79bb7/ [Accessed 12 March 2025].