This repository showcases a practical coursework-style project from my Master's studies in Big Data Analytics (at University of East London). The aim is to demonstrate how core big data tools and techniques can be used end-to-end on a realistic cybersecurity dataset, from data management and querying to advanced analytics and machine learning.
The work is built around the UNSW-NB15 intrusion detection dataset, a widely used benchmark for research on network traffic and cyber attacks. The project combines:
- Big data management using Hadoop and HDFS
- Batch querying and feature exploration using Apache Hive
- Scalable advanced analytics and machine learning using PySpark
- A short discussion of alternative technologies for big data systems
⚠️ Note on academic integrity:
This repository is a restructured, summarised version of work originally completed as part of a university assignment. It is shared here purely as a portfolio of practical skills in big data analytics and systems. Anyone using this material for their own coursework should adapt, extend, and properly acknowledge it rather than copying directly.
- Apply the basics of big data engineering: HDFS storage, schema design, data loading.
- Use HiveQL to:
- Explore temporal patterns of malicious traffic
- Work with mathematical and string functions at scale
- Build simple derived features using conditional logic
- Use PySpark to:
- Clean and prepare a large intrusion detection dataset
- Compute descriptive statistics and correlations
- Perform hypothesis testing and density estimation on large numeric features
- Build both binary and multi-class classification models at scale
data/– Information about the UNSW-NB15 dataset and how to obtain it.hive/– HiveQL scripts used for big data querying and basic feature engineering.pyspark/– PySpark scripts for advanced analytics, statistical exploration, and classification models.
Apache Hadoop / HDFS – distributed storage backend.
- Apache Hive – SQL-like querying over large datasets using a schema-on-read model.
- PySpark (Apache Spark 3.x) – distributed in-memory analytics and machine learning:
- DataFrames + SQL.
- MLlib (Logistic Regression, Random Forest).
pyspark.ml.stat(correlation, KS test).
- Matplotlib / Pandas – used locally for visualisation of Spark outputs.
- Clone the repository:
git clone https://github.com/DrFarouk/big-data-analytics.git cd big-data-analytics - Set up a Hadoop + Spark environment (local or cluster).
- Follow
data/README.mdto download and place the UNSW-NB15 CSV in HDFS. - Run:
- Hive scripts from the
hive/directory. - PySpark scripts from the
pyspark/directory
- Hive scripts from the
By the end of this project, I:
- Gained practical experience with schema-on-read data management and large-scale SQL querying in Hive.
- Learned how to conduct distributed statistical analysis and machine learning using PySpark.
- Built and evaluated both binary and multi-class classifiers for intrusion detection.
- Developed an informed view of where tools like Presto, Impala, Dask, Flink, and RAPIDS fit in the broader big data landscape.
Bar chart showing average source and destination bytes by attack category
A heatmap showing the Pearson correlation matrix.
ROC curve showing evaluation of a simple logistic regression model (binary classification).


