Big Data Analytics on the UNSW-NB15 Cybersecurity Dataset

This repository showcases a practical coursework-style project from my Master's studies in Big Data Analytics (at University of East London). The aim is to demonstrate how core big data tools and techniques can be used end-to-end on a realistic cybersecurity dataset, from data management and querying to advanced analytics and machine learning.

The work is built around the UNSW-NB15 intrusion detection dataset, a widely used benchmark for research on network traffic and cyber attacks. The project combines:

Big data management using Hadoop and HDFS
Batch querying and feature exploration using Apache Hive
Scalable advanced analytics and machine learning using PySpark
A short discussion of alternative technologies for big data systems

⚠️ Note on academic integrity:
This repository is a restructured, summarised version of work originally completed as part of a university assignment. It is shared here purely as a portfolio of practical skills in big data analytics and systems. Anyone using this material for their own coursework should adapt, extend, and properly acknowledge it rather than copying directly.

Project goals

Apply the basics of big data engineering: HDFS storage, schema design, data loading.
Use HiveQL to:
- Explore temporal patterns of malicious traffic
- Work with mathematical and string functions at scale
- Build simple derived features using conditional logic
Use PySpark to:
- Clean and prepare a large intrusion detection dataset
- Compute descriptive statistics and correlations
- Perform hypothesis testing and density estimation on large numeric features
- Build both binary and multi-class classification models at scale

Repository structure

data/ – Information about the UNSW-NB15 dataset and how to obtain it.
hive/ – HiveQL scripts used for big data querying and basic feature engineering.
pyspark/ – PySpark scripts for advanced analytics, statistical exploration, and classification models.

Technologies used

Apache Hadoop / HDFS – distributed storage backend.

Apache Hive – SQL-like querying over large datasets using a schema-on-read model.
PySpark (Apache Spark 3.x) – distributed in-memory analytics and machine learning:
- DataFrames + SQL.
- MLlib (Logistic Regression, Random Forest).
- pyspark.ml.stat (correlation, KS test).
Matplotlib / Pandas – used locally for visualisation of Spark outputs.

Getting started

Clone the repository:

git clone https://github.com/DrFarouk/big-data-analytics.git
cd big-data-analytics

Set up a Hadoop + Spark environment (local or cluster).
Follow data/README.md to download and place the UNSW-NB15 CSV in HDFS.
Run:
- Hive scripts from the hive/ directory.
- PySpark scripts from the pyspark/ directory

Learning Outcomes

By the end of this project, I:

Gained practical experience with schema-on-read data management and large-scale SQL querying in Hive.
Learned how to conduct distributed statistical analysis and machine learning using PySpark.
Built and evaluated both binary and multi-class classifiers for intrusion detection.
Developed an informed view of where tools like Presto, Impala, Dask, Flink, and RAPIDS fit in the broader big data landscape.

Examples of included Graphical Representations:

Bar chart showing average source and destination bytes by attack category

A heatmap showing the Pearson correlation matrix.

ROC curve showing evaluation of a simple logistic regression model (binary classification).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Big Data Analytics on the UNSW-NB15 Cybersecurity Dataset

Project goals

Repository structure

Technologies used

Getting started

Learning Outcomes

Examples of included Graphical Representations:

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data		data
hive		hive
pyspark		pyspark
README.md		README.md

DrFarouk/big-data-analytics

Folders and files

Latest commit

History

Repository files navigation

Big Data Analytics on the UNSW-NB15 Cybersecurity Dataset

Project goals

Repository structure

Technologies used

Getting started

Learning Outcomes

Examples of included Graphical Representations:

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages