This repository contains code used for experiments in my BSc final thesis, “Multi-class Classification of Botnet Detection by Active Learning.”
Labeling malware samples and network traffic is an expensive process in the cybersecurity industry. This active learning framework allows for the efficient development of effective machine learning models using a limited dataset. This thesis compares well-known query strategies to identify which strategy and parameters yield the best results while minimizing the required data samples.
- Margin Sampling is the optimal strategy in terms of stability and convergence speed.
- If multiple instances are required in each iteration, Ranked Batch-mode Sampling with a small unlabeled pool may perform well.
To get started, clone the repository:
git clone https://github.com/kei5uke/botnet-active-learning.git
Then, change your current directory and install the dependencies:
cd active_learning
pip install -r requirements.txt
Next, install the MedBIoT and N-BaIoT datasets and store them in the /dataset directory
The file structure is shown in the directory, so be sure to install the datasets accordingly
You can find the datasets here:
Only a small portion of the datasets is used for the experiments
To generate dataset pickles, run python3 Make_df_MedBIoT.py and python3 Make_df_N-BaIoT.py
Change common variables in global_variable.py and shared variables in each file
Now you are ready to run any experimental code in /active_learning directory


