Malware Detection & Classification with Convolutional Neural Networks

This repository contains the implementation of a deep learning-based malware classification system developed for the PFE Camp '26. The project explores the effectiveness of treating Windows executable binaries as 2D images, allowing a Convolutional Neural Network (CNN) to "see" and classify malware families based on visual byte-level textures.

Project Overview

Traditional malware detection relies on signatures or complex heuristics. This project implements a Static Analysis pipeline that transforms raw .bytes files from the Microsoft Malware Classification Challenge into grayscale images. By leveraging the spatial hierarchy of CNNs, the model identifies structural patterns unique to specific malware families.

Final Performance (Current Best)

Validation Accuracy: 98.3%
Kaggle Public Score: 0.06005
Kaggle Private Score: 0.06321
Leaderboard Rank: Top 175

Dataset

The project utilizes the Microsoft Malware Classification Challenge (BIG 2015) dataset:

Samples: 10,868 training files.
Classes: 9 distinct malware families:
1. Ramnit | 2. Lollipop | 3. Kelihos_ver3 | 4. Vundo | 5. Simda | 6. Tracur | 7. Kelihos_ver1 | 8. Obfuscator.ACY | 9. Gatak.

Iterative Methodology & Evolution

The project followed an incremental improvement strategy to optimize the Log Loss score:

Milestone	Resolution	Key Features	Accuracy
Baseline	256x256	Simple 3-layer CNN	~80%
Stage 2	256x256	Increased depth & epochs	~85%
Stage 3	256x256	Deep 5-layer architecture	~90%
Stage 4	256x256	Added Batch Normalization	~95%
Final Static	512x512	LR Scheduler & Dropout	98.3%

Model Architecture

The final model is a custom deep CNN optimized for 512x512 inputs:

Feature Extractor: 6 Convolutional blocks with nn.BatchNorm2d and nn.ReLU.
Downsampling: 5 nn.MaxPool2d layers.
Global Pooling: nn.AdaptiveAvgPool2d((1, 1)) for resolution independence.
Classifier: Dense layer with Dropout(p=0.2) to prevent overfitting.

Tech Stack

Framework: PyTorch (v2.x)
GPU: NVIDIA GeForce RTX 3080 Ti (CUDA acceleration)
Language: Python 3.12
Libraries: Pandas, NumPy, Pathlib, tqdm, Scikit-learn

Directory Structure

.
├── data
│   ├── processed_tensors_256
│   │   └── train
│   ├── processed_tensors_512
│   │   └── train
│   ├── test_raw
│   │   └── test
│   └── train_raw
│       └── train
├── models
│   └── all
└── submissions
    └── hidden

Repository Structure

Malware_Classification.ipynb: The main notebook containing all logic for data preprocessing, model architecture, training loops, and inference.
data/: Contains the raw Microsoft Malware dataset (.bytes and .asm files) and the processed_tensors/ folder.
models/: Stores trained model weights (.pth files), including the final high-performance model.
submissions/: Stores generated submission.csv files for Kaggle benchmarking.

Future Work (Phase 2 & 3)

RGB Representation: Implementing a 3-channel image mapping (Red: Raw Bytes, Green: Local Entropy, Blue: ASM Metadata).
Benign vs. Malware: Extending the system to distinguish between safe Windows executables and malicious files.
Dynamic Analysis: Integrating sandbox-based behavior monitoring (API calls, network activity) as a secondary classification layer.

Author: Luka Marković
Project: PFE Camp '26 Proposal Implementation

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
models		models
submissions		submissions
.gitignore		.gitignore
Klasifikacija i Detektovanje Malware.pdf		Klasifikacija i Detektovanje Malware.pdf
main.ipynb		main.ipynb
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Malware Detection & Classification with Convolutional Neural Networks

Project Overview

Final Performance (Current Best)

Dataset

Iterative Methodology & Evolution

Model Architecture

Tech Stack

Directory Structure

Repository Structure

Future Work (Phase 2 & 3)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Malware Detection & Classification with Convolutional Neural Networks

Project Overview

Final Performance (Current Best)

Dataset

Iterative Methodology & Evolution

Model Architecture

Tech Stack

Directory Structure

Repository Structure

Future Work (Phase 2 & 3)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages