This repository contains the implementation of a deep learning-based malware classification system developed for the PFE Camp '26. The project explores the effectiveness of treating Windows executable binaries as 2D images, allowing a Convolutional Neural Network (CNN) to "see" and classify malware families based on visual byte-level textures.
Traditional malware detection relies on signatures or complex heuristics. This project implements a Static Analysis pipeline that transforms raw .bytes files from the Microsoft Malware Classification Challenge into grayscale images. By leveraging the spatial hierarchy of CNNs, the model identifies structural patterns unique to specific malware families.
- Validation Accuracy: 98.3%
- Kaggle Public Score: 0.06005
- Kaggle Private Score: 0.06321
- Leaderboard Rank: Top 175
The project utilizes the Microsoft Malware Classification Challenge (BIG 2015) dataset:
- Samples: 10,868 training files.
- Classes: 9 distinct malware families:
- Ramnit | 2. Lollipop | 3. Kelihos_ver3 | 4. Vundo | 5. Simda | 6. Tracur | 7. Kelihos_ver1 | 8. Obfuscator.ACY | 9. Gatak.
The project followed an incremental improvement strategy to optimize the Log Loss score:
| Milestone | Resolution | Key Features | Accuracy |
|---|---|---|---|
| Baseline | 256x256 | Simple 3-layer CNN | ~80% |
| Stage 2 | 256x256 | Increased depth & epochs | ~85% |
| Stage 3 | 256x256 | Deep 5-layer architecture | ~90% |
| Stage 4 | 256x256 | Added Batch Normalization | ~95% |
| Final Static | 512x512 | LR Scheduler & Dropout | 98.3% |
The final model is a custom deep CNN optimized for 512x512 inputs:
- Feature Extractor: 6 Convolutional blocks with
nn.BatchNorm2dandnn.ReLU. - Downsampling: 5
nn.MaxPool2dlayers. - Global Pooling:
nn.AdaptiveAvgPool2d((1, 1))for resolution independence. - Classifier: Dense layer with
Dropout(p=0.2)to prevent overfitting.
- Framework: PyTorch (v2.x)
- GPU: NVIDIA GeForce RTX 3080 Ti (CUDA acceleration)
- Language: Python 3.12
- Libraries: Pandas, NumPy, Pathlib, tqdm, Scikit-learn
.
├── data
│ ├── processed_tensors_256
│ │ └── train
│ ├── processed_tensors_512
│ │ └── train
│ ├── test_raw
│ │ └── test
│ └── train_raw
│ └── train
├── models
│ └── all
└── submissions
└── hidden
Malware_Classification.ipynb: The main notebook containing all logic for data preprocessing, model architecture, training loops, and inference.data/: Contains the raw Microsoft Malware dataset (.bytesand.asmfiles) and theprocessed_tensors/folder.models/: Stores trained model weights (.pthfiles), including the final high-performance model.submissions/: Stores generatedsubmission.csvfiles for Kaggle benchmarking.
- RGB Representation: Implementing a 3-channel image mapping (Red: Raw Bytes, Green: Local Entropy, Blue: ASM Metadata).
- Benign vs. Malware: Extending the system to distinguish between safe Windows executables and malicious files.
- Dynamic Analysis: Integrating sandbox-based behavior monitoring (API calls, network activity) as a secondary classification layer.
Author: Luka Marković
Project: PFE Camp '26 Proposal Implementation