Skip to content

FlowFrontiers/MalwareDet-JA4vsFlowStats

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 

Repository files navigation

ECH-Resilient Malware Detection via Flow-Level Statistical Features

Supporting repository for the paper: "ECH-Resilient Malware Detection via Flow-Level Statistical Features" By Márton Pál Lipcsey-Magyar, Attila Ármin Madarász, and Adrian Pekar

Paper Abstract

The deployment of Encrypted Client Hello (ECH) challenges TLS fingerprinting, the dominant approach for encrypted malware detection. This paper presents a comprehensive evaluation of flow-based statistical features as an ECH-resilient alternative. Through rigorous validation against the official JA4+ implementation, we demonstrate that only 64.9% of malware families possess unique signatures, fundamentally limiting fingerprinting recall.

Our results show that Random Forest classifiers using combined flow statistics achieve 98.11% F1-score for binary malware detection with 97.22% recall—substantially exceeding fingerprinting's theoretical maximum of 64.9%. These findings establish flow-based classification as a practical approach for maintaining network security visibility as encryption technologies advance.

Repository Structure

├── reproduce-research/          # Validation pipelines
│   ├── paper-pipeline/          # Reproduce using original author's data
│   ├── nfstream-pipeline/       # Reproduce using NFStream extraction
│   └── verify-ja4-calculation/  # JA4+ conformance validation
│
└── paper-code/                  # Main classification system (Python)

See paper-code/README.md for detailed usage instructions.

Key Results

Binary Classification (Malware vs Benign)

Model Feature Set Accuracy Precision Recall F1-Score
Random Forest Combined 97.07% 99.02% 97.22% 98.11%
Random Forest Core 96.55% 98.66% 96.91% 97.78%
Random Forest SPLT 96.65% 98.74% 96.95% 97.84%
Neural Network Combined 90.03% 94.39% 92.78% 93.58%

Full Multiclass (101 Families)

Model Feature Set Accuracy Macro F1
Random Forest Combined 61.62% 54.81%
Random Forest Core 59.66% 52.39%
FAISS k-NN Combined 43.97% 34.30%

Comparison with TLS Fingerprinting

Metric TLS Fingerprinting (JA4+JA4S+SNI) Flow-Based ML (RF+Combined)
Recall ≤64.9% (theoretical max) 97.22%
F1-Score ≤78.6% 98.11%
ECH-Resilient No Yes
Malware Coverage 64.9% 100%

Interactive Exploration

Open paper-code/notebooks/malware_classification_experiments.ipynb for an interactive notebook with all experiments, visualizations, and analysis.

Dataset

The experiments use the malware traffic dataset from:

Matoušek, P., Přívora, J., & Ryšavý, O. (2024). "TLS Traffic Analysis: Malware Classification with JA4+ Fingerprints"

Dataset characteristics:

  • 16,542 flows across 101 families (59 malware, 42 benign)
  • Sources: Desktop malware, mobile malware, desktop apps, mobile apps
  • Authenticated and labeled network traces

Note: The full dataset is not included in this repository. Please refer to the original paper for access.

Features

Core Flow Statistics (33 features)

  • Volumetric: Packet counts, byte volumes (bidirectional, src→dst, dst→src)
  • Temporal: Flow duration per direction
  • Statistical: Packet size distributions (min, mean, stddev, max)
  • Timing: Packet inter-arrival times (PIAT) distributions

Sequential Packet Lengths (25 features)

  • First 25 packet sizes in arrival order
  • Captures protocol-specific patterns
  • Early detection capability

Combined Feature Set (58 features)

  • Synergy between macro-level (flow stats) and micro-level (SPL) patterns
  • Best performance across all tasks

Experimental Design

  • 3 Classification Tasks: Binary, Full Multiclass (101 classes), Malware-only (59 classes)
  • 3 Feature Sets: Core (33), SPLT (25), Combined (58)
  • 3 ML Models: Neural Network, Random Forest, FAISS k-NN
  • Total: 27 experimental configurations
  • Reproducibility: Fixed random seeds (42), stratified 80/20 splits

Authors

  • Márton Pál Lipcsey-Magyar - Budapest University of Technology and Economics
  • Attila Ármin Madarász - Budapest University of Technology and Economics
  • Adrian Pekar - Budapest University of Technology and Economics & CUJO LLC

Contact

For questions about the paper or code:

Acknowledgments

Supported by the János Bolyai Research Scholarship of the Hungarian Academy of Sciences and Celtic-Next project RAI-6Green: Robust and AI Native 6G for Green Networks (C2023/1-9, funded by 2024-1.2.6-EUREKA-2024-00009).


Note: This repository contains the complete implementation and validation pipelines supporting the paper. All experimental results are reproducible using the provided code and methodology.

About

Supporting code for ECH-Resilient Malware Detection via Flow-Level Statistical Features

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published