LogBoost: Boost Log Anomaly Detection by Cherry-Picking Log Sequences

LogBoost, a lightweight framework to boost log-based anomaly detection by automatically reducing redundant log templates. Based on our proposed similarity measurement, it effectively sorts the importance of log templates and identifies templates that are ineffective in anomaly detection. By filtering out these "noise" templates, LogBoost optimizes the training data, thereby improving the efficiency and accuracy of downstream anomaly detection models.

📝 Introduction

In modern distributed systems, logs are generated at an unprecedented rate, often containing a vast amount of redundant information. Traditional anomaly detection models often struggle with this noise, leading to high computational costs and lower accuracy.

LogBoost addresses this by introducing a "Cherry-Picking" mechanism. It evaluates the contribution of different log templates to the anomaly detection task using a semantic similarity metric. By selectively preserving high-value log sequences and discarding redundant "noise," LogBoost acts as a universal enhancer for various downstream models, including Deep Learning (e.g., DeepLog, LogAnomaly) and Machine Learning (e.g., SVM, XGBoost) approaches.

✨ Key Features

🍒 Smart Cherry-Picking: Automatically identifies and filters out redundant log templates based on semantic similarity measurements, optimizing the quality of training data.
🚀 Performance Boosting: By reducing the dimensionality and noise of input data, LogBoost significantly reduces training time while maintaining or improving detection accuracy.
📚 Comprehensive Model Zoo:
Deep Learning: Implementations of LSTM-based models (DeepLog) and Semantic-based models (LogAnomaly).
Machine Learning: Wrappers for Random Forest, XGBoost, SVM, and Logistic Regression.
Sequence Matching: Includes robust sequence matching algorithms like RobustLog.
🛠️ End-to-End Pipeline: Provides a complete workflow from log parsing (Drain/Spell) and feature extraction to model training and evaluation.

🏗 Architecture & Structure

LogBoost/
├── logboost/
│   ├── boost/                 # Core logic for Data Boosting & Cherry-Picking
│   ├── dataGenerator/         # Data preprocessing and vectorization
│   │   ├── tensor.py          # PyTorch Tensor generation
│   │   ├── sample.py          # Sliding window & negative sampling
│   │   └── ...                # Specific processors for HDFS/Spark
│   ├── models/                # Model Implementations
│   │   ├── lstm.py            # Deep Learning models (DeepLog, LogAnomaly)
│   │   └── ml.py              # Machine Learning wrappers (XGB, RF, SVM)
│   └── utils/                 # Utilities
│       ├── train.py           # Training loops
│       ├── predict.py         # Inference logic
│       └── visualize.py       # Result visualization tools
├── demo/                      # Entry points for running experiments
│   ├── boostlog.py            # Main demo for the LogBoost algorithm
│   ├── deeplog.py             # Baseline: DeepLog
│   ├── loganomaly.py          # Baseline: LogAnomaly
│   ├── robustlog.py           # Baseline: RobustLog
│   ├── xgb.py                 # Baseline: XGBoost
│   └── ...
├── data/                      # Raw datasets (Zipped)
├── logparse_result/           # Intermediate parsing results (Templates/Vectors)
└── requirements.txt           # Dependency list

📦 Prerequisites & Installation

Environment

Python: 3.8+
PyTorch: 1.8+ (CUDA recommended for Deep Learning models)
Scikit-learn: For Machine Learning models

Installation

Clone the repository:

git clone https://github.com/your-repo/LogBoost.git
cd LogBoost

Install dependencies:

pip install -r requirements.txt

🚀 Quick Start

1. Data Preparation

The repository contains compressed datasets and parsing results to save space. You must unzip them before running any models.

# Unzip raw data
cd data
unzip spark.zip
# (Optional: unzip other datasets if available)

# Unzip intermediate parsing results (Templates & Vectors)
cd ../logparse_result
unzip results.zip
cd ..

2. Running Anomaly Detection

You can run different models using the scripts provided in the demo/ folder. The scripts require specific arguments to select the dataset, target setting, and whether to use the original data or LogBoost-enhanced data.

A. Deep Learning Models (DeepLog, LogAnomaly, RobustLog)

Command Syntax:

python demo/<model>.py <mode> <dataset> <target> <boost_type> <device>

Arguments:

model: deeplog, loganomaly, or robustlog
mode:
train: Train the model.
predict: Run inference (prediction).
evaluation: Evaluate model performance (not available for RobustLog).
dataset: hdfs or spark
target:
deep: Target HDFS-A dataset setting.
swiss: Target HDFS-B dataset setting.
spark: Target Spark dataset setting.
boost_type:
origin: Use original (baseline) data.
boost: Use LogBoost enhanced data.
device: cpu or cuda

Examples:

Run DeepLog on HDFS-A (Original vs. Boosted):

# Train original DeepLog on CPU
python demo/deeplog.py train hdfs deep origin cpu

# Train LogBoost-enhanced DeepLog on CPU
python demo/deeplog.py train hdfs deep boost cpu

Run LogAnomaly on Spark:

# Train original LogAnomaly
python demo/loganomaly.py train spark spark origin cpu

# Predict with LogBoost-enhanced LogAnomaly
python demo/loganomaly.py predict spark spark boost cpu

Run RobustLog on HDFS-B:

python demo/robustlog.py train hdfs swiss boost cpu

B. Machine Learning Models (XGBoost, RandomForest)

Command Syntax:

python demo/<model>.py <dataset> <feature_type> <target> <boost_type>

Arguments:

model: xgb or randomforest
dataset: hdfs or spark
feature_type:
seq: Use Sequence Vectors.
frq: Use Frequency Vectors.
target: deep (HDFS-A), swiss (HDFS-B), or spark
boost_type: origin or boost

Examples:

Run XGBoost on HDFS-A (Sequence Vector):

# Baseline
python demo/xgb.py hdfs seq deep origin

# Boosted
python demo/xgb.py hdfs seq deep boost

Run RandomForest on Spark (Frequency Vector):

# Baseline
python demo/randomforest.py spark frq spark origin

# Boosted
python demo/randomforest.py spark frq spark boost

C. Running the Boosting Algorithm

To generate the boosted datasets yourself (performing the "Cherry-Picking" analysis), you can run:

python demo/boostlog.py

(Note: You may need to modify the options dictionary inside demo/boostlog.py to select the specific dataset and parameters you wish to process.)

🧪 Supported Models

LogBoost framework supports and compares against the following state-of-the-art models:

Model	Type	Key Technology	Description
LogBoost	Ours	Cherry-Picking	Filters noise templates to boost downstream model performance.
DeepLog	Deep Learning	LSTM	Models log patterns as a natural language sequence.
LogAnomaly	Deep Learning	LSTM + Semantics	Utilizes template semantic vectors to handle new log patterns.
RobustLog	Sequence Match	Attention/Matching	Robust against parsing errors and noise.
XGBoost	Machine Learning	Gradient Boosting	High-performance classifier based on log count vectors.
RandomForest	Machine Learning	Ensemble	Baseline classifier using decision trees.
SVM	Machine Learning	Hyperplane	Standard baseline for linear classification tasks.

📊 Datasets

The framework is optimized for standard log anomaly detection datasets:

HDFS: Distributed file system logs.
Spark: Large-scale data processing engine logs. Spark-SDA
(And other custom datasets processed via the dataGenerator module)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LogBoost: Boost Log Anomaly Detection by Cherry-Picking Log Sequences

📝 Introduction

✨ Key Features

🏗 Architecture & Structure

📦 Prerequisites & Installation

Environment

Installation

🚀 Quick Start

1. Data Preparation

2. Running Anomaly Detection

A. Deep Learning Models (DeepLog, LogAnomaly, RobustLog)

B. Machine Learning Models (XGBoost, RandomForest)

C. Running the Boosting Algorithm

🧪 Supported Models

📊 Datasets

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
data		data
demo		demo
logboost		logboost
logparse_result		logparse_result
README.md		README.md
requirements.txt		requirements.txt

IntelligentDDS/LogBoost

Folders and files

Latest commit

History

Repository files navigation

LogBoost: Boost Log Anomaly Detection by Cherry-Picking Log Sequences

📝 Introduction

✨ Key Features

🏗 Architecture & Structure

📦 Prerequisites & Installation

Environment

Installation

🚀 Quick Start

1. Data Preparation

2. Running Anomaly Detection

A. Deep Learning Models (DeepLog, LogAnomaly, RobustLog)

B. Machine Learning Models (XGBoost, RandomForest)

C. Running the Boosting Algorithm

🧪 Supported Models

📊 Datasets

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages