LogBoost, a lightweight framework to boost log-based anomaly detection by automatically reducing redundant log templates. Based on our proposed similarity measurement, it effectively sorts the importance of log templates and identifies templates that are ineffective in anomaly detection. By filtering out these "noise" templates, LogBoost optimizes the training data, thereby improving the efficiency and accuracy of downstream anomaly detection models.
In modern distributed systems, logs are generated at an unprecedented rate, often containing a vast amount of redundant information. Traditional anomaly detection models often struggle with this noise, leading to high computational costs and lower accuracy.
LogBoost addresses this by introducing a "Cherry-Picking" mechanism. It evaluates the contribution of different log templates to the anomaly detection task using a semantic similarity metric. By selectively preserving high-value log sequences and discarding redundant "noise," LogBoost acts as a universal enhancer for various downstream models, including Deep Learning (e.g., DeepLog, LogAnomaly) and Machine Learning (e.g., SVM, XGBoost) approaches.
-
🍒 Smart Cherry-Picking: Automatically identifies and filters out redundant log templates based on semantic similarity measurements, optimizing the quality of training data.
-
🚀 Performance Boosting: By reducing the dimensionality and noise of input data, LogBoost significantly reduces training time while maintaining or improving detection accuracy.
-
📚 Comprehensive Model Zoo:
-
Deep Learning: Implementations of LSTM-based models (DeepLog) and Semantic-based models (LogAnomaly).
-
Machine Learning: Wrappers for Random Forest, XGBoost, SVM, and Logistic Regression.
-
Sequence Matching: Includes robust sequence matching algorithms like RobustLog.
-
🛠️ End-to-End Pipeline: Provides a complete workflow from log parsing (Drain/Spell) and feature extraction to model training and evaluation.
LogBoost/
├── logboost/
│ ├── boost/ # Core logic for Data Boosting & Cherry-Picking
│ ├── dataGenerator/ # Data preprocessing and vectorization
│ │ ├── tensor.py # PyTorch Tensor generation
│ │ ├── sample.py # Sliding window & negative sampling
│ │ └── ... # Specific processors for HDFS/Spark
│ ├── models/ # Model Implementations
│ │ ├── lstm.py # Deep Learning models (DeepLog, LogAnomaly)
│ │ └── ml.py # Machine Learning wrappers (XGB, RF, SVM)
│ └── utils/ # Utilities
│ ├── train.py # Training loops
│ ├── predict.py # Inference logic
│ └── visualize.py # Result visualization tools
├── demo/ # Entry points for running experiments
│ ├── boostlog.py # Main demo for the LogBoost algorithm
│ ├── deeplog.py # Baseline: DeepLog
│ ├── loganomaly.py # Baseline: LogAnomaly
│ ├── robustlog.py # Baseline: RobustLog
│ ├── xgb.py # Baseline: XGBoost
│ └── ...
├── data/ # Raw datasets (Zipped)
├── logparse_result/ # Intermediate parsing results (Templates/Vectors)
└── requirements.txt # Dependency list
- Python: 3.8+
- PyTorch: 1.8+ (CUDA recommended for Deep Learning models)
- Scikit-learn: For Machine Learning models
- Clone the repository:
git clone https://github.com/your-repo/LogBoost.git
cd LogBoost
- Install dependencies:
pip install -r requirements.txt
The repository contains compressed datasets and parsing results to save space. You must unzip them before running any models.
# Unzip raw data
cd data
unzip spark.zip
# (Optional: unzip other datasets if available)
# Unzip intermediate parsing results (Templates & Vectors)
cd ../logparse_result
unzip results.zip
cd ..
You can run different models using the scripts provided in the demo/ folder. The scripts require specific arguments to select the dataset, target setting, and whether to use the original data or LogBoost-enhanced data.
Command Syntax:
python demo/<model>.py <mode> <dataset> <target> <boost_type> <device>
Arguments:
-
model:deeplog,loganomaly, orrobustlog -
mode: -
train: Train the model. -
predict: Run inference (prediction). -
evaluation: Evaluate model performance (not available for RobustLog). -
dataset:hdfsorspark -
target: -
deep: Target HDFS-A dataset setting. -
swiss: Target HDFS-B dataset setting. -
spark: Target Spark dataset setting. -
boost_type: -
origin: Use original (baseline) data. -
boost: Use LogBoost enhanced data. -
device:cpuorcuda
Examples:
- Run DeepLog on HDFS-A (Original vs. Boosted):
# Train original DeepLog on CPU
python demo/deeplog.py train hdfs deep origin cpu
# Train LogBoost-enhanced DeepLog on CPU
python demo/deeplog.py train hdfs deep boost cpu
- Run LogAnomaly on Spark:
# Train original LogAnomaly
python demo/loganomaly.py train spark spark origin cpu
# Predict with LogBoost-enhanced LogAnomaly
python demo/loganomaly.py predict spark spark boost cpu
- Run RobustLog on HDFS-B:
python demo/robustlog.py train hdfs swiss boost cpu
Command Syntax:
python demo/<model>.py <dataset> <feature_type> <target> <boost_type>
Arguments:
-
model:xgborrandomforest -
dataset:hdfsorspark -
feature_type: -
seq: Use Sequence Vectors. -
frq: Use Frequency Vectors. -
target:deep(HDFS-A),swiss(HDFS-B), orspark -
boost_type:originorboost
Examples:
- Run XGBoost on HDFS-A (Sequence Vector):
# Baseline
python demo/xgb.py hdfs seq deep origin
# Boosted
python demo/xgb.py hdfs seq deep boost
- Run RandomForest on Spark (Frequency Vector):
# Baseline
python demo/randomforest.py spark frq spark origin
# Boosted
python demo/randomforest.py spark frq spark boost
To generate the boosted datasets yourself (performing the "Cherry-Picking" analysis), you can run:
python demo/boostlog.py
(Note: You may need to modify the options dictionary inside demo/boostlog.py to select the specific dataset and parameters you wish to process.)
LogBoost framework supports and compares against the following state-of-the-art models:
| Model | Type | Key Technology | Description |
|---|---|---|---|
| LogBoost | Ours | Cherry-Picking | Filters noise templates to boost downstream model performance. |
| DeepLog | Deep Learning | LSTM | Models log patterns as a natural language sequence. |
| LogAnomaly | Deep Learning | LSTM + Semantics | Utilizes template semantic vectors to handle new log patterns. |
| RobustLog | Sequence Match | Attention/Matching | Robust against parsing errors and noise. |
| XGBoost | Machine Learning | Gradient Boosting | High-performance classifier based on log count vectors. |
| RandomForest | Machine Learning | Ensemble | Baseline classifier using decision trees. |
| SVM | Machine Learning | Hyperplane | Standard baseline for linear classification tasks. |
The framework is optimized for standard log anomaly detection datasets:
- HDFS: Distributed file system logs.
- Spark: Large-scale data processing engine logs. Spark-SDA
- (And other custom datasets processed via the
dataGeneratormodule)