LogGen is a deep learning-based framework designed to automate the generation and injection of logging statements into source code (Java/C++). It leverages a Multi-Task Learning architecture combined with Retrieval-Augmented Generation (RAG) to determine where to log (Location Prediction) and what to log (Content Generation).
Key features include:
- Multi-Task EMA Network: Simultaneously predicts function-level necessity and line-level insertion points using an Encoder-Multihead-Attention architecture.
- Semantic Feature Fusion: Combines positional embeddings () with LLM-generated function summaries () for robust context understanding.
- RAG-Enhanced Content: Retrieves similar historical log patterns based on component clustering to generate context-aware log messages.
- Syntactic Awareness: Filters noise (comments/empty lines) and utilizes explicit syntactic features for precise localization.
LogGen/
├── config/
│ ├── settings.yaml # Central configuration for Network, Training, and Paths
│ └── prompts.yaml # Prompts used for LLM interaction
├── data/
│ ├── input_project/ # Place your raw source code here
│ ├── output_project/ # Resulting code with injected logs
│ ├── processed/ # Intermediate tensors (.pt) and component maps
│ └── cache/ # Caches for LLM summaries to speed up processing
├── src/
│ ├── core/
│ │ ├── pipeline.py # Inference logic (LLM summary -> Model -> Injection)
│ │ ├── trainer.py # Training loop with Multi-task & Focal Loss
│ │ └── constants.py # Language-specific keywords (Loggers, syntax)
│ ├── data_process/
│ │ ├── dataset_builder.py # Data preprocessing, vectorization, and labeling
│ │ ├── summarizer.py # LLM-based function summarization module
│ │ └── code_parser.py # Tree-sitter based code parsing logic
│ ├── model/
│ │ └── ema_network.py # Neural Network architecture (Adapter + EMA)
│ ├── rag/ # RAG components (Knowledge Base & Clustering)
│ └── services/ # Wrappers for LLM (Ollama) and Log Generation
├── main.py # Entry point for the CLI
├── clean.py # Utility script to clean cache and outputs
└── requirements.txt # Python dependencies
To run LogGen, ensure your environment meets the following requirements:
- OS: Linux / macOS / Windows (WSL recommended)
- Python: 3.8+
- Hardware: GPU recommended (NVIDIA CUDA) for efficient training and inference.
- External Services:
- Ollama: Required for generating function summaries and log content.
- Hugging Face Model:
sentence-transformers/all-MiniLM-L6-v2(Local copy recommended).
git clone https://github.com/your-username/LogGen.git
cd LogGen
Create a virtual environment and install the required packages.
conda create -n loggen python=3.10
conda activate loggen
pip install -r requirements.txt
Note: Ensure torch is installed with CUDA support if you have a GPU.
Download and run Ollama. Pull the default model specified in config/settings.yaml (default: llama3.1:8b).
# In a separate terminal
ollama serve
# Pull the model
ollama pull llama3.1:8b
To avoid network timeouts, download sentence-transformers/all-MiniLM-L6-v2 manually from Hugging Face and place it in the project root.
# Directory structure should look like:
# LogGen/all-MiniLM-L6-v2/config.json ...
The project is driven by config/settings.yaml. You may need to modify the following based on your dataset:
- Project Paths:
project.input_root: Path to your training/testing source code.
- Network Hyperparameters:
network.input_dim: Fixed to 389 (384 Semantic + 5 Syntactic).network.hidden_dim: Default 128.
- Training Tuning:
training.log_threshold: Confidence threshold for inference (Default:0.05).training.lambda_func_loss: Weight for the auxiliary function-level task.
- Language Keywords:
- Modify
src/core/constants.pyif your project uses custom logger names (e.g.,myLogger.debug).
This step parses the code, generates function summaries via LLM, and constructs feature vectors.
Command:
python main.py build --input data/input_project
- Process:
- Parses code using Tree-sitter.
- Invokes LLM to generate summaries for all functions (Cached in
data/cache). - Fuses Semantic vectors + Syntactic features + Positional embeddings.
- Saves the dataset to
data/processed/dataset.pt.
Trains the EMA model using the Multi-Task objective (Line detection + Function classification).
Command:
python main.py train
- Metrics: Monitors Line-Level F1/Precision/Recall and Function-Level Accuracy.
- Loss: Uses Focal Loss for line detection to handle class imbalance.
- Artifacts: Best model weights are saved to
checkpoints/ema_model.pth.
Predicts insertion points on new code and generates log content using RAG.
Command:
python main.py run --input data/test_project
- Logic:
- Generates on-the-fly summaries for the target code.
- Predicts insertion points using Dynamic Thresholding & NMS.
- Generates log statements and writes the modified code to
data/output_project.
To reset the project, remove caches, or clear training data, use the cleaning utility:
python clean.py
- Warning: This deletes
data/processed,data/output_project,checkpoints, and__pycache__.
LogGen employs a weighted fusion mechanism to represent code functions:
- : Positional embedding based on file path/package structure.
- : Semantic embedding derived from Function Name + LLM Summary.
The model optimizes two objectives simultaneously:
- Line-Level Task: Sequence tagging (Log / No-Log) using Focal Loss.
- Function-Level Task: Binary classification (Does this function need logging?).
The Function-Level prediction acts as a global gate during inference, dynamically adjusting the sensitivity of the line-level detector.