LogGen: Integrating Traditional Model and LLM with Code Analysis for Precise Log Generation

LogGen is a deep learning-based framework designed to automate the generation and injection of logging statements into source code (Java/C++). It leverages a Multi-Task Learning architecture combined with Retrieval-Augmented Generation (RAG) to determine where to log (Location Prediction) and what to log (Content Generation).

Key features include:

Multi-Task EMA Network: Simultaneously predicts function-level necessity and line-level insertion points using an Encoder-Multihead-Attention architecture.
Semantic Feature Fusion: Combines positional embeddings () with LLM-generated function summaries () for robust context understanding.
RAG-Enhanced Content: Retrieves similar historical log patterns based on component clustering to generate context-aware log messages.
Syntactic Awareness: Filters noise (comments/empty lines) and utilizes explicit syntactic features for precise localization.

📂 Project Directory Structure

LogGen/
├── config/
│   ├── settings.yaml          # Central configuration for Network, Training, and Paths
│   └── prompts.yaml           # Prompts used for LLM interaction
├── data/
│   ├── input_project/         # Place your raw source code here
│   ├── output_project/        # Resulting code with injected logs
│   ├── processed/             # Intermediate tensors (.pt) and component maps
│   └── cache/                 # Caches for LLM summaries to speed up processing
├── src/
│   ├── core/
│   │   ├── pipeline.py        # Inference logic (LLM summary -> Model -> Injection)
│   │   ├── trainer.py         # Training loop with Multi-task & Focal Loss
│   │   └── constants.py       # Language-specific keywords (Loggers, syntax)
│   ├── data_process/
│   │   ├── dataset_builder.py # Data preprocessing, vectorization, and labeling
│   │   ├── summarizer.py      # LLM-based function summarization module
│   │   └── code_parser.py     # Tree-sitter based code parsing logic
│   ├── model/
│   │   └── ema_network.py     # Neural Network architecture (Adapter + EMA)
│   ├── rag/                   # RAG components (Knowledge Base & Clustering)
│   └── services/              # Wrappers for LLM (Ollama) and Log Generation
├── main.py                    # Entry point for the CLI
├── clean.py                   # Utility script to clean cache and outputs
└── requirements.txt           # Python dependencies

🛠️ Environment Prerequisites

To run LogGen, ensure your environment meets the following requirements:

OS: Linux / macOS / Windows (WSL recommended)
Python: 3.8+
Hardware: GPU recommended (NVIDIA CUDA) for efficient training and inference.
External Services:
Ollama: Required for generating function summaries and log content.
Hugging Face Model: sentence-transformers/all-MiniLM-L6-v2 (Local copy recommended).

🚀 Installation & Setup

1. Clone the Repository

git clone https://github.com/your-username/LogGen.git
cd LogGen

2. Install Dependencies

Create a virtual environment and install the required packages.

conda create -n loggen python=3.10
conda activate loggen
pip install -r requirements.txt

Note: Ensure torch is installed with CUDA support if you have a GPU.

3. Setup LLM (Ollama)

Download and run Ollama. Pull the default model specified in config/settings.yaml (default: llama3.1:8b).

# In a separate terminal
ollama serve
# Pull the model
ollama pull llama3.1:8b

4. Local Embedding Model (Crucial)

To avoid network timeouts, download sentence-transformers/all-MiniLM-L6-v2 manually from Hugging Face and place it in the project root.

# Directory structure should look like:
# LogGen/all-MiniLM-L6-v2/config.json ...

⚙️ Configuration & Key API Modifications

The project is driven by config/settings.yaml. You may need to modify the following based on your dataset:

Project Paths:

project.input_root: Path to your training/testing source code.

Network Hyperparameters:

network.input_dim: Fixed to 389 (384 Semantic + 5 Syntactic).
network.hidden_dim: Default 128.

Training Tuning:

training.log_threshold: Confidence threshold for inference (Default: 0.05).
training.lambda_func_loss: Weight for the auxiliary function-level task.

Language Keywords:

Modify src/core/constants.py if your project uses custom logger names (e.g., myLogger.debug).

📊 Workflow

Step 1: Data Preparation

This step parses the code, generates function summaries via LLM, and constructs feature vectors.

Command:

python main.py build --input data/input_project

Process:

Parses code using Tree-sitter.
Invokes LLM to generate summaries for all functions (Cached in data/cache).
Fuses Semantic vectors + Syntactic features + Positional embeddings.
Saves the dataset to data/processed/dataset.pt.

Step 2: Model Training

Trains the EMA model using the Multi-Task objective (Line detection + Function classification).

Command:

python main.py train

Metrics: Monitors Line-Level F1/Precision/Recall and Function-Level Accuracy.
Loss: Uses Focal Loss for line detection to handle class imbalance.
Artifacts: Best model weights are saved to checkpoints/ema_model.pth.

Step 3: Inference (Log Injection)

Predicts insertion points on new code and generates log content using RAG.

Command:

python main.py run --input data/test_project

Logic:

Generates on-the-fly summaries for the target code.
Predicts insertion points using Dynamic Thresholding & NMS.
Generates log statements and writes the modified code to data/output_project.

🧹 Maintenance

To reset the project, remove caches, or clear training data, use the cleaning utility:

python clean.py

Warning: This deletes data/processed, data/output_project, checkpoints, and __pycache__.

🧩 Architectural Highlights

Feature Fusion Strategy

LogGen employs a weighted fusion mechanism to represent code functions:

: Positional embedding based on file path/package structure.
: Semantic embedding derived from Function Name + LLM Summary.

Multi-Task Learning

The model optimizes two objectives simultaneously:

Line-Level Task: Sequence tagging (Log / No-Log) using Focal Loss.
Function-Level Task: Binary classification (Does this function need logging?).

The Function-Level prediction acts as a global gate during inference, dynamically adjusting the sensitivity of the line-level detector.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
config		config
src		src
vendor		vendor
.gitignore		.gitignore
README.md		README.md
clean.py		clean.py
debug_dataset_stats.py		debug_dataset_stats.py
debug_labeling_visualizer.py		debug_labeling_visualizer.py
main.py		main.py
requirements.txt		requirements.txt
setup_parsers.py		setup_parsers.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LogGen: Integrating Traditional Model and LLM with Code Analysis for Precise Log Generation

📂 Project Directory Structure

🛠️ Environment Prerequisites

🚀 Installation & Setup

1. Clone the Repository

2. Install Dependencies

3. Setup LLM (Ollama)

4. Local Embedding Model (Crucial)

⚙️ Configuration & Key API Modifications

📊 Workflow

Step 1: Data Preparation

Step 2: Model Training

Step 3: Inference (Log Injection)

🧹 Maintenance

🧩 Architectural Highlights

Feature Fusion Strategy

Multi-Task Learning

About

Uh oh!

Releases

Packages

Languages

IntelligentDDS/LogGen

Folders and files

Latest commit

History

Repository files navigation

LogGen: Integrating Traditional Model and LLM with Code Analysis for Precise Log Generation

📂 Project Directory Structure

🛠️ Environment Prerequisites

🚀 Installation & Setup

1. Clone the Repository

2. Install Dependencies

3. Setup LLM (Ollama)

4. Local Embedding Model (Crucial)

⚙️ Configuration & Key API Modifications

📊 Workflow

Step 1: Data Preparation

Step 2: Model Training

Step 3: Inference (Log Injection)

🧹 Maintenance

🧩 Architectural Highlights

Feature Fusion Strategy

Multi-Task Learning

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages