Skip to content

IntelligentDDS/LogGen

Repository files navigation

LogGen: Integrating Traditional Model and LLM with Code Analysis for Precise Log Generation

LogGen is a deep learning-based framework designed to automate the generation and injection of logging statements into source code (Java/C++). It leverages a Multi-Task Learning architecture combined with Retrieval-Augmented Generation (RAG) to determine where to log (Location Prediction) and what to log (Content Generation).

Key features include:

  • Multi-Task EMA Network: Simultaneously predicts function-level necessity and line-level insertion points using an Encoder-Multihead-Attention architecture.
  • Semantic Feature Fusion: Combines positional embeddings () with LLM-generated function summaries () for robust context understanding.
  • RAG-Enhanced Content: Retrieves similar historical log patterns based on component clustering to generate context-aware log messages.
  • Syntactic Awareness: Filters noise (comments/empty lines) and utilizes explicit syntactic features for precise localization.

📂 Project Directory Structure

LogGen/
├── config/
│   ├── settings.yaml          # Central configuration for Network, Training, and Paths
│   └── prompts.yaml           # Prompts used for LLM interaction
├── data/
│   ├── input_project/         # Place your raw source code here
│   ├── output_project/        # Resulting code with injected logs
│   ├── processed/             # Intermediate tensors (.pt) and component maps
│   └── cache/                 # Caches for LLM summaries to speed up processing
├── src/
│   ├── core/
│   │   ├── pipeline.py        # Inference logic (LLM summary -> Model -> Injection)
│   │   ├── trainer.py         # Training loop with Multi-task & Focal Loss
│   │   └── constants.py       # Language-specific keywords (Loggers, syntax)
│   ├── data_process/
│   │   ├── dataset_builder.py # Data preprocessing, vectorization, and labeling
│   │   ├── summarizer.py      # LLM-based function summarization module
│   │   └── code_parser.py     # Tree-sitter based code parsing logic
│   ├── model/
│   │   └── ema_network.py     # Neural Network architecture (Adapter + EMA)
│   ├── rag/                   # RAG components (Knowledge Base & Clustering)
│   └── services/              # Wrappers for LLM (Ollama) and Log Generation
├── main.py                    # Entry point for the CLI
├── clean.py                   # Utility script to clean cache and outputs
└── requirements.txt           # Python dependencies


🛠️ Environment Prerequisites

To run LogGen, ensure your environment meets the following requirements:

  • OS: Linux / macOS / Windows (WSL recommended)
  • Python: 3.8+
  • Hardware: GPU recommended (NVIDIA CUDA) for efficient training and inference.
  • External Services:
  • Ollama: Required for generating function summaries and log content.
  • Hugging Face Model: sentence-transformers/all-MiniLM-L6-v2 (Local copy recommended).

🚀 Installation & Setup

1. Clone the Repository

git clone https://github.com/your-username/LogGen.git
cd LogGen

2. Install Dependencies

Create a virtual environment and install the required packages.

conda create -n loggen python=3.10
conda activate loggen
pip install -r requirements.txt

Note: Ensure torch is installed with CUDA support if you have a GPU.

3. Setup LLM (Ollama)

Download and run Ollama. Pull the default model specified in config/settings.yaml (default: llama3.1:8b).

# In a separate terminal
ollama serve
# Pull the model
ollama pull llama3.1:8b

4. Local Embedding Model (Crucial)

To avoid network timeouts, download sentence-transformers/all-MiniLM-L6-v2 manually from Hugging Face and place it in the project root.

# Directory structure should look like:
# LogGen/all-MiniLM-L6-v2/config.json ...

⚙️ Configuration & Key API Modifications

The project is driven by config/settings.yaml. You may need to modify the following based on your dataset:

  1. Project Paths:
  • project.input_root: Path to your training/testing source code.
  1. Network Hyperparameters:
  • network.input_dim: Fixed to 389 (384 Semantic + 5 Syntactic).
  • network.hidden_dim: Default 128.
  1. Training Tuning:
  • training.log_threshold: Confidence threshold for inference (Default: 0.05).
  • training.lambda_func_loss: Weight for the auxiliary function-level task.
  1. Language Keywords:
  • Modify src/core/constants.py if your project uses custom logger names (e.g., myLogger.debug).

📊 Workflow

Step 1: Data Preparation

This step parses the code, generates function summaries via LLM, and constructs feature vectors.

Command:

python main.py build --input data/input_project
  • Process:
  1. Parses code using Tree-sitter.
  2. Invokes LLM to generate summaries for all functions (Cached in data/cache).
  3. Fuses Semantic vectors + Syntactic features + Positional embeddings.
  4. Saves the dataset to data/processed/dataset.pt.

Step 2: Model Training

Trains the EMA model using the Multi-Task objective (Line detection + Function classification).

Command:

python main.py train
  • Metrics: Monitors Line-Level F1/Precision/Recall and Function-Level Accuracy.
  • Loss: Uses Focal Loss for line detection to handle class imbalance.
  • Artifacts: Best model weights are saved to checkpoints/ema_model.pth.

Step 3: Inference (Log Injection)

Predicts insertion points on new code and generates log content using RAG.

Command:

python main.py run --input data/test_project
  • Logic:
  1. Generates on-the-fly summaries for the target code.
  2. Predicts insertion points using Dynamic Thresholding & NMS.
  3. Generates log statements and writes the modified code to data/output_project.

🧹 Maintenance

To reset the project, remove caches, or clear training data, use the cleaning utility:

python clean.py
  • Warning: This deletes data/processed, data/output_project, checkpoints, and __pycache__.

🧩 Architectural Highlights

Feature Fusion Strategy

LogGen employs a weighted fusion mechanism to represent code functions:

  • : Positional embedding based on file path/package structure.
  • : Semantic embedding derived from Function Name + LLM Summary.

Multi-Task Learning

The model optimizes two objectives simultaneously:

  1. Line-Level Task: Sequence tagging (Log / No-Log) using Focal Loss.
  2. Function-Level Task: Binary classification (Does this function need logging?).

The Function-Level prediction acts as a global gate during inference, dynamically adjusting the sensitivity of the line-level detector.


About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages