Skip to content

Lightweight Feed-Forward Neural Network for language identification using character N-Grams, PyTorch, and Scikit-Learn.

Notifications You must be signed in to change notification settings

Michael-Ott-03/language_classifier

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

N-Gram Language Classifier

Python PyTorch Scikit-Learn

🧠 Project Overview

This repository implements a Feed-Forward Neural Network (FNN) to classify the language of a single input word.

Instead of relying on large-scale embeddings (like BERT) which are computationally expensive, this project demonstrates how classical NLP feature engineering combined with a lightweight neural network can achieve high accuracy. It analyzes Character N-Grams to capture morphological patterns unique to each language (e.g., the suffix "ung" in German vs. "tion" in English).

Supported Languages: 🇬🇧 English | 🇩🇪 German | 🇪🇸 Spanish | 🇫🇷 French | 🇮🇹 Italian | 🇳🇴 Norwegian | 🇵🇱 Polish

⚙️ Methodology

1. Feature Engineering (N-Grams)

The raw text data is transformed into numerical vectors using a Bag-of-N-Grams approach.

  • Technique: Character-level N-Grams (Range: 2-4 characters).
  • Vectorization: CountVectorizer from Scikit-Learn.
  • Why: This captures sub-word structures like prefixes, suffixes, and common character combinations independent of the word's semantic meaning.

2. Neural Network Architecture

The classifier is a fully connected neural network built with PyTorch:

  • Input Layer: Matches the N-Gram vocabulary size (~5000 features).
  • Hidden Layers: Two linear layers with ReLU activation to learn non-linear decision boundaries.
  • Regularization: Dropout layers ($p=0.25$) prevent overfitting to specific N-Grams.
  • Output: Softmax probabilities for the 7 language classes.

3. Automated Data Pipeline

The system is designed for reproducibility. It automatically:

  1. Checks for local data.
  2. If missing, downloads raw word lists from verified GitHub sources.
  3. Samples data to ensure balanced classes.
  4. Splits into Stratified Train/Test sets.

🚀 Usage

Prerequisites

  • Python 3.10+
  • CUDA (Optional, supported for NVIDIA GPUs)

Installation

pip install -r requirements.txt

Running the Application

The main.py script handles the entire workflow (Training -> Inference -> Saving).

python main.py

Workflow

  1. Train: Select "y" to train a new model from scratch.
  2. Predict: Enter words interactively to see language probabilities.
  3. Save: Save your trained model and vectorizer for future use.
  4. Load: Next time you run the script, you can load your saved model instantly.

📂 Project Structure

language-classifier/
├── data/                  # Raw word lists (auto-downloaded)
├── models/                # Saved model checkpoints & metadata
├── src/
│   ├── config.py          # Hyperparameters & Paths
│   ├── dataset.py         # Downloader & Vectorization pipeline
│   ├── model.py           # PyTorch Neural Network Class
│   ├── train.py           # Training loop & Evaluation
│   ├── inference.py       # Prediction logic
│   └── utils.py           # Save/Load utilities
├── main.py                # Entry point
└── requirements.txt       # Dependencies

📊 Performance

The model typically achieves ~85% accuracy on unseen test data after 15 epochs. This demonstrates the effectiveness of N-Gram features, even when distinguishing between closely related European languages sharing significant lexical overlap.

👤 Author

Michael Ott (Michael.Ott.03@posteo.de)

About

Lightweight Feed-Forward Neural Network for language identification using character N-Grams, PyTorch, and Scikit-Learn.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages