N-Gram Language Classifier

🧠 Project Overview

This repository implements a Feed-Forward Neural Network (FNN) to classify the language of a single input word.

Instead of relying on large-scale embeddings (like BERT) which are computationally expensive, this project demonstrates how classical NLP feature engineering combined with a lightweight neural network can achieve high accuracy. It analyzes Character N-Grams to capture morphological patterns unique to each language (e.g., the suffix "ung" in German vs. "tion" in English).

⚙️ Methodology

1. Feature Engineering (N-Grams)

The raw text data is transformed into numerical vectors using a Bag-of-N-Grams approach.

Technique: Character-level N-Grams (Range: 2-4 characters).
Vectorization: CountVectorizer from Scikit-Learn.
Why: This captures sub-word structures like prefixes, suffixes, and common character combinations independent of the word's semantic meaning.

2. Neural Network Architecture

The classifier is a fully connected neural network built with PyTorch:

Input Layer: Matches the N-Gram vocabulary size (~5000 features).
Hidden Layers: Two linear layers with ReLU activation to learn non-linear decision boundaries.
Regularization: Dropout layers ($p=0.25$) prevent overfitting to specific N-Grams.
Output: Softmax probabilities for the 7 language classes.

3. Automated Data Pipeline

The system is designed for reproducibility. It automatically:

Checks for local data.
If missing, downloads raw word lists from verified GitHub sources.
Samples data to ensure balanced classes.
Splits into Stratified Train/Test sets.

🚀 Usage

Prerequisites

Python 3.10+
CUDA (Optional, supported for NVIDIA GPUs)

Installation

pip install -r requirements.txt

Running the Application

The main.py script handles the entire workflow (Training -> Inference -> Saving).

python main.py

Workflow

Train: Select "y" to train a new model from scratch.
Predict: Enter words interactively to see language probabilities.
Save: Save your trained model and vectorizer for future use.
Load: Next time you run the script, you can load your saved model instantly.

📂 Project Structure

language-classifier/
├── data/                  # Raw word lists (auto-downloaded)
├── models/                # Saved model checkpoints & metadata
├── src/
│   ├── config.py          # Hyperparameters & Paths
│   ├── dataset.py         # Downloader & Vectorization pipeline
│   ├── model.py           # PyTorch Neural Network Class
│   ├── train.py           # Training loop & Evaluation
│   ├── inference.py       # Prediction logic
│   └── utils.py           # Save/Load utilities
├── main.py                # Entry point
└── requirements.txt       # Dependencies

📊 Performance

The model typically achieves ~85% accuracy on unseen test data after 15 epochs. This demonstrates the effectiveness of N-Gram features, even when distinguishing between closely related European languages sharing significant lexical overlap.

👤 Author

Michael Ott (Michael.Ott.03@posteo.de)

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
src		src
.gitignore		.gitignore
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

N-Gram Language Classifier

🧠 Project Overview

⚙️ Methodology

1. Feature Engineering (N-Grams)

2. Neural Network Architecture

3. Automated Data Pipeline

🚀 Usage

Prerequisites

Installation

Running the Application

Workflow

📂 Project Structure

📊 Performance

👤 Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Michael-Ott-03/language_classifier

Folders and files

Latest commit

History

Repository files navigation

N-Gram Language Classifier

🧠 Project Overview

⚙️ Methodology

1. Feature Engineering (N-Grams)

2. Neural Network Architecture

3. Automated Data Pipeline

🚀 Usage

Prerequisites

Installation

Running the Application

Workflow

📂 Project Structure

📊 Performance

👤 Author

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages