This repository implements a Feed-Forward Neural Network (FNN) to classify the language of a single input word.
Instead of relying on large-scale embeddings (like BERT) which are computationally expensive, this project demonstrates how classical NLP feature engineering combined with a lightweight neural network can achieve high accuracy. It analyzes Character N-Grams to capture morphological patterns unique to each language (e.g., the suffix "ung" in German vs. "tion" in English).
Supported Languages: 🇬🇧 English | 🇩🇪 German | 🇪🇸 Spanish | 🇫🇷 French | 🇮🇹 Italian | 🇳🇴 Norwegian | 🇵🇱 Polish
The raw text data is transformed into numerical vectors using a Bag-of-N-Grams approach.
- Technique: Character-level N-Grams (Range: 2-4 characters).
- Vectorization:
CountVectorizerfrom Scikit-Learn. - Why: This captures sub-word structures like prefixes, suffixes, and common character combinations independent of the word's semantic meaning.
The classifier is a fully connected neural network built with PyTorch:
- Input Layer: Matches the N-Gram vocabulary size (~5000 features).
- Hidden Layers: Two linear layers with ReLU activation to learn non-linear decision boundaries.
-
Regularization: Dropout layers (
$p=0.25$ ) prevent overfitting to specific N-Grams. - Output: Softmax probabilities for the 7 language classes.
The system is designed for reproducibility. It automatically:
- Checks for local data.
- If missing, downloads raw word lists from verified GitHub sources.
- Samples data to ensure balanced classes.
- Splits into Stratified Train/Test sets.
- Python 3.10+
- CUDA (Optional, supported for NVIDIA GPUs)
pip install -r requirements.txtThe main.py script handles the entire workflow (Training -> Inference -> Saving).
python main.py- Train: Select "y" to train a new model from scratch.
- Predict: Enter words interactively to see language probabilities.
- Save: Save your trained model and vectorizer for future use.
- Load: Next time you run the script, you can load your saved model instantly.
language-classifier/
├── data/ # Raw word lists (auto-downloaded)
├── models/ # Saved model checkpoints & metadata
├── src/
│ ├── config.py # Hyperparameters & Paths
│ ├── dataset.py # Downloader & Vectorization pipeline
│ ├── model.py # PyTorch Neural Network Class
│ ├── train.py # Training loop & Evaluation
│ ├── inference.py # Prediction logic
│ └── utils.py # Save/Load utilities
├── main.py # Entry point
└── requirements.txt # Dependencies
The model typically achieves ~85% accuracy on unseen test data after 15 epochs. This demonstrates the effectiveness of N-Gram features, even when distinguishing between closely related European languages sharing significant lexical overlap.
Michael Ott (Michael.Ott.03@posteo.de)