Classify weather conditions (fog/smog, rain, sandstorm) from raw images using two ML pipelines — one fully unsupervised-to-supervised, and one iterative self-training — with no deep learning required.
Labeling image datasets is expensive and time-consuming. This project explores whether classical computer vision features combined with semi-supervised learning strategies can classify weather conditions with minimal labeled data — a practical constraint in real-world environmental monitoring.
The goal: achieve competitive accuracy using only aggregate spatial pixel statistics, K-Means clustering, and SVM classification.
| Category | Technology | Why |
|---|---|---|
| Language | Python 3 | Standard for ML/CV research |
| Image Processing | OpenCV, Pillow | Efficient resize, blur, grayscale pipeline |
| Numerical Computing | NumPy | Fast array operations on pixel matrices |
| ML — Unsupervised | scikit-learn KMeans |
Cluster discovery without ground-truth labels |
| ML — Supervised | scikit-learn SVC |
Probabilistic SVM enables confidence-gated self-training |
| Data Wrangling | pandas | CSV merging across weather classes |
Three weather categories sourced as image folders:
| Class | Label ID | Examples |
|---|---|---|
| Fog / Smog | 0 |
~854 images |
| Sandstorm | 1 |
~627 images |
| Rain | 2 |
~827 images |
Images are preprocessed to 224×224, Gaussian-blurred, converted to grayscale, normalized [0,1], and inverted. Each image is reduced to a 2D feature vector: the summed pixel values of its top half and bottom half — a simple but effective spatial descriptor for weather scenes.
- Two independent classification pipelines for comparative analysis of unsupervised vs. semi-supervised approaches.
- Zero-label K-Means baseline (
uSupToSup.py): discovers 3 clusters then uses a voting heuristic on known-label samples to resolve the cluster-to-class mapping — no labeled data needed during training. - Iterative self-training (
supRecurs.py): starts with a small labeled seed, repeatedly promotes high-confidence (>70%) predictions to the training set across 100 iterations. Forces remaining uncertain samples into training at iteration 99 to guarantee convergence. - Feature engineering over raw pixels: Gaussian blur reduces noise before feature extraction; inversion makes atmospheric haze (bright pixels) the dominant signal.
- Reproducible train/test splits: 80/20 random stratified sampling using
random.sample.
flowchart TD
A["Raw Image Folders\nrain/ fogsmog/ sandstorm/"] --> B["folderProcess.py\nResize → Blur → Grayscale\n→ Normalize → Invert"]
B -->|"per-class .csv"| C["combo.py\nDataset Merger"]
C -->|"all224.csv"| D["all224.json\n2D Feature Vectors + Labels"]
D --> E{"Choose Pipeline"}
E -->|"Unsupervised → Supervised"| F["uSupToSup.py\nKMeans (k=3)\n→ Cluster-Label Vote Mapping\n→ SVM Classifier"]
E -->|"Self-Training"| G["supRecurs.py\nSVC (prob=True)\n→ 100-Iteration Bootstrap\n(confidence threshold = 0.70)"]
F --> H["Evaluate\nF1 Score (macro) + Accuracy"]
G --> H
Edit the path and label in folderProcess.py for each class, then run:
python folderProcess.py # generates rain.csv, fogsmog.csv, sandstorm.csvpython combo.py # generates all224.csvNote: The ML scripts expect
all224.json— a JSON version of the combined features. Convertall224.csvto JSON or adapt the loader as needed.
# Unsupervised → Supervised (KMeans + SVM)
python uSupToSup.py
# Semi-supervised self-training
python supRecurs.pyBoth scripts print macro F1 score and accuracy on the held-out test set.
pip install opencv-python pillow numpy pandas scikit-learnK-Means assigns clusters {0, 1, 2} arbitrarily — there's no guarantee cluster 0 corresponds to fog/smog. Naively evaluating cluster assignments against true labels produces misleading metrics.
Solution (uSupToSup.py:81–168): After clustering the training set, the code finds the index of the first occurrence of each known label in the sorted dataset. It then tallies how the K-Means predictions distribute across that label's slice of data, and assigns the cluster ID that appears most often to that weather class. This voting heuristic resolves the permutation ambiguity without requiring any labeled training examples during the K-Means phase itself.
In early iterations, the SVC trained on a small seed may not assign >70% probability to any unlabeled sample, causing the unlabeled pool to never shrink.
Solution (supRecurs.py:116–125): At iteration 99 (the final pass), the confidence gate is lifted and every remaining unlabeled sample is force-assigned the argmax class. This guarantees the model trains on the full dataset by iteration 100, preventing infinite stagnation while still preserving the high-confidence-first ordering across the preceding 99 rounds.
weatherIdentification/
├── folderProcess.py # Image preprocessing pipeline
├── combo.py # Dataset CSV merger
├── uSupToSup.py # Unsupervised → Supervised (KMeans + SVM)
├── supRecurs.py # Semi-supervised self-training (SVC)
├── all224.csv # Combined feature/label store
└── data.zip # Raw image archive
Companion paper: "Weather Classification using Semi-Supervised Learning" (included in repo)