Weather Classification via Semi-Supervised Learning

Classify weather conditions (fog/smog, rain, sandstorm) from raw images using two ML pipelines — one fully unsupervised-to-supervised, and one iterative self-training — with no deep learning required.

The "Why"

Labeling image datasets is expensive and time-consuming. This project explores whether classical computer vision features combined with semi-supervised learning strategies can classify weather conditions with minimal labeled data — a practical constraint in real-world environmental monitoring.

The goal: achieve competitive accuracy using only aggregate spatial pixel statistics, K-Means clustering, and SVM classification.

Tech Stack

Category	Technology	Why
Language	Python 3	Standard for ML/CV research
Image Processing	OpenCV, Pillow	Efficient resize, blur, grayscale pipeline
Numerical Computing	NumPy	Fast array operations on pixel matrices
ML — Unsupervised	scikit-learn `KMeans`	Cluster discovery without ground-truth labels
ML — Supervised	scikit-learn `SVC`	Probabilistic SVM enables confidence-gated self-training
Data Wrangling	pandas	CSV merging across weather classes

Dataset

Three weather categories sourced as image folders:

Class	Label ID	Examples
Fog / Smog	`0`	~854 images
Sandstorm	`1`	~627 images
Rain	`2`	~827 images

Images are preprocessed to 224×224, Gaussian-blurred, converted to grayscale, normalized [0,1], and inverted. Each image is reduced to a 2D feature vector: the summed pixel values of its top half and bottom half — a simple but effective spatial descriptor for weather scenes.

Key Features

Two independent classification pipelines for comparative analysis of unsupervised vs. semi-supervised approaches.
Zero-label K-Means baseline (uSupToSup.py): discovers 3 clusters then uses a voting heuristic on known-label samples to resolve the cluster-to-class mapping — no labeled data needed during training.
Iterative self-training (supRecurs.py): starts with a small labeled seed, repeatedly promotes high-confidence (>70%) predictions to the training set across 100 iterations. Forces remaining uncertain samples into training at iteration 99 to guarantee convergence.
Feature engineering over raw pixels: Gaussian blur reduces noise before feature extraction; inversion makes atmospheric haze (bright pixels) the dominant signal.
Reproducible train/test splits: 80/20 random stratified sampling using random.sample.

Architecture & Data Flow

flowchart TD
    A["Raw Image Folders\nrain/  fogsmog/  sandstorm/"] --> B["folderProcess.py\nResize → Blur → Grayscale\n→ Normalize → Invert"]
    B -->|"per-class .csv"| C["combo.py\nDataset Merger"]
    C -->|"all224.csv"| D["all224.json\n2D Feature Vectors + Labels"]

    D --> E{"Choose Pipeline"}

    E -->|"Unsupervised → Supervised"| F["uSupToSup.py\nKMeans (k=3)\n→ Cluster-Label Vote Mapping\n→ SVM Classifier"]
    E -->|"Self-Training"| G["supRecurs.py\nSVC (prob=True)\n→ 100-Iteration Bootstrap\n(confidence threshold = 0.70)"]

    F --> H["Evaluate\nF1 Score (macro) + Accuracy"]
    G --> H

Usage

1. Preprocess images

Edit the path and label in folderProcess.py for each class, then run:

python folderProcess.py   # generates rain.csv, fogsmog.csv, sandstorm.csv

2. Combine datasets

python combo.py           # generates all224.csv

Note: The ML scripts expect all224.json — a JSON version of the combined features. Convert all224.csv to JSON or adapt the loader as needed.

3. Run a classification pipeline

# Unsupervised → Supervised (KMeans + SVM)
python uSupToSup.py

# Semi-supervised self-training
python supRecurs.py

Both scripts print macro F1 score and accuracy on the held-out test set.

Requirements

pip install opencv-python pillow numpy pandas scikit-learn

Challenges & Solutions

Challenge: Cluster IDs don't map to class labels

K-Means assigns clusters {0, 1, 2} arbitrarily — there's no guarantee cluster 0 corresponds to fog/smog. Naively evaluating cluster assignments against true labels produces misleading metrics.

Solution (uSupToSup.py:81–168): After clustering the training set, the code finds the index of the first occurrence of each known label in the sorted dataset. It then tallies how the K-Means predictions distribute across that label's slice of data, and assigns the cluster ID that appears most often to that weather class. This voting heuristic resolves the permutation ambiguity without requiring any labeled training examples during the K-Means phase itself.

Challenge: Self-training can stall if no samples exceed the confidence threshold

In early iterations, the SVC trained on a small seed may not assign >70% probability to any unlabeled sample, causing the unlabeled pool to never shrink.

Solution (supRecurs.py:116–125): At iteration 99 (the final pass), the confidence gate is lifted and every remaining unlabeled sample is force-assigned the argmax class. This guarantees the model trains on the full dataset by iteration 100, preventing infinite stagnation while still preserving the high-confidence-first ordering across the preceding 99 rounds.

Project Structure

weatherIdentification/
├── folderProcess.py   # Image preprocessing pipeline
├── combo.py           # Dataset CSV merger
├── uSupToSup.py       # Unsupervised → Supervised (KMeans + SVM)
├── supRecurs.py       # Semi-supervised self-training (SVC)
├── all224.csv         # Combined feature/label store
└── data.zip           # Raw image archive

Companion paper: "Weather Classification using Semi-Supervised Learning" (included in repo)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Weather Classification via Semi-Supervised Learning

The "Why"

Tech Stack

Dataset

Key Features

Architecture & Data Flow

Usage

1. Preprocess images

2. Combine datasets

3. Run a classification pipeline

Requirements

Challenges & Solutions

Challenge: Cluster IDs don't map to class labels

Challenge: Self-training can stall if no samples exceed the confidence threshold

Project Structure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
fogsmog		fogsmog
rain		rain
sandstorm		sandstorm
.gitignore		.gitignore
README.md		README.md
all224.csv		all224.csv
combo.py		combo.py
fogsmog.csv		fogsmog.csv
folderProcess.py		folderProcess.py
rain.csv		rain.csv
sandstorm.csv		sandstorm.csv
supRecurs.py		supRecurs.py
uSupToSup.py		uSupToSup.py

Folders and files

Latest commit

History

Repository files navigation

Weather Classification via Semi-Supervised Learning

The "Why"

Tech Stack

Dataset

Key Features

Architecture & Data Flow

Usage

1. Preprocess images

2. Combine datasets

3. Run a classification pipeline

Requirements

Challenges & Solutions

Challenge: Cluster IDs don't map to class labels

Challenge: Self-training can stall if no samples exceed the confidence threshold

Project Structure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages