Pump-It-Up

This is a solution for the classification challenge Pump it Up: Data Mining the Water Table by DrivenData available at
https://www.drivendata.org/competitions/7/pump-it-up-data-mining-the-water-table/ using sklearn on jupyter notebook.

Followings are the pre-processing and feature engineering techniques I used in my solution.

Pre-processing and Feature Engineering Techniques

Converted all categorical values in the dataframe to lowercase.
Every value repeats less than 30 times in each column(categorical features only) was replaced with other.

Handling missing values.
Missing values of following features were filled with corresponding values.

Feature Name	Value
construction_year	1959
longitude	mean of the longitude grouped by `district_code`
latitude	mean of the latitude grouped by `district_code`
scheme_name	-1
scheme_management	-1
funder	-1
subvillage	-1
installer	-1

One-hot encoding was done to features public_meeting and permit ignoring missing values.
Label encoding was done to following features.
- funder
- installer
- basin
- subvillage
- lga
- ward
- scheme_management
- scheme_name
- extraction_type
- extraction_type_class
- management
- management_group
- payment_type
- water_quality
- quantity_group
- source
- source_class
- waterpoint_type
Standard normalization was done to following features.
- amount_tsh
- population
- num_private
- gps_height
- num_days(created feature)

Following features were removed due to corresponding reasons.

Feature Name	Reason
region	redundant with `region_code`
wpt_name	unique for every record
recorded_by	same for every record
id	unique for every record
extraction_type_group	highly correlated with `extraction_type`
quantity	highly correlated with `quantity_group`
payment	highly correlated with `payment_type`
quality_group	highly correlated with `water_quality_`
waterpoint_type_group	highly correlated with `waterpoint_type`
source_type	highly correlated with `source`

Created a new feature num_days using existing feature date_recorded and carried out standard normalization.
Features permit_False (derived from applying one-hot encoding to permit) and num_private were dropped due to low feature importance (<0.002).

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
.gitignore		.gitignore
README.md		README.md
pump it up.ipynb		pump it up.ipynb
rank.PNG		rank.PNG