This is a solution for the classification challenge Pump it Up: Data Mining the Water Table by DrivenData available at
https://www.drivendata.org/competitions/7/pump-it-up-data-mining-the-water-table/ using sklearn on jupyter notebook.
Followings are the pre-processing and feature engineering techniques I used in my solution.
-
Converted all categorical values in the dataframe to lowercase.
-
Every value repeats less than 30 times in each column(categorical features only) was replaced with
other. -
Handling missing values.
Missing values of following features were filled with corresponding values.Feature Name Value construction_year 1959 longitude mean of the longitude grouped by district_codelatitude mean of the latitude grouped by district_codescheme_name -1 scheme_management -1 funder -1 subvillage -1 installer -1 -
One-hot encoding was done to features
public_meetingandpermitignoring missing values. -
Label encoding was done to following features.
- funder
- installer
- basin
- subvillage
- lga
- ward
- scheme_management
- scheme_name
- extraction_type
- extraction_type_class
- management
- management_group
- payment_type
- water_quality
- quantity_group
- source
- source_class
- waterpoint_type
-
Standard normalization was done to following features.
- amount_tsh
- population
- num_private
- gps_height
- num_days(created feature)
-
Following features were removed due to corresponding reasons.
Feature Name Reason region redundant with region_codewpt_name unique for every record recorded_by same for every record id unique for every record extraction_type_group highly correlated with extraction_typequantity highly correlated with quantity_grouppayment highly correlated with payment_typequality_group highly correlated with water_quality_waterpoint_type_group highly correlated with waterpoint_typesource_type highly correlated with source -
Created a new feature
num_daysusing existing featuredate_recordedand carried out standard normalization. -
Features
permit_False(derived from applying one-hot encoding topermit) andnum_privatewere dropped due to low feature importance (<0.002).