Repo for the Regression House Price competition project:
https://www.kaggle.com/c/iowa-house-prices-regression-techniques/data
-
Exploratory Data Analysis, outliers identification and data cleaning.
-
Modelling using Random Forest, CatBoost amd XGBoost Regressors.
-
Hyperparameters tuning usings RandomizedSearchCV amd GridSearchCV.
-
Test set predictions.
Python Version: 3.8.2
Packages: Pandas, Numpy, Matplotlib, Seaborn, SKlearn, XGBoost, CatBoost
The EDA made shows how data is distributed and relation between different features. Following few highlights from the graphs dispalyed:
First use the pandas api to clean the Train dataset (df1) as follow:
- Fill missing numerical values with feature median
- Convert Object data into numerical
- Create a binary column for missing data with Boolean values
Then functionize the whole process with a preprocess_data(df) function that performs same transformations.
-
Split Data into
trainandtestdata -
Create
fit_and_score(model)function to instantiate and compare accuracy from different estimators simultaneously. -
3 different estimators: Random Forest Classifier XGBoost Classifier CatBoost Classifier
-
Hyperparameter tuning using RandomizedSearchCV and GridSearchCV for the two best performant classifiers.
Once evaluated the best model which gives the lowest RSLME, use it to make predictions on test data.


