Missing value imputation refers to replacing missing data with substituted values in a dataset. This project explores python implementation of the various missing value imputation methods.
MEAN/MEDIAN/MODE IMPUTATION: These are statistical methods of imputation to replace missing values with the mean, median, or mode of the available values in a dataset. Mean Imputation: Replaces missing values with the mean (average) of the available values. This method is suitable for numerical data that does not have outliers, as outliers can significantly affect the mean. Median Imputation: Replaces missing values with the median of the available values. It is more robust than mean imputation, especially for data with outliers or a non-normal distribution. Mode Imputation: Replaces missing values with the mode (the most frequently occurring value). This method is used for categorical data.
PREDICTIVE IMPUTATION Predictive imputation involves using statistical models to predict and fill in missing values based on the relationships observed in the rest of the data. Some methods include: Regression Imputation: Uses a regression model to predict missing values based on other, related variables in the data. K-Nearest Neighbors (KNN) Imputation: Identifies ‘k’ samples in the dataset that are similar to the observation with missing data and imputes values based on the average (or majority) of these ‘k’ neighbours.
LAST OBS CARRIED FORWARD (LOCF) AND NEXT OBS CARRIED BACKWARD (NCOB) MODEL These are imputation methods typically used in time series data or longitudinal studies where the ordering of observations is meaningful. Last Observation Carried Forward (LOCF): Replaces a missing value with the last observed value prior to the missing one. It is based on the assumption that the best guess for a missing value is the one that was most recently observed. Next Observation Carried Backward (NOCB): It is the reverse of LOCF. It replaces a missing value with the next observed value after the missing one.