The objective of this exercise is to compare the performance of five different regression algorithms on the Diabetes dataset (from scikit-learn). The ultimate goal is to predict a patient's disease progression after one year (a numeric target) and interpret the strengths and weaknesses of each model based on standard regression metrics.
This study was conducted as part of the coursework for a Master's School program in the field of Machine Learning/Data Science.
| 🏷️ Feature | Description |
|---|---|
| Dataset | Diabetes Dataset (scikit-learn) |
| Input Features (X) | 10 baseline variables (age, sex, body mass index, average blood pressure, and six blood serum measurements). |
| Target Variable (y) | Quantitative measure of disease progression one year after baseline. (A continuous numeric value). |
| Task | Regression (Predicting a number). |
- Load the
diabetesdataset fromsklearn.datasets. - Split the data into training and testing sets (e.g., 80% train, 20% test).
The following five models will be trained on the training data:
- Linear Regression (Baseline model)
- Polynomial Regression (Exploring non-linearity, e.g., degree 2)
- Decision Tree Regression
- Random Forest Regression (Ensemble method to reduce variance)
- k-Nearest Neighbors (KNN) Regression
A custom or standard evaluation function must be used to calculate the following metrics for each model's predictions on the test set:
- MAE (Mean Absolute Error)
- MSE (Mean Squared Error)
- RMSE (Root Mean Squared Error)
-
$R^2$ ($R$ -squared)
Combine all results into a single Pandas DataFrame, using the model name as the index, for easy side-by-side comparison.
The core of the exercise is to analyze the final comparison table.
-
Criteria: The best model will generally have the lowest RMSE and the highest
$R^2$ (closest to 1). - Interpretation: State which model consistently scores the best across the error metrics, indicating it is the most robust predictor for the Diabetes progression data.
| Condition | Implication | Common Model Type |
|---|---|---|
| High Train Error & High Test Error | Underfitting (Model is too simple or too constrained). | Often seen in simple Linear Regression if the relationship is complex. |
| Low Train Error & High Test Error | Overfitting (Model learned the noise in the training data). | Often seen in unconstrained Decision Trees or high-degree Polynomial Regression. |
| Low Train Error & Low Test Error | Good Fit (Model generalizes well). | Typically seen in Random Forest or well-tuned models. |
Analyze how the characteristics of each algorithm relate to their performance:
- Tree-based (DT & RF): If Random Forest significantly outperforms Linear Regression, it suggests the relationship between the features and disease progression is non-linear.
- Linear/Polynomial: If Linear Regression is poor but Polynomial Regression is better, it suggests a curved relationship.
- KNN: If KNN performs poorly, it might indicate that the data is not well-structured in Euclidean space, or that feature scaling was critical (and perhaps not perfectly implemented), or the dataset is too small/noisy.
- Random Forest's Robustness: Random Forest should ideally be better than a single Decision Tree by having a lower variance (reduced gap between train and test scores).