Skip to content

The objective of this exercise is to compare the performance of five different regression algorithms on the **Diabetes dataset** (from scikit-learn). The ultimate goal is to **predict a patient's disease progression after one year** (a numeric target) and interpret the strengths and weaknesses of each model based on standard regression metrics.

License

Notifications You must be signed in to change notification settings

hgabrali/Mini-Project-Regression-Model-Comparison-Exercise

Repository files navigation

📈 Regression Model Comparison Exercise

Project Goal

The objective of this exercise is to compare the performance of five different regression algorithms on the Diabetes dataset (from scikit-learn). The ultimate goal is to predict a patient's disease progression after one year (a numeric target) and interpret the strengths and weaknesses of each model based on standard regression metrics.

This study was conducted as part of the coursework for a Master's School program in the field of Machine Learning/Data Science.


💻 Dataset and Task Details

🏷️ Feature Description
Dataset Diabetes Dataset (scikit-learn)
Input Features (X) 10 baseline variables (age, sex, body mass index, average blood pressure, and six blood serum measurements).
Target Variable (y) Quantitative measure of disease progression one year after baseline. (A continuous numeric value).
Task Regression (Predicting a number).

⚙️ Instructions and Workflow

1. Data Preparation

  • Load the diabetes dataset from sklearn.datasets.
  • Split the data into training and testing sets (e.g., 80% train, 20% test).

2. Regression Models to be Trained

The following five models will be trained on the training data:

  1. Linear Regression (Baseline model)
  2. Polynomial Regression (Exploring non-linearity, e.g., degree 2)
  3. Decision Tree Regression
  4. Random Forest Regression (Ensemble method to reduce variance)
  5. k-Nearest Neighbors (KNN) Regression

3. Evaluation Function (Metrics)

A custom or standard evaluation function must be used to calculate the following metrics for each model's predictions on the test set:

  • MAE (Mean Absolute Error)
  • MSE (Mean Squared Error)
  • RMSE (Root Mean Squared Error)
  • $R^2$ ($R$-squared)

4. Comparison

Combine all results into a single Pandas DataFrame, using the model name as the index, for easy side-by-side comparison.


🧠 Interpreting the Results

The core of the exercise is to analyze the final comparison table.

1. Which Model Performs Best?

  • Criteria: The best model will generally have the lowest RMSE and the highest $R^2$ (closest to 1).
  • Interpretation: State which model consistently scores the best across the error metrics, indicating it is the most robust predictor for the Diabetes progression data.

2. Underfitting vs. Overfitting Indicators

Condition Implication Common Model Type
High Train Error & High Test Error Underfitting (Model is too simple or too constrained). Often seen in simple Linear Regression if the relationship is complex.
Low Train Error & High Test Error Overfitting (Model learned the noise in the training data). Often seen in unconstrained Decision Trees or high-degree Polynomial Regression.
Low Train Error & Low Test Error Good Fit (Model generalizes well). Typically seen in Random Forest or well-tuned models.

3. Algorithm Pros/Cons Reflected in the Data

Analyze how the characteristics of each algorithm relate to their performance:

  • Tree-based (DT & RF): If Random Forest significantly outperforms Linear Regression, it suggests the relationship between the features and disease progression is non-linear.
  • Linear/Polynomial: If Linear Regression is poor but Polynomial Regression is better, it suggests a curved relationship.
  • KNN: If KNN performs poorly, it might indicate that the data is not well-structured in Euclidean space, or that feature scaling was critical (and perhaps not perfectly implemented), or the dataset is too small/noisy.
  • Random Forest's Robustness: Random Forest should ideally be better than a single Decision Tree by having a lower variance (reduced gap between train and test scores).

About

The objective of this exercise is to compare the performance of five different regression algorithms on the **Diabetes dataset** (from scikit-learn). The ultimate goal is to **predict a patient's disease progression after one year** (a numeric target) and interpret the strengths and weaknesses of each model based on standard regression metrics.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published