📈 Regression Model Comparison Exercise

Project Goal

The objective of this exercise is to compare the performance of five different regression algorithms on the Diabetes dataset (from scikit-learn). The ultimate goal is to predict a patient's disease progression after one year (a numeric target) and interpret the strengths and weaknesses of each model based on standard regression metrics.

This study was conducted as part of the coursework for a Master's School program in the field of Machine Learning/Data Science.

💻 Dataset and Task Details

🏷️ Feature	Description
Dataset	Diabetes Dataset (scikit-learn)
Input Features (X)	10 baseline variables (age, sex, body mass index, average blood pressure, and six blood serum measurements).
Target Variable (y)	Quantitative measure of disease progression one year after baseline. (A continuous numeric value).
Task	Regression (Predicting a number).

⚙️ Instructions and Workflow

1. Data Preparation

Load the diabetes dataset from sklearn.datasets.
Split the data into training and testing sets (e.g., 80% train, 20% test).

2. Regression Models to be Trained

The following five models will be trained on the training data:

Linear Regression (Baseline model)
Polynomial Regression (Exploring non-linearity, e.g., degree 2)
Decision Tree Regression
Random Forest Regression (Ensemble method to reduce variance)
k-Nearest Neighbors (KNN) Regression

3. Evaluation Function (Metrics)

A custom or standard evaluation function must be used to calculate the following metrics for each model's predictions on the test set:

MAE (Mean Absolute Error)
MSE (Mean Squared Error)
RMSE (Root Mean Squared Error)
$R^2$ ($R$-squared)

4. Comparison

Combine all results into a single Pandas DataFrame, using the model name as the index, for easy side-by-side comparison.

🧠 Interpreting the Results

The core of the exercise is to analyze the final comparison table.

1. Which Model Performs Best?

Criteria: The best model will generally have the lowest RMSE and the highest $R^2$ (closest to 1).
Interpretation: State which model consistently scores the best across the error metrics, indicating it is the most robust predictor for the Diabetes progression data.

2. Underfitting vs. Overfitting Indicators

Condition	Implication	Common Model Type
High Train Error & High Test Error	Underfitting (Model is too simple or too constrained).	Often seen in simple Linear Regression if the relationship is complex.
Low Train Error & High Test Error	Overfitting (Model learned the noise in the training data).	Often seen in unconstrained Decision Trees or high-degree Polynomial Regression.
Low Train Error & Low Test Error	Good Fit (Model generalizes well).	Typically seen in Random Forest or well-tuned models.

3. Algorithm Pros/Cons Reflected in the Data

Analyze how the characteristics of each algorithm relate to their performance:

Tree-based (DT & RF): If Random Forest significantly outperforms Linear Regression, it suggests the relationship between the features and disease progression is non-linear.
Linear/Polynomial: If Linear Regression is poor but Polynomial Regression is better, it suggests a curved relationship.
KNN: If KNN performs poorly, it might indicate that the data is not well-structured in Euclidean space, or that feature scaling was critical (and perhaps not perfectly implemented), or the dataset is too small/noisy.
Random Forest's Robustness: Random Forest should ideally be better than a single Decision Tree by having a lower variance (reduced gap between train and test scores).

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
LICENSE		LICENSE
README.md		README.md
Regression Model Performance Analysis: Diabetes Dataset.md		Regression Model Performance Analysis: Diabetes Dataset.md
Regression_Model_Comparison_Exercise.ipynb		Regression_Model_Comparison_Exercise.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📈 Regression Model Comparison Exercise

Project Goal

💻 Dataset and Task Details

⚙️ Instructions and Workflow

1. Data Preparation

2. Regression Models to be Trained

3. Evaluation Function (Metrics)

4. Comparison

🧠 Interpreting the Results

1. Which Model Performs Best?

2. Underfitting vs. Overfitting Indicators

3. Algorithm Pros/Cons Reflected in the Data

About

Uh oh!

Releases

Packages

Languages

License

hgabrali/Mini-Project-Regression-Model-Comparison-Exercise

Folders and files

Latest commit

History

Repository files navigation

📈 Regression Model Comparison Exercise

Project Goal

💻 Dataset and Task Details

⚙️ Instructions and Workflow

1. Data Preparation

2. Regression Models to be Trained

3. Evaluation Function (Metrics)

4. Comparison

🧠 Interpreting the Results

1. Which Model Performs Best?

2. Underfitting vs. Overfitting Indicators

3. Algorithm Pros/Cons Reflected in the Data

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages