PCA and Linear Regression
This project performs exploratory data analysis, Principal Component Analysis (PCA), and linear regression on a prostate cancer dataset. The objective is to understand which clinical factors influence the prostate-specific antigen (PSA) level and to build interpretable predictive models.
Python 3 is required to run this project.
All required Python packages are listed in the requirements.txt file.
From the project root directory, install the dependencies:
pip install -r requirements.txtIf pip does not work, use:
python3 -m pip install -r requirements.txtDataset The dataset is located in data/prostate.csv.
Important note: The dataset is space separated rather than comma separated. All scripts handle this format explicitly.
How to Run the Project All commands below must be executed from the project root directory.
cd prostate_projectStep 1: Load and Check the Data This script loads the dataset, checks the number of observations and variables, and verifies that there are no missing values.
python3 src/01_load_and_check.pyStep 2: Descriptive Statistics and Visualization This script computes descriptive statistics and generates boxplots and scatter plots for exploratory analysis.
python3 src/02_descriptive_analysis.pyStep 3: Log Transformation This script applies a logarithmic transformation to skewed variables and saves a transformed dataset. Log transformed variables:
- vol
- wht
- bh
- pc
- psa Age is intentionally left untransformed.
python3 src/03_log_transform.pyAfter execution, the following file is created:
data/prostate_log.csvStep 4: Principal Component Analysis This script performs PCA on standardized explanatory variables and produces:
- Proportion of variance explained plot
- Cumulative proportion of variance explained plot
- Correlation circle
- PCA biplot
python3 src/04_pca.pyStep 5: Simple Linear Regression This script computes correlations between the target variable and predictors, identifies the most correlated predictor, and fits a simple linear regression model.
python3 src/05_simple_regression.pyStep 6: Best Subset Selection This script evaluates all combinations of predictors and selects the best multiple linear regression model using adjusted R².
python3 src/06_best_subset.pyStep 7: Final Prediction This script fits the final regression model and predicts PSA for a new patient.
Steps performed:
- Log transformation of new patient inputs
- Prediction of log PSA
- Back transformation to PSA scale
python3 src/07_prediction.pyThe predicted PSA value is displayed in the terminal.
Thankyou.