A comprehensive analysis and machine learning approach for predicting stroke likelihood using demographic and health-related data.
This project conducts an in-depth analysis of the Stroke Prediction Dataset to identify key predictors of stroke and develop models that can support early detection and prevention efforts.
- Understand the structure and distribution of stroke-related features
- Handle missing values and data imbalances effectively
- Identify patterns and correlations in health indicators
- Build and compare multiple machine learning models
- Achieve high recall for stroke detection (prioritizing identification of actual stroke cases)
- Total Records: 5,110 patient records
- Features: 11 attributes including demographic and health indicators
- Target Variable: Stroke occurrence (binary: 0/1)
- Class Imbalance: Only 4.87% positive stroke cases
- BMI: Contains actual missing (NaN) values
- Smoking Status: 30.22% marked as "Unknown"
- Missing BMI data shows strong correlation with stroke occurrence
- Missing data is not missing at random - systematic patterns identified
- Age: Strong positive correlation with stroke (primary risk factor)
- Glucose Level: Moderate influence on stroke likelihood
- BMI: Less predictive power than expected
- Gender, Hypertension, Heart Disease: Significant correlations with stroke
-
Removed
idcolumn (non-informative) -
Removed single "Other" gender entry (insufficient data)
-
BMI Categorization: Converted continuous BMI to categories:
- Underweight (โค18.5)
- Normal weight (18.5-25)
- Overweight (25-30)
- Obese (>30)
- Child (<20 years)
- Unknown (missing values)
-
Smoking Status: Corrected "Unknown" to "never smoked" for children โค12 years
-
Feature Engineering:
- Binary encoding for categorical variables
- Normalization of continuous variables
- One-hot encoding for multi-categorical features
- Applied SMOTE (Synthetic Minority Oversampling Technique)
- Balanced training set from 4.87% to 50% positive cases
- Important: Oversampling applied only to training data, test set remains unchanged
- Logistic Regression (baseline)
- Decision Tree Classifier
- XGBoost Classifier
- Random Forest Classifier
- Tuned Logistic Regression (hyperparameter optimized)
- Grid Search with Cross-Validation for hyperparameter tuning
- Pipeline Approach to properly handle SMOTE during cross-validation
- F1-weighted scoring for model selection
- Decision Tree:
max_depth=3, criterion='entropy' - XGBoost:
learning_rate=0.001, max_depth=2, n_estimators=200 - Random Forest:
n_estimators=500, max_depth=2, bootstrap=True - Tuned Logistic Regression:
C=0.01, penalty='l2', solver='liblinear'
| Model | Accuracy | AUC | Precision | Recall | F1-Score |
|---|---|---|---|---|---|
| Logistic Regression (No SMOTE) | 94% | 0.85 | 0.5 | 0.02 | 0.91 |
| Logistic Regression (After SMOTE) | 82% | 0.80 | 0.16 | 0.45 | 0.86 |
| Decision Tree | 85% | 0.65 | 0.17 | 0.35 | 0.88 |
| XGBoost | 66.14% | 0.83 | 0.13 | 0.89 | 0.75 |
| Random Forest | 66.14% | 0.84 | 0.14 | 0.89 | 0.75 |
| Logistic Regression (Tuned) | 78% | 0.83 | 0.18 | 0.71 | 0.83 |
- XGBoost: Highest recall (92%) - best at identifying actual stroke cases
- Random Forest: Strong recall (89%) with good AUC (0.84)
- Tuned Logistic Regression: Balanced performance across all metrics
- High Recall Models (XGBoost, Random Forest): Excel at detecting true positives but produce more false positives
- Balanced Models (Tuned Logistic Regression): Better precision-recall balance
- Medical Context: High recall preferred due to serious consequences of missing stroke cases
In medical diagnosis, false negatives are more critical than false positives:
- Missing a stroke case (false negative) can be life-threatening
- False positives lead to additional testing but ensure patient safety
- Models prioritizing recall help ensure no stroke cases are missed
Based on the analysis, key stroke predictors include:
- Age (strongest predictor)
- Hypertension
- Heart Disease
- Average Glucose Level
- BMI Category (including "Unknown" status)
- Smoking Status
stroke-prediction/
โ
โโโ data/
โ โโโ healthcare-dataset-stroke-data.csv
โ
โโโ notebooks/
โ โโโ stroke_analysis.ipynb
โ
โโโ visualizations/
โ โโโ numerical_distributions.png
โ โโโ categorical_distributions.png
โ โโโ correlation_matrix_after.png
โ โโโ roc_curves_comparison.png
โ โโโ pr_curves_comparison.png
โ โโโ metrics_comparison.png
โ
โโโ models/
โ โโโ trained_models.pkl
โ
โโโ README.md
pandas>=1.3.0
numpy>=1.20.0
scikit-learn>=1.0.0
xgboost>=1.5.0
plotly>=5.0.0
matplotlib>=3.4.0
seaborn>=0.11.0
imbalanced-learn>=0.8.0git clone https://github.com/suphyusinhtet/stroke-prediction.git
cd stroke-prediction
pip install -r requirements.txt# Load the trained model
import pickle
with open('models/best_model.pkl', 'rb') as f:
model = pickle.load(f)
# Make predictions
predictions = model.predict(X_test)
probabilities = model.predict_proba(X_test)The project includes comprehensive visualizations:
- Feature distribution analysis
- Correlation matrices
- ROC and Precision-Recall curves
- Model performance comparisons
- Radar charts for metric comparison
- Feature Engineering: Create additional risk score features
- Advanced Models: Experiment with neural networks and ensemble methods
- Cross-Validation: Implement more robust validation strategies
- Clinical Validation: Collaborate with medical professionals for validation
- Real-time Prediction: Develop API for real-time stroke risk assessment
Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
This project is licensed under the MIT License - see the LICENSE file for details.
- Dataset provided by fedesoriano on Kaggle
- Medical guidelines from CDC for BMI categorization
- Research insights from stroke prevention studies
For questions or collaborations, please reach out:
- Email: suphyusinhtet@gmail.com
- LinkedIn: Su Phyu Sin Htet
- GitHub: @suphyusinhtet