A comprehensive machine learning solution for breast cancer detection with a user-friendly web interface. This project compares multiple state-of-the-art algorithms (Random Forest, XGBoost, CatBoost, LightGBM, SVM, and Neural Networks) and provides real-time predictions through an intuitive web application.
- Features
- Project Structure
- Installation
- Quick Start
- Usage Guide
- Web Interface
- API Documentation
- Model Comparison
- Understanding the Results
- Deployment
- Dataset Information
- Medical Disclaimer
- Contributing
- License
-
📊 Multiple Model Comparison: Evaluates 6 different machine learning algorithms
- Random Forest
- XGBoost
- CatBoost
- LightGBM
- Support Vector Machine (SVM)
- Neural Network
-
🎯 Automatic Best Model Selection: Selects the best performing model based on F1-score
-
🚀 Production-Ready API: FastAPI-based REST API for real-time predictions
-
💻 User-Friendly Web Interface:
- Beautiful, responsive web UI
- Interactive form with all 30 features
- Real-time predictions
- Color-coded results (Green for Benign, Red for Malignant)
-
📚 Educational Tooltips:
- Help icons (?) next to each field
- Plain-language explanations for all medical terminology
- Tooltips on hover with detailed descriptions
- Clear explanations of Malignant vs Benign
-
🎲 Random Example Generation:
- Load random examples from the dataset
- Different example each time
- Pre-filled form for quick testing
-
📈 Comprehensive Evaluation:
- Accuracy, Precision, Recall, F1-score
- 5-fold Cross-Validation
- Detailed classification reports
- Confusion matrix analysis
-
🔧 Deployment Ready:
- Docker support
- Cloud deployment guides (Render, Heroku, Railway, AWS, GCP, Azure)
- Traditional hosting deployment instructions
Breast Cancer Detection/
├── Breast_cancer_dataset.csv # Dataset with 569 samples and 30 features
├── train_models.py # Model training and comparison script
├── app.py # FastAPI application with web interface
├── requirements.txt # Python dependencies
├── static/
│ └── index.html # User-friendly web interface
├── Dockerfile # Docker container configuration
├── docker-compose.yml # Docker Compose setup
├── Procfile # Heroku deployment configuration
├── render.yaml # Render.com deployment configuration
├── runtime.txt # Python version specification
├── deploy.sh # Quick deployment script
├── setup_hosting.sh # Hosting deployment script
├── start_server.sh # Server start script
├── passenger_wsgi.py # WSGI entry point for hosting providers
├── DEPLOYMENT.md # Comprehensive deployment guide
├── DEPLOYMENT_QUICKSTART.md # Quick deployment guide
├── DEPLOYMENT_HOSTING.md # Traditional hosting deployment guide
├── RENDER_DEPLOYMENT.md # Render.com specific guide
├── .gitignore # Git ignore file
└── README.md # This file
- Python 3.8 or higher
- pip (Python package manager)
git clone https://github.com/ShaonINT/Breast_Cancer_Detection.git
cd Breast_Cancer_Detectionpip install -r requirements.txtRequired packages:
- pandas
- numpy
- scikit-learn
- xgboost
- catboost
- lightgbm
- fastapi
- uvicorn
- pydantic
- joblib
python train_models.pyThis will:
- Load and preprocess the dataset
- Train all 6 models
- Evaluate each model
- Select the best model (highest F1-score)
- Save model files (
best_model.pkl,model_metadata.pkl)
Expected Output:
================================================================================
BREAST CANCER DETECTION - MODEL COMPARISON
================================================================================
Training and evaluating models...
Random Forest: Accuracy: 0.9649, F1-Score: 0.9500
XGBoost: Accuracy: 0.9737, F1-Score: 0.9630
CatBoost: Accuracy: 0.9561, F1-Score: 0.9367
LightGBM: Accuracy: 0.9649, F1-Score: 0.9500
SVM: Accuracy: 0.9737, F1-Score: 0.9630
Neural Network: Accuracy: 0.9737, F1-Score: 0.9630
Best Model: XGBoost (F1-Score: 0.9630)
python app.pyThen open your browser:
- Web Interface: http://localhost:8000
- API Documentation: http://localhost:8000/docs
- Alternative Docs: http://localhost:8000/redoc
-
Navigate to http://localhost:8000
-
Read the Information Box at the top to understand:
- What the measurements mean
- What Malignant and Benign mean
- How to use the tool
-
Fill in the Form:
- Hover over the ? icon next to any field for explanations
- Or click "Load Random Example" to fill with sample data
- Enter all 30 feature values
-
Get Prediction:
- Click "Get Prediction" button
- View results:
- B (Benign) in green = Non-cancerous
- M (Malignant) in red = Cancerous
- Probability percentage
- Confidence level (High/Medium/Low)
-
Try Another:
- Click "Load Random Example" for a different sample
- Or "Clear Form" to start fresh
Each field has:
- Label: The feature name
- ? Icon: Hover for detailed explanation
- Help Text: Brief description below the label
- Input Field: Enter the numeric value
Field Categories:
- Mean Features: Average measurements across all cells (10 fields)
- Standard Error Features: Variation/uncertainty in measurements (10 fields)
- Worst Features: Most abnormal/severe measurements (10 fields)
- Responsive Layout: Works on desktop, tablet, and mobile
- Color-Coded Results:
- Green background for Benign (B) results
- Red background for Malignant (M) results
- Clear Visual Hierarchy: Organized sections for easy navigation
- Interactive Tooltips: Hover over ? icons for explanations
- Plain Language: All medical terms explained in simple language
- Info Box: Overview of measurements and terminology at the top
- Result Explanations: Each prediction includes explanation of what it means
- Random Examples: Get different examples each time
- Real Data: Examples come from actual dataset
- Pre-filled Form: One-click to populate all fields
http://localhost:8000
Returns the web interface HTML page.
Health check endpoint to verify the API and model status.
Response:
{
"status": "healthy",
"model_loaded": true,
"model_name": "XGBoost"
}Get a random example from the dataset with proper precision formatting.
Response:
{
"radius_mean": 17.99,
"texture_mean": 10.38,
"perimeter_mean": 122.8,
...
}Single prediction endpoint.
Request Body:
{
"radius_mean": 17.99,
"texture_mean": 10.38,
"perimeter_mean": 122.8,
"area_mean": 1001.0,
"smoothness_mean": 0.1184,
"compactness_mean": 0.2776,
"concavity_mean": 0.3001,
"concave_points_mean": 0.1471,
"symmetry_mean": 0.2419,
"fractal_dimension_mean": 0.07871,
"radius_se": 1.095,
"texture_se": 0.9053,
"perimeter_se": 8.589,
"area_se": 153.4,
"smoothness_se": 0.006399,
"compactness_se": 0.04904,
"concavity_se": 0.05373,
"concave_points_se": 0.01587,
"symmetry_se": 0.03003,
"fractal_dimension_se": 0.006193,
"radius_worst": 25.38,
"texture_worst": 17.33,
"perimeter_worst": 184.6,
"area_worst": 2019.0,
"smoothness_worst": 0.1622,
"compactness_worst": 0.6656,
"concavity_worst": 0.7119,
"concave_points_worst": 0.2654,
"symmetry_worst": 0.4601,
"fractal_dimension_worst": 0.1189
}Response:
{
"prediction": "Malignant (M)",
"probability": 0.95,
"confidence": "High"
}Batch prediction for multiple samples at once.
Request Body:
[
{
"radius_mean": 17.99,
...
},
{
"radius_mean": 13.54,
...
}
]Response:
{
"predictions": [
{
"prediction": "Malignant (M)",
"probability": 0.95,
"confidence": "High"
},
{
"prediction": "Benign (B)",
"probability": 0.12,
"confidence": "High"
}
],
"count": 2
}import requests
url = "http://localhost:8000/predict"
data = {
"radius_mean": 17.99,
"texture_mean": 10.38,
# ... all other features
}
response = requests.post(url, json=data)
result = response.json()
print(f"Prediction: {result['prediction']}")
print(f"Probability: {result['probability']*100:.2f}%")
print(f"Confidence: {result['confidence']}")curl -X POST "http://localhost:8000/predict" \
-H "Content-Type: application/json" \
-d '{
"radius_mean": 17.99,
"texture_mean": 10.38,
...
}'- Random Forest: Ensemble learning with decision trees
- XGBoost: Gradient boosting framework
- CatBoost: Gradient boosting with categorical features support
- LightGBM: Fast gradient boosting framework
- SVM (Support Vector Machine): Kernel-based classification
- Neural Network: Multi-layer perceptron classifier
Each model is evaluated using:
- Accuracy: Overall correctness of predictions
- Precision: Correctness of positive (malignant) predictions
- Recall: Ability to find all malignant cases
- F1-Score: Harmonic mean of precision and recall (used for model selection)
- Cross-Validation: 5-fold CV for robust performance estimation
The model with the highest F1-score is automatically selected as the best model. F1-score balances precision and recall, making it ideal for medical diagnosis where both false positives and false negatives are important.
The model returns one of two predictions:
- Meaning: The cells appear normal and healthy
- Result: The tissue is NOT cancerous
- Display: Shown with green color
- What to do: Continue regular monitoring as recommended by your doctor
- Meaning: The cells show abnormalities that may indicate cancer
- Result: The tissue may be cancerous
- Display: Shown with red color
- What to do: Consult a healthcare professional immediately for proper medical evaluation and diagnosis
- High: Probability ≥ 80% or ≤ 20% (very confident prediction)
- Medium: Probability between 70-80% or 20-30% (moderately confident)
- Low: Probability between 30-70% (less confident, may require more tests)
The probability value indicates how confident the model is that the sample is malignant (M). For example:
- 0.95 = 95% probability of being malignant
- 0.12 = 12% probability of being malignant (88% chance it's benign)
- Go to render.com and sign up
- Click "New +" → "Web Service"
- Connect your GitHub repository:
ShaonINT/Breast_Cancer_Detection - Configure:
- Build Command:
pip install -r requirements.txt - Start Command:
python app.py
- Build Command:
- Click "Create Web Service"
- Your app will be live in ~2 minutes!
📖 Detailed Guide: See RENDER_DEPLOYMENT.md
- Go to railway.app
- Connect GitHub repository
- Railway auto-detects and deploys
- Done!
docker build -t breast-cancer-detection .
docker run -p 8000:8000 breast-cancer-detectionSee DEPLOYMENT_HOSTING.md for detailed instructions.
- Quick Start: DEPLOYMENT_QUICKSTART.md
- Comprehensive Guide: DEPLOYMENT.md
- Traditional Hosting: DEPLOYMENT_HOSTING.md
- Render.com: RENDER_DEPLOYMENT.md
Kaggle Dataset: Breast Cancer Dataset
The dataset is available on Kaggle and is based on the Wisconsin Breast Cancer Database (WBCD). This dataset is widely used in machine learning research for breast cancer classification tasks.
The dataset contains features computed from digitized images of Fine Needle Aspirate (FNA) samples of breast masses. FNA is a diagnostic procedure where a thin needle is used to extract a small sample of cells from a breast mass for microscopic examination.
Key Characteristics:
- Source: Features are derived from images of cell nuclei obtained through FNA procedures
- Purpose: Classify breast masses as benign (non-cancerous) or malignant (cancerous)
- Medical Context: This is a binary classification problem critical for early breast cancer detection
- Data Quality: Well-curated dataset with minimal missing values
- Total Samples: 569 instances
- Features: 30 numerical features
- Target Distribution:
- B (Benign): 357 cases (62.7%) - Non-cancerous tissue
- M (Malignant): 212 cases (37.3%) - Cancerous tissue
- Class Imbalance: Slight imbalance toward benign cases (typical in medical datasets)
The 30 features are organized into three categories, each measuring 10 different characteristics of cell nuclei:
-
Mean Features (10 features): Average measurements across all cells in the image
- Provides overall characterization of cell nuclei
-
Standard Error Features (10 features): Standard error (variation) in measurements
- Indicates consistency and variability across cells
-
Worst Features (10 features): Largest (worst/most abnormal) measurements found
- Captures the most severe abnormalities in cell nuclei
Each of the 10 measured characteristics provides different insights into cell structure:
- Radius: Distance from center to points on the perimeter (size indicator)
- Texture: Standard deviation of gray-scale values (surface appearance variation)
- Perimeter: Distance around the boundary of the cell nucleus
- Area: Size of the cell nucleus (often larger in malignant cells)
- Smoothness: Local variation in radius lengths (boundary smoothness)
- Compactness: Perimeter² / (Area - 1) (how circular/compact the shape is)
- Concavity: Severity of concave portions of the contour (indentations)
- Concave Points: Number of concave portions (frequency of indentations)
- Symmetry: How symmetrical the cell nucleus is (normal cells are more symmetrical)
- Fractal Dimension: Complexity of the boundary ("coastline approximation")
This dataset is particularly valuable because:
- Early Detection: Enables classification based on cell characteristics visible in FNA samples
- Non-Invasive: FNA is less invasive than surgical biopsies
- Fast Results: Can provide quicker preliminary diagnosis
- Pattern Recognition: Machine learning models can identify subtle patterns not easily visible to the human eye
In this project, the dataset is preprocessed by:
- Normalizing feature names (removing spaces)
- Encoding target variable (M=1, B=0)
- Handling missing values (if any)
- Splitting into training (80%) and testing (20%) sets with stratification
Note: For detailed explanations of each feature, hover over the ? icons in the web interface!
IMPORTANT: This tool is for educational and research purposes only.
- ❌ NOT a substitute for professional medical diagnosis
- ❌ NOT intended for clinical decision-making
- ✅ ALWAYS consult with a qualified healthcare provider
- ✅ Results should be interpreted by medical professionals
This application does not provide medical advice. Always seek the advice of a physician or other qualified health provider with any questions regarding a medical condition.
-
Data Preprocessing:
- Remove ID column
- Normalize feature names (spaces to underscores)
- Remove unnamed/empty columns
- Encode target variable (M=1, B=0)
- Handle missing values
-
Data Splitting:
- 80% training set
- 20% test set
- Stratified split to maintain class distribution
-
Feature Scaling:
- Automatic scaling for SVM and Neural Network
- Other models use raw features
-
Model Evaluation:
- 5-fold cross-validation
- Multiple metrics (Accuracy, Precision, Recall, F1)
- Best model selection based on F1-score
After training, the following files are generated:
best_model.pkl: The selected best model (joblib format)model_metadata.pkl: Metadata including:- Model name
- Feature names and order
- Performance metrics
- Whether scaler is needed
scaler.pkl: Feature scaler (if required by the model)
# Test the API
python app.py
# Then visit http://localhost:8000/health
# Test predictions
curl http://localhost:8000/predict -X POST -H "Content-Type: application/json" -d '{...}'train_models.py: Model comparison and training pipelineapp.py: FastAPI application with endpoints and model loadingstatic/index.html: Complete web interface with explanations- Deployment files: Docker, Render, Heroku, hosting configurations
- Start the server:
python app.py - Open browser: http://localhost:8000
- Click "Load Random Example"
- Click "Get Prediction"
- View result: M (Malignant) or B (Benign)
import requests
# Get a random example
example = requests.get('http://localhost:8000/example').json()
# Make a prediction
prediction = requests.post(
'http://localhost:8000/predict',
json=example
).json()
print(f"Result: {prediction['prediction']}")
print(f"Confidence: {prediction['confidence']}")import requests
examples = [
requests.get('http://localhost:8000/example').json()
for _ in range(5)
]
results = requests.post(
'http://localhost:8000/predict/batch',
json=examples
).json()
for i, result in enumerate(results['predictions']):
print(f"Sample {i+1}: {result['prediction']}")- Mean: Average value across all cells
- SE (Standard Error): How much values vary
- Worst: Most abnormal/severe value
- Benign (B): Non-cancerous (healthy)
- Malignant (M): Cancerous (abnormal)
Every field has:
- Help icon (?): Hover to see detailed explanation
- Brief description: Plain language summary
- Tooltip: Full explanation of what the measurement means
- DEPLOYMENT_QUICKSTART.md: Quick deployment guide
- DEPLOYMENT.md: Comprehensive deployment options
- DEPLOYMENT_HOSTING.md: Traditional hosting guide
- RENDER_DEPLOYMENT.md: Render.com specific guide
deploy.sh: Quick deployment scriptsetup_hosting.sh: Hosting setup automationstart_server.sh: Server start script
Contributions are welcome! Please feel free to submit a Pull Request.
- Additional model architectures
- Enhanced feature engineering
- Better UI/UX improvements
- Documentation improvements
- Performance optimizations
This project is for educational and research purposes only.
Shaon Biswas
- GitHub: @ShaonINT
- Repository: Breast_Cancer_Detection
- Dataset Source: Breast Cancer Dataset on Kaggle by wasiqaliyasir
- Original Dataset: Wisconsin Breast Cancer Database (WBCD)
- Libraries: scikit-learn, XGBoost, CatBoost, LightGBM, FastAPI
- Community: Open source machine learning community
For issues, questions, or contributions:
- Open an issue on GitHub
- Check the deployment guides
- Review the documentation files
⭐ If you find this project useful, please consider giving it a star on GitHub!
