Comprehensive Exploratory Data Analysis & Preprocessing Pipeline
A Python-based end-to-end Exploratory Data Analysis (EDA) project aligned with UN Sustainable Development Goal 15 β Life on Land, focusing on Mountain Forest Cover and Degraded Mountain Land.
The project delivers a complete EDA + preprocessing pipeline that transforms raw SDG data into clean, insightful, and machine-learning-ready outputs.
- Analyze mountain land degradation trends
- Identify geographic, temporal, and environmental patterns
- Generate high-quality visual insights
- Prepare clean and ML-ready datasets
- Ensure reproducibility using
requirements.txt
- Dataset structure & feature inspection
- Numerical vs categorical feature separation
- Summary statistics and profiling
- Missing value detection
- Duplicate identification
- Outlier detection using IQR method
- Duplicate removal
- Median imputation (numerical)
- Mode imputation (categorical)
- Most affected geographic regions
- Degradation by bioclimatic belt
- Land cover distribution
- Temporal trend analysis
- Correlation analysis
- Geographic degradation ranking
- Bioclimatic & land cover plots
- Time-series trends
- Distribution & correlation plots
- Label encoding of categorical features
- Feature scaling using StandardScaler
- Dimensionality reduction with PCA
- ML-ready dataset export
- Auto-generated EDA report
- Cleaned dataset export
- ML-ready dataset export
- Python 3.9+
- Pandas β Data manipulation
- NumPy β Numerical computation
- Matplotlib & Seaborn β Visualization
- SciPy β Statistical analysis
- Scikit-learn β Scaling, Encoding & PCA
- OpenPyXL β Excel file handling
SDG_ExploratoryDataAnalysis/
β
βββ data/
β βββ Goal15.xlsx
β
βββ outputs/
β βββ plots/
β β βββ geographic_analysis.png
β β βββ bioclimatic_landcover.png
β β βββ temporal_trends.png
β β βββ comprehensive_analysis.png
β β
β βββ results/
β βββ cleaned_data.csv
β βββ ml_ready_data.csv
β βββ eda_report.txt
β
βββ main.py
βββ requirements.txt
βββ README.md
- Python 3.9 or higher
- pip package manager
All required libraries are listed in requirements.txt.
Install them using:
pip install -r requirements.txtThis ensures environment consistency and reproducibility.
git clone https://github.com/AnkeshGG/SDG_ExploratoryDataAnalysis.git
cd SDG_ExploratoryDataAnalysispip install -r requirements.txtpython main.py- Geographic degradation ranking
- Bioclimatic belt & land cover analysis
- Temporal degradation trends
- Correlation and distribution plots
cleaned_data.csvβ Cleaned datasetml_ready_data.csvβ Encoded & scaled dataset
eda_report.txtβ Detailed EDA summary and insights
UN SDG Goal: 15 β Life on Land
Indicator: 15.4 β Mountain land degradation
Key Attributes:
- Geographic Area
- Time Period
- Bioclimatic Belt
- Land Cover Type
- Degraded Area (sq. km)
Note: If the dataset file is missing, the pipeline automatically generates realistic sample data for demonstration purposes.
- Environmental & sustainability analysis
- Climate and forest degradation studies
- Machine learning feature engineering
- Academic mini/major projects
- Interactive dashboards (Streamlit / Plotly)
- Degradation trend forecasting
- Region-wise clustering
- Integration with live UN SDG APIs
- Predictive ML models
Contributions are welcome!
- Fork the repository
- Create a feature branch
- Commit your changes
- Push to your fork
- Submit a pull request
This project is licensed under the MIT License.
Ankesh Kumar
CSE Undergraduate | Data & ML Enthusiast
- π GitHub: AnkeshGG
- πΌ LinkedIn: Ankesh Kumar
- π Medium: ankeshGG
This project demonstrates how structured EDA and preprocessing pipelines can convert raw sustainability data into insightful, reproducible, and ML-ready outputs, contributing toward UN SDG Goal 15 β Life on Land.
β If you found this useful, consider starring the repository.