Skip to content

An end-to-end Python-based exploratory data analysis and preprocessing pipeline for UN SDG Goal 15, focusing on mountain forest cover and degraded mountain land with ML-ready outputs.

Notifications You must be signed in to change notification settings

AnkeshGG/SDG_ExploratoryDataAnalysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

4 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🌲 UN SDG Goal 15 – Mountain Forest Cover

Comprehensive Exploratory Data Analysis & Preprocessing Pipeline

Python Pandas Scikit-Learn UN SDG License

A Python-based end-to-end Exploratory Data Analysis (EDA) project aligned with UN Sustainable Development Goal 15 – Life on Land, focusing on Mountain Forest Cover and Degraded Mountain Land.

The project delivers a complete EDA + preprocessing pipeline that transforms raw SDG data into clean, insightful, and machine-learning-ready outputs.


🎯 Project Objective

  • Analyze mountain land degradation trends
  • Identify geographic, temporal, and environmental patterns
  • Generate high-quality visual insights
  • Prepare clean and ML-ready datasets
  • Ensure reproducibility using requirements.txt

πŸš€ Key Features

πŸ“Š Exploratory Data Analysis

  • Dataset structure & feature inspection
  • Numerical vs categorical feature separation
  • Summary statistics and profiling

πŸ”¬ Data Quality Analysis

  • Missing value detection
  • Duplicate identification
  • Outlier detection using IQR method

🧹 Data Cleaning

  • Duplicate removal
  • Median imputation (numerical)
  • Mode imputation (categorical)

πŸ“ˆ Statistical & Pattern Analysis

  • Most affected geographic regions
  • Degradation by bioclimatic belt
  • Land cover distribution
  • Temporal trend analysis
  • Correlation analysis

πŸ“Š Automated Visualizations

  • Geographic degradation ranking
  • Bioclimatic & land cover plots
  • Time-series trends
  • Distribution & correlation plots

πŸ€– Machine Learning Preparation

  • Label encoding of categorical features
  • Feature scaling using StandardScaler
  • Dimensionality reduction with PCA
  • ML-ready dataset export

πŸ“‹ Report Generation

  • Auto-generated EDA report
  • Cleaned dataset export
  • ML-ready dataset export

πŸ› οΈ Tech Stack

  • Python 3.9+
  • Pandas – Data manipulation
  • NumPy – Numerical computation
  • Matplotlib & Seaborn – Visualization
  • SciPy – Statistical analysis
  • Scikit-learn – Scaling, Encoding & PCA
  • OpenPyXL – Excel file handling

πŸ“ Project Structure

SDG_ExploratoryDataAnalysis/
β”‚
β”œβ”€β”€ data/
β”‚   └── Goal15.xlsx
β”‚
β”œβ”€β”€ outputs/
β”‚   β”œβ”€β”€ plots/
β”‚   β”‚   β”œβ”€β”€ geographic_analysis.png
β”‚   β”‚   β”œβ”€β”€ bioclimatic_landcover.png
β”‚   β”‚   β”œβ”€β”€ temporal_trends.png
β”‚   β”‚   └── comprehensive_analysis.png
β”‚   β”‚
β”‚   └── results/
β”‚       β”œβ”€β”€ cleaned_data.csv
β”‚       β”œβ”€β”€ ml_ready_data.csv
β”‚       └── eda_report.txt
β”‚
β”œβ”€β”€ main.py
β”œβ”€β”€ requirements.txt
└── README.md

βš™οΈ System Requirements

  • Python 3.9 or higher
  • pip package manager

πŸ“¦ Dependencies

All required libraries are listed in requirements.txt.

Install them using:

pip install -r requirements.txt

This ensures environment consistency and reproducibility.


⚑ Quick Start

1️⃣ Clone the Repository

git clone https://github.com/AnkeshGG/SDG_ExploratoryDataAnalysis.git
cd SDG_ExploratoryDataAnalysis

2️⃣ Install Dependencies

pip install -r requirements.txt

3️⃣ Run the EDA Pipeline

python main.py

πŸ“¦ Generated Outputs

πŸ“Š Visualizations

  • Geographic degradation ranking
  • Bioclimatic belt & land cover analysis
  • Temporal degradation trends
  • Correlation and distribution plots

πŸ“ Datasets

  • cleaned_data.csv – Cleaned dataset
  • ml_ready_data.csv – Encoded & scaled dataset

πŸ“„ Report

  • eda_report.txt – Detailed EDA summary and insights

πŸ“Œ Dataset Description

UN SDG Goal: 15 – Life on Land
Indicator: 15.4 – Mountain land degradation

Key Attributes:

  • Geographic Area
  • Time Period
  • Bioclimatic Belt
  • Land Cover Type
  • Degraded Area (sq. km)

Note: If the dataset file is missing, the pipeline automatically generates realistic sample data for demonstration purposes.


πŸ’‘ Use Cases

  • Environmental & sustainability analysis
  • Climate and forest degradation studies
  • Machine learning feature engineering
  • Academic mini/major projects

πŸ§ͺ Future Enhancements

  • Interactive dashboards (Streamlit / Plotly)
  • Degradation trend forecasting
  • Region-wise clustering
  • Integration with live UN SDG APIs
  • Predictive ML models

🀝 Contributing

Contributions are welcome!

  1. Fork the repository
  2. Create a feature branch
  3. Commit your changes
  4. Push to your fork
  5. Submit a pull request

πŸ“„ License

This project is licensed under the MIT License.


πŸ‘¨β€πŸ’» Author

Ankesh Kumar
CSE Undergraduate | Data & ML Enthusiast


🌍 Final Note

This project demonstrates how structured EDA and preprocessing pipelines can convert raw sustainability data into insightful, reproducible, and ML-ready outputs, contributing toward UN SDG Goal 15 – Life on Land.

⭐ If you found this useful, consider starring the repository. Built with love

About

An end-to-end Python-based exploratory data analysis and preprocessing pipeline for UN SDG Goal 15, focusing on mountain forest cover and degraded mountain land with ML-ready outputs.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages