Data Mining: Iris Data Exploration

This repository contains the solution for Project 1 of a Data Mining course. The primary objective of this project is to perform comprehensive Exploratory Data Analysis (EDA) on the classic Iris Dataset. The project covers univariate and bivariate statistical analysis, correlation analysis, and extensive data visualization to understand the characteristics and relationships within the data.

Dataset

The analysis is performed on the Iris Dataset, originally introduced by Ronald Fisher in 1936.

Source: UCI Machine Learning Repository
Instances: 150
Features: 4 (Sepal Length, Sepal Width, Petal Length, Petal Width)
Classes: 3 (Iris Setosa, Iris Versicolor, Iris Virginica)

Features

This project implements the following data mining tasks:

Univariate Summary Statistics (dist/statistics.csv): Calculates detailed statistics for sepal width grouped by species:
- Missing values, Min, Max, Mean
- Quartiles (Q1, Median, Q3), 95th Percentile
- Range, IQR, Standard Deviation (Sample & Population), MAD
Correlation Analysis (dist/correlation.csv): Computes Pearson's correlation matrix for all numerical features and identifies minimum/maximum correlations.
- A 4x4 matrix showing Pearson correlation coefficients between:Sepal Length, Sepal Width, Petal Length, Petal Width
Data Visualization:
- Histograms: Distribution of Petal Length and Sepal Width.
- 3D Histogram: Joint distribution of Petal Length and Sepal Width.
- Box Plots: Distribution of features per species to identify outliers.
- Scatter Plots: Pairwise relationships between all features, colored by species.
- 3D Scatter Plot: Separability analysis using Sepal Length, Sepal Width, and Petal Length.
- PDF Curves: Probability Density Function estimation for Petal Length per species.

Project Structure

.
├── src/
│   └── run.ipynb          # Main Jupyter Notebook containing the analysis code
├── dist/
│   ├── statistics.csv     # Generated univariate statistics
│   └── correlation.csv    # Generated correlation matrix
├── data/                  # Directory for dataset storage 
├── requirements.txt       # Python dependencies
├── README.md              # Project documentation

References

Fisher, R. A. (1936). "The use of multiple measurements in taxonomic problems". Annals of Eugenics.
Stevens, S. S. (1946). "On the theory of scales of measurement". Science.
UCI Machine Learning Repository: Iris Dataset

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Mining: Iris Data Exploration

Dataset

Features

Project Structure

References

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
dist		dist
src		src
README.md		README.md
requirements.txt		requirements.txt

elhamabedi/iris-data-analysis

Folders and files

Latest commit

History

Repository files navigation

Data Mining: Iris Data Exploration

Dataset

Features

Project Structure

References

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages