This is the repository for the group 6 of the Data Mining course 2024/2025 at the University of Pisa. This repository contains the code and the notebooks used for the analysis of the dataset provided for the project. You can read our final report here: Final Report
The repository is organized as follows:
📂 .
├── 🛠️ environment.yml
├── 📄 project.pdf
├── 📘 README.md
└── 📂 src
├── 📁 dataset
├── 🐍 generic_utils.py
├── 📂 task1_data_understanding
│ ├── 📒 cyclist_analysis.ipynb
│ ├── 📒 data_distribution_refined.ipynb
│ ├── 🐍 dataunderstanding.py
│ ├── 📒 races_analysis.ipynb
│ ├── 🐍 transformations.py
│ └── 🐍 utils.py
├── 📂 task2_data_transformation
│ ├── 📒 feature engineering_cyclists.ipynb
│ ├── 📒 feature_engineering.ipynb
│ ├── 📒 outlier_detection.ipynb
│ ├── 📒 races_understanding.ipynb
│ └── 🐍 utils.py
├── 📂 task3_clustering
│ ├── 📒 dbscan.ipynb
│ ├── 📒 hierarchical.ipynb
│ ├── 📒 kmeans_clustering.ipynb
│ ├── 📒 optics.ipynb
│ ├── 🐍 transformations.py
│ └── 🐍 utils.py
├── 📂 task4_prediction
│ ├── 📒 decisione_trees_classification .ipynb
│ ├── 📒 knn_classification.ipynb
│ ├── 🗂️ params_dt
│ ├── 🗂️ params_knn
│ ├── 🗂️ params_ripper
│ ├── 📒 ripper_classification.ipynb
│ ├── 📒 bagging_classification.ipynb
│ ├── 📒 decision_trees_classification .ipynb
│ ├── 📒 nn_classification.ipynb
│ ├── 📒 ripper_classification.ipynb
│ ├── 📒 boosting.ipynb
│ └── 🐍 preprocessing.py
└── 📂 task5_xai
├── 📒 bagging_explanation.ipynb
├── 🐍 preprocessing.py
├── 🐍 transformations.py
└── 📒 xgbc_explanation.ipynbThe repository contains the following files and folders:
environment.yml: file containing the environment used for the project.project.pdf: file containing the instructions for the project.README.md: file containing the description of the repository.src: contains folders with scripts and notebooks used for the analysisdataset: folder containing the dataset used for the project.task1_data_understanding: folder of files for the first task of the projectcyclist_analysis.ipynb: notebook containing the analysis of the cyclists' dataframedata_distribution_refined.ipynb: notebook containing the analysis of the distribution of the datadataunderstanding.py: utility functions for the data understanding taskraces_analysis.ipynb: notebook containing the analysis of the races' dataframetransformations.py: utility functions for normalizing the data.utils.py: utility functions for the data understanding task
task2_data_transformation: folder of files for the second task of the projectfeature engineering_cyclists.ipynb: notebook containing the feature engineering and newer understanding of the cyclists' dataframe.feature_engineering.ipynb: notebook containing the feature engineering of the races' dataframeoutlier_detection.ipynb: notebook containing the outlier detection of the data both for the cyclists and races.races_understanding.ipynb: notebook containing the data understanding of the new features of the races.utils.py: utility functions for the data transformation task
task3_clustering: folder of files for the third task of the projectdbscan.ipynb: notebook containing the DBSCAN clustering of the datahierarchical.ipynb: notebook containing the hierarchical clustering of the datakmeans_clustering.ipynb: notebook containing the KMeans clustering of the datatransformations.py: utility functions for the normalization taskutils.py: utility functions for the clustering task
task4_prediction: folder of files for the fourth task of the projectdecision_trees_classification .ipynb: notebook containing the decision tree classification of the dataknn_classification.ipynb: notebook containing the KNN classification of the dataparams_dt: folder containing the parameters for the decision tree classificationparams_knn: folder containing the parameters for the KNN classificationparams_ripper: folder containing the parameters for the RIPPER classificationripper_classification.ipynb: notebook containing the RIPPER classification of the databagging_classification.ipynb: notebook containing the bagging classification of the datadecisione_trees_classification .ipynb: notebook containing the decision tree classification of the datann_classification.ipynb: notebook containing the neural network classification of the dataripper_classification.ipynb: notebook containing the RIPPER classification of the databoosting.ipynb: notebook containing the boosting classification of the datapreprocessing.py: utility functions for the preprocessing of the data
task5_xai: folder of files for the fifth task of the projectbagging_explanation.ipynb: notebook containing the explanation of the bagging classificationpreprocessing.py: utility functions for the preprocessing of the datatransformations.py: utility functions for the normalization taskxgbc_explanation.ipynb: notebook containing the explanation of the XGBoost classification