Predict order profitability, delivery performance using historical Amazon sales data.
- Classification
- Regression
We are a team of data analysts who work with data from start to finish, cleaning it, analyzing it, and presenting insights in a clear and meaningful way. Using an Amazon Sales Dataset to buy utilising machine learning
The presentation is available here.
The analysis shows day from Amazon India between (2022-03-31 - 2022-06-29):
| Dataset | Source | Purpose |
|---|---|---|
| Amazon Sales Report | Kaggle: `mdsazzatsardar/amazonsalesreport/ | Core data for orders across 20+ Indian states. |
- The initial day focused on exploratory data analysis (EDA) and defining the analytical framework.
- Acquired Amazon sales data from Kaggle.
- Performed initial exploratory data analysis (EDA) to understand structure, distributions, and class imbalance.
- Cleaned and standardized column names
- Handled missing values and inconsistent data types
- Begin feature engineering to enhance your model's predictive capabilities.
- Implemented a K-Nearest Neighbors (KNN) classification model, Linear Regression, Decision Tree, and Random Forest.
- Applied feature scaling and categorical encoding
- Split data into stratified training and test sets
- Evaluated performance using accuracy, random forest
- Analyzed model performance:
- Accuracy, Confusion matrix, Precision, recall, and F1-score
- Identified limitations caused by class imbalance
- Created visualizations for presentation
- Prepared final project slides
- Accuracy: ~92%
- Strong performance for Delivered orders
- Lower performance for Cancelled and Returned orders due to class imbalance
- Three models were evaluated using an 80/20 train–test split: Linear Regression, Decision Tree, and Random Forest.
- Among the tested models, Random Forest achieved the best overall performance.
- Random Forest produced the highest R² score (0.52), explaining the most variance in the target variable.
- It also achieved the lowest MAE (135.6) and lowest RMSE (186.1), indicating more accurate predictions compared to the other models.
- Dataset is highly imbalanced
- KNN struggles to identify rare classes
- Performance depends heavily on the quality and scope of the available data.
- Random Forest models are less interpretable than simpler models like Linear Regression.
- Results are based on a single train–test split and may vary with different data partitions.
Overall, the machine learning models demonstrated strong predictive performance, with KNN effectively classifying delivery outcomes and Random Forest providing the most accurate regression results. While high accuracy was achieved, class imbalance and unexplained variance highlight the limitations of the data and models. These results show the value of machine learning for real world analysis while emphasizing the need for careful evaluation and further model refinement.
Alan, Pati, Pedro, Charul.