Skip to content

IVproger/bigdata_course_project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ—οΈ Big-Data Salary Prediction project overview

End-to-end Spark / Hadoop project that ingests Kaggle job-description data, turns it into an analytics-ready Hive warehouse, runs Spark-SQL EDA, trains & tunes Spark-ML regression models, and surfaces everything in an Apache Superset dashboard.

High End architecture of the project

flowchart LR
    %% ────────────────────────────
    %% Stage I β€” Data Collection
    %% ────────────────────────────
    subgraph STAGE1["Stage I β€” Data Collection"]
        direction TB
        Kaggle["Kaggle CLI<br/>(job-descriptions.csv)"] --> PostgreSQL["PostgreSQL"]
        PostgreSQL --> Sqoop["Sqoop Import"]
        Sqoop --> HDFS1["HDFS<br/>Avro (+ schema)"]
    end

    %% ────────────────────────────
    %% Stage II β€” Data Warehouse & EDA
    %% ────────────────────────────
    subgraph STAGE2["Stage II β€” Data Warehouse &<br/>EDA"]
        direction TB
        Hive["Hive Externals<br/>(partitioned & bucketed)"] --> SparkSQL["Spark SQL<br/>(6 analyses)"]
    end

    %% ────────────────────────────
    %% Stage III β€” Predictive Analytics
    %% ────────────────────────────
    subgraph STAGE3["Stage III β€” Predictive<br/>Analytics"]
        direction TB
        Preproc["Data Preprocessing<br/>(Spark DataFrame ops)"] --> SparkML["ML Modelling<br/>(Spark ML Pipeline)"]
        SparkML --> LR["Linear Regression"]
        SparkML --> GBT["Gradient-Boosted Trees"]
    end

    %% ────────────────────────────
    %% Stage IV β€” Presentation & Delivery
    %% ────────────────────────────
    subgraph STAGE4["Stage IV β€” Presentation &<br/>Delivery"]
        direction TB
        HiveExt["Hive Externals<br/>(metrics & predictions)"] --> Superset["Apache Superset<br/>Dashboards"]
    end

    %% ────────────────────────────
    %% Cross-stage flow (left β†’ right)
    %% ────────────────────────────
    HDFS1 --> Hive 
    SparkSQL --> Preproc
    LR --> HiveExt
    GBT --> HiveExt
Loading

✨ Key Features

  • One-click pipeline. 4 bash stages or a single main.sh.
  • Optimised Hive layout. Avro + partitioning (work_type) + bucketing (preference) for low scan cost.
  • Scalable ML. Linear Regression vs. Gradient-Boosted Trees with 3-fold CV, persisted in HDFS.
  • Metrics at a glance. KL-divergence, RMSE, RΒ² and hyper-parameter grids ready for BI tools.
  • Dashboard ready. External Hive tables expose CSV outputs directly to Superset.

πŸ—‚οΈ Repository Layout

β”œβ”€β”€ data/            # Raw download + ML data splits (synced from HDFS)
β”œβ”€β”€ docs/
β”‚   β”œβ”€β”€ img/         # ← Put your screenshots here
β”‚   └── report_*.md  # In-depth reports for each stage
β”œβ”€β”€ models/          # Trained Spark-ML models
β”œβ”€β”€ output/          # Avro schemas, EDA CSVs, predictions, evaluation logs
β”œβ”€β”€ scripts/         # Bash & Python automation
β”œβ”€β”€ sql/             # PostgreSQL & Hive DDL / DML
└── .venv/           # Project-scoped virtualenv

⚑ Quick Start

Prerequisites
Python 3.11 β€’ Hadoop 3 β€’ Hive 3 β€’ Spark 3.5 β€’ Sqoop 1.4 β€’ PostgreSQL 15 β€’ Kaggle CLI

# clone & bootstrap
git clone https://github.com/IVproger/bigdata_course_project.git && cd bigdata_course_project
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt

# store secrets
echo "POSTGRES_PASSWORD=********" > secrets/.psql.pass

# full run (β‰ˆ 30 min on 4-node cluster)
bash main.sh

Each stage can be invoked separately if you prefer:

bash scripts/stage1.sh   # ingest β†’ PostgreSQL β†’ HDFS
bash scripts/stage2.sh   # Hive warehouse + Spark-SQL EDA
bash scripts/stage3.sh   # Spark-ML training & tuning
bash scripts/stage4.sh   # metrics β†’ Hive for BI

πŸ” Stage Breakdown

Stage What happens Key outputs
1 Data Collection Kaggle β†’ PostgreSQL β†’ Sqoop Avro in HDFS warehouse/*.avro
2 Warehouse & EDA Partitioned + bucketed Hive table, 6 Spark-SQL analyses output/q*_results.csv
3 Predictive ML Linear vs. GBT, 3-fold CV, log-salary target models/**, output/model*_predictions.csv
4 Presentation KL divergence, Hive externals for Superset output/evaluation.csv , output/kl_divergence.csv, output/model1_predictions.csv, output/model2_predictions.csv

Details live in docs/report_*.md for auditors and graders.


πŸ“Š Dashboard Preview

Dashboard Link

Data description

EDA ML Modelling


πŸ”¬ Results

Model RMSE (log) RΒ² (log) KL-Div. (salary)
Linear Reg. 0.092 -1.59E-6 18.77
GBT 0.091 1.05E-4 16.3

β†’ GBT shows and better RMSE and KL divergence, indicating tighter fit on the heavy-tailed salary distribution.


🀝 Contributing

Pull requests welcome! Please open an issue first to discuss major changes.

  1. Fork ➜ create feature branch (git checkout -b feat/my-feature)
  2. Commit + push (git commit -m "feat: add …" β†’ git push origin)
  3. Open PR β†’ pass CI.

πŸ“„ License

Distributed under the MIT License. See LICENSE for details.


πŸ™ Acknowledgements

  • Kaggle for the open job-descriptions dataset
  • Apache Software Foundation for the Hadoop ecosystem
  • University Big-Data Engineering course staff for project guidance

Happy crunching β€” and may your HDFS never fill up!

About

πŸš€ Team14 bigdata course project: Kaggle β†’ Hive β†’ Spark-SQL β†’ ML β†’ Superset

Topics

Resources

Stars

Watchers

Forks

Contributors 4

  •  
  •  
  •  
  •