End-to-end Spark / Hadoop project that ingests Kaggle job-description data, turns it into an analytics-ready Hive warehouse, runs Spark-SQL EDA, trains & tunes Spark-ML regression models, and surfaces everything in an Apache Superset dashboard.
High End architecture of the project
flowchart LR
%% ββββββββββββββββββββββββββββ
%% Stage I β Data Collection
%% ββββββββββββββββββββββββββββ
subgraph STAGE1["Stage I β Data Collection"]
direction TB
Kaggle["Kaggle CLI<br/>(job-descriptions.csv)"] --> PostgreSQL["PostgreSQL"]
PostgreSQL --> Sqoop["Sqoop Import"]
Sqoop --> HDFS1["HDFS<br/>Avro (+ schema)"]
end
%% ββββββββββββββββββββββββββββ
%% Stage II β Data Warehouse & EDA
%% ββββββββββββββββββββββββββββ
subgraph STAGE2["Stage II β Data Warehouse &<br/>EDA"]
direction TB
Hive["Hive Externals<br/>(partitioned & bucketed)"] --> SparkSQL["Spark SQL<br/>(6 analyses)"]
end
%% ββββββββββββββββββββββββββββ
%% Stage III β Predictive Analytics
%% ββββββββββββββββββββββββββββ
subgraph STAGE3["Stage III β Predictive<br/>Analytics"]
direction TB
Preproc["Data Preprocessing<br/>(Spark DataFrame ops)"] --> SparkML["ML Modelling<br/>(Spark ML Pipeline)"]
SparkML --> LR["Linear Regression"]
SparkML --> GBT["Gradient-Boosted Trees"]
end
%% ββββββββββββββββββββββββββββ
%% Stage IV β Presentation & Delivery
%% ββββββββββββββββββββββββββββ
subgraph STAGE4["Stage IV β Presentation &<br/>Delivery"]
direction TB
HiveExt["Hive Externals<br/>(metrics & predictions)"] --> Superset["Apache Superset<br/>Dashboards"]
end
%% ββββββββββββββββββββββββββββ
%% Cross-stage flow (left β right)
%% ββββββββββββββββββββββββββββ
HDFS1 --> Hive
SparkSQL --> Preproc
LR --> HiveExt
GBT --> HiveExt
- One-click pipeline. 4 bash stages or a single
main.sh. - Optimised Hive layout. Avro + partitioning (
work_type) + bucketing (preference) for low scan cost. - Scalable ML. Linear Regression vs. Gradient-Boosted Trees with 3-fold CV, persisted in HDFS.
- Metrics at a glance. KL-divergence, RMSE, RΒ² and hyper-parameter grids ready for BI tools.
- Dashboard ready. External Hive tables expose CSV outputs directly to Superset.
βββ data/ # Raw download + ML data splits (synced from HDFS)
βββ docs/
β βββ img/ # β Put your screenshots here
β βββ report_*.md # In-depth reports for each stage
βββ models/ # Trained Spark-ML models
βββ output/ # Avro schemas, EDA CSVs, predictions, evaluation logs
βββ scripts/ # Bash & Python automation
βββ sql/ # PostgreSQL & Hive DDL / DML
βββ .venv/ # Project-scoped virtualenv
Prerequisites
Python 3.11 β’ Hadoop 3 β’ Hive 3 β’ Spark 3.5 β’ Sqoop 1.4 β’ PostgreSQL 15 β’ Kaggle CLI
# clone & bootstrap
git clone https://github.com/IVproger/bigdata_course_project.git && cd bigdata_course_project
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
# store secrets
echo "POSTGRES_PASSWORD=********" > secrets/.psql.pass
# full run (β 30 min on 4-node cluster)
bash main.shEach stage can be invoked separately if you prefer:
bash scripts/stage1.sh # ingest β PostgreSQL β HDFS
bash scripts/stage2.sh # Hive warehouse + Spark-SQL EDA
bash scripts/stage3.sh # Spark-ML training & tuning
bash scripts/stage4.sh # metrics β Hive for BI| Stage | What happens | Key outputs |
|---|---|---|
| 1 Data Collection | Kaggle β PostgreSQL β Sqoop Avro in HDFS | warehouse/*.avro |
| 2 Warehouse & EDA | Partitioned + bucketed Hive table, 6 Spark-SQL analyses | output/q*_results.csv |
| 3 Predictive ML | Linear vs. GBT, 3-fold CV, log-salary target | models/**, output/model*_predictions.csv |
| 4 Presentation | KL divergence, Hive externals for Superset | output/evaluation.csv , output/kl_divergence.csv, output/model1_predictions.csv, output/model2_predictions.csv |
Details live in docs/report_*.md for auditors and graders.
| Model | RMSE (log) | RΒ² (log) | KL-Div. (salary) |
|---|---|---|---|
| Linear Reg. | 0.092 | -1.59E-6 | 18.77 |
| GBT | 0.091 | 1.05E-4 | 16.3 |
β GBT shows and better RMSE and KL divergence, indicating tighter fit on the heavy-tailed salary distribution.
Pull requests welcome! Please open an issue first to discuss major changes.
- Fork β create feature branch (
git checkout -b feat/my-feature) - Commit + push (
git commit -m "feat: add β¦"βgit push origin) - Open PR β pass CI.
Distributed under the MIT License. See LICENSE for details.
- Kaggle for the open job-descriptions dataset
- Apache Software Foundation for the Hadoop ecosystem
- University Big-Data Engineering course staff for project guidance
Happy crunching β and may your HDFS never fill up!


