This is a technical assessment for a Data Engineering position. You will be working with Apache Airflow, PostgreSQL, and Apache Superset to create data pipelines and analytics.
Before starting, ensure you have the following installed:
- Docker (version 20.10 or higher)
- Docker Compose (version 2.0 or higher)
- At least 4GB of available RAM
- Available ports: 5432, 8080, 8088
Navigate to the project directory and start all services:
docker compose up -dThis will start:
- PostgreSQL database with e-commerce sample data
- Apache Airflow (webserver, scheduler, worker, triggerer)
- Apache Superset for data visualization
- Redis for Airflow's Celery executor
Initial startup takes 2-3 minutes. Monitor progress with:
docker compose psWait until all services show "healthy" status.
Once all services are healthy, access them at:
| Service | URL | Credentials |
|---|---|---|
| Airflow | http://localhost:8080 | Username: adminPassword: admin |
| Superset | http://localhost:8088 | Username: adminPassword: admin |
| PostgreSQL | localhost:5432 |
User: datauserPassword: datapassDatabase: datauser |
Check that the database is loaded with sample data:
docker exec postgres-data psql -U datauser -d datauser -c "SELECT COUNT(*) FROM orders;"You should see 50 orders in the database.
The PostgreSQL database contains e-commerce data with the following tables:
id(SERIAL PRIMARY KEY)name(VARCHAR)email(VARCHAR UNIQUE)created_at(TIMESTAMP)
50 customers in the database.
id(SERIAL PRIMARY KEY)name(VARCHAR)category(VARCHAR)price(DECIMAL)stock(INTEGER)created_at(TIMESTAMP)
100 products across multiple categories (Electronics, Home & Office, Stationery, Accessories).
id(SERIAL PRIMARY KEY)customer_id(INTEGER, FK to customers)order_date(TIMESTAMP)total_amount(DECIMAL)status(VARCHAR: 'pending', 'processing', 'shipped', 'completed')
50 orders spanning January to March 2024.
id(SERIAL PRIMARY KEY)order_id(INTEGER, FK to orders)product_id(INTEGER, FK to products)quantity(INTEGER)price(DECIMAL)
Multiple items per order.
order_summary: Pre-built view joining orders with customers and aggregating order items.
Create an Apache Airflow DAG that performs the following data transformations. The DAG should run daily and create/update the following tables:
Create a table that merges orders and order_items with the following specifications:
Required Columns:
order_id- The order IDcustomer_id- The customer IDorder_date_epoch- Order date converted to epoch time (Unix timestamp)total_amount- Total order amountstatus- Order statusitems- An array/JSON array of order items, where each item contains:product_idproduct_namequantityprice
Example row structure:
order_id: 1
customer_id: 1
order_date_epoch: 1705747800
total_amount: 1342.97
status: 'completed'
items: [
{"product_id": 1, "product_name": "Laptop Pro 15", "quantity": 1, "price": 1299.99},
{"product_id": 2, "product_name": "Wireless Mouse", "quantity": 1, "price": 29.99},
{"product_id": 3, "product_name": "USB-C Cable", "quantity": 1, "price": 12.99}
]
Create a summary table that tracks the number of orders by status.
Required Columns:
status- Order status ('pending', 'processing', 'shipped', 'completed')order_count- Number of orders with this statustotal_value- Total dollar value of orders with this statuslast_updated- Timestamp when this record was last updated
Example:
status: 'completed'
order_count: 35
total_value: 15234.50
last_updated: 2024-03-20 10:30:00
Create a table with predictions based on historical data.
Required Columns:
name(VARCHAR) - Name/description of the predictionamount(DECIMAL) - Predicted value
Required Predictions:
-
Total orders for the current month
name: 'monthly_orders_prediction'amount: Predicted number of orders for the current month based on historical trends
-
Product category sales predictions
- For each product category (Electronics, Home & Office, Stationery, Accessories)
name: 'category_<category_name>_prediction'amount: Predicted number of units that will sell this month for that category
Example rows:
name: 'monthly_orders_prediction', amount: 45
name: 'category_Electronics_prediction', amount: 120
name: 'category_Home & Office_prediction', amount: 85
name: 'category_Stationery_prediction', amount: 200
name: 'category_Accessories_prediction', amount: 150
Prediction Method: You can use any reasonable method for predictions, such as:
- Simple moving average
- Linear trend analysis
- Average growth rate
- Or any other statistical method you prefer
Use Apache Superset to create visualizations and a dashboard based on your data.
Requirements:
-
Create at least 3 visualizations using the tables you created:
- One visualization from
detailed_orders(e.g., orders over time, revenue by status) - One visualization from
order_tracking(e.g., pie chart of orders by status) - One visualization from
predictions(e.g., bar chart comparing predictions across categories)
- One visualization from
-
Create a Dashboard that combines all your visualizations
-
Take a Screenshot of your dashboard showing all visualizations
-
Save the screenshot in the
screenshots/directory with the filename:dashboard.png
Tips for Superset:
- Access Superset at http://localhost:8088
- Go to SQL Lab → Select "PostgreSQL Data" database
- Create datasets from your tables: Data → Datasets → + Dataset
- Create charts: Charts → + Chart
- Create dashboard: Dashboards → + Dashboard
- Add your charts to the dashboard
- Take a screenshot (use your OS screenshot tool or browser extension)
-
DAG File: Create your DAG in the
dags/directory (e.g.,dags/data_pipeline.py) -
Connection: Use the pre-configured Airflow connection
postgres_datato connect to PostgreSQL -
Idempotency: Your DAG should be idempotent - it should be safe to run multiple times without causing errors or duplicate data
-
Error Handling: Include appropriate error handling and logging
-
Task Dependencies: Structure your tasks with proper dependencies
-
Documentation: Include docstrings explaining what each task does
After creating your DAG:
- Enable and trigger the DAG in Airflow UI
- Verify all tasks complete successfully
- Check that the three tables were created:
# Check detailed_orders
docker exec postgres-data psql -U datauser -d datauser -c "SELECT COUNT(*) FROM detailed_orders;"
# Check order_tracking
docker exec postgres-data psql -U datauser -d datauser -c "SELECT * FROM order_tracking;"
# Check predictions
docker exec postgres-data psql -U datauser -d datauser -c "SELECT * FROM predictions;"- Create your Superset dashboard with visualizations from all three tables
- Save a screenshot of your dashboard as
screenshots/dashboard.png
docker compose logs -f [service-name]docker compose downdocker compose down -v
docker compose up -ddocker exec -it postgres-data psql -U datauser -d datauserServices not starting:
docker compose logsPermission errors:
chmod -R 755 dags/Database connection issues:
docker compose logs postgres-dataAirflow DAG not appearing:
- Wait 1-2 minutes for Airflow to scan the dags folder
- Check for syntax errors:
docker compose logs airflow-scheduler - Verify file is in
dags/directory
When complete, ensure:
- Your DAG file is in the
dags/directory - All tables are created and populated
- The DAG runs successfully in Airflow
- Your Superset dashboard screenshot is saved as
screenshots/dashboard.png - Include any additional documentation or notes in comments or a README file
Good luck!