Skip to content

Personal notes and lab solutions for the Data Engineer Handbook Bootcamp

License

Notifications You must be signed in to change notification settings

pizofreude/data-engineer-notes

Repository files navigation

Data Engineer Bootcamp Notes

Welcome to my personal notes for the Data Engineer Handbook Bootcamp.

This repository is my learning journal, containing summaries, key concepts, and lab solutions for the 6-week bootcamp. It complements my forked Data Engineer Handbook repo.


Repository Structure

data-engineer-notes/
β”œβ”€β”€ README.md
β”œβ”€β”€ resources.md
β”œβ”€β”€ assets/
β”œβ”€β”€ images/
β”œβ”€β”€ week00/
β”‚   β”œβ”€β”€ summary.md
β”‚   β”œβ”€β”€ key-concepts.md
β”‚   β”œβ”€β”€ lab-notes.md
β”‚   └── lab00/
β”‚       β”œβ”€β”€ solution.ipynb
β”‚       └── ...     # Artifacts from bootcamp materials
β”œβ”€β”€ week01/
β”‚   └── (similar structure)
β”œβ”€β”€ ...
└── week06/
    └── (similar structure)

Week Notes

Each week contains:

  • Summary: Key takeaways from the week
  • Key Concepts: Detailed explanations and examples of core ideas
  • Lab Notes: Observations, detailed notes, and troubleshooting during labs
  • Labs: Solutions for each lab

Links


Learning Progress

  • Module 1: Bootcamp Orientation - Database setup and Boot Camp Kickoff [Week 0]
    • Bootcamp Kickoff | 20 min
    • Boot Camp Database Setup | 20 min
  • Module 2: Dimensional Data Modeling [Week 1]
    • Dimensional Data Modeling Complex Data Type and Cumulation Day 1 Lecture | 43 min
    • Dimensional Data Modeling Complex Data Type and Cumulation Day 1 Lab | 41 min
    • Dimensional Data Modeling: Building Slowly Changing Dimensions Day 2 Lecture | 40 min
    • Dimensional Data Modeling: Building Slowly Changing Dimensions Day 2 Lab | 45 min
    • Dimensional Data Modeling: Graph Data Modeling Day 3 Lecture | 34 min
    • Dimensional Data Modeling: Graph Data Modeling Day 3 Lab | 46 min
    • Dimensional Data Modeling - Week 1 Assignment
  • Module 3: Fact Data Modeling [Week 2]
    • Fact Data Modeling: Core Concepts, Deduplication Day 1 Lecture | 52 min
    • Fact Data Modeling: Practical Insights into Data Modeling Day 1 Lab | 40 min
    • Fact Data Modeling: Core Elements in Data Modeling Day 2 Lecture | 31 min
    • Fact Data Modeling: Compact Tables for Efficient Data Representation Day 2 Lab | 45 min
    • Fact Data Modeling: Minimizing Shuffle and Reducing Facts Day 3 Lecture | 32 min
    • Fact Data Modeling: Practical Guide to Formatting and Aggregating Data Day 3 Lab | 30 min
    • Fact Data Modeling - Week 2 Assignment
  • Module 4: Apache Spark Fundamentals [Week 3]
    • Apache Spark: Architecture, Optimization, and Best Practices Day 1 Lecture | 48 min
    • Apache Spark: Hands-On for Broadcast and Hash Joins Day 1 Lab | 26 min
    • Apache Spark: Managing Spark Jobs and Notebooks Day 2 Lecture | 34 min
    • Apache Spark: User-Defined Functions and Broadcast Join Day 2 Lab | 36 min
    • Unit Testing Spark Jobs: Importance, Challenges, and Leadership Perspectives Lecture | 41 min
    • Unit Testing Spark Jobs: Mastering Spark and PySpark Testing Lab | 27 min
    • Spark Fundamentals - Week 3 Assignment
  • Module 5: Applying Analytical Patterns [Week 4]
    • Applying Analytical Patterns: Exploring SQL, Scaling Projects and Aggregation Analysis Day 1 Lecture | 52 min
    • Applying Analytical Patterns: Mastering Growth Accounting and Retention Analysis Day 1 Lab | 34 min
    • Applying Analytical Patterns: Recursive CTEs and Window Functions Day 2 Lecture | 44 min
    • Applying Analytical Patterns: Aggregations and Cardinality Reduction Day 2 Lab | 33 min
    • Applying Analytical Patterns - Week 4 Assignment
  • Module 6: Real-time pipelines with Flink and Kafka [Week 5]
    • Flink Lab Setup | 7 min
    • Streaming Pipelines: Mastering Streaming and Real-time Pipelines Day 1 Lecture | 50 min
    • Streaming Pipelines: Setting up Streaming Pipelines Day 1 Lab | 40 min
    • Streaming Pipelines: Exploring Data Collection and Processing Day 2 Lecture | 31 min
    • Streaming Pipelines: Kafka, Postgres, Spark Integrations and Parallelism Day 2 Lab | 39 min
    • Flink - Week 5 Assignment
  • Module 7: Data Visualization and Impact [Week 6 Part 1]
    • Data Visualization and Impact: Mastering Data Engineering Day 1 Lecture | 39 min
    • Data Visualization and Impact: Hands-On with the CSV files Day 1 Lab | 8 min
    • Data Visualization and Impact: Insights and Best Practices Day 2 Lecture | 23 min
    • Data Visualization and Impact: Exploring Data Visualization and Aggregation Techniques Day 2 Lab | 37 min
    • Data Visualization - Week 6 1st Assignment
  • Module 8: Data Pipeline Maintenance [Week 6 Part 2]
    • Data Pipeline Maintenance: Navigating the Complexities of Data Engineering Day 1 Lecture | 67 min
    • Data Pipeline Maintenance: Strategies for Maintenance and Dock Building Day 2 Lecture | 77 min
    • Data Pipeline Maintenance - Week 6 2nd Assignment
  • Module 9: KPIs and Experimentation [Week 6 Part 3]
    • KPIs and Experimentation: Decoding Business Success: Metrics, Growth Strategies and Collaborative Approaches Day 1 Lecture | 55 min
    • KPIs and Experimentation: Setting up and Analysing Experiments Day 1 Lab | 36 min
    • KPIs and Experimentation: Leading and Lagging Metrics Day 2 Lecture | 65 min
    • KPIs and Experimentation - Week 6 3rd Assignment
  • Module 10: Data Quality Patterns [Week 7]
    • Data Quality Patterns: MIDAS Process from Airbnb Day 1 Lecture | 45 min
    • Data Quality Patterns: Spec-Building Document Day 1 Lab | 33 min
    • Data Quality Patterns: WAP Patterns Day 2 Lecture | 27 min

πŸ’» Daily Practice System

This repository now includes a comprehensive practice tracking system to organize daily coding practice across multiple platforms:

  • practice/ - Platform-organized coding problems (LeetCode, StrataScratch, HackerRank, NeetCode, Codewars, etc.)
  • concepts/ - Reference notes on data structures, algorithms, SQL patterns, and system design
  • interview-prep/ - Interview-specific preparation materials (behavioral, technical, system design)
  • logs/ - Daily practice logs and progress tracking with statistics dashboard

Quick Start

# Create today's log entry
./scripts/new-day.sh

# Start a new problem
# ./scripts/create-problem.sh <platform> <difficulty> "problem-name"
./scripts/create-problem.sh leetcode medium "problem-name"

# Create a concept note
# ./scripts/link-concept.sh "concept-name" <category>
./scripts/link-concept.sh "Window Functions" sql-patterns

# Generate weekly stats
python scripts/generate-stats.py

βœ… Manual Update Checklist: For Every Problem You Solve

Here's your streamlined checklist for logging each problem.

πŸ“‹ The 5-Step Workflow

Step 1: Start Your Day ⏱️ 30 seconds

Run once per day (first thing in the morning):

./scripts/new-day.sh

βœ… Done! No manual edits needed for this step.


Step 2: Scaffold the Problem ⏱️ 30 seconds

For each new problem you're about to solve:

./scripts/create-problem.sh <platform> <difficulty> "<problem-slug>"

Examples:

./scripts/create-problem.sh codewars easy "absolute-value-log-base"
./scripts/create-problem.sh leetcode medium "rank-scores"
./scripts/create-problem.sh stratascratch hard "revenue-analysis"

βœ… Done! Folder created, template copied, ready to code.


Step 3: Write Your Solution ⏱️ 10-30 minutes (solving time)

Navigate to the problem folder:

cd practice/<platform>/<difficulty>/<problem-slug>

Open and write your solution:

code solution. sql    # For SQL problems
# OR
code solution.py     # For Python/algorithm problems

What to do:

  • ✏️ Paste your working solution code
  • ✏️ Add comments explaining key logic (optional but recommended)
  • πŸ’Ύ Save the file

Example:

-- Calculate absolute value and logarithm base 64
SELECT
  ABS(number1) AS abs,
  LOG(64, number2) AS log
FROM decimals;

Step 4: Document Your Solution ⏱️ 10-15 minutes

Open the notes file:

code notes.md

You need to manually update these sections:

A. Metadata (Top of file)
# [Problem Name]  ← CHANGE THIS

## πŸ“‹ Metadata
- **Platform:** [Platform name]  ← CHANGE THIS
- **Difficulty:** [Easy/Medium/Hard] (Optional:  add platform rating like "7 kyu")  ← CHANGE THIS
- **Date Solved:** 2026-01-03  ← βœ… ALREADY FILLED BY SCRIPT
- **Time Spent:** XX minutes  ← CHANGE THIS
- **Status:** [βœ… Solved | πŸ”„ Revisit | ❌ Stuck]  ← CHANGE THIS

Example:

# Absolute Value and Log to Base

## πŸ“‹ Metadata
- **Platform:** Codewars
- **Difficulty:** Easy (7 kyu)
- **Date Solved:** 2026-01-03  ← Script filled this
- **Time Spent:** 15 minutes
- **Status:** βœ… Solved

B. Links
## πŸ”— Links
- [Problem URL]  ← PASTE THE ACTUAL URL HERE

Example:

## πŸ”— Links
- https://www.codewars.com/kata/594a8f2f7ca3c692a4000041/train/sql

C. Topics & Tags (Check the boxes)
## πŸ“š Topics & Tags
- [ ] SQL
- [ ] Window Functions
- [ ] Joins
- [ ] CTEs
- [ ] Python
- [ ] Dynamic Programming

Check the relevant ones:

##πŸ“š Topics & Tags
- [x] SQL  ← Put 'x' inside
- [x] Mathematical Functions
- [ ] Window Functions
- [ ] Joins

D. Problem Statement
## πŸ“ Problem Statement
[Paste the problem description here]

### Example Input/Output
```markdown
Input: 
Output: 

What to do:

  • ✏️ Copy-paste the problem description from the platform
  • ✏️ Add example input/output (if provided)

E. Approach
## πŸ’‘ Approach

### Initial Thoughts
[What was your first idea?  What patterns did you recognize?]

### Solution Strategy
1. Step 1
2. Step 2
3. Step 3

What to do:

  • ✏️ Write your thought process (2-3 sentences)
  • ✏️ List the steps you took (bullet points)

Example:

## πŸ’‘ Approach

### Initial Thoughts
Straightforward application of SQL math functions:  ABS for absolute value, LOG for logarithm with custom base.

### Solution Strategy
1. Use `ABS(number1)` to get absolute values
2. Use `LOG(64, number2)` for logarithm base 64
3. Alias columns as required (`abs`, `log`)

F. Solution
## πŸ–₯️ Solution

### Attempt 1 (Initial)
```sql
-- Your first solution here

Result: [Passed/Failed/Timeout]


**What to do:**
- ✏️ Paste your solution code (can be same as `solution.sql`)
- ✏️ Note if it passed or failed

**If you optimized it, add:**
```markdown
### Attempt 2 (Optimized) ⭐
```sql
-- Improved solution

Result: βœ… Passed with better performance


---

#### **G. Complexity Analysis**
```markdown
## ⚑ Complexity Analysis
- **Time Complexity:** O(?)
- **Space Complexity:** O(?)

What to do:

  • ✏️ Fill in the Big O notation
  • ✏️ If you don't know, write: "Time: O(n) - single pass through table"

Example:

## ⚑ Complexity Analysis
- **Time Complexity:** O(n) - single pass through table
- **Space Complexity:** O(n) - result set same size as input

H. Key Learnings

## πŸŽ“ Key Learnings
1. 
2. 
3. 

What to do:

  • ✏️ Write 2-4 things you learned (this is THE MOST IMPORTANT SECTION!)

Example:

## πŸŽ“ Key Learnings
1. **ABS()** - Returns absolute value (distance from zero)
2. **LOG(base, value)** - PostgreSQL syntax for custom base logarithm
3. PostgreSQL uses `LOG(base, value)` while MySQL uses `LOG(value) / LOG(base)`
4. Base-64 logarithm:  `LOG(64, 4096) = 2` because 64Β² = 4096

I. Related Concepts (Optional)

## 🏷️ Related Concepts
See:  `concepts/sql-patterns/[concept-file]. md`

What to do:

  • ✏️ If you created a concept note, link it here
  • ⏭️ Skip if you haven't created a concept yet

Example:

## 🏷️ Related Concepts
See:  `concepts/sql-patterns/sql-mathematical-functions.md`

Step 5: Update Daily Log ⏱️ 5 minutes

Open today's log:

cd ../../../../    # Return to repo root
code logs/2026/01-january. md

Find today's date section and fill in:

A. Time Spent

### Friday, January 03, 2026
⏱️ Time:   X hours  ← CHANGE THIS

Example:

⏱️ Time:  1.5 hours

B. Problems Completed

#### βœ… Completed
1. 

Add each problem with:

  • Problem name and difficulty
  • Key topics
  • Link to your solution
  • One-line key learning

Example:

#### βœ… Completed
1. **Codewars - Absolute Value and Log to Base** (Easy/7kyu)
   - Topics: ABS(), LOG(), Mathematical functions
   - [Solution](../../practice/codewars/easy/absolute-value-log-base/)
   - Key learning: PostgreSQL LOG(base, value) syntax differs from MySQL

2. **LeetCode 178 - Rank Scores** (Medium)
   - Topics: Window functions, DENSE_RANK
   - [Solution](../../practice/leetcode/medium/178-rank-scores/)
   - Key learning:  DENSE_RANK vs RANK vs ROW_NUMBER differences

C. Learnings

#### πŸ’‘ Learnings
- 

Write 2-4 broader learnings from today:

Example:

#### πŸ’‘ Learnings
- Mathematical functions in SQL are database-specific (PostgreSQL vs MySQL syntax)
- Always check for NULL values when using LOG() with user input
- ABS() is useful for calculating distances and differences
- Created concept note:  `concepts/sql-patterns/sql-mathematical-functions.md`

D. Tomorrow's Plan

#### 🎯 Tomorrow
- [ ] 

Plan 2-3 things for tomorrow:

Example:

#### 🎯 Tomorrow
- [ ] LeetCode 180 - Consecutive Numbers (Window functions practice)
- [ ] StrataScratch - Revenue analysis problem
- [ ] Review:  Self-joins pattern

Step 6: Commit & Push ⏱️ 1 minute

git status

# Add your changes
git add practice/<platform>/<difficulty>/<problem-slug>/
git add logs/2026/01-january. md

# If you created a concept note, add it too
git add concepts/

# Commit with descriptive message
git commit -m "βœ… [Platform]:  [Problem Name] - [Key Topic]"

# Push to GitHub
git push

Example commit messages:

git commit -m "βœ… Codewars:  Absolute Value and Log to Base - SQL math functions"
git commit -m "βœ… LeetCode 178: Rank Scores - Window functions"
git commit -m "βœ… StrataScratch: Revenue Analysis - CTEs and aggregations"

πŸ“Š Weekly: Update Stats Dashboard ⏱️ 5 minutes

Run every Sunday (or end of week):

python scripts/generate-stats.py

Copy the output:

## πŸ“Š All-Time Stats

| Platform      | Easy | Medium | Hard | Total |
|---------------|------|--------|------|-------|
| Codewars      | 5    | 2      | 0    | 7     |
| Leetcode      | 12   | 8      | 1    | 21    |
| **Total**     | **17** | **10** | **1** | **28** |

πŸ“… Last Updated: 2026-01-05 20:30

Paste it into:

code logs/README.md

Replace the old stats section with the new output.

Also update:

## πŸ”₯ Current Streaks
- **Daily Practice:** X days  ← UPDATE THIS MANUALLY

Commit:

git add logs/README.md
git commit -m "πŸ“Š Update weekly practice stats"
git push

βœ… Quick Reference Checklist

Print this and keep it next to you:

β–‘ Step 1: ./scripts/new-day.sh (once per day)

For each problem: 
β–‘ Step 2: ./scripts/create-problem.sh <platform> <difficulty> "<slug>"
β–‘ Step 3: Write solution in solution.sql or solution.py
β–‘ Step 4: Fill in notes.md:
    β–‘ Change title
    β–‘ Update metadata (platform, difficulty, time, status)
    β–‘ Paste problem URL
    β–‘ Check topic tags
    β–‘ Paste problem statement
    β–‘ Write approach & strategy
    β–‘ Paste solution code
    β–‘ Add complexity analysis
    β–‘ Write key learnings (MOST IMPORTANT!)
    β–‘ Link concept note (if created)

β–‘ Step 5: Update logs/2026/01-january.md:
    β–‘ Time spent today
    β–‘ Add problem to "Completed" list
    β–‘ Write today's learnings
    β–‘ Plan tomorrow's focus

β–‘ Step 6: git add β†’ commit β†’ push

Weekly: 
β–‘ Sunday: Run generate-stats.py
β–‘ Update logs/README.md & the monthly log + practice/README.md with new stats

πŸ’‘ Time-Saving Tips

Minimal Version (10 min per problem)

If you're short on time, focus on:

  1. βœ… Solution code (solution.sql)
  2. βœ… Key learnings in notes.md
  3. βœ… Daily log entry

Skip the rest for now, come back later to fill in.


Batch Update

If you solve multiple problems:

  1. Scaffold all problems first
  2. Solve all problems
  3. Update all notes.md files
  4. Update daily log once (list all problems)
  5. Single commit at the end

Use Snippets/Shortcuts

Create editor snippets for repetitive sections like complexity analysis, common tags, etc.


🎯 Summary: What You MUST Do Manually

File What to Update
solution.sql Your code
notes.md Title, metadata, URL, approach, learnings
logs/YYYY/MM-month.md Time, problems list, learnings, tomorrow's plan
logs/README.md Weekly stats (copy from script output)

Everything else is automated! πŸŽ‰

Features

  • Automation Scripts: Quickly scaffold new problems and logs with templates
  • Platform-Agnostic: Automatically discovers and tracks any coding platform
  • Comprehensive Templates: Detailed templates for problems, concepts, and daily logs
  • Progress Tracking: Statistics generation and progress dashboards
  • Knowledge Base: Structured concept notes linked to practice problems

See practice/README.md for detailed usage instructions and workflow.