This repository serves as my official Learning Journal and portfolio for the Level 5 Data Engineer Apprenticeship.
It is a documentation of my practical experience, knowledge gained, project work, and personal reflections aligned with the apprenticeship curriculum and the Data Engineer (Level 5) standard.
The journal is organized into modules/folders reflecting the core areas of data engineering practice, ensuring all competencies required for the End-Point Assessment (EPA) are logged and demonstrable.
./01_Core_Concepts/: Notes, definitions, and foundational knowledge (e.g., Data Architecture, Ethics, Governance)../02_SQL_Data_Modelling/: SQL scripts, data modeling diagrams, and database practice../03_Python_ETL/: Python scripts for data manipulation (Pandas), scripting, and basic ETL/ELT processes../04_Data_Pipelines_Orchestration/: Code and configurations for building, automating, and monitoring data workflows (e.g., Airflow, Azure Data Factory)../05_Cloud_Infrastructure/: Notes and setup scripts (IaC) related to cloud platforms (AWS/Azure/GCP) for data solutions../06_Capstone_Project/: The final, significant project used for EPA preparation (e.g., building a complete, end-to-end data platform)../Documentation/: Technical documentation, requirements gathering, and professional discussion preparation.
This journal documents hands-on mastery in the following core areas:
- Python: Advanced scripting, data manipulation with Pandas/Numpy, and software development best practices (testing, version control).
- SQL: Complex queries, stored procedures, performance tuning, and database administration (DDL/DML).
- PySpark/Scala (Optional): Working with distributed computing frameworks for Big Data processing.
- Data Modelling: Relational (3NF/Dimensional) and NoSQL modeling for different use cases.
- Data Warehousing: Concepts of Data Lakes, Data Lakehouses, and Data Marts (e.g., Snowflake, Microsoft Fabric).
- Cloud Platforms: Implementation of data solutions using [AWS / Azure / GCP] services (e.g., S3/Blob Storage, EC2/VMs, RDS/Managed Databases).
- Pipeline Tools: Building and managing robust data pipelines using tools like Apache Airflow, Azure Data Factory, or similar orchestrators.
- Streaming: Experience with batch, micro-batch, and real-time streaming concepts (e.g., Kafka, Azure Stream Analytics).
- Version Control: Professional usage of Git and GitHub for collaborative development.
- Containerization: Introduction to Docker for dependency management and reproducible environments.
- CI/CD: Implementing basic Continuous Integration and Continuous Deployment for data pipelines (e.g., GitHub Actions).