This is a project for experimentation using dlt for data ingestion in an ETL pipeline. It uses a file of NZ Census data (comprised of 17 cols x 35.6m records) in Parquet format as the source and applies a custom cleaning function to the data before writing it back to the filesystem.
The project is structured to allow for easy modification and testing of different data ingestion and transformation strategies.
- Ingests census data from Parquet files located in the
files/inputdirectory. - Cleans the ingested data using a custom cleaning function (
helpers.generic.clean_data). - Writes the cleaned data back to the filesystem (configured via
dlt.destinations.filesystem). - Exports the inferred schema to
files/schemas/export. - Uses
dltfor pipeline orchestration and data loading. - Uses
pyarrowfor efficient data handling.
This project uses uv for dependency management.
- Install
uv: If you don't haveuvinstalled, follow the instructions here. - Create a virtual environment:
uv venv
- Activate the virtual environment:
- macOS/Linux:
source .venv/bin/activate - Windows:
.venv\Scripts\activate
- macOS/Linux:
- Install dependencies:
This installs the project in editable mode along with development dependencies.
uv sync
To run the main data ingestion pipeline:
python census_pipeline.pyThe pipeline will:
- Read Parquet files from
files/input/. - Apply the
census_cleantransformer. - Write the cleaned data (in Parquet format) to the destination configured in the pipeline (defaulting to a local filesystem location managed by
dlt). - Log progress to the console.