dlt-ingestion-poc

This is a project for experimentation using dlt for data ingestion in an ETL pipeline. It uses a file of NZ Census data (comprised of 17 cols x 35.6m records) in Parquet format as the source and applies a custom cleaning function to the data before writing it back to the filesystem.

The project is structured to allow for easy modification and testing of different data ingestion and transformation strategies.

Features

Ingests census data from Parquet files located in the files/input directory.
Cleans the ingested data using a custom cleaning function (helpers.generic.clean_data).
Writes the cleaned data back to the filesystem (configured via dlt.destinations.filesystem).
Exports the inferred schema to files/schemas/export.
Uses dlt for pipeline orchestration and data loading.
Uses pyarrow for efficient data handling.

Setup

This project uses uv for dependency management.

Install uv: If you don't have uv installed, follow the instructions here.
Create a virtual environment:
```
uv venv
```
Activate the virtual environment:
- macOS/Linux: source .venv/bin/activate
- Windows: .venv\Scripts\activate
Install dependencies:
```
uv sync
```
This installs the project in editable mode along with development dependencies.

Usage

To run the main data ingestion pipeline:

python census_pipeline.py

The pipeline will:

Read Parquet files from files/input/.
Apply the census_clean transformer.
Write the cleaned data (in Parquet format) to the destination configured in the pipeline (defaulting to a local filesystem location managed by dlt).
Log progress to the console.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.dlt		.dlt
helpers		helpers
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
census_pipeline.py		census_pipeline.py
codecov.yaml		codecov.yaml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

dlt-ingestion-poc

Features

Setup

Usage

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

rory-data/jestr

Folders and files

Latest commit

History

Repository files navigation

dlt-ingestion-poc

Features

Setup

Usage

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages