flattener

A Python utility for flattening complex nested data structures generated by the Connect for Cancer Prevention study questionnaires.

Overview

The flattener tool converts Connect study data tables to Parquet files within Google Cloud Storage (GCS), builds and executes a DuckDB SQL script to process these files, and generates flattened Parquet files that are then loaded into BigQuery tables.

This tool is deployed as a Cloud Run function which accepts API requests to initiate the flattening process. The use of this package is orchastrated as an Airflow DAG, available in the flattener-orchestrator repository.

Features

Correctly traversing complex nested structures (STRUCTs within STRUCTs)
Expanding array fields into separate indicator columns
Maintaining proper hierarchical relationships between fields
Generating a flattened Parquet file with intuitive column naming

Usage

TODO

How It Works

Struct Traversal

The tool obtains the schema of a Parquet file and then recursively explores it to identify all fields and their hierarchical relationships. For each field, it:

Builds the correct hierarchical path (e.g., "parent"."child"."grandchild")
Creates an intuitive alias using underscores (e.g., parent_child_grandchild)
Adds the field to the SELECT expression with proper SQL syntax

Array Field Expansion

For fields of type VARCHAR[], the tool:

Identifies all unique values in the array across the entire dataset
For each unique value, creates a string binary indicator column (1 if present, 0 if not)
Names these columns using a pattern of original_field_D_value

For example, an array field containing values ["foo", "bar"] would expand to:

original_field_D_foo = 1
original_field_D_bar = 1

SQL Generation

The tool generates a SQL query that:

Selects all identified fields with proper paths and aliases
Reads from the input Parquet file
Outputs to a new Parquet file

Column Naming Conventions

The tool follows these naming conventions:

Regular nested fields: parent_child_grandchild
Expanded array values: parent_array_field_D_value

Example

Input Schema (Simplified)

struct_column STRUCT(
  nested_struct STRUCT(
    string_field VARCHAR, 
    number_field BIGINT, 
    array_field VARCHAR[]
  )
)

Output Columns

struct_column_nested_struct_string_field
struct_column_nested_struct_number_field
struct_column_nested_struct_array_field_D_value1
struct_column_nested_struct_array_field_D_value2
...

Name		Name	Last commit message	Last commit date
Latest commit History 134 Commits
core		core
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
cloudbuild.yaml		cloudbuild.yaml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

flattener

Overview

Features

Usage

How It Works

Struct Traversal

Array Field Expansion

SQL Generation

Column Naming Conventions

Example

Input Schema (Simplified)

Output Columns

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

Analyticsphere/flattener

Folders and files

Latest commit

History

Repository files navigation

flattener

Overview

Features

Usage

How It Works

Struct Traversal

Array Field Expansion

SQL Generation

Column Naming Conventions

Example

Input Schema (Simplified)

Output Columns

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages