GH-62: refactor(metadata): Rename ColumnProfile to ColumnInformation by Urfoex · Pull Request #63 · getml/getml-io

Urfoex · 2025-08-29T03:09:49Z

The Pydantic model ColumnProfile and its corresponding field column_profile have been renamed to ColumnInformation and columns respectively.

This change improves clarity and consistency. "Information" is a more accurate and general term for the data being stored (name, role, statistics) than "Profile". The field name columns is more concise and conventional.

All related functions, variables, test fixtures, and test data have been updated to reflect this renaming. No functional changes are introduced.

GH-62: refactor: Add support for text column statistics

Introduces support for calculating and storing summary statistics for columns with the text role.

This includes:

Adding a ColumnStatisticsText model.
Refactoring ColumnStatistics models to use a common ColumnStatisticsBase class, reducing code duplication.
Making statistical fields (like min, max, avg) optional to correctly handle columns with only NULL values.
Introducing a ColumnType enum to replace string literals for column types, improving type safety and readability.
Updating test fixtures and helpers to align with the new data models and support the text role.

GH-62: ci: Optimize workflow triggers with path filtering

Adds path filters to the pull_request and push triggers for the Python test workflow.

This prevents the workflow from running on changes unrelated to the source code, tests, or project configuration (e.g., documentation updates).

Additionally, this change adds a push trigger to run tests on merges to main and develop, ensuring the integrity of the primary branches.

GH-62: refactor: Generalize Parquet serialization and statistics logic

Extract the logic for serializing data to Parquet and calculating column statistics into a new, more generic serialize_dataframe function in serialize/parquet.py.

This new function is now used for serializing both getML DataFrames/Views and prediction results. This refactoring improves code reuse and separation of concerns.

Key changes:

serialize_dataframe accepts callables for saving the Parquet file and retrieving column roles, decoupling it from specific data structures like getml.DataFrame.
serialize_predictions now creates a pyarrow.Table directly from the numpy array and uses the new serialize_dataframe function, avoiding the overhead of creating an intermediate getml.DataFrame.
Moves shared logic out of serialize_dataframe_or_view.py, simplifying it significantly.

GH-62: refactor(pydantic): Use frozen=True for immutable models

Modernize Pydantic model definitions by replacing the model_config dictionary with the frozen=True class keyword argument.

This change simplifies the code, makes it more readable, and aligns with current Pydantic v2 best practices for creating immutable models. No functional changes are introduced.

Copilot

Pull Request Overview

This PR refactors Pydantic model names and field names for improved clarity and consistency. The main changes include renaming ColumnProfile to ColumnInformation, updating the corresponding field name from column_profile to columns, and adopting modern Pydantic v2 practices by using frozen=True instead of model_config dictionaries.

Key changes include:

Renaming ColumnProfile to ColumnInformation and column_profile field to columns
Implementing frozen=True for immutable Pydantic models
Adding support for text column statistics and introducing ColumnType enum
Generalizing Parquet serialization logic into a reusable serialize_dataframe function

Reviewed Changes

Copilot reviewed 38 out of 39 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
tests/unit/types.py	Updates type aliases to use new `ColumnInformationType` naming
tests/unit/serialize/test_pipeline.py	Updates test assertions to use new target name `target0`
tests/unit/serialize/test_parquet.py	New test file for generalized Parquet serialization functionality
tests/unit/serialize/test_dataframe_or_view.py	Simplified tests using new shared helper functions
tests/unit/serialize/test_container.py	Updates to use new column information naming and helper functions
tests/unit/metadata/test_pipeline_information.py	Updates type references and test data to use new naming
tests/unit/metadata/test_container_information.py	Updates to use new `ColumnInformationType`
tests/unit/conftest.py	Major refactoring of test fixtures and helpers to support new naming and structure
Integration test files	Updates Pydantic model definitions to use `frozen=True` and expected JSON output changes

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

tests/unit/conftest.py

gemini-code-assist

Code Review

This is a substantial pull request that introduces several valuable refactorings and improvements. The renaming of ColumnProfile to ColumnInformation enhances clarity. The modernization of Pydantic models to use frozen=True is a great update. The core of this PR is the generalization of Parquet serialization and statistics calculation, which significantly improves code reuse and separation of concerns. Adding support for text column statistics and refactoring the statistics models with a base class and optional fields is also a solid improvement for correctness and maintainability. Overall, these are excellent changes. I've identified a couple of areas in the new serialization logic where robustness could be improved to handle potential edge cases gracefully.

src/getml_io/serialize/parquet.py

The Pydantic model `ColumnProfile` and its corresponding field `column_profile` have been renamed to `ColumnInformation` and `columns` respectively. This change improves clarity and consistency. "Information" is a more accurate and general term for the data being stored (name, role, statistics) than "Profile". The field name `columns` is more concise and conventional. All related functions, variables, test fixtures, and test data have been updated to reflect this renaming. No functional changes are introduced. GH-62: refactor: Add support for text column statistics Introduces support for calculating and storing summary statistics for columns with the `text` role. This includes: - Adding a `ColumnStatisticsText` model. - Refactoring `ColumnStatistics` models to use a common `ColumnStatisticsBase` class, reducing code duplication. - Making statistical fields (like min, max, avg) optional to correctly handle columns with only NULL values. - Introducing a `ColumnType` enum to replace string literals for column types, improving type safety and readability. - Updating test fixtures and helpers to align with the new data models and support the `text` role. GH-62: ci: Optimize workflow triggers with path filtering Adds path filters to the pull_request and push triggers for the Python test workflow. This prevents the workflow from running on changes unrelated to the source code, tests, or project configuration (e.g., documentation updates). Additionally, this change adds a `push` trigger to run tests on merges to `main` and `develop`, ensuring the integrity of the primary branches. GH-62: refactor: Generalize Parquet serialization and statistics logic Extract the logic for serializing data to Parquet and calculating column statistics into a new, more generic `serialize_dataframe` function in `serialize/parquet.py`. This new function is now used for serializing both getML DataFrames/Views and prediction results. This refactoring improves code reuse and separation of concerns. Key changes: - `serialize_dataframe` accepts callables for saving the Parquet file and retrieving column roles, decoupling it from specific data structures like `getml.DataFrame`. - `serialize_predictions` now creates a `pyarrow.Table` directly from the numpy array and uses the new `serialize_dataframe` function, avoiding the overhead of creating an intermediate `getml.DataFrame`. - Moves shared logic out of `serialize_dataframe_or_view.py`, simplifying it significantly. GH-62: refactor(pydantic): Use `frozen=True` for immutable models Modernize Pydantic model definitions by replacing the `model_config` dictionary with the `frozen=True` class keyword argument. This change simplifies the code, makes it more readable, and aligns with current Pydantic v2 best practices for creating immutable models. No functional changes are introduced.

srnnkls

lgtm.

awaismirza92

LGTM

Urfoex requested a review from Copilot August 29, 2025 03:09

Urfoex self-assigned this Aug 29, 2025

Urfoex added the enhancement New feature or request label Aug 29, 2025

Copilot AI reviewed Aug 29, 2025

View reviewed changes

tests/unit/conftest.py Show resolved Hide resolved

gemini-code-assist bot reviewed Aug 29, 2025

View reviewed changes

src/getml_io/serialize/parquet.py Show resolved Hide resolved

src/getml_io/serialize/parquet.py Show resolved Hide resolved

Urfoex marked this pull request as ready for review August 29, 2025 03:15

Urfoex requested review from awaismirza92 and srnnkls August 29, 2025 03:15

Urfoex force-pushed the feature/GH-18-serialize-pipeline-scores branch from d8cdb40 to a8d9594 Compare August 29, 2025 20:40

Base automatically changed from feature/GH-18-serialize-pipeline-scores to develop August 29, 2025 20:40

Urfoex force-pushed the refactor/GH-62-incorporate-review-feedback branch from b4ffcde to 96fa168 Compare August 29, 2025 20:41

srnnkls approved these changes Sep 1, 2025

View reviewed changes

awaismirza92 approved these changes Sep 1, 2025

View reviewed changes

Urfoex merged commit ad8eb81 into develop Sep 1, 2025
3 checks passed

Urfoex deleted the refactor/GH-62-incorporate-review-feedback branch September 1, 2025 10:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-62: refactor(metadata): Rename ColumnProfile to ColumnInformation#63

GH-62: refactor(metadata): Rename ColumnProfile to ColumnInformation#63
Urfoex merged 1 commit intodevelopfrom
refactor/GH-62-incorporate-review-feedback

Urfoex commented Aug 29, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

srnnkls left a comment

Uh oh!

awaismirza92 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

Conversation

Urfoex commented Aug 29, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

srnnkls left a comment

Choose a reason for hiding this comment

Uh oh!

awaismirza92 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments