Skip to content

GH-62: refactor(metadata): Rename ColumnProfile to ColumnInformation#63

Merged
Urfoex merged 1 commit intodevelopfrom
refactor/GH-62-incorporate-review-feedback
Sep 1, 2025
Merged

GH-62: refactor(metadata): Rename ColumnProfile to ColumnInformation#63
Urfoex merged 1 commit intodevelopfrom
refactor/GH-62-incorporate-review-feedback

Conversation

@Urfoex
Copy link
Collaborator

@Urfoex Urfoex commented Aug 29, 2025

The Pydantic model ColumnProfile and its corresponding field column_profile have been renamed to ColumnInformation and columns respectively.

This change improves clarity and consistency. "Information" is a more accurate and general term for the data being stored (name, role, statistics) than "Profile". The field name columns is more concise and conventional.

All related functions, variables, test fixtures, and test data have been updated to reflect this renaming. No functional changes are introduced.

GH-62: refactor: Add support for text column statistics

Introduces support for calculating and storing summary statistics for columns with the text role.

This includes:

  • Adding a ColumnStatisticsText model.
  • Refactoring ColumnStatistics models to use a common ColumnStatisticsBase class, reducing code duplication.
  • Making statistical fields (like min, max, avg) optional to correctly handle columns with only NULL values.
  • Introducing a ColumnType enum to replace string literals for column types, improving type safety and readability.
  • Updating test fixtures and helpers to align with the new data models and support the text role.

GH-62: ci: Optimize workflow triggers with path filtering

Adds path filters to the pull_request and push triggers for the Python test workflow.

This prevents the workflow from running on changes unrelated to the source code, tests, or project configuration (e.g., documentation updates).

Additionally, this change adds a push trigger to run tests on merges to main and develop, ensuring the integrity of the primary branches.

GH-62: refactor: Generalize Parquet serialization and statistics logic

Extract the logic for serializing data to Parquet and calculating column statistics into a new, more generic serialize_dataframe function in serialize/parquet.py.

This new function is now used for serializing both getML DataFrames/Views and prediction results. This refactoring improves code reuse and separation of concerns.

Key changes:

  • serialize_dataframe accepts callables for saving the Parquet file and retrieving column roles, decoupling it from specific data structures like getml.DataFrame.
  • serialize_predictions now creates a pyarrow.Table directly from the numpy array and uses the new serialize_dataframe function, avoiding the overhead of creating an intermediate getml.DataFrame.
  • Moves shared logic out of serialize_dataframe_or_view.py, simplifying it significantly.

GH-62: refactor(pydantic): Use frozen=True for immutable models

Modernize Pydantic model definitions by replacing the model_config dictionary with the frozen=True class keyword argument.

This change simplifies the code, makes it more readable, and aligns with current Pydantic v2 best practices for creating immutable models. No functional changes are introduced.

@Urfoex Urfoex requested a review from Copilot August 29, 2025 03:09
@Urfoex Urfoex self-assigned this Aug 29, 2025
@Urfoex Urfoex added the enhancement New feature or request label Aug 29, 2025
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors Pydantic model names and field names for improved clarity and consistency. The main changes include renaming ColumnProfile to ColumnInformation, updating the corresponding field name from column_profile to columns, and adopting modern Pydantic v2 practices by using frozen=True instead of model_config dictionaries.

Key changes include:

  • Renaming ColumnProfile to ColumnInformation and column_profile field to columns
  • Implementing frozen=True for immutable Pydantic models
  • Adding support for text column statistics and introducing ColumnType enum
  • Generalizing Parquet serialization logic into a reusable serialize_dataframe function

Reviewed Changes

Copilot reviewed 38 out of 39 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
tests/unit/types.py Updates type aliases to use new ColumnInformationType naming
tests/unit/serialize/test_pipeline.py Updates test assertions to use new target name target0
tests/unit/serialize/test_parquet.py New test file for generalized Parquet serialization functionality
tests/unit/serialize/test_dataframe_or_view.py Simplified tests using new shared helper functions
tests/unit/serialize/test_container.py Updates to use new column information naming and helper functions
tests/unit/metadata/test_pipeline_information.py Updates type references and test data to use new naming
tests/unit/metadata/test_container_information.py Updates to use new ColumnInformationType
tests/unit/conftest.py Major refactoring of test fixtures and helpers to support new naming and structure
Integration test files Updates Pydantic model definitions to use frozen=True and expected JSON output changes

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This is a substantial pull request that introduces several valuable refactorings and improvements. The renaming of ColumnProfile to ColumnInformation enhances clarity. The modernization of Pydantic models to use frozen=True is a great update. The core of this PR is the generalization of Parquet serialization and statistics calculation, which significantly improves code reuse and separation of concerns. Adding support for text column statistics and refactoring the statistics models with a base class and optional fields is also a solid improvement for correctness and maintainability. Overall, these are excellent changes. I've identified a couple of areas in the new serialization logic where robustness could be improved to handle potential edge cases gracefully.

@Urfoex Urfoex marked this pull request as ready for review August 29, 2025 03:15
@Urfoex Urfoex requested review from awaismirza92 and srnnkls August 29, 2025 03:15
@Urfoex Urfoex force-pushed the feature/GH-18-serialize-pipeline-scores branch from d8cdb40 to a8d9594 Compare August 29, 2025 20:40
Base automatically changed from feature/GH-18-serialize-pipeline-scores to develop August 29, 2025 20:40
The Pydantic model `ColumnProfile` and its corresponding field `column_profile` have been renamed to `ColumnInformation` and `columns` respectively.

This change improves clarity and consistency. "Information" is a more accurate and general term for the data being stored (name, role, statistics) than "Profile". The field name `columns` is more concise and conventional.

All related functions, variables, test fixtures, and test data have been updated to reflect this renaming. No functional changes are introduced.

GH-62: refactor: Add support for text column statistics

Introduces support for calculating and storing summary statistics for columns with the `text` role.

This includes:
- Adding a `ColumnStatisticsText` model.
- Refactoring `ColumnStatistics` models to use a common `ColumnStatisticsBase` class, reducing code duplication.
- Making statistical fields (like min, max, avg) optional to correctly handle columns with only NULL values.
- Introducing a `ColumnType` enum to replace string literals for column types, improving type safety and readability.
- Updating test fixtures and helpers to align with the new data models and support the `text` role.

GH-62: ci: Optimize workflow triggers with path filtering

Adds path filters to the pull_request and push triggers for the Python test workflow.

This prevents the workflow from running on changes unrelated to the source code, tests, or project configuration (e.g., documentation updates).

Additionally, this change adds a `push` trigger to run tests on merges to `main` and `develop`, ensuring the integrity of the primary branches.

GH-62: refactor: Generalize Parquet serialization and statistics logic

Extract the logic for serializing data to Parquet and calculating column statistics into a new, more generic `serialize_dataframe` function in `serialize/parquet.py`.

This new function is now used for serializing both getML DataFrames/Views and prediction results. This refactoring improves code reuse and separation of concerns.

Key changes:
- `serialize_dataframe` accepts callables for saving the Parquet file and retrieving column roles, decoupling it from specific data structures like `getml.DataFrame`.
- `serialize_predictions` now creates a `pyarrow.Table` directly from the numpy array and uses the new `serialize_dataframe` function, avoiding the overhead of creating an intermediate `getml.DataFrame`.
- Moves shared logic out of `serialize_dataframe_or_view.py`, simplifying it significantly.

GH-62: refactor(pydantic): Use `frozen=True` for immutable models

Modernize Pydantic model definitions by replacing the `model_config` dictionary with the `frozen=True` class keyword argument.

This change simplifies the code, makes it more readable, and aligns with current Pydantic v2 best practices for creating immutable models. No functional changes are introduced.
@Urfoex Urfoex force-pushed the refactor/GH-62-incorporate-review-feedback branch from b4ffcde to 96fa168 Compare August 29, 2025 20:41
Copy link
Collaborator

@srnnkls srnnkls left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm.

Copy link

@awaismirza92 awaismirza92 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Urfoex Urfoex merged commit ad8eb81 into develop Sep 1, 2025
3 checks passed
@Urfoex Urfoex deleted the refactor/GH-62-incorporate-review-feedback branch September 1, 2025 10:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Comments