Conversation
There was a problem hiding this comment.
Pull Request Overview
This PR refactors Pydantic model names and field names for improved clarity and consistency. The main changes include renaming ColumnProfile to ColumnInformation, updating the corresponding field name from column_profile to columns, and adopting modern Pydantic v2 practices by using frozen=True instead of model_config dictionaries.
Key changes include:
- Renaming
ColumnProfiletoColumnInformationandcolumn_profilefield tocolumns - Implementing
frozen=Truefor immutable Pydantic models - Adding support for text column statistics and introducing
ColumnTypeenum - Generalizing Parquet serialization logic into a reusable
serialize_dataframefunction
Reviewed Changes
Copilot reviewed 38 out of 39 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| tests/unit/types.py | Updates type aliases to use new ColumnInformationType naming |
| tests/unit/serialize/test_pipeline.py | Updates test assertions to use new target name target0 |
| tests/unit/serialize/test_parquet.py | New test file for generalized Parquet serialization functionality |
| tests/unit/serialize/test_dataframe_or_view.py | Simplified tests using new shared helper functions |
| tests/unit/serialize/test_container.py | Updates to use new column information naming and helper functions |
| tests/unit/metadata/test_pipeline_information.py | Updates type references and test data to use new naming |
| tests/unit/metadata/test_container_information.py | Updates to use new ColumnInformationType |
| tests/unit/conftest.py | Major refactoring of test fixtures and helpers to support new naming and structure |
| Integration test files | Updates Pydantic model definitions to use frozen=True and expected JSON output changes |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
There was a problem hiding this comment.
Code Review
This is a substantial pull request that introduces several valuable refactorings and improvements. The renaming of ColumnProfile to ColumnInformation enhances clarity. The modernization of Pydantic models to use frozen=True is a great update. The core of this PR is the generalization of Parquet serialization and statistics calculation, which significantly improves code reuse and separation of concerns. Adding support for text column statistics and refactoring the statistics models with a base class and optional fields is also a solid improvement for correctness and maintainability. Overall, these are excellent changes. I've identified a couple of areas in the new serialization logic where robustness could be improved to handle potential edge cases gracefully.
d8cdb40 to
a8d9594
Compare
The Pydantic model `ColumnProfile` and its corresponding field `column_profile` have been renamed to `ColumnInformation` and `columns` respectively. This change improves clarity and consistency. "Information" is a more accurate and general term for the data being stored (name, role, statistics) than "Profile". The field name `columns` is more concise and conventional. All related functions, variables, test fixtures, and test data have been updated to reflect this renaming. No functional changes are introduced. GH-62: refactor: Add support for text column statistics Introduces support for calculating and storing summary statistics for columns with the `text` role. This includes: - Adding a `ColumnStatisticsText` model. - Refactoring `ColumnStatistics` models to use a common `ColumnStatisticsBase` class, reducing code duplication. - Making statistical fields (like min, max, avg) optional to correctly handle columns with only NULL values. - Introducing a `ColumnType` enum to replace string literals for column types, improving type safety and readability. - Updating test fixtures and helpers to align with the new data models and support the `text` role. GH-62: ci: Optimize workflow triggers with path filtering Adds path filters to the pull_request and push triggers for the Python test workflow. This prevents the workflow from running on changes unrelated to the source code, tests, or project configuration (e.g., documentation updates). Additionally, this change adds a `push` trigger to run tests on merges to `main` and `develop`, ensuring the integrity of the primary branches. GH-62: refactor: Generalize Parquet serialization and statistics logic Extract the logic for serializing data to Parquet and calculating column statistics into a new, more generic `serialize_dataframe` function in `serialize/parquet.py`. This new function is now used for serializing both getML DataFrames/Views and prediction results. This refactoring improves code reuse and separation of concerns. Key changes: - `serialize_dataframe` accepts callables for saving the Parquet file and retrieving column roles, decoupling it from specific data structures like `getml.DataFrame`. - `serialize_predictions` now creates a `pyarrow.Table` directly from the numpy array and uses the new `serialize_dataframe` function, avoiding the overhead of creating an intermediate `getml.DataFrame`. - Moves shared logic out of `serialize_dataframe_or_view.py`, simplifying it significantly. GH-62: refactor(pydantic): Use `frozen=True` for immutable models Modernize Pydantic model definitions by replacing the `model_config` dictionary with the `frozen=True` class keyword argument. This change simplifies the code, makes it more readable, and aligns with current Pydantic v2 best practices for creating immutable models. No functional changes are introduced.
b4ffcde to
96fa168
Compare
The Pydantic model
ColumnProfileand its corresponding fieldcolumn_profilehave been renamed toColumnInformationandcolumnsrespectively.This change improves clarity and consistency. "Information" is a more accurate and general term for the data being stored (name, role, statistics) than "Profile". The field name
columnsis more concise and conventional.All related functions, variables, test fixtures, and test data have been updated to reflect this renaming. No functional changes are introduced.
GH-62: refactor: Add support for text column statistics
Introduces support for calculating and storing summary statistics for columns with the
textrole.This includes:
ColumnStatisticsTextmodel.ColumnStatisticsmodels to use a commonColumnStatisticsBaseclass, reducing code duplication.ColumnTypeenum to replace string literals for column types, improving type safety and readability.textrole.GH-62: ci: Optimize workflow triggers with path filtering
Adds path filters to the pull_request and push triggers for the Python test workflow.
This prevents the workflow from running on changes unrelated to the source code, tests, or project configuration (e.g., documentation updates).
Additionally, this change adds a
pushtrigger to run tests on merges tomainanddevelop, ensuring the integrity of the primary branches.GH-62: refactor: Generalize Parquet serialization and statistics logic
Extract the logic for serializing data to Parquet and calculating column statistics into a new, more generic
serialize_dataframefunction inserialize/parquet.py.This new function is now used for serializing both getML DataFrames/Views and prediction results. This refactoring improves code reuse and separation of concerns.
Key changes:
serialize_dataframeaccepts callables for saving the Parquet file and retrieving column roles, decoupling it from specific data structures likegetml.DataFrame.serialize_predictionsnow creates apyarrow.Tabledirectly from the numpy array and uses the newserialize_dataframefunction, avoiding the overhead of creating an intermediategetml.DataFrame.serialize_dataframe_or_view.py, simplifying it significantly.GH-62: refactor(pydantic): Use
frozen=Truefor immutable modelsModernize Pydantic model definitions by replacing the
model_configdictionary with thefrozen=Trueclass keyword argument.This change simplifies the code, makes it more readable, and aligns with current Pydantic v2 best practices for creating immutable models. No functional changes are introduced.