Conversation
There was a problem hiding this comment.
Pull Request Overview
This PR refactors the serialization of pipeline predictions to unify it with the existing feature set serialization approach, removing code duplication and improving consistency. It also introduces strict type checking with basedpyright into the CI pipeline.
- Removes custom
TableInformationclass andserialize_ndarrayfunction, using the standard DataFrame serialization instead - Converts NumPy arrays from predictions to
getml.DataFrameviapyarrow.Tablebefore serializing - Updates all related code, tests, and expected outputs to reflect the new naming scheme and data structure
- Adds
basedpyrightto CI with necessary type ignore annotations
Reviewed Changes
Copilot reviewed 33 out of 33 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| src/getml_io/serialize/pipeline.py | Main refactoring to convert predictions to DataFrame and use unified serialization |
| src/getml_io/serialize/ndarray.py | Removed - no longer needed with unified approach |
| src/getml_io/metadata/table_information.py | Removed - replaced by DataFrameInformation |
| src/getml_io/metadata/dataframe_information.py | Fixed potential None value for std in column statistics |
| src/getml_io/metadata/utils.py | Simplified to only handle DataFrameInformation |
| tests/* | Updated tests to reflect new naming scheme and data structures |
| .github/workflows/python-tests.yml | Added basedpyright type checking to CI |
Comments suppressed due to low confidence (1)
tests/unit/serialize/test_pipeline.py:214
- The test no longer verifies that the prediction files actually exist on disk, which was an important validation. Consider adding this assertion back or explaining why it's no longer needed.
expected_path = path / f"prediction.{subset_name}.parquet"
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
There was a problem hiding this comment.
Code Review
This pull request is a great refactoring effort that unifies the serialization of pipeline predictions and feature sets. By removing the custom TableInformation class and serialize_ndarray function, and instead leveraging the existing serialize_dataframe_or_view function, the code is now more consistent and maintainable. The introduction of basedpyright for stricter type checking is also a valuable addition for improving code quality. The changes are well-implemented and the tests are updated accordingly. I have one suggestion to make the exception handling more specific for better robustness.
srnnkls
left a comment
There was a problem hiding this comment.
great simplification. lgtm.
|
Added one comment on wasteful data conversions. |
ed82b6a to
80daa12
Compare
…alization This commit refactors the serialization of pipeline predictions to align with the existing process for feature sets. Previously, predictions (as NumPy arrays) were handled by a custom `serialize_ndarray` function and tracked using a separate `TableInformation` data class. This led to code duplication and inconsistency. The changes include: - Removing the `TableInformation` class and its related exceptions. - Deleting the `serialize_ndarray` utility function. - Updating the `serialize_predictions` function to: 1. Convert the NumPy array from `pipeline.predict()` into a `getml.DataFrame` via `pyarrow.Table`. 2. Use the existing `serialize_dataframe_or_view` function to store the prediction data and its metadata. - Simplifying the `derive_instance_with_relative_path` utility, as it now only needs to handle `DataFrameInformation`. Additionally, this commit introduces `basedpyright` into the CI pipeline for stricter static type checking, along with necessary `pyright: ignore` annotations for third-party library interactions. A minor bug where `std` in column statistics could be `None` has also been fixed.
da5401c to
0ccae38
Compare
…alization
This commit refactors the serialization of pipeline predictions to align with the existing process for feature sets.
Previously, predictions (as NumPy arrays) were handled by a custom
serialize_ndarrayfunction and tracked using a separateTableInformationdata class. This led to code duplication and inconsistency.The changes include:
TableInformationclass and its related exceptions.serialize_ndarrayutility function.serialize_predictionsfunction to:pipeline.predict()into agetml.DataFrameviapyarrow.Table.serialize_dataframe_or_viewfunction to store the prediction data and its metadata.derive_instance_with_relative_pathutility, as it now only needs to handleDataFrameInformation.Additionally, this commit introduces
basedpyrightinto the CI pipeline for stricter static type checking, along with necessarypyright: ignoreannotations for third-party library interactions. A minor bug wherestdin column statistics could beNonehas also been fixed.