GH-54: refactor(serialization): Unify prediction and feature set seri… by Urfoex · Pull Request #57 · getml/getml-io

Urfoex · 2025-08-19T21:48:37Z

…alization

This commit refactors the serialization of pipeline predictions to align with the existing process for feature sets.

Previously, predictions (as NumPy arrays) were handled by a custom serialize_ndarray function and tracked using a separate TableInformation data class. This led to code duplication and inconsistency.

The changes include:

Removing the TableInformation class and its related exceptions.
Deleting the serialize_ndarray utility function.
Updating the serialize_predictions function to:
1. Convert the NumPy array from pipeline.predict() into a getml.DataFrame via pyarrow.Table.
2. Use the existing serialize_dataframe_or_view function to store the prediction data and its metadata.
Simplifying the derive_instance_with_relative_path utility, as it now only needs to handle DataFrameInformation.

Additionally, this commit introduces basedpyright into the CI pipeline for stricter static type checking, along with necessary pyright: ignore annotations for third-party library interactions. A minor bug where std in column statistics could be None has also been fixed.

Copilot

Pull Request Overview

This PR refactors the serialization of pipeline predictions to unify it with the existing feature set serialization approach, removing code duplication and improving consistency. It also introduces strict type checking with basedpyright into the CI pipeline.

Removes custom TableInformation class and serialize_ndarray function, using the standard DataFrame serialization instead
Converts NumPy arrays from predictions to getml.DataFrame via pyarrow.Table before serializing
Updates all related code, tests, and expected outputs to reflect the new naming scheme and data structure
Adds basedpyright to CI with necessary type ignore annotations

Reviewed Changes

Copilot reviewed 33 out of 33 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
src/getml_io/serialize/pipeline.py	Main refactoring to convert predictions to DataFrame and use unified serialization
src/getml_io/serialize/ndarray.py	Removed - no longer needed with unified approach
src/getml_io/metadata/table_information.py	Removed - replaced by DataFrameInformation
src/getml_io/metadata/dataframe_information.py	Fixed potential None value for std in column statistics
src/getml_io/metadata/utils.py	Simplified to only handle DataFrameInformation
tests/*	Updated tests to reflect new naming scheme and data structures
.github/workflows/python-tests.yml	Added basedpyright type checking to CI

Comments suppressed due to low confidence (1)

tests/unit/serialize/test_pipeline.py:214

The test no longer verifies that the prediction files actually exist on disk, which was an important validation. Consider adding this assertion back or explaining why it's no longer needed.

    expected_path = path / f"prediction.{subset_name}.parquet"

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

src/getml_io/serialize/pipeline.py

gemini-code-assist

Code Review

This pull request is a great refactoring effort that unifies the serialization of pipeline predictions and feature sets. By removing the custom TableInformation class and serialize_ndarray function, and instead leveraging the existing serialize_dataframe_or_view function, the code is now more consistent and maintainable. The introduction of basedpyright for stricter type checking is also a valuable addition for improving code quality. The changes are well-implemented and the tests are updated accordingly. I have one suggestion to make the exception handling more specific for better robustness.

src/getml_io/metadata/utils.py

srnnkls

great simplification. lgtm.

src/getml_io/serialize/pipeline.py

srnnkls · 2025-08-25T09:17:30Z

Added one comment on wasteful data conversions.

awaismirza92

LGTM

…alization This commit refactors the serialization of pipeline predictions to align with the existing process for feature sets. Previously, predictions (as NumPy arrays) were handled by a custom `serialize_ndarray` function and tracked using a separate `TableInformation` data class. This led to code duplication and inconsistency. The changes include: - Removing the `TableInformation` class and its related exceptions. - Deleting the `serialize_ndarray` utility function. - Updating the `serialize_predictions` function to: 1. Convert the NumPy array from `pipeline.predict()` into a `getml.DataFrame` via `pyarrow.Table`. 2. Use the existing `serialize_dataframe_or_view` function to store the prediction data and its metadata. - Simplifying the `derive_instance_with_relative_path` utility, as it now only needs to handle `DataFrameInformation`. Additionally, this commit introduces `basedpyright` into the CI pipeline for stricter static type checking, along with necessary `pyright: ignore` annotations for third-party library interactions. A minor bug where `std` in column statistics could be `None` has also been fixed.

Urfoex requested a review from Copilot August 19, 2025 21:48

Urfoex self-assigned this Aug 19, 2025

Urfoex added the enhancement New feature or request label Aug 19, 2025

Copilot AI reviewed Aug 19, 2025

View reviewed changes

src/getml_io/serialize/pipeline.py Show resolved Hide resolved

src/getml_io/serialize/pipeline.py Show resolved Hide resolved

gemini-code-assist bot reviewed Aug 19, 2025

View reviewed changes

src/getml_io/metadata/utils.py Show resolved Hide resolved

Urfoex marked this pull request as ready for review August 19, 2025 21:54

Urfoex requested a review from srnnkls August 19, 2025 21:54

Urfoex linked an issue Aug 22, 2025 that may be closed by this pull request

Convert prediction NDArray to DataFrame and use existing dataframe-serialization #54

Closed

Urfoex requested a review from awaismirza92 August 25, 2025 08:43

srnnkls approved these changes Aug 25, 2025

View reviewed changes

srnnkls reviewed Aug 25, 2025

View reviewed changes

src/getml_io/serialize/pipeline.py Show resolved Hide resolved

awaismirza92 approved these changes Aug 26, 2025

View reviewed changes

Urfoex mentioned this pull request Aug 28, 2025

Incorporate Review Feedback #62

Closed

Urfoex force-pushed the feature/GH-16-serialize-pipeline-fields-data-model branch from ed82b6a to 80daa12 Compare August 29, 2025 20:32

Base automatically changed from feature/GH-16-serialize-pipeline-fields-data-model to develop August 29, 2025 20:33

Urfoex force-pushed the feature/GH-54-convert-predict-ndarray-to-dataframe branch from da5401c to 0ccae38 Compare August 29, 2025 20:34

Urfoex merged commit f9b3dc6 into develop Aug 29, 2025
3 checks passed

Urfoex deleted the feature/GH-54-convert-predict-ndarray-to-dataframe branch August 29, 2025 20:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-54: refactor(serialization): Unify prediction and feature set seri…#57

GH-54: refactor(serialization): Unify prediction and feature set seri…#57
Urfoex merged 1 commit intodevelopfrom
feature/GH-54-convert-predict-ndarray-to-dataframe

Urfoex commented Aug 19, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

srnnkls left a comment

Uh oh!

Uh oh!

srnnkls commented Aug 25, 2025

Uh oh!

awaismirza92 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

Conversation

Urfoex commented Aug 19, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

srnnkls left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

srnnkls commented Aug 25, 2025

Uh oh!

awaismirza92 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments