Skip to content

GH-54: refactor(serialization): Unify prediction and feature set seri…#57

Merged
Urfoex merged 1 commit intodevelopfrom
feature/GH-54-convert-predict-ndarray-to-dataframe
Aug 29, 2025
Merged

GH-54: refactor(serialization): Unify prediction and feature set seri…#57
Urfoex merged 1 commit intodevelopfrom
feature/GH-54-convert-predict-ndarray-to-dataframe

Conversation

@Urfoex
Copy link
Collaborator

@Urfoex Urfoex commented Aug 19, 2025

…alization

This commit refactors the serialization of pipeline predictions to align with the existing process for feature sets.

Previously, predictions (as NumPy arrays) were handled by a custom serialize_ndarray function and tracked using a separate TableInformation data class. This led to code duplication and inconsistency.

The changes include:

  • Removing the TableInformation class and its related exceptions.
  • Deleting the serialize_ndarray utility function.
  • Updating the serialize_predictions function to:
    1. Convert the NumPy array from pipeline.predict() into a getml.DataFrame via pyarrow.Table.
    2. Use the existing serialize_dataframe_or_view function to store the prediction data and its metadata.
  • Simplifying the derive_instance_with_relative_path utility, as it now only needs to handle DataFrameInformation.

Additionally, this commit introduces basedpyright into the CI pipeline for stricter static type checking, along with necessary pyright: ignore annotations for third-party library interactions. A minor bug where std in column statistics could be None has also been fixed.

@Urfoex Urfoex requested a review from Copilot August 19, 2025 21:48
@Urfoex Urfoex self-assigned this Aug 19, 2025
@Urfoex Urfoex added the enhancement New feature or request label Aug 19, 2025
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors the serialization of pipeline predictions to unify it with the existing feature set serialization approach, removing code duplication and improving consistency. It also introduces strict type checking with basedpyright into the CI pipeline.

  • Removes custom TableInformation class and serialize_ndarray function, using the standard DataFrame serialization instead
  • Converts NumPy arrays from predictions to getml.DataFrame via pyarrow.Table before serializing
  • Updates all related code, tests, and expected outputs to reflect the new naming scheme and data structure
  • Adds basedpyright to CI with necessary type ignore annotations

Reviewed Changes

Copilot reviewed 33 out of 33 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
src/getml_io/serialize/pipeline.py Main refactoring to convert predictions to DataFrame and use unified serialization
src/getml_io/serialize/ndarray.py Removed - no longer needed with unified approach
src/getml_io/metadata/table_information.py Removed - replaced by DataFrameInformation
src/getml_io/metadata/dataframe_information.py Fixed potential None value for std in column statistics
src/getml_io/metadata/utils.py Simplified to only handle DataFrameInformation
tests/* Updated tests to reflect new naming scheme and data structures
.github/workflows/python-tests.yml Added basedpyright type checking to CI
Comments suppressed due to low confidence (1)

tests/unit/serialize/test_pipeline.py:214

  • The test no longer verifies that the prediction files actually exist on disk, which was an important validation. Consider adding this assertion back or explaining why it's no longer needed.
    expected_path = path / f"prediction.{subset_name}.parquet"

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request is a great refactoring effort that unifies the serialization of pipeline predictions and feature sets. By removing the custom TableInformation class and serialize_ndarray function, and instead leveraging the existing serialize_dataframe_or_view function, the code is now more consistent and maintainable. The introduction of basedpyright for stricter type checking is also a valuable addition for improving code quality. The changes are well-implemented and the tests are updated accordingly. I have one suggestion to make the exception handling more specific for better robustness.

@Urfoex Urfoex marked this pull request as ready for review August 19, 2025 21:54
@Urfoex Urfoex requested a review from srnnkls August 19, 2025 21:54
@Urfoex Urfoex requested a review from awaismirza92 August 25, 2025 08:43
Copy link
Collaborator

@srnnkls srnnkls left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great simplification. lgtm.

@srnnkls
Copy link
Collaborator

srnnkls commented Aug 25, 2025

Added one comment on wasteful data conversions.

Copy link

@awaismirza92 awaismirza92 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Urfoex Urfoex force-pushed the feature/GH-16-serialize-pipeline-fields-data-model branch from ed82b6a to 80daa12 Compare August 29, 2025 20:32
Base automatically changed from feature/GH-16-serialize-pipeline-fields-data-model to develop August 29, 2025 20:33
…alization

This commit refactors the serialization of pipeline predictions to align with the existing process for feature sets.

Previously, predictions (as NumPy arrays) were handled by a custom `serialize_ndarray` function and tracked using a separate `TableInformation` data class. This led to code duplication and inconsistency.

The changes include:
- Removing the `TableInformation` class and its related exceptions.
- Deleting the `serialize_ndarray` utility function.
- Updating the `serialize_predictions` function to:
  1. Convert the NumPy array from `pipeline.predict()` into a `getml.DataFrame` via `pyarrow.Table`.
  2. Use the existing `serialize_dataframe_or_view` function to store the prediction data and its metadata.
- Simplifying the `derive_instance_with_relative_path` utility, as it now only needs to handle `DataFrameInformation`.

Additionally, this commit introduces `basedpyright` into the CI pipeline for stricter static type checking, along with necessary `pyright: ignore` annotations for third-party library interactions. A minor bug where `std` in column statistics could be `None` has also been fixed.
@Urfoex Urfoex force-pushed the feature/GH-54-convert-predict-ndarray-to-dataframe branch from da5401c to 0ccae38 Compare August 29, 2025 20:34
@Urfoex Urfoex merged commit f9b3dc6 into develop Aug 29, 2025
3 checks passed
@Urfoex Urfoex deleted the feature/GH-54-convert-predict-ndarray-to-dataframe branch August 29, 2025 20:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Convert prediction NDArray to DataFrame and use existing dataframe-serialization

3 participants

Comments