Skip to content

GH-16: feat: Add comprehensive pipeline metadata serialization#55

Merged
Urfoex merged 1 commit intodevelopfrom
feature/GH-16-serialize-pipeline-fields-data-model
Aug 29, 2025
Merged

GH-16: feat: Add comprehensive pipeline metadata serialization#55
Urfoex merged 1 commit intodevelopfrom
feature/GH-16-serialize-pipeline-fields-data-model

Conversation

@Urfoex
Copy link
Collaborator

@Urfoex Urfoex commented Aug 14, 2025

This commit significantly expands the metadata captured for a getML pipeline, moving towards a fully self-contained and reproducible project format.

Key changes include:

  • Introduced Pydantic models for all major getML components:
    • FeatureLearners (FastProp, Multirel, etc.)
    • Predictors & FeatureSelectors (XGBoost, LinearRegression, etc.)
    • Preprocessors (CategoryTrimmer, Mapping, etc.)
    • DataModel, Placeholders, and Joins
    • Roles and Relationships
  • Used Pydantic's discriminated unions for polymorphic components like FeatureLearner and Predictor, ensuring type-safe serialization and deserialization.
  • Updated PipelineInformation to embed the full configuration of the pipeline, including its structure, components, and parameters.
  • Implemented serialization logic to convert live getML objects into these new Pydantic models.
  • Refactored DataFrameInformation by renaming profile to column_profile for clarity.
  • Expanded unit and integration tests to assert the correctness of the new, richer metadata.

@Urfoex Urfoex requested review from Copilot and srnnkls August 14, 2025 15:23
@Urfoex Urfoex self-assigned this Aug 14, 2025
@Urfoex Urfoex added the enhancement New feature or request label Aug 14, 2025
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR significantly expands the metadata captured for getML pipelines by adding comprehensive serialization support for all major components, creating a fully self-contained and reproducible project format.

Key changes include:

  • Introduced Pydantic models for FeatureLearners, Predictors, Preprocessors, and DataModel components
  • Used discriminated unions for polymorphic components with type-safe serialization
  • Renamed profile to column_profile in DataFrameInformation for clarity

Reviewed Changes

Copilot reviewed 35 out of 35 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
tests/unit/serialize/test_pipeline_information.py Updated test expectations to include comprehensive pipeline metadata structure
tests/unit/serialize/test_pipeline.py Added serialization tests for feature learners, predictors, and preprocessors
tests/unit/serialize/test_dataframe_or_view.py Updated for profile to column_profile renaming
tests/unit/serialize/test_container.py Updated for profile to column_profile renaming
tests/unit/metadata/test_utils.py Added test for non-dataclass validation in derive_instance_with_relative_path
tests/unit/metadata/test_pipeline_information.py Updated test expectations for expanded pipeline metadata
tests/unit/metadata/test_container_information.py Updated for profile to column_profile renaming
tests/unit/conftest.py Expanded fixtures to support new pipeline metadata structure
src/getml_io/serialize/*.py New serialization modules for roles, placeholders, data models, and pipeline components
src/getml_io/metadata/*.py Updated metadata models with new structures and renamed fields
src/getml_io/getml/*.py New Pydantic models for getML components with discriminated unions
tests/integration/data/*.json Updated expected JSON outputs with expanded metadata

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@gemini-code-assist
Copy link

Summary of Changes

This pull request significantly enhances the metadata serialization capabilities for getML pipelines, aiming to create a fully self-contained and reproducible project format. It introduces a comprehensive set of Pydantic models for various getML components, such as feature learners, predictors, preprocessors, and data model elements. The PipelineInformation structure has been updated to capture the entire configuration of a pipeline, including its components and parameters. This change also involves implementing the necessary serialization logic to convert live getML objects into these new Pydantic models and refactoring existing data structures for clarity. The expanded test suite ensures the correctness of this richer metadata.

Highlights

  • Pydantic Models for getML Components: Introduced Pydantic models for all major getML components, including FeatureLearners (FastProp, Multirel), Predictors & FeatureSelectors (XGBoost, LinearRegression), Preprocessors (CategoryTrimmer, Mapping), DataModel, Placeholders, Joins, Roles, and Relationships.
  • Type-Safe Polymorphic Serialization: Leveraged Pydantic's discriminated unions to ensure type-safe serialization and deserialization for polymorphic components like FeatureLearner and Predictor.
  • Comprehensive Pipeline Configuration: Updated the PipelineInformation structure to embed the full configuration of the pipeline, encompassing its structure, components, and parameters, moving towards a fully self-contained project format.
  • Serialization Logic for getML Objects: Implemented serialization logic to convert live getML objects into the newly defined Pydantic models, enabling persistent storage and reproducibility of pipeline metadata.
  • DataFrameInformation Refactoring: Refactored DataFrameInformation by renaming the profile field to column_profile for improved clarity and consistency.
  • Enhanced Test Coverage: Expanded unit and integration tests to thoroughly assert the correctness and integrity of the new, richer metadata serialization.
Changelog
  • src/getml_io/getml/feature_learning.py
    • Added a new file defining Pydantic dataclasses for various feature learners (FastProp, Fastboost, Multirel, Relboost, RelMT) and an Annotated union type for FeatureLearner.
  • src/getml_io/getml/predictors.py
    • Added a new file defining Pydantic dataclasses for different predictor types (LinearRegression, LogisticRegression, ScaleGBMClassifier, ScaleGBMRegressor, XGBoostClassifier, XGBoostRegressor) and union types for FeatureSelector and Predictor.
  • src/getml_io/getml/preprocessors.py
    • Added a new file defining Pydantic dataclasses for various preprocessor types (CategoryTrimmer, EmailDomain, Imputation, Mapping, Seasonal, Substring, TextFieldSplitter) and a union type for Preprocessor.
  • src/getml_io/getml/relationships.py
    • Added a new file defining a Relationship Enum to map getML relationship types to string values.
  • src/getml_io/getml/roles.py
    • Added a new file defining a Role Enum to map getML roles to string values and a Roles dataclass for sequences of role strings.
  • src/getml_io/metadata/container_information.py
    • Added a TODO comment related to adjusting relative paths in the serialization function.
  • src/getml_io/metadata/data_model_information.py
    • Added a new file defining the DataModelInformation dataclass for structuring population and peripheral placeholder information.
  • src/getml_io/metadata/dataframe_information.py
    • Removed the local Role Enum definition and now import it from getml_io.getml.roles.
    • Renamed the profile field to column_profile within the DataFrameInformation dataclass.
  • src/getml_io/metadata/feature_sets.py
    • Updated the FeatureSets TypeAlias from dict to Mapping for broader type compatibility.
  • src/getml_io/metadata/pipeline_information.py
    • Introduced a LossFunction Enum.
    • Expanded the PipelineInformationDict and PipelineInformation dataclasses to include comprehensive pipeline configuration details such as feature learners, selectors, predictors, preprocessors, data model, and various flags and parameters.
    • Updated the _serialize_model method to serialize all the newly added pipeline configuration fields.
  • src/getml_io/metadata/placeholder_information.py
    • Added a new file defining JoinInformation and PlaceholderInformation dataclasses to represent detailed data model structures, including nested joins.
  • src/getml_io/metadata/prediction_results.py
    • Updated the PredictionResults TypeAlias from dict to Mapping.
  • src/getml_io/metadata/utils.py
    • Modified derive_instance_with_relative_path to include a TypeError check for non-dataclass instances and explicitly use dataclasses.replace.
  • src/getml_io/serialize/data_model.py
    • Added a new file with serialize_data_model function to convert getML DataModel objects into DataModelInformation.
  • src/getml_io/serialize/dataframe_or_view.py
    • Updated imports to use Mapping and cast.
    • Changed the import source for Role to getml_io.getml.roles and added serialize_role import.
    • Renamed _calculate_profile to _calculate_column_profile and updated its usage.
    • Added cast operations for DuckDB query results.
    • Modified _get_column_statistics_adapter to utilize the new serialize_role function.
  • src/getml_io/serialize/pipeline.py
    • Introduced imports for various getML components and their corresponding Pydantic models.
    • Extended the serialize_pipeline function to include serialization of feature learners, selectors, predictors, preprocessors, and the data model.
    • Added new serialization functions: serialize_feature_learner, serialize_predictor, and serialize_preprocessor, leveraging dataclasses.asdict and TypeAdapter for conversion.
    • Adjusted serialize_predictions and serialize_feature_sets to initialize empty PredictionResults and FeatureSets as dictionary literals.
  • src/getml_io/serialize/placeholder.py
    • Added a new file with serialize_placeholder function to convert getML Placeholder objects into PlaceholderInformation, including handling nested joins.
  • src/getml_io/serialize/roles.py
    • Added a new file with serialize_roles and serialize_role functions to convert getML Roles and individual Role objects to their Pydantic counterparts.
  • tests/integration/assertions.py
    • Updated type hints from list to Sequence in assertion functions.
    • Renamed profile to column_profile in assert_dataframe_information.
    • Added new assertions for various pipeline information fields, including feature learners, selectors, categorical flags, classification/regression types, loss function, peripheral, predictors, preprocessors, share selected features, tags, targets, and data model.
  • tests/integration/data/datasets.py
    • Updated type hints to use Mapping and Sequence.
  • tests/integration/data/getmlproject.py
    • Updated type hints to use Sequence and chain for iteration.
  • tests/integration/data/loans/expected.container.json
    • Updated profile keys to column_profile and adjusted some statistical values.
  • tests/integration/data/loans/expected.pipeline.json
    • Updated the pipeline id.
    • Changed profile keys to column_profile and modified some statistical values.
    • Added new expected fields for feature learners, selectors, predictors, preprocessors, and the data model.
  • tests/integration/data/numerical/expected.container.json
    • Updated profile keys to column_profile.
  • tests/integration/data/numerical/expected.pipeline.json
    • Updated the pipeline id.
    • Changed profile keys to column_profile and adjusted some statistical values.
    • Added new expected fields for feature learners, selectors, predictors, preprocessors, and the data model.
  • tests/integration/data/robot/expected.container.json
    • Updated profile keys to column_profile.
  • tests/integration/data/robot/expected.pipeline.json
    • Updated the pipeline id.
    • Changed profile keys to column_profile.
    • Added new expected fields for feature learners, selectors, predictors, preprocessors, and the data model.
  • tests/unit/conftest.py
    • Updated type hints to use Sequence and Mapping.
    • Renamed profile_default to column_profile_default.
    • Adjusted container_information and dataframe_information fixtures to use column_profile.
    • Introduced new fixtures: roles_empty, placeholder_information_empty, data_model_information_empty, placeholder_information, data_mode_information, fast_prop, linear_regression, category_trimmer.
    • Expanded pipeline_information_empty and pipeline_information fixtures to include new pipeline metadata fields.
    • Updated mock_pipeline to set various pipeline attributes for serialization.
    • Modified mock_duckdb_execute_factory to use Mapping and Sequence.
  • tests/unit/metadata/test_container_information.py
    • Updated import for Role to getml_io.getml.roles and introduced new type hints.
    • Adjusted expected JSON to reflect column_profile instead of profile.
    • Refactored expected JSON into dedicated helper functions.
  • tests/unit/metadata/test_pipeline_information.py
    • Updated imports for Relationship and Role to their new locations, and added LossFunction import.
    • Adjusted expected JSON to use column_profile and include all new pipeline metadata fields.
    • Refactored expected JSON into dedicated helper functions.
  • tests/unit/metadata/test_utils.py
    • Added a new test case test_derive_instance_with_relative_path_not_dataclass to verify that a TypeError is raised when a non-dataclass instance is passed.
  • tests/unit/serialize/test_container.py
    • Introduced new type hints.
    • Updated expected JSON to use column_profile.
    • Refactored expected JSON into dedicated helper functions.
  • tests/unit/serialize/test_dataframe_or_view.py
    • Updated import for Role to getml_io.getml.roles.
    • Renamed profile_to_json to column_profiles_to_json and updated its usage.
    • Adjusted expected data to use column_profile.
  • tests/unit/serialize/test_pipeline.py
    • Added new unit tests for serialize_feature_learner, serialize_predictor, and serialize_preprocessor to cover various getML component types.
  • tests/unit/serialize/test_pipeline_information.py
    • Introduced new type hints.
    • Updated expected JSON to include all new pipeline metadata fields.
    • Refactored expected JSON into dedicated helper functions.
Activity
  • Urfoex opened this pull request.
  • Copilot left a nitpick comment suggesting the use of descriptive constants for Substring parameters to improve readability and maintainability.
  • Copilot left a comment highlighting an incomplete functionality (TODO) and suggested converting NDArray to DataFrame for serialization to avoid technical debt.
  • Copilot left a nitpick comment suggesting to extract nested cast operations into separate variables or functions for better clarity.
  • Copilot left a nitpick comment suggesting to extract deeply nested cast operations into a separate function for better readability.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive metadata serialization for getML pipelines using Pydantic models, which is a significant step towards creating fully self-contained and reproducible projects. The changes are well-structured, leveraging Pydantic's discriminated unions for polymorphic components and centralizing type definitions. The test suite has also been commendably expanded to cover this new, richer metadata. My review identified a minor typo in a test fixture name, which I've commented on. Overall, this is a high-quality and impactful contribution.

@Urfoex Urfoex force-pushed the feature/GH-16-serialize-pipeline-fields-data-model branch 4 times, most recently from 8bec36c to ed82b6a Compare August 19, 2025 18:36
@Urfoex Urfoex linked an issue Aug 22, 2025 that may be closed by this pull request
3 tasks
@Urfoex Urfoex requested a review from awaismirza92 August 25, 2025 08:43
Copy link
Collaborator

@srnnkls srnnkls left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm.

Copy link

@awaismirza92 awaismirza92 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

  1. (nitpick) You can probably disable ruff's PLR0911 rule:
(getml-io) awais@Admins-MacBook-Air:~/code17/projects/getML/github/getml-io% uv run ruff check --extend-ignore FIX .
src/getml_io/serialize/pipeline.py:266:5: PLR0911 Too many return statements (7 > 6)
    |
266 | def serialize_preprocessor(
    |     ^^^^^^^^^^^^^^^^^^^^^^ PLR0911
267 |     preprocessor: getml_preprocessor.CategoryTrimmer
268 |     | getml_preprocessor.EmailDomain
    |

Found 1 error.
  1. (nitpick) The GitHub workflow python-tests.yaml currently runs only for PRs against main & develop, which I think is too narrow. Probably, it should run for all PRs or workflow_dispatch: be added to allow running it manually for the feature branch of a PR.

@awaismirza92
Copy link

I take back the first comment (about PLR0911). I have just seen that it's already implemented in PR 57.

@Urfoex
Copy link
Collaborator Author

Urfoex commented Aug 28, 2025

LGTM.

1. (nitpick) You can probably disable ruff's `PLR0911` rule:
(getml-io) awais@Admins-MacBook-Air:~/code17/projects/getML/github/getml-io% uv run ruff check --extend-ignore FIX .
src/getml_io/serialize/pipeline.py:266:5: PLR0911 Too many return statements (7 > 6)
    |
266 | def serialize_preprocessor(
    |     ^^^^^^^^^^^^^^^^^^^^^^ PLR0911
267 |     preprocessor: getml_preprocessor.CategoryTrimmer
268 |     | getml_preprocessor.EmailDomain
    |

Found 1 error.
2. (nitpick) The GitHub workflow `python-tests.yaml` currently runs only for PRs against `main` & `develop`, which I think is too narrow. Probably, it should run for all PRs or `workflow_dispatch:` be added to allow running it manually for the feature branch of a PR.

@2:
True. Would probably be good. But at least for merge protection into develop and main it still works. Because if I merge this branch into parent branch, and want to merge parent into develop, then tests are run.
I add all branches for workflow. Better to directly see, when something is off: #62
👍

Base automatically changed from refactor/GH-16-serialize-dataframe-statistics to develop August 29, 2025 20:31
This commit significantly expands the metadata captured for a getML pipeline, moving towards a fully self-contained and reproducible project format.

Key changes include:
- Introduced Pydantic models for all major getML components:
  - FeatureLearners (FastProp, Multirel, etc.)
  - Predictors & FeatureSelectors (XGBoost, LinearRegression, etc.)
  - Preprocessors (CategoryTrimmer, Mapping, etc.)
  - DataModel, Placeholders, and Joins
  - Roles and Relationships
- Used Pydantic's discriminated unions for polymorphic components like `FeatureLearner` and `Predictor`, ensuring type-safe serialization and deserialization.
- Updated `PipelineInformation` to embed the full configuration of the pipeline, including its structure, components, and parameters.
- Implemented serialization logic to convert live getML objects into these new Pydantic models.
- Refactored `DataFrameInformation` by renaming `profile` to `column_profile` for clarity.
- Expanded unit and integration tests to assert the correctness of the new, richer metadata.
@Urfoex Urfoex force-pushed the feature/GH-16-serialize-pipeline-fields-data-model branch from ed82b6a to 80daa12 Compare August 29, 2025 20:32
@Urfoex Urfoex merged commit 372bb28 into develop Aug 29, 2025
0 of 3 checks passed
@Urfoex Urfoex deleted the feature/GH-16-serialize-pipeline-fields-data-model branch August 29, 2025 20:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement data model metadata serializers

3 participants

Comments