Conversation
There was a problem hiding this comment.
Pull Request Overview
This PR significantly expands the metadata captured for getML pipelines by adding comprehensive serialization support for all major components, creating a fully self-contained and reproducible project format.
Key changes include:
- Introduced Pydantic models for FeatureLearners, Predictors, Preprocessors, and DataModel components
- Used discriminated unions for polymorphic components with type-safe serialization
- Renamed
profiletocolumn_profilein DataFrameInformation for clarity
Reviewed Changes
Copilot reviewed 35 out of 35 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/unit/serialize/test_pipeline_information.py | Updated test expectations to include comprehensive pipeline metadata structure |
| tests/unit/serialize/test_pipeline.py | Added serialization tests for feature learners, predictors, and preprocessors |
| tests/unit/serialize/test_dataframe_or_view.py | Updated for profile to column_profile renaming |
| tests/unit/serialize/test_container.py | Updated for profile to column_profile renaming |
| tests/unit/metadata/test_utils.py | Added test for non-dataclass validation in derive_instance_with_relative_path |
| tests/unit/metadata/test_pipeline_information.py | Updated test expectations for expanded pipeline metadata |
| tests/unit/metadata/test_container_information.py | Updated for profile to column_profile renaming |
| tests/unit/conftest.py | Expanded fixtures to support new pipeline metadata structure |
| src/getml_io/serialize/*.py | New serialization modules for roles, placeholders, data models, and pipeline components |
| src/getml_io/metadata/*.py | Updated metadata models with new structures and renamed fields |
| src/getml_io/getml/*.py | New Pydantic models for getML components with discriminated unions |
| tests/integration/data/*.json | Updated expected JSON outputs with expanded metadata |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
Summary of ChangesThis pull request significantly enhances the metadata serialization capabilities for getML pipelines, aiming to create a fully self-contained and reproducible project format. It introduces a comprehensive set of Pydantic models for various getML components, such as feature learners, predictors, preprocessors, and data model elements. The Highlights
Changelog
Activity
|
There was a problem hiding this comment.
Code Review
This pull request introduces a comprehensive metadata serialization for getML pipelines using Pydantic models, which is a significant step towards creating fully self-contained and reproducible projects. The changes are well-structured, leveraging Pydantic's discriminated unions for polymorphic components and centralizing type definitions. The test suite has also been commendably expanded to cover this new, richer metadata. My review identified a minor typo in a test fixture name, which I've commented on. Overall, this is a high-quality and impactful contribution.
8bec36c to
ed82b6a
Compare
awaismirza92
left a comment
There was a problem hiding this comment.
LGTM.
- (nitpick) You can probably disable ruff's
PLR0911rule:
(getml-io) awais@Admins-MacBook-Air:~/code17/projects/getML/github/getml-io% uv run ruff check --extend-ignore FIX .
src/getml_io/serialize/pipeline.py:266:5: PLR0911 Too many return statements (7 > 6)
|
266 | def serialize_preprocessor(
| ^^^^^^^^^^^^^^^^^^^^^^ PLR0911
267 | preprocessor: getml_preprocessor.CategoryTrimmer
268 | | getml_preprocessor.EmailDomain
|
Found 1 error.
- (nitpick) The GitHub workflow
python-tests.yamlcurrently runs only for PRs againstmain&develop, which I think is too narrow. Probably, it should run for all PRs orworkflow_dispatch:be added to allow running it manually for the feature branch of a PR.
|
I take back the first comment (about |
@2: |
This commit significantly expands the metadata captured for a getML pipeline, moving towards a fully self-contained and reproducible project format. Key changes include: - Introduced Pydantic models for all major getML components: - FeatureLearners (FastProp, Multirel, etc.) - Predictors & FeatureSelectors (XGBoost, LinearRegression, etc.) - Preprocessors (CategoryTrimmer, Mapping, etc.) - DataModel, Placeholders, and Joins - Roles and Relationships - Used Pydantic's discriminated unions for polymorphic components like `FeatureLearner` and `Predictor`, ensuring type-safe serialization and deserialization. - Updated `PipelineInformation` to embed the full configuration of the pipeline, including its structure, components, and parameters. - Implemented serialization logic to convert live getML objects into these new Pydantic models. - Refactored `DataFrameInformation` by renaming `profile` to `column_profile` for clarity. - Expanded unit and integration tests to assert the correctness of the new, richer metadata.
ed82b6a to
80daa12
Compare
This commit significantly expands the metadata captured for a getML pipeline, moving towards a fully self-contained and reproducible project format.
Key changes include:
FeatureLearnerandPredictor, ensuring type-safe serialization and deserialization.PipelineInformationto embed the full configuration of the pipeline, including its structure, components, and parameters.DataFrameInformationby renamingprofiletocolumn_profilefor clarity.