refactor(serialize): Decouple serialization logic from disk I/O by Urfoex · Pull Request #67 · getml/getml-io

Urfoex · 2025-09-16T19:18:23Z

#56: refactor(serialize): Decouple serialization logic from disk I/O

Refactor the serialization logic to decouple the process of extracting project metadata and data from the process of writing them to disk. This makes the serialization logic more flexible, testable, and enables in-memory inspection of project structures without any file I/O.

The main change is that the serialize_* functions (e.g., serialize_project, serialize_container) now accept an optional target_storage_directory argument:

If a storage directory is provided, the project and its data artifacts are saved to disk as before.
If target_storage_directory is None, only the metadata *Information objects are generated and returned, with no disk I/O performed.

This allows for a two-phase approach where project information can be gathered and analyzed in memory before deciding whether to persist it.

Key changes include:

Optional I/O: serialize_* functions now conditionally perform disk I/O based on the presence of a target_storage_directory.
Decoupled Metadata: The path attribute has been removed from DataFrameInformation, removing the tight coupling between metadata and a physical file location.
Clearer Naming: ProjectInformation was renamed to ProjectIdentification to clarify its role in locating a project. Renamed serialize_* to convert_* for functions that only transform getML objects to pydantic models without I/O.
Centralized I/O: Parquet I/O and column statistics calculation have been centralized into dedicated modules (serialize/column_information.py, serialize/parquet.py).
Improved Return Values: serialize_project now consistently returns the ProjectInformation object, allowing for immediate in-memory use.
Refactored Tests: Tests have been updated to align with the new architecture, allowing for separate testing of the metadata generation and file I/O stages.

Copilot

Pull Request Overview

This pull request refactors the serialization architecture to decouple data extraction from disk I/O, enabling optional disk serialization. The changes introduce a two-phase approach where extraction and serialization can be performed independently.

Renamed ProjectInformation to ProjectIdentification for clarity as it's used to locate projects rather than store project data
Refactored serialization functions to accept optional target_storage_directory parameters, allowing metadata-only extraction when None
Removed path attribute from DataFrameInformation as it tightly coupled metadata to file locations
Centralized column statistics calculation and introduced new utility modules for DuckDB operations and Parquet handling

Reviewed Changes

Copilot reviewed 46 out of 46 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
Multiple test files	Updated imports and function calls to use `ProjectIdentification` instead of `ProjectInformation`
`src/getml_io/serialize/project.py`	Refactored to return `ProjectInformation` and support optional disk serialization
`src/getml_io/serialize/pipeline.py`	Updated function signatures to support optional storage and renamed serialize functions to convert functions
`src/getml_io/serialize/container.py`	Modified to support optional disk serialization throughout the container hierarchy
`src/getml_io/utils/duckdb.py`	New utility module for DuckDB operations and summary statistics
`src/getml_io/serialize/column_information.py`	New module centralizing column information building logic
`src/getml_io/metadata/dataframe_information.py`	Removed `path` attribute and moved column statistics to separate module

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

src/getml_io/serialize/container.py

src/getml_io/serialize/parquet.py

gemini-code-assist

Code Review

This pull request introduces a significant and well-executed refactoring to make disk serialization optional. The changes effectively decouple data extraction from I/O operations by modifying serialization functions to accept an optional storage directory. When no directory is provided, the functions now return in-memory metadata objects, which greatly enhances the library's flexibility. The code has been reorganized into more logical modules, such as separating column statistics and DuckDB utilities, which improves maintainability. My review includes suggestions to further improve the code structure, enhance the robustness of the test suite by addressing a fragile mocking strategy and fixing a bug in a test, and to increase type safety by addressing suppressed type errors.

tests/unit/serialize/test_container.py

src/getml_io/serialize/project.py

src/getml_io/serialize/column_information.py

tests/unit/conftest.py

Refactor the serialization logic to decouple the process of extracting project metadata and data from the process of writing them to disk. This makes the serialization logic more flexible, testable, and enables in-memory inspection of project structures without any file I/O. The main change is that the `serialize_*` functions (e.g., `serialize_project`, `serialize_container`) now accept an optional `target_storage_directory` argument: - If a storage directory is provided, the project and its data artifacts are saved to disk as before. - If `target_storage_directory` is `None`, only the metadata `*Information` objects are generated and returned, with no disk I/O performed. This allows for a two-phase approach where project information can be gathered and analyzed in memory before deciding whether to persist it. Key changes include: - **Optional I/O:** `serialize_*` functions now conditionally perform disk I/O based on the presence of a `target_storage_directory`. - **Decoupled Metadata:** The `path` attribute has been removed from `DataFrameInformation`, removing the tight coupling between metadata and a physical file location. - **Clearer Naming:** `ProjectInformation` was renamed to `ProjectIdentification` to clarify its role in locating a project. Renamed `serialize_*` to `convert_*` for functions that only transform getML objects to pydantic models without I/O. - **Centralized I/O:** Parquet I/O and column statistics calculation have been centralized into dedicated modules (`serialize/column_information.py`, `serialize/parquet.py`). - **Improved Return Values:** `serialize_project` now consistently returns the `ProjectInformation` object, allowing for immediate in-memory use. - **Refactored Tests:** Tests have been updated to align with the new architecture, allowing for separate testing of the metadata generation and file I/O stages.

awaismirza92

LGTM

srnnkls

lgtm

Urfoex requested a review from Copilot September 16, 2025 19:18

Urfoex self-assigned this Sep 16, 2025

Urfoex added the enhancement New feature or request label Sep 16, 2025

Copilot AI reviewed Sep 16, 2025

View reviewed changes

src/getml_io/serialize/container.py Show resolved Hide resolved

src/getml_io/serialize/parquet.py Show resolved Hide resolved

gemini-code-assist bot reviewed Sep 16, 2025

View reviewed changes

tests/unit/serialize/test_container.py Show resolved Hide resolved

src/getml_io/serialize/project.py Outdated Show resolved Hide resolved

src/getml_io/serialize/column_information.py Show resolved Hide resolved

tests/unit/conftest.py Show resolved Hide resolved

Urfoex force-pushed the feature/GH-52-serialize-pipeline-tables branch from bf44654 to c3ce193 Compare September 17, 2025 15:59

Base automatically changed from feature/GH-52-serialize-pipeline-tables to develop September 17, 2025 16:58

Urfoex force-pushed the feature/GH-56-optional-disk-serialization branch 6 times, most recently from aa504be to 34eb80e Compare September 17, 2025 23:37

Urfoex force-pushed the feature/GH-56-optional-disk-serialization branch from 34eb80e to 5a9341c Compare September 18, 2025 00:01

Urfoex marked this pull request as ready for review September 18, 2025 00:05

Urfoex requested review from awaismirza92 and srnnkls September 18, 2025 00:06

Urfoex changed the title ~~Feature/gh 56 optional disk serialization~~ refactor(serialize): Decouple serialization logic from disk I/O Sep 18, 2025

Urfoex linked an issue Sep 18, 2025 that may be closed by this pull request

Provide option to not serialize parquet files to disk, and stream content into duckdb for statistics #56

Closed

awaismirza92 approved these changes Sep 18, 2025

View reviewed changes

srnnkls approved these changes Oct 9, 2025

View reviewed changes

Urfoex merged commit 171d1f9 into develop Oct 9, 2025
3 checks passed

Urfoex deleted the feature/GH-56-optional-disk-serialization branch October 9, 2025 20:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(serialize): Decouple serialization logic from disk I/O#67

refactor(serialize): Decouple serialization logic from disk I/O#67
Urfoex merged 1 commit intodevelopfrom
feature/GH-56-optional-disk-serialization

Urfoex commented Sep 16, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

awaismirza92 left a comment

Uh oh!

srnnkls left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

Conversation

Urfoex commented Sep 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

awaismirza92 left a comment

Choose a reason for hiding this comment

Uh oh!

srnnkls left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

Urfoex commented Sep 16, 2025 •

edited

Loading