Skip to content

refactor(serialize): Decouple serialization logic from disk I/O#67

Merged
Urfoex merged 1 commit intodevelopfrom
feature/GH-56-optional-disk-serialization
Oct 9, 2025
Merged

refactor(serialize): Decouple serialization logic from disk I/O#67
Urfoex merged 1 commit intodevelopfrom
feature/GH-56-optional-disk-serialization

Conversation

@Urfoex
Copy link
Collaborator

@Urfoex Urfoex commented Sep 16, 2025

#56: refactor(serialize): Decouple serialization logic from disk I/O

Refactor the serialization logic to decouple the process of extracting project metadata and data from the process of writing them to disk. This makes the serialization logic more flexible, testable, and enables in-memory inspection of project structures without any file I/O.

The main change is that the serialize_* functions (e.g., serialize_project, serialize_container) now accept an optional target_storage_directory argument:

  • If a storage directory is provided, the project and its data artifacts are saved to disk as before.
  • If target_storage_directory is None, only the metadata *Information objects are generated and returned, with no disk I/O performed.

This allows for a two-phase approach where project information can be gathered and analyzed in memory before deciding whether to persist it.

Key changes include:

  • Optional I/O: serialize_* functions now conditionally perform disk I/O based on the presence of a target_storage_directory.
  • Decoupled Metadata: The path attribute has been removed from DataFrameInformation, removing the tight coupling between metadata and a physical file location.
  • Clearer Naming: ProjectInformation was renamed to ProjectIdentification to clarify its role in locating a project. Renamed serialize_* to convert_* for functions that only transform getML objects to pydantic models without I/O.
  • Centralized I/O: Parquet I/O and column statistics calculation have been centralized into dedicated modules (serialize/column_information.py, serialize/parquet.py).
  • Improved Return Values: serialize_project now consistently returns the ProjectInformation object, allowing for immediate in-memory use.
  • Refactored Tests: Tests have been updated to align with the new architecture, allowing for separate testing of the metadata generation and file I/O stages.

@Urfoex Urfoex requested a review from Copilot September 16, 2025 19:18
@Urfoex Urfoex self-assigned this Sep 16, 2025
@Urfoex Urfoex added the enhancement New feature or request label Sep 16, 2025
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This pull request refactors the serialization architecture to decouple data extraction from disk I/O, enabling optional disk serialization. The changes introduce a two-phase approach where extraction and serialization can be performed independently.

  • Renamed ProjectInformation to ProjectIdentification for clarity as it's used to locate projects rather than store project data
  • Refactored serialization functions to accept optional target_storage_directory parameters, allowing metadata-only extraction when None
  • Removed path attribute from DataFrameInformation as it tightly coupled metadata to file locations
  • Centralized column statistics calculation and introduced new utility modules for DuckDB operations and Parquet handling

Reviewed Changes

Copilot reviewed 46 out of 46 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
Multiple test files Updated imports and function calls to use ProjectIdentification instead of ProjectInformation
src/getml_io/serialize/project.py Refactored to return ProjectInformation and support optional disk serialization
src/getml_io/serialize/pipeline.py Updated function signatures to support optional storage and renamed serialize functions to convert functions
src/getml_io/serialize/container.py Modified to support optional disk serialization throughout the container hierarchy
src/getml_io/utils/duckdb.py New utility module for DuckDB operations and summary statistics
src/getml_io/serialize/column_information.py New module centralizing column information building logic
src/getml_io/metadata/dataframe_information.py Removed path attribute and moved column statistics to separate module

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant and well-executed refactoring to make disk serialization optional. The changes effectively decouple data extraction from I/O operations by modifying serialization functions to accept an optional storage directory. When no directory is provided, the functions now return in-memory metadata objects, which greatly enhances the library's flexibility. The code has been reorganized into more logical modules, such as separating column statistics and DuckDB utilities, which improves maintainability. My review includes suggestions to further improve the code structure, enhance the robustness of the test suite by addressing a fragile mocking strategy and fixing a bug in a test, and to increase type safety by addressing suppressed type errors.

@Urfoex Urfoex force-pushed the feature/GH-52-serialize-pipeline-tables branch from bf44654 to c3ce193 Compare September 17, 2025 15:59
Base automatically changed from feature/GH-52-serialize-pipeline-tables to develop September 17, 2025 16:58
@Urfoex Urfoex force-pushed the feature/GH-56-optional-disk-serialization branch 6 times, most recently from aa504be to 34eb80e Compare September 17, 2025 23:37
Refactor the serialization logic to decouple the process of extracting project metadata and data from the process of writing them to disk. This makes the serialization logic more flexible, testable, and enables in-memory inspection of project structures without any file I/O.

The main change is that the `serialize_*` functions (e.g., `serialize_project`, `serialize_container`) now accept an optional `target_storage_directory` argument:

- If a storage directory is provided, the project and its data artifacts are saved to disk as before.
- If `target_storage_directory` is `None`, only the metadata `*Information` objects are generated and returned, with no disk I/O performed.

This allows for a two-phase approach where project information can be gathered and analyzed in memory before deciding whether to persist it.

Key changes include:

- **Optional I/O:** `serialize_*` functions now conditionally perform disk I/O based on the presence of a `target_storage_directory`.
- **Decoupled Metadata:** The `path` attribute has been removed from `DataFrameInformation`, removing the tight coupling between metadata and a physical file location.
- **Clearer Naming:** `ProjectInformation` was renamed to `ProjectIdentification` to clarify its role in locating a project. Renamed `serialize_*` to `convert_*` for functions that only transform getML objects to pydantic models without I/O.
- **Centralized I/O:** Parquet I/O and column statistics calculation have been centralized into dedicated modules (`serialize/column_information.py`, `serialize/parquet.py`).
- **Improved Return Values:** `serialize_project` now consistently returns the `ProjectInformation` object, allowing for immediate in-memory use.
- **Refactored Tests:** Tests have been updated to align with the new architecture, allowing for separate testing of the metadata generation and file I/O stages.
@Urfoex Urfoex force-pushed the feature/GH-56-optional-disk-serialization branch from 34eb80e to 5a9341c Compare September 18, 2025 00:01
@Urfoex Urfoex marked this pull request as ready for review September 18, 2025 00:05
@Urfoex Urfoex changed the title Feature/gh 56 optional disk serialization refactor(serialize): Decouple serialization logic from disk I/O Sep 18, 2025
Copy link

@awaismirza92 awaismirza92 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Collaborator

@srnnkls srnnkls left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@Urfoex Urfoex merged commit 171d1f9 into develop Oct 9, 2025
3 checks passed
@Urfoex Urfoex deleted the feature/GH-56-optional-disk-serialization branch October 9, 2025 20:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Provide option to not serialize parquet files to disk, and stream content into duckdb for statistics

3 participants

Comments