Conversation
There was a problem hiding this comment.
Pull Request Overview
This pull request refactors the serialization architecture to decouple data extraction from disk I/O, enabling optional disk serialization. The changes introduce a two-phase approach where extraction and serialization can be performed independently.
- Renamed
ProjectInformationtoProjectIdentificationfor clarity as it's used to locate projects rather than store project data - Refactored serialization functions to accept optional
target_storage_directoryparameters, allowing metadata-only extraction whenNone - Removed
pathattribute fromDataFrameInformationas it tightly coupled metadata to file locations - Centralized column statistics calculation and introduced new utility modules for DuckDB operations and Parquet handling
Reviewed Changes
Copilot reviewed 46 out of 46 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| Multiple test files | Updated imports and function calls to use ProjectIdentification instead of ProjectInformation |
src/getml_io/serialize/project.py |
Refactored to return ProjectInformation and support optional disk serialization |
src/getml_io/serialize/pipeline.py |
Updated function signatures to support optional storage and renamed serialize functions to convert functions |
src/getml_io/serialize/container.py |
Modified to support optional disk serialization throughout the container hierarchy |
src/getml_io/utils/duckdb.py |
New utility module for DuckDB operations and summary statistics |
src/getml_io/serialize/column_information.py |
New module centralizing column information building logic |
src/getml_io/metadata/dataframe_information.py |
Removed path attribute and moved column statistics to separate module |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
There was a problem hiding this comment.
Code Review
This pull request introduces a significant and well-executed refactoring to make disk serialization optional. The changes effectively decouple data extraction from I/O operations by modifying serialization functions to accept an optional storage directory. When no directory is provided, the functions now return in-memory metadata objects, which greatly enhances the library's flexibility. The code has been reorganized into more logical modules, such as separating column statistics and DuckDB utilities, which improves maintainability. My review includes suggestions to further improve the code structure, enhance the robustness of the test suite by addressing a fragile mocking strategy and fixing a bug in a test, and to increase type safety by addressing suppressed type errors.
bf44654 to
c3ce193
Compare
aa504be to
34eb80e
Compare
Refactor the serialization logic to decouple the process of extracting project metadata and data from the process of writing them to disk. This makes the serialization logic more flexible, testable, and enables in-memory inspection of project structures without any file I/O. The main change is that the `serialize_*` functions (e.g., `serialize_project`, `serialize_container`) now accept an optional `target_storage_directory` argument: - If a storage directory is provided, the project and its data artifacts are saved to disk as before. - If `target_storage_directory` is `None`, only the metadata `*Information` objects are generated and returned, with no disk I/O performed. This allows for a two-phase approach where project information can be gathered and analyzed in memory before deciding whether to persist it. Key changes include: - **Optional I/O:** `serialize_*` functions now conditionally perform disk I/O based on the presence of a `target_storage_directory`. - **Decoupled Metadata:** The `path` attribute has been removed from `DataFrameInformation`, removing the tight coupling between metadata and a physical file location. - **Clearer Naming:** `ProjectInformation` was renamed to `ProjectIdentification` to clarify its role in locating a project. Renamed `serialize_*` to `convert_*` for functions that only transform getML objects to pydantic models without I/O. - **Centralized I/O:** Parquet I/O and column statistics calculation have been centralized into dedicated modules (`serialize/column_information.py`, `serialize/parquet.py`). - **Improved Return Values:** `serialize_project` now consistently returns the `ProjectInformation` object, allowing for immediate in-memory use. - **Refactored Tests:** Tests have been updated to align with the new architecture, allowing for separate testing of the metadata generation and file I/O stages.
34eb80e to
5a9341c
Compare
#56: refactor(serialize): Decouple serialization logic from disk I/O
Refactor the serialization logic to decouple the process of extracting project metadata and data from the process of writing them to disk. This makes the serialization logic more flexible, testable, and enables in-memory inspection of project structures without any file I/O.
The main change is that the
serialize_*functions (e.g.,serialize_project,serialize_container) now accept an optionaltarget_storage_directoryargument:target_storage_directoryisNone, only the metadata*Informationobjects are generated and returned, with no disk I/O performed.This allows for a two-phase approach where project information can be gathered and analyzed in memory before deciding whether to persist it.
Key changes include:
serialize_*functions now conditionally perform disk I/O based on the presence of atarget_storage_directory.pathattribute has been removed fromDataFrameInformation, removing the tight coupling between metadata and a physical file location.ProjectInformationwas renamed toProjectIdentificationto clarify its role in locating a project. Renamedserialize_*toconvert_*for functions that only transform getML objects to pydantic models without I/O.serialize/column_information.py,serialize/parquet.py).serialize_projectnow consistently returns theProjectInformationobject, allowing for immediate in-memory use.