Official support for the Apache Parquet format

I'm a radio astronomer interested in using this Julia-native implementation of the Apache Arrow in-memory format for black hole imaging with the [Event Horizon Telescope](https://eventhorizontelescope.org/). First of all, thanks for developing this package! We get interested in this package because the Apache Arrow and Parquet formats have been considered as a [major candidate for the next generation radio astronomy data format](https://github.com/ratt-ru/casa-arrow/discussions/1). 

I'm wondering if the package envisions implementing IO functions of the Apache Parquet format in the future. I read a previous [issue]( https://github.com/apache/arrow-julia/issues/227) regarding this topic. I believe that no method is yet available to directly load/write columnar data in Parquest file into the Arrow.jl's in-memory data ---- the only way to handle this in a pure Julia way seems to be converting disk-based data into the one in the Apache IPC format by using both Parquet.jl and Arrow.jl, and then reloading it into memory using Arrow.jl.

This seems to be a bit problematic for our use case appearing as a major issue preventing us from using this package and apache's columnar formats in Julia. I think the key issues are
- This sort of disk-based conversion via [Parquet.jl](https://github.com/JuliaIO/Parquet.jl) and Arrow.jl is not computationally optimal as it involves disk-write and -read. This will be a major overhead in our use case.
- The Apache IPC format is [not prioritizing long-term storage and archival usage](https://arrow.apache.org/faq/#what-about-arrow-files-then), which would not satisfy the requirements of our community. So, purely relying on the IPC format won't be a solution.
- The current Julia packages for the Apache Parquet format (e.g. [Parquet.jl](https://github.com/JuliaIO/Parquet.jl) and [Parquet2.jl](https://gitlab.com/ExpandingMan/Parquet2.jl)) seem not fully support nested types, which are key to handle [our radio astronomy data in the Apache's columnar formats](https://github.com/ratt-ru/casa-arrow), while Arrow.jl does for the Arrow in-memory and IPC formats.

Given a lot of similarities and cross sections between the specifications of the Apache Parquet and Arrow formats, I feel it is more straightforward to request the IO features of Parquet formats in Arrow.jl rather than request some missing features to the existing Julia Parquet packages. Any thoughts on this are appreciated. Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Official support for the Apache Parquet format #410

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Official support for the Apache Parquet format #410

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions