Skip to content

Fix Parquet boolean columns mapping to NaNs (#26)#437

Open
ChrisJohnNOAA wants to merge 1 commit intoERDDAP:mainfrom
ChrisJohnNOAA:parquet_boolean
Open

Fix Parquet boolean columns mapping to NaNs (#26)#437
ChrisJohnNOAA wants to merge 1 commit intoERDDAP:mainfrom
ChrisJohnNOAA:parquet_boolean

Conversation

@ChrisJohnNOAA
Copy link
Contributor

  • Fix Parquet boolean columns read as NaN in ERDDAP

Modified Table.readParquet to explicitly handle BOOLEAN types from Parquet schema, mapping them to ERDDAP's internal byte representation (1=true, 0=false, 127=NaN). Direct use of g.getBoolean() avoids data loss from failed string conversions.

Updated CustomWriteSupport.java to correctly interpret ERDDAP's boolean byte representations when writing back to Parquet.

Added logic to skip the simplification phase for non-String columns in Table.readParquet to prevent type corruption for boolean columns.

Added TableParquetTests.java as a reproduction and regression test.

Description

Please include a summary of the changes and the related issue. Please also include relevant motivation and context. List any dependencies that are required for this change.

Fixes #425

Type of change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)

Checklist before requesting a review

  • I have performed a self-review of my code
  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

* Fix Parquet boolean columns read as NaN in ERDDAP

Modified Table.readParquet to explicitly handle BOOLEAN types from Parquet schema, mapping them to ERDDAP's internal byte representation (1=true, 0=false, 127=NaN). Direct use of g.getBoolean() avoids data loss from failed string conversions.

Updated CustomWriteSupport.java to correctly interpret ERDDAP's boolean byte representations when writing back to Parquet.

Added logic to skip the simplification phase for non-String columns in Table.readParquet to prevent type corruption for boolean columns.

Added TableParquetTests.java as a reproduction and regression test.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Parquet files with boolean columns generate a byte data type with no data in the column.

1 participant