Skip to content

Conversation

@hackintoshrao
Copy link

Summary

Parquet decimals can be stored using multiple physical types depending on precision:

  • INT32 for precision <= 9
  • INT64 for precision <= 18
  • FIXED_LEN_BYTE_ARRAY for any precision
  • BYTE_ARRAY for any precision

The previous implementation only accepted FIXED_LEN_BYTE_ARRAY for all decimals and rejected valid parquet files with error:

unexpected physical type INT32 for decimal(7, 2), expected FIXED_LEN_BYTE_ARRAY

This caused AddFiles to fail when importing datasets (like TPC-DS) that use INT32/INT64 for small precision decimals, which is valid per the Parquet specification.

Changes

  • Refactors createStatsAgg to switch on Iceberg logical type first, then handle physical representations (matches iceberg-java's ParquetConversions.java approach)
  • For DecimalType, accepts all valid parquet physical types
  • Updates DataFileStatsFromMeta to handle INT32/INT64 decimal statistics
  • Adds wrappedDecByteArrayStats for BYTE_ARRAY encoded decimals

Test plan

  • Existing tests pass
  • Build succeeds
  • Tested with TPC-DS parquet files that use INT32 decimals

Parquet decimals can be stored using multiple physical types depending
on precision:
- INT32 for precision <= 9
- INT64 for precision <= 18
- FIXED_LEN_BYTE_ARRAY for any precision
- BYTE_ARRAY for any precision

The previous implementation only accepted FIXED_LEN_BYTE_ARRAY for all
decimals and rejected valid parquet files with error:

  unexpected physical type INT32 for decimal(7, 2), expected FIXED_LEN_BYTE_ARRAY

This caused AddFiles to fail when importing datasets (like TPC-DS) that
use INT32/INT64 for small precision decimals, which is valid per the
Parquet specification.

This change:
- Refactors createStatsAgg to switch on Iceberg logical type first,
  then handle physical representations (matches iceberg-java's
  ParquetConversions.java approach)
- For DecimalType, accepts all valid parquet physical types
- Updates DataFileStatsFromMeta to handle INT32/INT64 decimal statistics
- Adds wrappedDecByteArrayStats for BYTE_ARRAY encoded decimals
@zeroshade
Copy link
Member

Can we add a unit test and/or integration test for this?

Otherwise this looks good to me

Adds TestDecimalPhysicalTypes which creates Parquet files with decimals
stored as INT32 (precision <= 9) and INT64 (precision <= 18) physical
types, then verifies that DataFileStatsFromMeta correctly extracts
column statistics without panicking.

This test validates the fix for the decimal physical type handling.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants