Skip to content

Add BROTLI compression support for Parquet decoding#176

Open
Siya-05 wants to merge 1 commit intoDataHaskell:mainfrom
Siya-05:add-brotli-parquet
Open

Add BROTLI compression support for Parquet decoding#176
Siya-05 wants to merge 1 commit intoDataHaskell:mainfrom
Siya-05:add-brotli-parquet

Conversation

@Siya-05
Copy link

@Siya-05 Siya-05 commented Mar 5, 2026

Summary

This PR improves Parquet page decompression by correctly handling DataPageHeaderV2 pages and adds support for BROTLI compressed pages. The previous implementation decompressed the entire page payload based only on the column codec, which is incorrect for DataPageHeaderV2, where repetition and definition levels are stored as an uncompressed prefix and only the data body may be compressed (controlled by dataPageHeaderV2IsCompressed).

Changes

  • Correct DataPageHeaderV2 decoding:
    • Split the page payload into:
      • prefix = repetitionLevelsBytes + definitionLevelsBytes
      • body = remaining bytes
    • Decompress only the body when dataPageHeaderV2IsCompressed == True
    • Recombine as prefix <> decompressedBody
  • Preserve existing behavior for non-V2 pages:
    • Continue decompressing the full page payload using the column codec.
  • Add BROTLI decompression path:
    • BROTLI is handled both for V2 (body-only) and non-V2 pages (full payload).
  • Add a safety check:
    • Throw a clear error if prefixLen > compressedPageSize for V2 pages, preventing silent corruption on malformed/truncated inputs.

Notes

  • This change is intentionally minimal and avoids refactoring the codec switch to keep the diff focused on correctness.
  • Codec handling (ZSTD, SNAPPY, UNCOMPRESSED, BROTLI) remains unchanged for non-V2 pages.

refers to #166.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant