[GLUTEN-11550][VL][UT] Enable Variant test suites#11726
[GLUTEN-11550][VL][UT] Enable Variant test suites#11726baibaichen wants to merge 1 commit intoapache:mainfrom
Conversation
4a21089 to
794a1fe
Compare
|
Run Gluten Clickhouse CI on x86 |
794a1fe to
c6759c0
Compare
|
Run Gluten Clickhouse CI on x86 |
c6759c0 to
d53e74e
Compare
|
Run Gluten Clickhouse CI on x86 |
d53e74e to
ff31977
Compare
|
Run Gluten Clickhouse CI on x86 |
ff31977 to
f70182f
Compare
|
Run Gluten Clickhouse CI on x86 |
f70182f to
cbba654
Compare
Enable GlutenVariantEndToEndSuite, GlutenVariantShreddingSuite, and
GlutenParquetVariantShreddingSuite for both spark40 and spark41.
Changes:
1. VeloxValidatorApi: Detect variant shredded structs produced by
Spark's PushVariantIntoScan (checking __VARIANT_METADATA_KEY
metadata) to trigger fallback to Spark's native Parquet reader.
2. ParquetMetadataUtils: Refactor to single-pass architecture.
- Extract parquetFooters() returning Iterator[Either[Exception,
ParquetMetadata]] to read each footer once.
- Merge validateVariantAnnotation into isUnsupportedMetadata
with metadataValidationEnabled parameter.
- Variant annotation check is always-on (correctness); codec,
timezone, encryption checks are gated by config.
- VeloxBackend simplified to single validateMetadata() call.
3. Spark41Shims: Add shouldFallbackForParquetVariantAnnotation to
detect Parquet variant logical type annotations. Add
needsVariantAnnotationCheck to avoid unnecessary I/O on
Spark versions that don't support variant annotations.
4. pom.xml: Add -Dfile.encoding=UTF-8 to test JVM args. On JDK 17
with LANG=C (CI containers centos-8/9), the default charset is
US-ASCII causing garbled output for multi-byte characters. JDK
18+ defaults to UTF-8 via JEP 400.
See: https://github.com/apache/spark/blob/v4.0.1/common/variant/src/main/java/org/apache/spark/types/variant/VariantUtil.java#L508
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Run Gluten Clickhouse CI on x86 |
cbba654 to
aea9428
Compare
|
Run Gluten Clickhouse CI on x86 |
|
@jinchengchenghh @zhztheplayer please review |
There was a problem hiding this comment.
Pull request overview
Adds Spark 4.1-specific detection and fallback behavior for Parquet Variant annotation / variant shredding so Velox avoids reading unsupported encodings, and enables previously-disabled Variant-related test suites.
Changes:
- Add Spark shim hooks to decide when Parquet Variant annotation should trigger fallback (Spark 4.1).
- Refactor Parquet metadata validation to read each footer once and include an always-on Variant annotation check when needed.
- Enable Variant-related test suites for Spark 4.0/4.1 and force UTF-8 file encoding for test JVMs.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| shims/spark41/src/main/scala/org/apache/gluten/sql/shims/spark41/Spark41Shims.scala | Implements Spark 4.1 logic for checking Parquet Variant annotations via footer schema. |
| shims/common/src/main/scala/org/apache/gluten/sql/shims/SparkShims.scala | Adds default shim APIs for Variant-annotation fallback decisions. |
| backends-velox/src/main/scala/org/apache/gluten/utils/ParquetMetadataUtils.scala | Refactors footer scanning and adds Variant-annotation-driven fallback path. |
| backends-velox/src/main/scala/org/apache/gluten/backendsapi/velox/VeloxValidatorApi.scala | Detects “variant shredded” StructType via field metadata and rejects it for Velox. |
| backends-velox/src/main/scala/org/apache/gluten/backendsapi/velox/VeloxBackend.scala | Minor cleanup around metadata validation call chaining. |
| gluten-ut/spark41/src/test/scala/org/apache/gluten/utils/velox/VeloxTestSettings.scala | Enables Variant-related suites for Spark 4.1. |
| gluten-ut/spark40/src/test/scala/org/apache/gluten/utils/velox/VeloxTestSettings.scala | Enables Variant-related suites for Spark 4.0. |
| pom.xml | Forces UTF-8 file.encoding for test JVM execution. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
backends-velox/src/main/scala/org/apache/gluten/utils/ParquetMetadataUtils.scala
Show resolved
Hide resolved
backends-velox/src/main/scala/org/apache/gluten/utils/ParquetMetadataUtils.scala
Show resolved
Hide resolved
backends-velox/src/main/scala/org/apache/gluten/backendsapi/velox/VeloxValidatorApi.scala
Show resolved
Hide resolved
| // These structs have all fields annotated with __VARIANT_METADATA_KEY metadata. | ||
| // Velox cannot read the variant shredding encoding in Parquet files. | ||
| if ( | ||
| struct.fields.nonEmpty && |
There was a problem hiding this comment.
Do we need to add nonEmpty check?
| struct.fields.nonEmpty && | ||
| struct.fields.forall(_.metadata.contains("__VARIANT_METADATA_KEY")) | ||
| ) { | ||
| return Some(s"Variant shredded struct is not supported: $struct") |
There was a problem hiding this comment.
Might the struct $struct too heavy?
| if (!GlutenConfig.get.parquetMetadataValidationEnabled) { | ||
| None | ||
| val enabled = GlutenConfig.get.parquetMetadataValidationEnabled | ||
| if (enabled || SparkShimLoader.getSparkShims.needsVariantAnnotationCheck) { |
There was a problem hiding this comment.
Can we reuse parquetMetadataValidationEnabled? the default value is false
|
|
||
| def isParquetFileEncrypted(footer: ParquetMetadata): Boolean | ||
|
|
||
| def shouldFallbackForParquetVariantAnnotation(footer: ParquetMetadata): Boolean = false |
There was a problem hiding this comment.
We would we not fallback?
What changes are proposed in this pull request?
Enable
GlutenVariantEndToEndSuite,GlutenVariantShreddingSuite, andGlutenParquetVariantShreddingSuitefor both spark40 and spark41.Changes:
VeloxValidatorApi.scala: Detect variant shredded structs (produced by Spark's
PushVariantIntoScan) by checking for__VARIANT_METADATA_KEYmetadata on struct fields. Triggers fallback to Spark's native Parquet reader since Velox cannot read variant shredding encoding.ParquetMetadataUtils.scala: Refactor to single-pass architecture — extract
parquetFooters()returningIterator[Either[Exception, ParquetMetadata]]to read each footer once. MergevalidateVariantAnnotationintoisUnsupportedMetadatawithmetadataValidationEnabledparameter. Variant annotation check is always-on (correctness); codec, timezone, encryption checks are gated by config.Spark41Shims.scala: Add
shouldFallbackForParquetVariantAnnotationto detect Parquet variant logical type annotations. AddneedsVariantAnnotationCheckto avoid unnecessary I/O on Spark versions that don't support variant annotations.pom.xml: Add
-Dfile.encoding=UTF-8to test JVM args. On JDK 17 withLANG=C(CI containers centos-8/9), the default charset is US-ASCII causing garbled output for multi-byte characters. JDK 18+ defaults to UTF-8 via JEP 400. See VariantUtil.java#L508.How was this patch tested?
GlutenVariantEndToEndSuite14✅,GlutenVariantShreddingSuite8✅,GlutenParquetVariantShreddingSuite5✅GlutenVariantEndToEndSuite14✅,GlutenVariantShreddingSuite8✅,GlutenParquetVariantShreddingSuite7✅Was this patch authored or co-authored using generative AI tooling?
Generated-by: GitHub Copilot CLI
Related issue: #11550
Note: This PR subsumes #11723 (GlutenParquetVariantShreddingSuite).