Skip to content

[GLUTEN-4889][VL] feat: Support approx_percentile aggregate function#11651

Open
Yizhou-Yang wants to merge 34 commits intoapache:mainfrom
Yizhou-Yang:percentile0225
Open

[GLUTEN-4889][VL] feat: Support approx_percentile aggregate function#11651
Yizhou-Yang wants to merge 34 commits intoapache:mainfrom
Yizhou-Yang:percentile0225

Conversation

@Yizhou-Yang
Copy link

@Yizhou-Yang Yizhou-Yang commented Feb 25, 2026

What

Add Velox approx_percentile support for Spark.

Why

Velox uses KLL sketch while Spark uses GK algorithm — their intermediate data formats are incompatible (KLL: 9-field StructType vs GK: single BinaryType buffer). This means fallback between Velox and Spark requires separate handling.

How

  • VeloxApproximatePercentile: A DeclarativeAggregate with 9 aggBufferAttributes matching Velox's KLL sketch layout.
  • Spark-side KLL implementation (KllSketchHelper/KllSketchAdd/KllSketchMerge/KllSketchEval): Simplified KLL operations for fallback, binary-compatible with Velox's C++ accumulator.
  • ApproxPercentileRewriteRule: Rewrites Spark's ApproximatePercentile to the Velox-compatible version.
  • All 4 fallback modes supported: Full offload, partial fallback, final fallback, full fallback.

Key decisions

  • Accuracy stored as IntegerType (Spark's original value); Velox computes epsilon = 1.0/accuracy internally.
  • KLL chosen over GK for Spark-side fallback to maintain intermediate data compatibility with Velox.

Velox dependency

facebookincubator/velox#16320


Related issue: #4889

@github-actions github-actions bot added CORE works for Gluten Core VELOX labels Feb 25, 2026
@github-actions
Copy link

Run Gluten Clickhouse CI on x86

@Yizhou-Yang Yizhou-Yang changed the title feat:support gluten-level approx_percentile [GLUTEN-4889][VL] feat:support gluten-level approx_percentile Feb 25, 2026
@github-actions
Copy link

Run Gluten Clickhouse CI on x86

@jinchengchenghh
Copy link
Contributor

Please update get-velox.sh to test your PR, then you can verify if both can work well, you may update this line https://github.com/apache/incubator-gluten/blob/5d3f7145cd7fc258aa10b434ea4ec651bd82c764/ep/build-velox/src/get-velox.sh#L28

@jinchengchenghh
Copy link
Contributor

Do we need the config? Usually we offload the function to native by default

@Yizhou-Yang
Copy link
Author

Please update get-velox.sh to test your PR, then you can verify if both can work well, you may update this line

https://github.com/apache/incubator-gluten/blob/5d3f7145cd7fc258aa10b434ea4ec651bd82c764/ep/build-velox/src/get-velox.sh#L28

added the 16320 and removed the config

@github-actions github-actions bot added BUILD and removed CORE works for Gluten Core labels Feb 25, 2026
@github-actions
Copy link

Run Gluten Clickhouse CI on x86

1 similar comment
@github-actions
Copy link

Run Gluten Clickhouse CI on x86

@github-actions github-actions bot added the CORE works for Gluten Core label Mar 2, 2026
@github-actions
Copy link

github-actions bot commented Mar 2, 2026

Run Gluten Clickhouse CI on x86

@github-actions
Copy link

github-actions bot commented Mar 2, 2026

Run Gluten Clickhouse CI on x86

1 similar comment
@github-actions
Copy link

github-actions bot commented Mar 2, 2026

Run Gluten Clickhouse CI on x86

@github-actions github-actions bot removed the CORE works for Gluten Core label Mar 2, 2026
@github-actions
Copy link

github-actions bot commented Mar 2, 2026

Run Gluten Clickhouse CI on x86

5 similar comments
@github-actions
Copy link

github-actions bot commented Mar 2, 2026

Run Gluten Clickhouse CI on x86

@github-actions
Copy link

github-actions bot commented Mar 2, 2026

Run Gluten Clickhouse CI on x86

@github-actions
Copy link

github-actions bot commented Mar 3, 2026

Run Gluten Clickhouse CI on x86

@github-actions
Copy link

github-actions bot commented Mar 3, 2026

Run Gluten Clickhouse CI on x86

@github-actions
Copy link

github-actions bot commented Mar 3, 2026

Run Gluten Clickhouse CI on x86

@jinchengchenghh
Copy link
Contributor

Please update the PR description to describe the KLL Sketch is different so that we handle fallback separately.

@Yizhou-Yang
Copy link
Author

Yizhou-Yang commented Mar 3, 2026

Please update the PR description to describe the KLL Sketch is different so that we handle fallback separately.

done~

@github-actions
Copy link

github-actions bot commented Mar 6, 2026

Run Gluten Clickhouse CI on x86

jinchengchenghh and others added 16 commits March 18, 2026 15:33
…ministic across Spark and Velox implementations
…sensitive tests

1. VeloxApproxPercentile: Fix accuracyBuf type from IntegerType to DoubleType
   to match Velox intermediate type (row(..., double, ...)).
2. VeloxApproxPercentile: Compute K dynamically from accuracy using Velox's
   kFromEpsilon formula instead of hardcoded DEFAULT_K=200. This ensures the
   Spark fallback path has the same precision as Velox native execution.
3. GlutenApproximatePercentileQuerySuite: Override precision-sensitive tests
   with tolerance-based assertions. KLL sketch and GK algorithm inherently
   select different values at percentile boundaries (e.g., for 1..1000,
   exact 25th percentile=250.25, GK returns 250 while KLL returns 251).
   This difference cannot be eliminated by increasing precision.
@github-actions github-actions bot added the CORE works for Gluten Core label Mar 18, 2026
@github-actions
Copy link

Run Gluten Clickhouse CI on x86

@github-actions
Copy link

Run Gluten Clickhouse CI on x86

@github-actions
Copy link

Run Gluten Clickhouse CI on x86

@github-actions
Copy link

Run Gluten Clickhouse CI on x86

1 similar comment
@github-actions
Copy link

Run Gluten Clickhouse CI on x86

@github-actions
Copy link

Run Gluten Clickhouse CI on x86

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLICKHOUSE CORE works for Gluten Core VELOX

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants