Skip to content

Conversation

@Mazen-Ghanaym
Copy link

Which issue does this PR close?

Closes #2973.

Rationale for this change

The startsWith and endsWith string functions were previously delegated to DataFusion's built-in scalar functions, which introduced unnecessary overhead and did not fully leverage Comet's native execution capabilities. This PR implements optimized native expressions to improve performance.

What changes are included in this PR?

This PR introduces custom StartsWithExpr and EndsWithExpr physical expressions with the following optimizations:

startsWith:

  • Uses Arrow's compute::starts_with kernel with a pre-allocated pattern array to avoid per-batch allocations.
  • Achieves 1.1X speedup over Spark.

endsWith:

  • Uses direct buffer access to the underlying StringArray data, bypassing iterator overhead.
  • Manually calculates suffix offsets and performs raw byte slice comparison (memcmp).
  • Achieves 1.0X parity with Spark (improved from 0.9X regression).

Files Changed:

  • native/spark-expr/src/string_funcs/starts_ends_with.rs (NEW)
  • native/spark-expr/src/string_funcs/mod.rs
  • native/core/src/execution/expressions/strings.rs
  • native/core/src/execution/planner/expression_registry.rs
  • spark/src/main/scala/org/apache/comet/serde/strings.scala
  • spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala
  • spark/src/test/scala/org/apache/spark/sql/benchmark/CometStringExpressionBenchmark.scala

How are these changes tested?

  1. Existing Tests: The implementation passes all existing Comet tests, including TPC-DS and TPC-H correctness suites which exercise string functions.
  2. Benchmark Verification: Performance was verified using CometStringExpressionBenchmark:
    • startsWith: 1.1X faster than Spark (Comet 1887ms vs Spark 2028ms)
    • endsWith: 1.0X parity with Spark (Comet 3389ms vs Spark 3354ms)
  3. CI Verification: A temporary workflow was used to verify the benchmark executes correctly in GitHub Actions CI environment and the results are in the Benchmark Results

Benchmark Results

Environment: OpenJDK 64-Bit Server VM 11.0.29+7-LTS on Linux 6.11.0-1018-azure
Processor: AMD EPYC 7763 64-Core Processor

startsWith

Case Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
Spark 1657 1669 17 0.6 1580.2 1.0X
Comet (Scan) 1740 1755 20 0.6 1659.6 1.0X
Comet (Scan + Exec) 1546 1546 1 0.7 1474.1 1.1X

endsWith

Case Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
Spark 1625 1632 10 0.6 1549.9 1.0X
Comet (Scan) 1731 1732 1 0.6 1651.2 0.9X
Comet (Scan + Exec) 1562 1563 0 0.7 1490.0 1.0X

Summary:

  • startsWith: 1.1X faster than Spark (1546ms vs 1657ms)
  • endsWith: 1.0X parity with Spark (1562ms vs 1625ms)

Mazen-Ghanaym and others added 6 commits December 27, 2025 03:32
- Implements a hybrid optimization strategy for string primitives.
- Uses Arrow compute kernels for startsWith with pre-allocated pattern arrays to avoid per-batch allocation overhead.
- Uses direct buffer access and manual suffix calculation for endsWith to bypass iterator overhead and match JVM intrinsic performance.
- Achieves 1.1X speedup for startsWith and parity (1.0X) for endsWith compared to Spark.

Closes apache#2973
@Mazen-Ghanaym Mazen-Ghanaym changed the title Feat/optimize strings 2973 feat: Optimize startsWith and endsWith string functions Dec 27, 2025
@coderfender
Copy link
Contributor

Thank you for the PR @Mazen-Ghanaym . Any reason why we can't make the Datafusion's version faster here?

@Mazen-Ghanaym
Copy link
Author

Thank you for the PR @Mazen-Ghanaym . Any reason why we can't make the Datafusion's version faster here?

I spent around 4 days trying to optimize, and here is the short story of the journey
Day 1-2: Tried DataFusion built-ins Started with ScalarFunctionExpr using DataFusion's native starts_with/ends_with. Result: 1.0X – just matched Spark, no improvement.

Day 2: Tried unsafe Rust with direct byte access Attempted raw pointer manipulation for maximum speed. Failed due to compatibility issues.

Day 3: Tried safe Rust with stdlib slices. Used .starts_with() and .ends_with() on string slices with manual iteration. Result: 0.9X – actually slower than Spark, I think it's due to iterator overhead.

Day 3-4: Arrow compute kernels + pre-allocated pattern The breakthrough: I realized the overhead came from repeatedly processing the pattern each batch. Pre-allocating the pattern as a StringArray once and calling Arrow's compute::starts_with directly gave us 1.1X for startsWith.

For endsWith, Arrow's kernel was still slightly slower (0.9X), so I went with direct buffer access and manual suffix calculation to reach 1.0X parity.

I tried to optimize further but I don't know any optimizations that can beat Java in these direct, simple operations.

@Mazen-Ghanaym Mazen-Ghanaym changed the title feat: Optimize startsWith and endsWith string functions perf: Optimize startsWith and endsWith string functions Dec 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve performance of startsWith and endsWith expressions

2 participants