Bloom filter optimization

### Description

Bloom Filter Distribution in Spark and Velox

Current Status:

Spark Bloom Filter:
Spark implements Bloom filters with multiple hash functions. The number of hash functions is calculated based on the expected number of elements and the desired false positive rate. This approach is memory efficient but introduces overhead when computing multiple hash functions for each element.

Velox Bloom Filter:
Velox currently uses a simpler Bloom filter implementation. While efficient for small datasets, it lacks the flexibility of controlling the number of hash functions independently, which can lead to suboptimal false positive rates or unnecessary computational overhead.

Proposed Improvement:

Implement a native Bloom filter in Velox similar to Spark’s design:

Support multiple hash functions

Optimize memory allocation for large datasets.

Reduce false positive rates by matching the hash function count to the dataset size and expected error rate.

Maintain API compatibility for existing Velox operators that rely on Bloom filters.

Benefits:

Align Velox’s Bloom filter behavior with Spark, improving interoperability for query execution plans.

Solve the current hash function number issue in Velox, where the fixed or suboptimal hash function count can impact performance and accuracy.

Enable more efficient distributed filtering in Velox-powered pipelines.

### Gluten version

None

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bloom filter optimization #11771

Description

Gluten version

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Bloom filter optimization #11771

Description

Description

Gluten version

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions