Skip to content

Bloom filter optimization #11771

@jinchengchenghh

Description

@jinchengchenghh

Description

Bloom Filter Distribution in Spark and Velox

Current Status:

Spark Bloom Filter:
Spark implements Bloom filters with multiple hash functions. The number of hash functions is calculated based on the expected number of elements and the desired false positive rate. This approach is memory efficient but introduces overhead when computing multiple hash functions for each element.

Velox Bloom Filter:
Velox currently uses a simpler Bloom filter implementation. While efficient for small datasets, it lacks the flexibility of controlling the number of hash functions independently, which can lead to suboptimal false positive rates or unnecessary computational overhead.

Proposed Improvement:

Implement a native Bloom filter in Velox similar to Spark’s design:

Support multiple hash functions

Optimize memory allocation for large datasets.

Reduce false positive rates by matching the hash function count to the dataset size and expected error rate.

Maintain API compatibility for existing Velox operators that rely on Bloom filters.

Benefits:

Align Velox’s Bloom filter behavior with Spark, improving interoperability for query execution plans.

Solve the current hash function number issue in Velox, where the fixed or suboptimal hash function count can impact performance and accuracy.

Enable more efficient distributed filtering in Velox-powered pipelines.

Gluten version

None

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions