Description
Bloom Filter Distribution in Spark and Velox
Current Status:
Spark Bloom Filter:
Spark implements Bloom filters with multiple hash functions. The number of hash functions is calculated based on the expected number of elements and the desired false positive rate. This approach is memory efficient but introduces overhead when computing multiple hash functions for each element.
Velox Bloom Filter:
Velox currently uses a simpler Bloom filter implementation. While efficient for small datasets, it lacks the flexibility of controlling the number of hash functions independently, which can lead to suboptimal false positive rates or unnecessary computational overhead.
Proposed Improvement:
Implement a native Bloom filter in Velox similar to Spark’s design:
Support multiple hash functions
Optimize memory allocation for large datasets.
Reduce false positive rates by matching the hash function count to the dataset size and expected error rate.
Maintain API compatibility for existing Velox operators that rely on Bloom filters.
Benefits:
Align Velox’s Bloom filter behavior with Spark, improving interoperability for query execution plans.
Solve the current hash function number issue in Velox, where the fixed or suboptimal hash function count can impact performance and accuracy.
Enable more efficient distributed filtering in Velox-powered pipelines.
Gluten version
None