-
Notifications
You must be signed in to change notification settings - Fork 1
Data Structures
The bloom filter is a probabilistic data structure that is used to test whether an element is a member of a set.
Since this structure is probabilistic, it may produce false-positives. An acceptable probability of these false-positives occurring can be set when creating the instance using the fp_prob argument.
This implementation uses the general-purpose hashing function MurmurHash3 to produce 128-bit hashes.
The functions are seeded by seeds saved in bf-seeds.txt, if this file is missing, the seeds are generated during runtime and saved to a file by that name.
The size of the bit array is determined by first computing m, which is then rounded up to the nearest power of 2.
This is because the implementation has been optimized for speed rather than memory usage.
To do this, a custom modulo function modulo(num: u128, divisor: u128) from the utils crate is used when finding the bit location from the hash.
The particular implementation of the function requires the length of the bit array to be a power of 2, which is found using the closest_pow(n: u64) function from the same crate.
The initial size m of the bit array is determined using the formula:
-
$n$ - number of expected elements -
$p$ - acceptable probability of false-positives$$\Huge{m = -\left \lfloor {\frac{n\ln{p}}{(\ln{2})^2}}\right \rceil}$$ The number of hash functionskis determined using the formula:$$\Huge{k = \left \lfloor {\frac{m}{n}\ln{2}}\right \rceil}$$
external:
internal:
pub fn new(item_count: u64, fp_prob: f64) -> Self
Creates and returns an instance of the BloomFilter struct.
-
item_count- the expected number of elements
-
fp_prob- the acceptable false-positive probability
pub fn add(&mut self, item: &str) -> Result<()>
Takes an item, hashes it, writes 1 to the corresponding location in the bit array and returns a Result enum.
pub fn check(&self, item: &str) -> Result<bool>
Takes an item, hashes it, checks if the corresponding bit is a 1 and returns a bool wrapped in a Result.
The count-min sketch is a probabilistic data structure that is used to keep track of the number of times an element has been added to a large set, for each distinct element.
Since this structure is probabilistic, it may produce false-positives. The desired accuracy can be adjusted using the desired_accuracy argument, and the certainty of achieving that accuracy is adjusted using the certainty argument.
This implementation uses the general-purpose hashing function MurmurHash3 to produce 128-bit hashes.
The functions are seeded by seeds saved in cms-seeds.txt, if this file is missing, the seeds are generated during runtime and saved to a file by that name.
The number of columns is determined by first computing m, which is then rounded up to the nearest power of 2.
This is because the implementation has been optimized for speed rather than memory usage.
To do this, a custom modulo function modulo(num: u128, divisor: u128) from the utils crate is used when finding the bit location from the hash.
The particular implementation of the function requires the number of columns to be a power of 2, which is found using the closest_pow(n: u64) function from the same crate.
The initial number of columns m is determined using the formula:
$$\large{\text{$\varepsilon$ - the desired accuracy}}$$
The number of hash functions k is determined using the formula:
$$\large{\text{$\delta$ - certainty of achieving the desired accuracy}}$$
external:
internal:
pub fn new(desired_accuracy: f64, certainty: f64) -> CountMinSketch
Creates and returns an instance of the CountMinSketch struct.
pub fn add(&mut self, item: &str) -> Result<()>
Takes an item, hashes it, writes 1 to the corresponding location in the matrix and returns a Result enum.
pub fn count(&self, item: &str) -> Result<u64>
Takes an item, hashes it with multiple functions, finds all the numbers at the corresponding locations, and returns the smallest number wrapped in a Result.