Skip to content

Data Structures

Teodor Đurić edited this page Nov 28, 2023 · 6 revisions

Bloom filter

The bloom filter is a probabilistic data structure that is used to test whether an element is a member of a set.

Since this structure is probabilistic, it may produce false-positives. An acceptable probability of these false-positives occurring can be set when creating the instance using the fp_prob argument.

This implementation uses the general-purpose hashing function MurmurHash3 to produce 128-bit hashes.
The functions are seeded by seeds saved in bf-seeds.txt, if this file is missing, the seeds are generated during runtime and saved to a file by that name.

The size of the bit array is determined by first computing m, which is then rounded up to the nearest power of 2. This is because the implementation has been optimized for speed rather than memory usage. To do this, a custom modulo function modulo(num: u128, divisor: u128) from the utils crate is used when finding the bit location from the hash. The particular implementation of the function requires the length of the bit array to be a power of 2, which is found using the closest_pow(n: u64) function from the same crate.


The initial size m of the bit array is determined using the formula: $$\large{\left \lfloor \right \rceil \text{- nearest integer}}$$

  • $n$ - number of expected elements
  • $p$ - acceptable probability of false-positives $$\Huge{m = -\left \lfloor {\frac{n\ln{p}}{(\ln{2})^2}}\right \rceil}$$ The number of hash functions k is determined using the formula: $$\Huge{k = \left \lfloor {\frac{m}{n}\ln{2}}\right \rceil}$$

dependencies

external:

internal:

functions

new

pub fn new(item_count: u64, fp_prob: f64) -> Self

Creates and returns an instance of the BloomFilter struct.

  • item_count - the expected number of elements
  • fp_prob       - the acceptable false-positive probability

add

pub fn add(&mut self, item: &str) -> Result<()>

Takes an item, hashes it, writes 1 to the corresponding location in the bit array and returns a Result enum.

check

pub fn check(&self, item: &str) -> Result<bool>

Takes an item, hashes it, checks if the corresponding bit is a 1 and returns a bool wrapped in a Result.

Count-min sketch

The count-min sketch is a probabilistic data structure that is used to keep track of the number of times an element has been added to a large set, for each distinct element.

Since this structure is probabilistic, it may produce false-positives. The desired accuracy can be adjusted using the desired_accuracy argument, and the certainty of achieving that accuracy is adjusted using the certainty argument.

This implementation uses the general-purpose hashing function MurmurHash3 to produce 128-bit hashes.
The functions are seeded by seeds saved in cms-seeds.txt, if this file is missing, the seeds are generated during runtime and saved to a file by that name.

The number of columns is determined by first computing m, which is then rounded up to the nearest power of 2. This is because the implementation has been optimized for speed rather than memory usage. To do this, a custom modulo function modulo(num: u128, divisor: u128) from the utils crate is used when finding the bit location from the hash. The particular implementation of the function requires the number of columns to be a power of 2, which is found using the closest_pow(n: u64) function from the same crate.


The initial number of columns m is determined using the formula: $$\large{\text{$\varepsilon$ - the desired accuracy}}$$ $$\Huge{m = \left \lceil \frac{e}{\varepsilon} \right \rceil}$$

The number of hash functions k is determined using the formula: $$\large{\text{$\delta$ - certainty of achieving the desired accuracy}}$$ $$\Huge{k = \left \lceil \ln{\frac{1}{\delta}} \right \rceil}$$


dependencies

external:

internal:

functions

new

pub fn new(desired_accuracy: f64, certainty: f64) -> CountMinSketch

Creates and returns an instance of the CountMinSketch struct.

add

pub fn add(&mut self, item: &str) -> Result<()>

Takes an item, hashes it, writes 1 to the corresponding location in the matrix and returns a Result enum.

count

pub fn count(&self, item: &str) -> Result<u64>

Takes an item, hashes it with multiple functions, finds all the numbers at the corresponding locations, and returns the smallest number wrapped in a Result.

Clone this wiki locally