Introduction to Bloom Filters

A Bloom filter is a probabilistic, space-efficient, data structure that is used to test whether an element is in a set. False negative matches are not possible, so if we search for an element and get negative response, we are 100% percent sure that it doesn’t exist in the set. On the other hand, false positive matches are possible, which means that if the search returns a positive response, the element may be in the set.

Description

A Bloom filter is an array of m bits, with all values set to 0 when it’s empty. There are also k hash functions, which are used to map each element to k positions in the array. When we want to insert a new element in the array, we apply each hash function to it and get an index. Then, we just have to set the value of that array index to 1.

Use Cases

  • Check that a web object is cached only if it has accessed more than once
  • Reduce the disk lookups for non-existent rows or columns
  • Check if a given password is a dictionary word
  • Avoid recommending articles a user has previously read
  • Detect previously stored data
  • Distributed caching

Example

Insert Element

Let’s say we have a Bloom Filter with size m=10 and two hash functions, hash1 and hash2. In the beginning, the array looks like this:

[ 0 0 0 0 0 0 0 0 0 0 ]

We want to add the word hello in the set. First, we pass the word to the function hash1, which returns the index 4. We now set the that index to 1 and the Bloom Filter looks like this:

[ 0 0 0 0 1 0 0 0 0 0 ]

Then, we pass the same word to the function hash2, which returns the index 2. We set that index to 1 and the Bloom Filter changes to the following:

[ 0 0 1 0 1 0 0 0 0 0 ]

Search for Element

To check if an element is in the set, we pass it to the same hash functions, get the indexes and check whether any bit of those positions is 0. If so, we are sure that the element is not in the set. If all bits are set to 1, the element may be in the set. The probability of a false positive depends on the total elements, the size of the array and the number of the hash functions we use. To calculate it, we can use the following type:

(1−e−kn/m)k

k: Number of hash functions
n: Number of elements in the set
m: Number of bits of the array

Remove Element

Element removal from a Bloom Filter is impossible, because other elements may point to the same position, too. Of course, we can set those indexes to 0, but we will remove other elements that point to the same positions. For example, let’s say the hash indexes of the words hello and world are the same, 1 and 3, and we have the following array:

[ 0 1 0 1 0 0 0 0 0 0 ]

In order to remove the word hello, we set those bits to 0, so the array becomes:

[ 0 0 0 0 0 0 0 0 0 0 ]

We did remove the word hello, but, as you can see, we also removed the word world.

Optimal number of hash functions

For a given m and n, the value of k that minimizes the false positive probability is k = (m/n)ln2. If you want to find the optimum parameters for a Bloom filter, you can use this Bloom Filter Calculator.

References

Wikipedia
Bloom Filters by Example

Leave a Reply

Your email address will not be published. Required fields are marked *