Home / Twitter / GitHub / LinkedIn / Email

Bloom Filter Notes

August 2011

For the basics read up on the Wikipedia article and the Scalable Bloom Filters paper. Make sure you understand the false positive rate p and the relationship to m (bits) and k (hashes/key).

Hints:

Common calculations:

def optimal_bloomfilter(n, p):
    """
    Calculate optimal m (vector bits) and k (hash count) for a bloom filter
    with capacity `n` and an expected positive failure rate of `p`. Returns a
    tuple (m, k).
    """
    n, p = float(n), float(p)
    m = -1 * (n * log(p)) / pow(log(2), 2)
    k = (m / n) * log(2)
    return int(ceil(m)), int(ceil(k))


def bloomfilter_capacity(m, p):
    """
    Calculate max capacity n for bloom filter with size `m` bits and expected
    `p` false positive rate.
    """
    n = m * (pow(log(2), 2) / abs(log(p)))
    k = log(1/p, 2)
    return (int(round(n)), int(ceil(k)))

The relationship between p and m/n scale linearly as n goes up so you can use the following table as quick reference guide:

|--------------|--------------|--------------|--------------|--------------|
|   capacity   |    max p     |   mem (kB)   |   bits/key   |   hash/key   |
|--------------|--------------|--------------|--------------|--------------|
|        1000  |         0.1  |        0.59  |        4.79  |           4  |
|        1000  |        0.05  |        0.76  |        6.24  |           5  |
|        1000  |        0.01  |        1.17  |        9.59  |           7  |
|        1000  |       0.005  |        1.35  |       11.03  |           8  |
|        1000  |       0.001  |        1.76  |       14.38  |          10  |
|        1000  |      0.0005  |        1.93  |       15.82  |          11  |
|        1000  |      0.0001  |        2.34  |       19.17  |          14  |
|--------------|--------------|--------------|--------------|--------------|

Here is another table which optimizes (i.e. minimizes) m/n (bits/key) and k hashes since we have to deal with even bits in the real world:

|--------------|--------------|--------------|--------------|
|    max p     |  expected p  |   bits/key   |   hash/key   |
|--------------|--------------|--------------|--------------|
|       0.100  |      0.0918  |           5  |           3  |
|       0.050  |      0.0423  |           7  |           3  |
|       0.010  |      0.0094  |          10  |           5  |
|       0.005  |      0.0046  |          12  |           5  |
|       0.001  |      0.0009  |          15  |           8  |
|       0.001  |      0.0005  |          16  |          10  |
|       0.000  |      0.0001  |          20  |          10  |
|--------------|--------------|--------------|--------------|

There are a lot of implementations floating around on the net, some perfectly fine but most seem to break down on two things:

  1. Hash algorithm h.
  2. Not verifying the false positive rate p.

Dealing with hashing is pretty straight-forward, don't bother with cryptographic hashing (SHA/MD5) or CRC32 - simply find a Murmur hash implementation for your platform/language combination and you're set. Next step is to read the Less Hashing, Same Performance: Building a Better Bloom Filter paper and understand how you can turn the hashing k operation into an almost constant time operation:

gi(x) = h1(x) + i * h2(x) mod m

h1 = murmur_hash(key, 0)
h2 = murmur_hash(key, h1)
for n in 1..k:
    g(n) = (h1 + i * h2) % bits_per_slice

The second issue, verifying that the implementation you're intending to use actually holds up to the expected p is extremely important. If you're using a library which doesn't contain a test suite for checking the false positive rate you absolutely should to write something like this:

  1. Retrieve or calculate the filter capacity n.

  2. Fill upp the filter with n keys.

  3. Now run a large series of lookups against the filter with random keys. Make sure the keys are true false positives and keep track of the errors so that you can calculate the fp/n.

We expect fp to hover around the expected p value since we're dealing with a probabilistic data-structure but large deviations need to be further investigated. Also look out for implementations where the fp is very low as this a good sign that the filter is not correctly tuned, m is probably way higher than it needs to be.

I've a written a small Python package to document most of my findings and to serve as a future reference for doing bloom filter calculations. Let me know if you find any bugs or if its helpful.

Source: https://github.com/bkz/bloom

Resources:

Almeida, P. S., Baquero, C., PreguiƧa, N., and Hutchison, D. 2007. Scalable Bloom Filters. Inf. Process. Lett. 101, 6 (Mar. 2007), 255-261. DOI= http://dx.doi.org/10.1016/j.ipl.2006.10.007 link

Kirsh, A., and Mitzenmacher, M. 2006. Less Hashing, Same Performance: Building a Better Bloom Filter. In Proceedings of the 14th Annual European Symposium on Algorithms (ESA), pp. 456-467. link