# Bloom Filter Notes

#### August 2011

For the basics read up on the Wikipedia article and the Scalable Bloom
Filters paper. Make sure you understand the false positive rate `p`

and the
relationship to `m`

(bits) and `k`

(hashes/key).

Hints:

The capacity

`n`

is reached when around 50% of bits are set.Hashing

`k`

can be optimized to near constant time - it's better to try to minimize`m/n`

- especially if you are doing network traffic or dealing with huge datasets. Don't blindly set a expected`p`

, you might be able to shave of a bit or so per key by simply relaxing`p`

by a small amount.If using a third-party bitvector implementation as storage for your own custom solution, verify that it actually clears out the bits on allocation.

Use separate slices of the bitvector (i.e.

`m`

is divied into`k`

elements) to get better distribution and predictable "linear scan" lookup patterns.Build a growable bloom filter by combining Q filters, start out with Q1 and add keys until to you reach full capacity, add Q2 and use it until you need to add Q3 and so on. To check for a key you simply lookup the key in Q1..N.

Common calculations:

def optimal_bloomfilter(n, p): """ Calculate optimal m (vector bits) and k (hash count) for a bloom filter with capacity `n` and an expected positive failure rate of `p`. Returns a tuple (m, k). """ n, p = float(n), float(p) m = -1 * (n * log(p)) / pow(log(2), 2) k = (m / n) * log(2) return int(ceil(m)), int(ceil(k)) def bloomfilter_capacity(m, p): """ Calculate max capacity n for bloom filter with size `m` bits and expected `p` false positive rate. """ n = m * (pow(log(2), 2) / abs(log(p))) k = log(1/p, 2) return (int(round(n)), int(ceil(k)))

The relationship between `p`

and `m/n`

scale linearly as `n`

goes up so you can
use the following table as quick reference guide:

```
|--------------|--------------|--------------|--------------|--------------|
| capacity | max p | mem (kB) | bits/key | hash/key |
|--------------|--------------|--------------|--------------|--------------|
| 1000 | 0.1 | 0.59 | 4.79 | 4 |
| 1000 | 0.05 | 0.76 | 6.24 | 5 |
| 1000 | 0.01 | 1.17 | 9.59 | 7 |
| 1000 | 0.005 | 1.35 | 11.03 | 8 |
| 1000 | 0.001 | 1.76 | 14.38 | 10 |
| 1000 | 0.0005 | 1.93 | 15.82 | 11 |
| 1000 | 0.0001 | 2.34 | 19.17 | 14 |
|--------------|--------------|--------------|--------------|--------------|
```

Here is another table which optimizes (i.e. minimizes) `m/n`

(bits/key) and `k`

hashes since we have to deal with even bits in the real world:

```
|--------------|--------------|--------------|--------------|
| max p | expected p | bits/key | hash/key |
|--------------|--------------|--------------|--------------|
| 0.100 | 0.0918 | 5 | 3 |
| 0.050 | 0.0423 | 7 | 3 |
| 0.010 | 0.0094 | 10 | 5 |
| 0.005 | 0.0046 | 12 | 5 |
| 0.001 | 0.0009 | 15 | 8 |
| 0.001 | 0.0005 | 16 | 10 |
| 0.000 | 0.0001 | 20 | 10 |
|--------------|--------------|--------------|--------------|
```

There are a lot of implementations floating around on the net, some perfectly fine but most seem to break down on two things:

- Hash algorithm
`h`

. - Not verifying the false positive rate
`p`

.

Dealing with hashing is pretty straight-forward, don't bother with
cryptographic hashing (SHA/MD5) or CRC32 - simply find a Murmur hash
implementation for your platform/language combination and you're set. Next step
is to read the Less Hashing, Same Performance: Building a Better Bloom
Filter paper and understand how you can turn the hashing `k`

operation into
an almost constant time operation:

```
gi(x) = h1(x) + i * h2(x) mod m
h1 = murmur_hash(key, 0)
h2 = murmur_hash(key, h1)
for n in 1..k:
g(n) = (h1 + i * h2) % bits_per_slice
```

The second issue, verifying that the implementation you're intending to use
actually holds up to the expected `p`

is **extremely important**. If you're
using a library which doesn't contain a test suite for checking the false
positive rate you *absolutely should* to write something like this:

Retrieve or calculate the filter capacity

`n`

.Fill upp the filter with

`n`

keys.Now run a large series of lookups against the filter with random keys. Make sure the keys are true false positives and keep track of the errors so that you can calculate the

`fp/n`

.

We expect `fp`

to hover around the expected `p`

value since we're dealing with
a probabilistic data-structure but large deviations need to be further
investigated. Also look out for implementations where the `fp`

is **very low**
as this a good sign that the filter is not correctly tuned, `m`

is probably way
higher than it needs to be.

I've a written a small Python package to document most of my findings and to serve as a future reference for doing bloom filter calculations. Let me know if you find any bugs or if its helpful.

Source: https://github.com/bkz/bloom

### Resources:

Almeida, P. S., Baquero, C., PreguiĆ§a, N., and Hutchison, D. 2007. Scalable Bloom Filters. Inf. Process. Lett. 101, 6 (Mar. 2007), 255-261. DOI= http://dx.doi.org/10.1016/j.ipl.2006.10.007 link

Kirsh, A., and Mitzenmacher, M. 2006. Less Hashing, Same Performance: Building a Better Bloom Filter. In Proceedings of the 14th Annual European Symposium on Algorithms (ESA), pp. 456-467. link