Bloom Filter Notes
08 Aug 2011
For the basics read up on the Wikipedia article and the Scalable Bloom
Filters paper. Make sure you understand the false positive rate p and the
relationship to m (bits) and k (hashes/key).
Hints:
The capacity
nis reached when around 50% of bits are set.Hashing
kcan be optimized to near constant time - it's better to try to minimizem/n- especially if you are doing network traffic or dealing with huge datasets. Don't blindly set a expectedp, you might be able to shave of a bit or so per key by simply relaxingpby a small amount.If using a third-party bitvector implementation as storage for your own custom solution, verify that it actually clears out the bits on allocation.
Use separate slices of the bitvector (i.e.
mis divied intokelements) to get better distribution and predictable "linear scan" lookup patterns.Build a growable bloom filter by combining Q filters, start out with Q1 and add keys until to you reach full capacity, add Q2 and use it until you need to add Q3 and so on. To check for a key you simply lookup the key in Q1..N.
Common calculations:
def optimal_bloomfilter(n, p):
"""
Calculate optimal m (vector bits) and k (hash count) for a bloom filter
with capacity `n` and an expected positive failure rate of `p`. Returns a
tuple (m, k).
"""
n, p = float(n), float(p)
m = -1 * (n * log(p)) / pow(log(2), 2)
k = (m / n) * log(2)
return int(ceil(m)), int(ceil(k))
def bloomfilter_capacity(m, p):
"""
Calculate max capacity n for bloom filter with size `m` bits and expected
`p` false positive rate.
"""
n = m * (pow(log(2), 2) / abs(log(p)))
k = log(1/p, 2)
return (int(round(n)), int(ceil(k)))
The relationship between p and m/n scale linearly as n goes up so you can
use the following table as quick reference guide:
|--------------|--------------|--------------|--------------|--------------|
| capacity | max p | mem (kB) | bits/key | hash/key |
|--------------|--------------|--------------|--------------|--------------|
| 1000 | 0.1 | 0.59 | 4.79 | 4 |
| 1000 | 0.05 | 0.76 | 6.24 | 5 |
| 1000 | 0.01 | 1.17 | 9.59 | 7 |
| 1000 | 0.005 | 1.35 | 11.03 | 8 |
| 1000 | 0.001 | 1.76 | 14.38 | 10 |
| 1000 | 0.0005 | 1.93 | 15.82 | 11 |
| 1000 | 0.0001 | 2.34 | 19.17 | 14 |
|--------------|--------------|--------------|--------------|--------------|
Here is another table which optimizes (i.e. minimizes) m/n (bits/key) and k
hashes since we have to deal with even bits in the real world:
|--------------|--------------|--------------|--------------|
| max p | expected p | bits/key | hash/key |
|--------------|--------------|--------------|--------------|
| 0.100 | 0.0918 | 5 | 3 |
| 0.050 | 0.0423 | 7 | 3 |
| 0.010 | 0.0094 | 10 | 5 |
| 0.005 | 0.0046 | 12 | 5 |
| 0.001 | 0.0009 | 15 | 8 |
| 0.001 | 0.0005 | 16 | 10 |
| 0.000 | 0.0001 | 20 | 10 |
|--------------|--------------|--------------|--------------|
There are a lot of implementations floating around on the net, some perfectly fine but most seem to break down on two things:
- Hash algorithm
h. - Not verifying the false positive rate
p.
Dealing with hashing is pretty straight-forward, don't bother with
cryptographic hashing (SHA/MD5) or CRC32 - simply find a Murmur hash
implementation for your platform/language combination and you're set. Next step
is to read the Less Hashing, Same Performance: Building a Better Bloom
Filter paper and understand how you can turn the hashing k operation into
an almost constant time operation:
gi(x) = h1(x) + i * h2(x) mod m
h1 = murmur_hash(key, 0)
h2 = murmur_hash(key, h1)
for n in 1..k:
g(n) = (h1 + i * h2) % bits_per_slice
The second issue, verifying that the implementation you're intending to use
actually holds up to the expected p is extremely important. If you're
using a library which doesn't contain a test suite for checking the false
positive rate you absolutely should to write something like this:
Retrieve or calculate the filter capacity
n.Fill upp the filter with
nkeys.Now run a large series of lookups against the filter with random keys. Make sure the keys are true false positives and keep track of the errors so that you can calculate the
fp/n.
We expect fp to hover around the expected p value since we're dealing with
a probabilistic data-structure but large deviations need to be further
investigated. Also look out for implementations where the fp is very low
as this a good sign that the filter is not correctly tuned, m is probably way
higher than it needs to be.
I've a written a small Python package to document most of my findings and to serve as a future reference for doing bloom filter calculations. Let me know if you find any bugs or if its helpful.
Source: https://github.com/bkz/bloom
Resources:
Almeida, P. S., Baquero, C., PreguiƧa, N., and Hutchison, D. 2007. Scalable Bloom Filters. Inf. Process. Lett. 101, 6 (Mar. 2007), 255-261. DOI= http://dx.doi.org/10.1016/j.ipl.2006.10.007 link
Kirsh, A., and Mitzenmacher, M. 2006. Less Hashing, Same Performance: Building a Better Bloom Filter. In Proceedings of the 14th Annual European Symposium on Algorithms (ESA), pp. 456-467. link