False positive rate of a Bloom filter

Bloom filters are a powerful tool for approximate set membership queries. They efficiently determine whether an element is possibly in a set, with minimal memory overhead. However, like any probabilistic data structure, Bloom filters come with a trade-off: the potential for false positives. Understanding and quantifying the probability of false positives is crucial for effectively utilizing Bloom filters in various applications. In this blog post, we explore how to calculate the false positive probability of a Bloom filter.

Computing the False Positive Probability

Use the following form to compute the false positive probability of a Bloom filter.

  • n: number of items that have been inserted into the Bloom filter
  • m: size of the Bloom filter in number of bits
  • k: number of hash functions (leave empty to use the optimal number of hash functions)
Input:
n:
(valid input e.g. 4294967296 or 2^32)
m:
k:
Result:
Optimal k:
False positive probability for k=:

Mathematical Backgrounds

A Bloom filter is a memory efficient lookup data structure that allows membership queries. If a query is for an element that was added to the Bloom filter, the answer will always be “yes”. If a query is for an element that was not added to the Bloom filter, the answer is “no” with high probability. We have a false positive if the answer is “yes” even though the element was not added to the Bloom filter.

The probability of having a false positive is given by

\[p \approx (1-e^{-kn/m})^k\]

Proof: Let \(m\) be the size of the Bloom filter in bits, \(n\) the number of elements that were added to the Bloom filter and \(k\) the number of hash functions used to assign an element to the buckets in the \(m\) bit array.

Let’s assume we query the Bloom filter for an element that was not added. We won’t get a false positive if at least one of the \(k\) buckets to which the element is mapped to contains a 0.

We can compute the probability that the first bucket to which the element is mapped to as follows:

If we add an element to the Bloom filter we set \(k\) random buckets to 1. It follows that if we add \(n\) elements, we set \(nk\) random buckets to 1 (buckets can be chosen multiple times). Each time we select a bucket we have to choose a bucket that is not the first bucket. As we have \(m\) buckets we have to select one from \(m-1\) buckets. The probability of that is \((m-1)/m\). As we do this \(nk\) times, the probability that the first bucket is still zero after inserting \(n\) elements is

\[\left(\frac{m-1}{m}\right)^{nk} = \left(1-\frac{1}{m}\right)^{nk}\]

If \(m\) is large we can use

\[\lim_{m\to \infty}\left(1-\frac{1}{m}\right)^m = \frac{1}{e}\]

to approximate the probability as follows:

\[\left(1-\frac{1}{m}\right)^{nk} = \left(\left(1-\frac{1}{m}\right)^m\right)^{nk/m} \approx \left(\frac{1}{e}\right)^{nk/m} =e^{-kn/m}\]

It follows that the probability that the first bucket is 1 is

\[1-e^{-kn/m}\]

As we get a false positive when all \(k\) buckets are set to 1 we get for the probability of having a false positive

\[\left(1-e^{-kn/m}\right)^k\]

Optimal Number of Hash Functions

If we use a Bloom filter we want to use a number of hash functions which minimize the false positives. We can compute that number as follows:

Let \(f(k) = \left(1-e^{-kn/m}\right)^k\) be the number of false positives depending on \(k\). To get the minimum of that function we determine the \(k\) for which the first derivative of that function equals to zero, i.e.

\[\frac{\mathrm{d}}{\mathrm{d}k} \left(1-e^{-kn/m}\right)^k = 0\]

From here we know that the zeros of the derivative of a function \(f(x)\) for \(f(x)>0\) are also the zeros of the derivative of the function \(\ln f(x)\). Thus, to get the zeros of the equation above, we can instead compute the zeros of the derivative of

\[\frac{\mathrm{d}}{\mathrm{d}k} \ln \left[ \left(1-e^{-kn/m}\right)^k \right] = \frac{\mathrm{d}}{\mathrm{d}k} k \ln \left(1-e^{-kn/m}\right)\]

By applying the product rule we get

\[\ln \left(1-e^{-kn/m}\right) + k \frac{\mathrm{d}}{\mathrm{d}k} \ln \left(1-e^{-kn/m}\right)\]

Now, by applying the chain rule we get

\[\begin{split} \ln \left(1-e^{-kn/m}\right) & + k \frac{1}{1-e^{-kn/m}} \frac{\mathrm{d}}{\mathrm{d}k}(1-e^{-kn/m}) \\[2ex] & = \ln \left(1-e^{-kn/m}\right) + k \frac{1}{1-e^{-kn/m}} (0-e^{-kn/m} \frac{-n}{m}) \\[2ex] & = \ln \left(1-e^{-kn/m}\right) + k\frac{n}{m} \frac{e^{-kn/m}}{1-e^{-kn/m}} \end{split}\]

If we replace \(e^{-kn/m}\) by \(x\) we get

\[\ln \left(1-x\right) + k\frac{n}{m} \frac{x}{1-x}\]

If \(x=e^{-kn/m}\) we know that \(\ln x = -kn/m\). Therefore, we can replace \(k\frac{n}{m}\) by \(-\ln x\) and get

\[\ln \left(1-x\right) - \frac{x}{1-x} \ln x\]

This equation is zero if

\[\ln \left(1-x\right) = \frac{x}{1-x} \ln x\]

or

\[(1-x)\ln \left(1-x\right) = x \ln x\]

The left-hand and right-hand side are equal when \(1-x=x\) which is the case when \(x=1/2\). As \(x=e^{-kn/m}\) we have to solve this equation for \(k\).

\[\begin{split} \frac{1}{2} &= e^{-kn/m} \\[2ex] \ln \frac{1}{2} &= -\frac{kn}{m}\\[2ex] -\ln \frac{1}{2} &= \frac{kn}{m}\\[2ex] -\ln 1 + \ln 2 &= \frac{kn}{m}\\[2ex] \ln 2 &= \frac{kn}{m}\\[2ex] \frac{m}{n} \ln 2 &= k \end{split}\]

Now we know that \(\frac{\mathrm{d}}{\mathrm{d}k} \left(1-e^{-kn/m}\right)^k = 0\) for \(k=\frac{m}{n} \ln 2\).

However, we still need to figure out whether or not \(x=1/2\) is a minimum of \(f(k)\). This is the case when \(f(k)\) is decreasing for \(x<1/2\) and increasing for \(x>1/2\).

From above, we know that

\[\ln(1-x) - \frac{x}{1-x} \ln x\]

is zero for \(x=1/2\).

This equation is equal to

\[\frac{(1-x)\ln(1-x) - x \ln x}{1-x}\]

As \(\ln(1-x) > \ln x\) for \(0<x< 1/2\) and \(1-x > x\) it follows that

\[(1-x)\ln(1-x) > x\ln x\]

With that we know that the numerator \((1-x)\ln(1-x) - x\ln x > 0\).

As the numerator and denominator are positive for \(0<x<1/2\) the function \(f(k)\) is increasing for \(0<x<1/2\), or in other words for \(k>\frac{m}{n}\ln 2\).

Proof: As \(k=-\frac{m}{n}\ln x\) we need to show for \(0<x<1/2\) that

\[k = -\frac{m}{n}\ln x > \frac{m}{n}\ln 2\]

If follows

\[\begin{split} -\ln x &> \ln 2 \\[2ex] 0 &> \ln 2 + \ln x \\[2ex] 0 &> \ln 2x\\[2ex] e^0 &> 2x\\[2ex] 1 &> 2x\\[2ex] x &< \frac{1}{2} \end{split}\]

As we did not get a contradiction we know that \(k>\frac{m}{n}\ln 2\) for \(0<x<1/2\). \(\qquad_\square\)

Now, let’s look at the case when \(1/2 < x < 1\). In that case we have \(\ln x > \ln (1-x)\) and \(x > 1-x\). From that it follows

\[\begin{split} x\ln x &> (1-x)\ln (1-x) \\[2ex] 0 &> (1-x)\ln(1-x)-x\ln x \end{split}\]

As the denominator is positive and the numerator is negative, which makes the first derivative negative between \(1/2<x<1\), the function \(f(k)\) is decreasing in that interval, or in other words for \(k<\frac{m}{n}\ln 2\).

As \(f(k)\) is decreasing for \(k<\frac{m}{n}\ln 2\) and increasing for \(k>\frac{m}{n}\ln 2\), we have a minimum for \(k=\frac{m}{n}\ln 2\).

Therefore, if we use \(k=\frac{m}{n}\ln 2\) different hash functions to hash an element to the different buckts of the Bloom filter, the number of false positives is minimized.

Number of Bits Required to Achieve a Certain False Positive

Assuming we’re using the optimal number of hash functions, i.e. \(k=\frac{m}{n}\ln 2\) we can compute the number of bits \(m\) required to achieve a certain false positive rate. As shown above the probability for a false positive is given by

\[p = (1-e^{-kn/m})^k\]

Substituting \(k\) we get

\[\begin{split} p &= \left(1-e^{-\frac{m}{n}\ln 2 n/m}\right)^{\frac{m}{n}\ln 2} \\[2ex] &= \left(1-e^{-\ln 2}\right)^{\frac{m}{n}\ln 2} \\[2ex] &= \left(1-\frac{1}{2}\right)^{\frac{m}{n}\ln 2} \\[2ex] &= \left(\frac{1}{2}\right)^{\frac{m}{n}\ln 2} \\[2ex] \ln p &= \frac{m}{n}\ln 2\cdot \ln \frac{1}{2} \\[2ex] &= \frac{m}{n}\ln 2\cdot (\ln 2) \\[2ex] &= -\frac{m}{n}\ln^2 2\\[2ex] \end{split}\]

It follows for the number of bits required to achieve the false positive probability \(p\):

\[m = -n\frac{\ln p}{\ln^2 2} \approx -2.08n\ln p\]

References