Bloomfilters are a powerful tool for approximate set membership queries. They efficiently determine whether an element is possibly in a set, with minimal memory overhead. However, like any probabilistic data structure, Bloomfilters come with a trade-off: the potential for false positives. Understanding and quantifying the probability of false positives is crucial for effectively utilizing Bloomfilters in various applications. In this blog post, we explore how to calculate the false positive probability of a Bloomfilter.

Computing the False Positive Probability

Use the following form to compute the false positive probability of a Bloomfilter.

Parameters
Number of items that have been inserted:
Invalid input
Size of the Bloomfilter in number of bits:
Invalid input
Number of hash functions:
(leave empty to use the optimal number of hash functions)
Invalid input
Result
False positive probability:
$p = $
Optimal number of hash functions:
$k_{opt} = $

Mathematical Backgrounds

A Bloomfilter is a memory efficient lookup data structure that allows membership queries. If a query is for an element that was added to the Bloomfilter, the answer will always be “yes”. If a query is for an element that was not added to the Bloomfilter, the answer is “no” with high probability. We have a false positive if the answer is “yes” even though the element was not added to the Bloomfilter.

The probability of having a false positive is given by

$$ p \approx (1-e^{-kn/m})^k $$

Proof: Let $m$ be the size of the Bloomfilter in bits, $n$ the number of elements that were added to the Bloomfilter and $k$ the number of hash functions used to assign an element to the buckets in the $m$ bit array.

Let’s assume we query the Bloomfilter for an element that was not added. We won’t get a false positive if at least one of the $k$ buckets to which the element is mapped to contains a 0.

We can compute the probability that the first bucket to which the element is mapped to as follows:

If we add an element to the Bloomfilter we set $k$ random buckets to 1. It follows that if we add $n$ elements, we set $nk$ random buckets to 1 (buckets can be chosen multiple times). Each time we select a bucket we have to choose a bucket that is not the first bucket. As we have $m$ buckets we have to select one from $m-1$ buckets. The probability of that is $(m-1)/m$. As we do this $nk$ times, the probability that the first bucket is still zero after inserting $n$ elements is

$$ \left(\frac{m-1}{m}\right)^{nk} = \left(1-\frac{1}{m}\right)^{nk} $$

If $m$ is large we can use

$$ \lim_{m\to \infty}\left(1-\frac{1}{m}\right)^m = \frac{1}{e} $$

to approximate the probability as follows:

$$ \left(1-\frac{1}{m}\right)^{nk} = \left(\left(1-\frac{1}{m}\right)^m\right)^{nk/m} \approx \left(\frac{1}{e}\right)^{nk/m} =e^{-kn/m} $$

It follows that the probability that the first bucket is 1 is

$$ 1-e^{-kn/m} $$

As we get a false positive when all $k$ buckets are set to 1 we get for the probability of having a false positive

$$ \left(1-e^{-kn/m}\right)^k $$

Optimal Number of Hash Functions

If we use a Bloomfilter we want to use a number of hash functions which minimize the false positives. We can compute that number as follows:

Let $f(k) = \left(1-e^{-kn/m}\right)^k$ be the number of false positives depending on $k$. To get the minimum of that function we determine the $k$ for which the first derivative of that function equals to zero, i.e.

$$ \frac{\mathrm{d}}{\mathrm{d}k} \left(1-e^{-kn/m}\right)^k = 0 $$

From here we know that the zeros of the derivative of a function $f(x)$ for $f(x)>0$ are also the zeros of the derivative of the function $\ln f(x)$. Thus, to get the zeros of the equation above, we can instead compute the zeros of the derivative of

$$ \frac{\mathrm{d}}{\mathrm{d}k} \ln \left[ \left(1-e^{-kn/m}\right)^k \right] = \frac{\mathrm{d}}{\mathrm{d}k} k \ln \left(1-e^{-kn/m}\right) $$

By applying the product rule we get

$$ \ln \left(1-e^{-kn/m}\right) + k \frac{\mathrm{d}}{\mathrm{d}k} \ln \left(1-e^{-kn/m}\right) $$

Now, by applying the chain rule we get

$$ \begin{split} \ln \left(1-e^{-kn/m}\right) & + k \frac{1}{1-e^{-kn/m}} \frac{\mathrm{d}}{\mathrm{d}k}(1-e^{-kn/m}) \\[2ex] & = \ln \left(1-e^{-kn/m}\right) + k \frac{1}{1-e^{-kn/m}} (0-e^{-kn/m} \frac{-n}{m}) \\[2ex] & = \ln \left(1-e^{-kn/m}\right) + k\frac{n}{m} \frac{e^{-kn/m}}{1-e^{-kn/m}} \end{split} $$

If we replace $e^{-kn/m}$ by $x$ we get

$$ \ln \left(1-x\right) + k\frac{n}{m} \frac{x}{1-x} $$

If $x=e^{-kn/m}$ we know that $\ln x = -kn/m$. Therefore, we can replace $k\frac{n}{m}$ by $-\ln x$ and get

$$ \ln \left(1-x\right) - \frac{x}{1-x} \ln x $$

This equation is zero if

$$ \ln \left(1-x\right) = \frac{x}{1-x} \ln x $$

or

$$ (1-x)\ln \left(1-x\right) = x \ln x $$

The left-hand and right-hand side are equal when $1-x=x$ which is the case when $x=1/2$. As $x=e^{-kn/m}$ we have to solve this equation for $k$.

$$ \begin{split} \frac{1}{2} &= e^{-kn/m} \\[2ex] \ln \frac{1}{2} &= -\frac{kn}{m}\\[2ex] -\ln \frac{1}{2} &= \frac{kn}{m}\\[2ex] -\ln 1 + \ln 2 &= \frac{kn}{m}\\[2ex] \ln 2 &= \frac{kn}{m}\\[2ex] \frac{m}{n} \ln 2 &= k \end{split} $$

Now we know that $\frac{\mathrm{d}}{\mathrm{d}k} \left(1-e^{-kn/m}\right)^k = 0$ for $k=\frac{m}{n} \ln 2$.

However, we still need to figure out whether or not $x=1/2$ is a minimum of $f(k)$. This is the case when $f(k)$ is decreasing for $x<1/2$ and increasing for $x>1/2$.

From above, we know that

$$ \ln(1-x) - \frac{x}{1-x} \ln x $$

is zero for $x=1/2$.

This equation is equal to

$$ \frac{(1-x)\ln(1-x) - x \ln x}{1-x} $$

As $\ln(1-x) > \ln x$ for $0 and $1-x > x$ it follows that

$$ (1-x)\ln(1-x) > x\ln x $$

With that we know that the numerator $(1-x)\ln(1-x) - x\ln x > 0$.

As the numerator and denominator are positive for $0 the function $f(k)$ is increasing for $0, or in other words for $k>\frac{m}{n}\ln 2$.

Proof: As $k=-\frac{m}{n}\ln x$ we need to show for $0 that

$$ k = -\frac{m}{n}\ln x > \frac{m}{n}\ln 2 $$

If follows

$$ \begin{split} -\ln x &> \ln 2 \\[2ex] 0 &> \ln 2 + \ln x \\[2ex] 0 &> \ln 2x\\[2ex] e^0 &> 2x\\[2ex] 1 &> 2x\\[2ex] x &< \frac{1}{2} \end{split} $$

As we did not get a contradiction we know that $k>\frac{m}{n}\ln 2$ for $0. $\qquad_\square$

Now, let’s look at the case when $1/2 < x < 1$. In that case we have $\ln x > \ln (1-x)$ and $x > 1-x$. From that it follows

$$ \begin{split} x\ln x &> (1-x)\ln (1-x) \\[2ex] 0 &> (1-x)\ln(1-x)-x\ln x \end{split} $$

As the denominator is positive and the numerator is negative, which makes the first derivative negative between $1/2, the function $f(k)$ is decreasing in that interval, or in other words for $k<\frac{m}{n}\ln 2$.

As $f(k)$ is decreasing for $k<\frac{m}{n}\ln 2$ and increasing for $k>\frac{m}{n}\ln 2$, we have a minimum for $k=\frac{m}{n}\ln 2$.

Therefore, if we use $k=\frac{m}{n}\ln 2$ different hash functions to hash an element to the different buckts of the Bloomfilter, the number of false positives is minimized.

Number of Bits Required to Achieve a Certain False Positive

Assuming we’re using the optimal number of hash functions, i.e. $k=\frac{m}{n}\ln 2$ we can compute the number of bits $m$ required to achieve a certain false positive rate. As shown above the probability for a false positive is given by

$$ p = (1-e^{-kn/m})^k $$

Substituting $k$ we get

$$ \begin{split} p &= \left(1-e^{-\frac{m}{n}\ln 2 n/m}\right)^{\frac{m}{n}\ln 2} \\[2ex] &= \left(1-e^{-\ln 2}\right)^{\frac{m}{n}\ln 2} \\[2ex] &= \left(1-\frac{1}{2}\right)^{\frac{m}{n}\ln 2} \\[2ex] &= \left(\frac{1}{2}\right)^{\frac{m}{n}\ln 2} \\[2ex] \ln p &= \frac{m}{n}\ln 2\cdot \ln \frac{1}{2} \\[2ex] &= \frac{m}{n}\ln 2\cdot (\ln 2) \\[2ex] &= -\frac{m}{n}\ln^2 2\\[2ex] \end{split} $$

It follows for the number of bits required to achieve the false positive probability $p$:

$$ m = -n\frac{\ln p}{\ln^2 2} \approx -2.08n\ln p $$

References