Bloomfilters are a powerful tool for approximate set membership queries. They efficiently determine whether an element is possibly in a set, with minimal memory overhead. However, like any probabilistic data structure, Bloomfilters come with a trade-off: the potential for false positives. Understanding and quantifying the probability of false positives is crucial for effectively utilizing Bloomfilters in various applications. In this blog post, we explore how to calculate the false positive probability of a Bloomfilter.
Computing the False Positive Probability
Use the following form to compute the false positive probability of a Bloomfilter.
(leave empty to use the optimal number of hash functions)
$p = $
$k_{opt} = $
Mathematical Backgrounds
A Bloomfilter is a memory efficient lookup data structure that allows membership queries. If a query is for an element that was added to the Bloomfilter, the answer will always be “yes”. If a query is for an element that was not added to the Bloomfilter, the answer is “no” with high probability. We have a false positive if the answer is “yes” even though the element was not added to the Bloomfilter.
The probability of having a false positive is given by
Proof: Let $m$ be the size of the Bloomfilter in bits, $n$ the number of elements that were added to the Bloomfilter and $k$ the number of hash functions used to assign an element to the buckets in the $m$ bit array.
Let’s assume we query the Bloomfilter for an element that was not added. We won’t get a false positive if at least one of the $k$ buckets to which the element is mapped to contains a 0.
We can compute the probability that the first bucket to which the element is mapped to as follows:
If we add an element to the Bloomfilter we set $k$ random buckets to 1. It follows that if we add $n$ elements, we set $nk$ random buckets to 1 (buckets can be chosen multiple times). Each time we select a bucket we have to choose a bucket that is not the first bucket. As we have $m$ buckets we have to select one from $m-1$ buckets. The probability of that is $(m-1)/m$. As we do this $nk$ times, the probability that the first bucket is still zero after inserting $n$ elements is
If $m$ is large we can use
to approximate the probability as follows:
It follows that the probability that the first bucket is 1 is
As we get a false positive when all $k$ buckets are set to 1 we get for the probability of having a false positive
Optimal Number of Hash Functions
If we use a Bloomfilter we want to use a number of hash functions which minimize the false positives. We can compute that number as follows:
Let $f(k) = \left(1-e^{-kn/m}\right)^k$ be the number of false positives depending on $k$. To get the minimum of that function we determine the $k$ for which the first derivative of that function equals to zero, i.e.
From here we know that the zeros of the derivative of a function $f(x)$ for $f(x)>0$ are also the zeros of the derivative of the function $\ln f(x)$. Thus, to get the zeros of the equation above, we can instead compute the zeros of the derivative of
By applying the product rule we get
Now, by applying the chain rule we get
If we replace $e^{-kn/m}$ by $x$ we get
If $x=e^{-kn/m}$ we know that $\ln x = -kn/m$. Therefore, we can replace $k\frac{n}{m}$ by $-\ln x$ and get
This equation is zero if
or
The left-hand and right-hand side are equal when $1-x=x$ which is the case when $x=1/2$. As $x=e^{-kn/m}$ we have to solve this equation for $k$.
Now we know that $\frac{\mathrm{d}}{\mathrm{d}k} \left(1-e^{-kn/m}\right)^k = 0$ for $k=\frac{m}{n} \ln 2$.
However, we still need to figure out whether or not $x=1/2$ is a minimum of $f(k)$. This is the case when $f(k)$ is decreasing for $x<1/2$ and increasing for $x>1/2$.
From above, we know that
is zero for $x=1/2$.
This equation is equal to
As $\ln(1-x) > \ln x$ for $0
With that we know that the numerator $(1-x)\ln(1-x) - x\ln x > 0$.
As the numerator and denominator are positive for $0
Proof: As $k=-\frac{m}{n}\ln x$ we need to show for $0
If follows
As we did not get a contradiction we know that $k>\frac{m}{n}\ln 2$ for $0
Now, let’s look at the case when $1/2 < x < 1$. In that case we have $\ln x > \ln (1-x)$ and $x > 1-x$. From that it follows
As the denominator is positive and the numerator is negative, which makes the first derivative negative between $1/2
As $f(k)$ is decreasing for $k<\frac{m}{n}\ln 2$ and increasing for $k>\frac{m}{n}\ln 2$, we have a minimum for $k=\frac{m}{n}\ln 2$.
Therefore, if we use $k=\frac{m}{n}\ln 2$ different hash functions to hash an element to the different buckts of the Bloomfilter, the number of false positives is minimized.
Number of Bits Required to Achieve a Certain False Positive
Assuming we’re using the optimal number of hash functions, i.e. $k=\frac{m}{n}\ln 2$ we can compute the number of bits $m$ required to achieve a certain false positive rate. As shown above the probability for a false positive is given by
Substituting $k$ we get
It follows for the number of bits required to achieve the false positive probability $p$: