Let’ assume we have a hash function with an output space $m=2^b$. Then, the probability of a hash collision when hashing $n$ distinct inputs (or messages) is roughly
Use the following form to compute the hash collision for a given $n$ and $b$:
b=
Approximated collision probability:
Proof: If we know the probability $p'$ of not having a hash collision, we can compute the probability $p$ of having a hash collision as:
Let’s assume we hash the first message. Obviously, $p'=0$ as we have only one hash so far.
If we hash the second message, the second hash must not be equal to the first hash for not having a hash collision. If the hash function outputs $m=2^b$ different values, the second hash is allowed to take any of the $m-1$ outputs which haven’t been chosen by the hash function yet. The probability for that is $(m-1)/2^b$.
If we hash the third message, the third hash must be different to the first and the second hash for not having a hash collision. Thus, the third hash is allowed to take any of the $m-2$ outputs which haven’t been chosen by the hash function yet. The probability for that is $(m-2)/2^b$.
If we continue with that, we get
As the probability for not having a collision for the first hash it equal to one we can also add the term for the first hash to this equation. This small trick makes it possible to work with that equation more easily later:
We can simply this equation and get:
Here, we can see that the terms $m (m-1)(m-2) \dots\ (m-(n-1))$ are equal to the $n$ largest terms of $m!$.
Now, if we use $m!$ instead of these terms we get:
With that we can now compute the probablity of a hash collision as follows:
Large Numbers
Unfortunately, we can use the formula above only for relatively small numbers. Due to $m!$ and $m^n$ we quickly get numbers of magnitude which standard calculators cannot handle anymore. Therefore, to be able to compute the collision probability even for large $m$, we need to find an approximation of our result.
An intermediate result we got was
which we can write as
Next, we use the Tayler expansion of the expontial function to approximate each term. The Taylor expansion of the exponential function is
For very small values of $x$ the terms after $1+x$ are becoming also very small and we can approximate the exponential function as
With this approximation we can approximate the terms of $p'$. For intance, with $x=-\frac{1}{m}$ we can approximate the term $(1-\frac{1}{m})$ with $e^{-1/m}$. We can continue with that and replace each term $(1-\frac{i}{m})$ with $e^{-i/m}$. We get:
With Gauss’s formular to get the sum of the first $n$ positive integers
we get
and finnally for the probabilty of a collision when computing the hash of $n$ distince values we get