Let’ assume we have a hash function with an output space \(m=2^b\). Then, the probability of a hash collision when hashing \(n\) distinct inputs (or messages) is roughly

\[p \approx 1-e^{\frac{-n(n-1)}{2m}}\]

Use the following form to compute the hash collision for a given \(n\) and \(b\):

n= (valid input e.g. 4294967296 or 2^32)
b=
Approximated collision probability:

Proof: If we know the probability \(p'\) of not having a hash collision, we can compute the probability \(p\) of having a hash collision as:

\[p=1-p'\]

Let’s assume we hash the first message. Obviously, \(p'=0\) as we have only one hash so far.

If we hash the second message, the second hash must not be equal to the first hash for not having a hash collision. If the hash function outputs \(m=2^b\) different values, the second hash is allowed to take any of the \(m-1\) outputs which haven’t been chosen by the hash function yet. The probability for that is \((m-1)/2^b\).

If we hash the third message, the third hash must be different to the first and the second hash for not having a hash collision. Thus, the third hash is allowed to take any of the \(m-2\) outputs which haven’t been chosen by the hash function yet. The probability for that is \((m-2)/2^b\).

If we continue with that, we get

\[p' = \underbrace{\frac{m-1}{m}}_{\text{2nd hash}} \cdot \underbrace{\frac{m-2}{m}}_{\text{3rd hash}} \cdot \ \dots \ \cdot \underbrace{\frac{m-(n-2)}{m}}_{\text{(n-1)-th hash}} \cdot \underbrace{\frac{m-(n-1)}{m}}_{\text{n-th hash}}\]

As the probability for not having a collision for the first hash it equal to one we can also add the term for the first hash to this equation. This small trick makes it possible to work with that equation more easily later:

\[p' = \underbrace{\frac{m-0}{m}}_{\text{1st hash}} \cdot \underbrace{\frac{m-1}{m}}_{\text{2nd hash}} \cdot \underbrace{\frac{m-2}{m}}_{\text{3rd hash}} \cdot \ \dots \ \cdot \underbrace{\frac{m-(n-2)}{m}}_{\text{(n-1)-th hash}} \cdot \underbrace{\frac{m-(n-1)}{m}}_{\text{n-th hash}}\]

We can simply this equation and get:

\[\begin{equation} \begin{split} P'(n) &= \frac{1}{m^n}\cdot m (m-1)(m-2) \dots\ (m-(n-1)) \end{split} \end{equation}\]

Here, we can see that the terms \(m (m-1)(m-2) \dots\ (m-(n-1))\) are equal to the \(n\) largest terms of \(m!\).

\[m! = \underbrace{m(m-1)(m-2)\dots(m-(n-1))}_{\text{terms from p'}}\underbrace{(m-n)\dots 3 \cdot 2 \cdot 1}_{(m-n)!}\]

Now, if we use \(m!\) instead of these terms we get:

\[p' = \frac{1}{m^n} \cdot \frac{m!}{(m-n)!}\]

With that we can now compute the probablity of a hash collision as follows:

\[p = 1-p' = 1- \frac{m!}{m^n(m-n)!}\]

Large Numbers

Unfortunately, we can use the formula above only for relatively small numbers. Due to \(m!\) and \(m^n\) we quickly get numbers of magnitude which standard calculators cannot handle anymore. Therefore, to be able to compute the collision probability even for large \(m\), we need to find an approximation of our result.

An intermediate result we got was

\[p' = \frac{m-1}{m} \cdot \frac{m-2}{m} \cdot \ \dots \ \cdot \frac{m-(n-2)}{m} \cdot \frac{m-(n-1)}{m}\]

which we can write as

\[p' = \left(1-\frac{1}{m}\right) \cdot \left(1-\frac{2}{m}\right) \cdot \ \dots \ \cdot \left(1-\frac{n-2}{m}\right) \cdot \left(1-\frac{n-1}{m}\right)\]

Next, we use the Tayler expansion of the expontial function to approximate each term. The Taylor expansion of the exponential function is

\[e^x = \sum_{k=0}^\infty\frac{x^k}{k!} = 1+x+\frac{x^2}{2}+\frac{x^3}{6}+\dots\]

For very small values of \(x\) the terms after \(1+x\) are becoming also very small and we can approximate the exponential function as

\[e^x \approx 1+x\]

With this approximation we can approximate the terms of \(p'\). For intance, with \(x=-\frac{1}{m}\) we can approximate the term \((1-\frac{1}{m})\) with \(e^{-1/m}\). We can continue with that and replace each term \((1-\frac{i}{m})\) with \(e^{-i/m}\). We get:

\[p' \approx e^{-\frac{1}{m}} \cdot e^{-\frac{2}{m}} \cdot \ \dots \ \cdot e^{-\frac{n-1}{m}} = e^{-\frac{1+2+3+\ \dots\ +(n-1)}{m}}\]

With Gauss’s formular to get the sum of the first \(n\) positive integers

\[1+2+3+\ \dots\ +n = \sum_{k=1}^n = \frac{n(n+1)}{2}\]

we get

\[p' \approx e^{-\frac{n(n-1)}{2m}}\]

and finnally for the probabilty of a collision when computing the hash of \(n\) distince values we get

\[p = 1-p' \approx e^{-\frac{n(n-1)}{2m}}\]