What are the odds a hash value will be duplicated?
As readers of this blog will have noted, hash values are given in bit values. SHA-1 is a 160 bit hash function; MD5 is a 128 bit function. Moderns computers use a binary system of 0's and 1's, to represent data. A 2 bit system would use two digits, and have four possible values:
00
01
10
11
. . . we can calculate the total number of values by getting the value of 2² = 4.
A 32 bit hash value (such as CRC - cyclic redundancy check ) would have 4,294,967,296 (2³²) different values. So the chances of a duplicate hash value being generated would be more than 1 in 4 billion, right? Not, so fast.
Long, long ago, when I was in elementary school, I recall a teacher and some students discussing the likelihood that one of the children in a class of perhaps 30 had the same birthday. I felt sure that the chances would be very small - 1 in 365. I was wrong. Actually the probability that any two students will have the same birthday is 70% in a class of 30, or 50% in a class of only 23. Mathematicians term the chances of a collision attack (generating the same hash value by change) the Birthday Paradox.
The key to understanding this paradox, is realizing that in a class of 23, it's not just one person's birthday that is being compared against the other 22, but there are 231 other comparisons of birthdays of the other people in the set to one another. The math can be done by first figuring out what the chances are that there is no one in the class of 23 with the same birthday. First we determine the number of pairs in a set of 23:
(23 * 22) / 2 = 253
. . . and then calculate the probability of any two of them having the same value (or date) in a set of 365 (assuming those values are randomly assigned) :
(364/365) to the power of 253, which equals 0.49952284596341798556893827614273.
So, if we minus this result from 100, we see that the probability is greater than 50% that two students will have the same birthday.
So if we want to know what the chances of two equal hash values being generated by a 32 bit function for a set of 100,000 files, we make this calculation:
(100,000 * 99,999) /2 = 4999950000
. . . and then (4,294,967,295/4,294,967,296) to the power of 4999950000 which is 0.367878914873230000000000000000. So we see that in a set of 100,000 there is a 63% probability that two hash values will have the same value.
See the math further explained here: https://betterexplained.com/articles/understanding-the-birthday-paradox/