top of page

As readers of this blog will have noted, hash values are given in bit values. SHA-1 is a 160 bit hash function; MD5 is a 128 bit function. Moderns computers use a binary system of 0's and 1's, to represent data. A 2 bit system would use two digits, and have four possible values:

00

01

10

11

. . . we can calculate the total number of values by getting the value of 2² = 4.

A 32 bit hash value (such as CRC - cyclic redundancy check ) would have 4,294,967,296 (2³²) different values. So the chances of a duplicate hash value being generated would be more than 1 in 4 billion, right? Not, so fast.

Long, long ago, when I was in elementary school, I recall a teacher and some students discussing the likelihood that one of the children in a class of perhaps 30 had the same birthday. I felt sure that the chances would be very small - 1 in 365. I was wrong. Actually the probability that any two students will have the same birthday is 70% in a class of 30, or 50% in a class of only 23. Mathematicians term the chances of a collision attack (generating the same hash value by change) the Birthday Paradox.

The key to understanding this paradox, is realizing that in a class of 23, it's not just one person's birthday that is being compared against the other 22, but there are 231 other comparisons of birthdays of the other people in the set to one another. The math can be done by first figuring out what the chances are that there is no one in the class of 23 with the same birthday. First we determine the number of pairs in a set of 23:

(23 * 22) / 2 = 253

. . . and then calculate the probability of any two of them having the same value (or date) in a set of 365 (assuming those values are randomly assigned) :

(364/365) to the power of 253, which equals 0.49952284596341798556893827614273.

So, if we minus this result from 100, we see that the probability is greater than 50% that two students will have the same birthday.

So if we want to know what the chances of two equal hash values being generated by a 32 bit function for a set of 100,000 files, we make this calculation:

(100,000 * 99,999) /2 = 4999950000

. . . and then (4,294,967,295/4,294,967,296) to the power of 4999950000 which is 0.367878914873230000000000000000. So we see that in a set of 100,000 there is a 63% probability that two hash values will have the same value.



A SHA-1 hash function, like the MD5 hash function, is used in electronic discovery to determine whether or not two electronic files are duplicates. Now like MD5, the SHA-1 hash function it has been compromised. A cryptology group at the CWI Institute in the Netherlands worked with Google to make it possible to generate PDF files with different content, but identical SHA-1 hash values. They announced the results of their project on February 23, 2017. Hash values are not only used in electronic discovery to detect duplicates, but are also used by internet browsers' Transport Layer Security (TLS) and Secure Sockets Layer (SSL) protocols to transmit data online, and for digital signatures on legal documents.

The CWI/Google team has posted a notice of its success in breaking SHA-1 to the site, https://shattered.io/ . From this site you can download two PDFs, one blue, and one red, that have identical SHA-1 hash values. I did so, and tested them out, and as you can see in this screen they do indeed have identical SHA-1 hash values.

The team has also posted their research paper to the site, which goes into great technical detail about their research. I'll admit that I don't understand much of the math in the paper, but here are a few key points that it makes which will be helpful to the electronic discovery professional:

1. Their method requires immense computation power, but is 100,000 faster than a brute force attack.

2. Theoretical attacks against SHA-1 have been proposed since 2005, but it continues to be widely used. MD5 was broken in 2004. SHA-1 was developed by NIST in 1995.

3. The team's attack is estimated to take 150 days on a single quad-core CPU.

4. Using computational power rented from Amazon, the collision attack would cost $110,000 to implement.


  • Jan 24, 2017

As discussed in Craig Ball's Electronic Discovery Workbook, FTK Imager can be downloaded here for free. FTK Imager is part of the Forensic Toolkit developed by AccessData. FTK Imager allows you to create an image of a hard drive in different segments that can later be reconstructed. It uses MD5 hash values to confirm that the data has been copied correctly.

FTK Imager has many functions, but one of its more helpful ones it the ability to generate a list of hash value. Just go to File . . . Add Evidence Item. Select the type as the 'Contents of a folder' and the files in the folder will be loaded up.

. . . then simply go to File . . . Export File Hash List and an Excel .csv file will be created showing the SHA1 and MD5 hash values of each file in the selected folder.


Sean O'Shea has more than 20 years of experience in the litigation support field with major law firms in New York and San Francisco.   He is an ACEDS Certified eDiscovery Specialist and a Relativity Certified Administrator.

The views expressed in this blog are those of the owner and do not reflect the views or opinions of the owner’s employer.

If you have a question or comment about this blog, please make a submission using the form to the right. 

Your details were sent successfully!

© 2015 by Sean O'Shea . Proudly created with Wix.com

bottom of page