ssdeep for Context Triggered Piecewise Hashing
Jesse Kornblum, (a computer forensics specialist with ManTech, a Department of Defense security technology contractor), has developed a program which can perform 'Context Triggered Piecewise Hashing'. Conventional SHA-1 or MD5 hash values will change completely if even a slight change is made to an electronic file. In Kornblum's ssdeep program (which can be downloaded for Windows here, http://ssdeep.sourceforge.net/) , "hash is generated for many discrete fixed-size segments of the file.", as he explains in his paper, "Identifying almost identical files using context triggered piecewise hashing", available here. The paper explains the math behind his file comparison methodology.
Extract the files in the zip file posted to the source forge site and save them in a folder named, 'Temp' at the root of your C drive. Holding SHIFT + CTRL, right-click in this folder and select 'Open command window here' . Enter the command:
ssdeep -lrd C:\Temp
. . . having saved the files you want to review in subfolders at C:\Temp. In this example I am comparing several Word documents saved in subfolders at C:\Temp :
. . . ssdeep finds files it determines to be both identical or very similar, listing a score to indicate how identical they are.
On the first line Memo2.docx is shown to be very similar to Memo - Copy.docx and in fact the second file is just one from which the first word of the first file has been deleted. A score of 57 is given.
On the third and fourth lines, pairs of Word documents are shown to be similar despite the fact that their present content is completely different. This is likely because when I created them I used the first , deleted its content, pasted in the first page of the other novel, and re-saved with a new file name. So ssdeep not only reviews the visible data, but is able to analyze deleted content as well.
The fifth line shows an example of file that is a true identical - the MD5 and SHA-1 hash values of the two Memo.docx files are the same, and ssdeep shows a score of 100. I just copied the file to a new location. See the below screen grab from HashMyFiles which reflects this, and also shows how just a small change in the Memo2.docx to the Memo - Copy.docx results in a completely different hash value.
Not shown in the ssdeep results are two Word files both named 'pride.docx' one saved at C:\Temp\Unchanged and the other at C:\Temp\Draft1. However they both contain the exact same words from the first page of Pride and Prejudice. The meta data of the files is different.
Kornblum's site has other tools on it which are worth checking out. See: http://jessekornblum.com/tools/