Litigation Support Tip of the Night

April 19, 2019

More than once I've come across email attachments in document productions for which the file name was 'winmail.dat' and the attachment had not been processed.  This problem stems from a well-known problem that Microsoft Support has addressed here.  Some email clients cannot process emails sent from MS Outlook that are in the rich text format.   The message is sent in plain text and the .dat file contains the rich text formatting, embedded images and file attachments.   This method is known as TNEF, Transport Neutral Encapsulation Format.   

This error causes a minor security breach as the sender's login user name, and .pst folder paths can be found if the file is opened in a text editor. 

See this example from the Enron Email data set. 

It's hard to say if the EDRM's own processing stripped out some of the original information, but we can see a file path and what may be a login ID. 

January 18, 2016

Andre Ross gives a very good description of the shingling process in this blog post: http://digfor.blogspot.com/2013/03/fruity-shingles.html .    As discussed in the tip of the night for January 16, 2015 document shingling involves comparing n-grams of overlapping word sequences in two different text files.   Ross notes that shingling involves of the calculation of Jaccard Similarity, "the number of items in the intersection of A and B divided by the number of items in the union of A and B"  or 

 

Sim(A,B) =  |A ∩ B |

                       ______

                      |A ∪ B |

 

. . . so we get a figure based the number of n-grams the two have in common divided over the total number of unique n-grams used in both.   

 

Here's an example. 

 

1.  In Fig. 1 we see 3 text files, which are edited over the period of several weeks.   The August version is almost the same as the July version, but one phrase has been moved around.  In the September version while the original first sentence is still present in parts, an entirely new phrase has been added and more changes have been made. 

 

 

2. In Fig 2., we run the n-gram generator as was discussed in on the night of January 16, 2015 ,  and copy out the three word overlapping n-grams for each of the three text files to an Excel spreadsheet.

 

 

 

 

3.  In Excel the n-grams from each text file are pasted into columns A, C, and E, and then we run VLOOKUP formulas in column B to check which of n-grams from the July version in column A are the same as those in the August version in column C  [18], and which of the n-grams from the August version in column C match those in column E for the September version [8].

 

 

 

 

 

 

 

 

 

4. On a second worksheet, we combined the n-grams from a July and August de-duped set, and an August and September de-duped set to get totals of 36 and 49 respectively.   

 

 

 

 

 

 

 

5.  So while the July and August versions of have a Jaccard similarity of 0.5, the August and September versions only have a Jaccard similarity of 0.16. 

January 16, 2016

Shingling is a method of determining the degree of similarity between two electronic files by measuring how many n-grams the two have in common.  N-grams are sequences of a set number of words that appear in a text file that are created so that the second word of the present n-gram is always the first word of the succeeding n-gram.   So n-grams for this phrase, where n=3 (or where we want to generate 'trigrams'):

 

Now is the time for all good men to come to the aid of their party.

 

. . . would be:

 

Now is the

is the time

the time for

time for all

for all good

all good men

good men to

men to come

to come to

come to the

to the aid

the aid of

aid of their

of their party

 

The idea is to create word groupings that overlap with one another.  If you want to generate n-grams download the Win32 version of the N-gram extraction tool on this site: http://homepages.inf.ed.ac.uk/lzhang10/ngram.html 

 

Just download the zip file and extract the files to a folder.    Save the text file that you want to analyze in the same folder,  CTRL + SHIFT and right click in the folder, and select 'Open command window here'.    In the command prompt type:

 

text2ngram -n3 now.txt

 

. . . 'now.txt' being the name of the file you want to generate n-grams for.   You'll get the results shown in this screen grab: 

 

 

 

January 6, 2016

This night's tip comes from the site of Jesse Kornblum, as it did back on December 28, 2015.     He has custom programs which allow you to compare the size of different files and also generate bitmaps.  Download the filecompare and colorize programs on this web page, http://jessekornblum.livejournal.com/290358.html . 

 

The programs work by comparing files with differences that were made 'in place', not for a different version of a file in which changes have been made with insertions or deletions.   

 

Once you have downloaded the two executables to a folder, you just need to press SHIFT + CTRL in that folder, right click and select 'Open command window here'.

 

Just enter a commands such as

 

filecompare lorem.txt lorem-edit.txt > edit2.dat

 

colorize edit2.dat

 

. . . as shown in Fig. 1. 

 

 

. . . a bitmap file will be generated like the one shown here in Fig. 2. 

 

 

. . . .showing differences in the two examined files.

 

Please reload

Please reload

Sean O'Shea has more than 15 years of experience in the litigation support field with major law firms in New York and San Francisco.   He is an ACEDS Certified eDiscovery Specialist and a Relativity Certified Administrator.

The views expressed in this blog are those of the owner and do not reflect the views or opinions of the owner’s employer.

 

All content provided on this blog is for informational purposes only. The owner of this blog makes no representations as to the accuracy or completeness of any information on this site or found by following any link on this site. The owner will not be liable for any errors or omissions in this information nor for the availability of this information. The owner will not be liable for any losses, injuries, or damages from the display or use of this information.

 

This policy is subject to change at any time.