N-gram Generator


Shingling is a method of determining the degree of similarity between two electronic files by measuring how many n-grams the two have in common. N-grams are sequences of a set number of words that appear in a text file that are created so that the second word of the present n-gram is always the first word of the succeeding n-gram. So n-grams for this phrase, where n=3 (or where we want to generate 'trigrams'):

Now is the time for all good men to come to the aid of their party.

. . . would be:

Now is the

is the time

the time for

time for all

for all good

all good men

good men to

men to come

to come to

come to the

to the aid

the aid of

aid of their

of their party

The idea is to create word groupings that overlap with one another. If you want to generate n-grams download the Win32 version of the N-gram extraction tool on this site: http://homepages.inf.ed.ac.uk/lzhang10/ngram.html

Just download the zip file and extract the files to a folder. Save the text file that you want to analyze in the same folder, CTRL + SHIFT and right click in the folder, and select 'Open command window here'. In the command prompt type:

text2ngram -n3 now.txt

. . . 'now.txt' being the name of the file you want to generate n-grams for. You'll get the results shown in this screen grab:

#electronicdiscovery #shingling #ngrams #litigationsupport

Contact Me With Your Litigation Support Questions:

seankevinoshea@hotmail.com

  • Twitter Long Shadow

© 2015 by Sean O'Shea . Proudly created with Wix.com