Outline of Craig Ball's Electronic Discovery Workbook Part 9 - De-Duplication
Here's a continuation of my outline of the 2016 edition of Craig Ball's Electronic Discovery Workbook which I last posted about on May 5, 2017.
A. Single Instance Archival Solutions - can reduce the unnecessary replication of ESI.
1. Ball suggests that a third of messages are duplicates in such archives.
B. Problem of Replication in Attorney Review
1. Ball: "Failing to deduplicate substantial collections of ESI before attorney review is tantamount to cheating the client."
2. Case Study: New review platform couldn't tag emails that were already produced because when the same Outlook message is exported at different times as a .msg file each message will have a different hash value.
C. Mechanized De-depulication of ESI
1. Hashing files.
a. emails with the same displayable content may have different hash values because they traversed different paths to reach the same recipients.
b. hash values must be preserved throughout the process.
2. Hashing segments of a message (subject, to, cc, etc.) likely to match, and excluding those that may be different - message headers containing server paths and unique message IDs.
a. normalizing the message data, e.g., alphabetize addresses without aliases.
3. Textual comparison of segments of a message to determine if they are sufficiently identical for purposes of review.
4. $100 tool available to hash Outlook .pst files that are under 2 GB each. [What tool, Craig?]
1. ESI is just a a series of numbers (byte encoding schemes) and algorithms can generate a smaller, fixed length value from them - a message digest or hash value.
a. Message Digest 5 (MD5) (32 hexadecimal characters, which represent 340 trillion, trillion, trillion values) and Secure Hash Algorithm One (SHA-1) are most common hash algorithms used in e-discovery.
2. In sets of duplicate files with the same hash values, the first file is called the pivot and the set of duplicates is the occurrence log.
3. System metadata is not contained in the file, and not included in the calculation when the file is hashed. E.g., file name.
E. Word .docx files
1. A Word .docx file is a mix of text and rich media encoded in XML then compressed with ZIP algorithm.
2. Encoding scheme will be completely different if written to TIFF or PDF.
3. No two optical scans will be the same.
4. Ball's testing showed that saving the same Word document to a PDF with same settings will sometimes generate files with the same hash values.