If you're not already using it, you can download a 30 day trial version of Lexis LAW PreDiscovery at this site,
http://www.lexisnexis.com/litigation/products/law-prediscovery. LAW is one of the most widely used eDiscovery applications. It can import data from more than 2000 file types and interface with more than a hundred different types of scanners, and follow through with the endorsing, OCRing and production of the imported documents. The file to install the trial version will download without an extension. To make it work just rename the file to include 'exe' at the end. When installing be sure to include the options not just for LAW, but also for Early Data Analyzer, and the TIFF/PDF drivers. Over the next month I'm going to spend some time test driving LAW.
A good way to start is to load one of the PST folders from the EDRM Enron data set. Open a PST archive in Outlook and then save the attachments to a folder using the macro described in the Tip of the Night for May 14, 2015. Then in LAW go to File . . . Import . . . Electronic Discovery and choose the option on the right for 'File(s)'. Select the files you saved from Outlook and import them. The LAW Electronic Discovery Loader will give you the folloiwng options:
1. filter in or out certain file types.
2. import compound documents.
3. dedupe based on MD5 or SHA-1 hash values, at either the level of the full data set or the custodian level, either logging the existence of the dupes, excluding them, or both.
4. choose the type of data you want to import from .pst files (just email, or contacts, calendar entries, etc.)
5. exclude emails from a particular data range.
6. capture custom metadata, detect hidden rows, columns, & sheets for Excel files, track changes for Word & Excel files, and speaker notes for PowerPoint presentations.
7. de-NIST according to the National Software Reference Library database.
8. convert the imported documents to TIFF images, and index extracted text.
9. identify the language of the text.