top of page

To set up an optimized index in a Relativity workspace follow these steps:

1. Create a saved search for files between 0 and 30 MB.

2. Display only the extracted text field.

3. Under Indexing & Analytics . . . Structured Analytics, create a new structured analytics set. Enter a name and set prefix, select the saved search, and run a repeated content operation. Keep the default settings in the 'Repeated Content Identification' section, except for the setting for 'Minimum Number of Occurrences'. Enter a value equal to 0.5% of the documents in the saved search. So here we'll look for segments of between 10-100 words on 4 lines or less, 16 lines from the bottom, which appear more than 7 times. (.005 times 1446). [A different approach should be followed for a saved search of more than 100,000 documents.]

4. Click 'Run Structured Analytics' in the console. Make the appropriate selection if you are supplementing the search with new documents.

5. View the results of the repeated content operation.

6. Make note of the resulting text blocks which contain boilerplate language or non-authored content.

7. Go to Indexing & Analytics . . . Analytics Indexes, and create a new index. Use the saved search as both the training set and the searchable set.

8. Optimize the training set (to take out documents with only numbers, bad OCR etc.), remove English signatures and footers, and enable the email header filter.

9. Add selected repeated content filters to the index.

10. Finally click Populate Index: Full on the console, to populate and build the index.


 
 

Relativity Structured Analytics can identify repeated content (such email disclaimers) in order to improve searchable document indexes. You want to focus on the authored content of documents and avoid having to sort through search results with hits in boilerplate language. However, if you want an admin to find this repeated content there's a significant limitation

While repeated content can be found by manually entering language that you know to search for, in Relativity 9.6, under Indexing & Analytics . . . Structured Analytics Set, you can use the repeated content operation to automatically find repeated content in a set of documents in a saved search. Each segment of repeated content must have a word count range designated by the admin, and appear at least a set number of times also designed by the admin. Each segment must also be a certain number of lines - often no more than 4.

A key thing to keep in mind about repeated content identification is that Relativity structured analytics will only search for repeated content from the bottom of documents. Disclaimers listed in headers, or boilerplate language repeated through a document will not be identified int the automated search. Accordingly, the admin must set a number of tail lines to search. The tail lines are the number of non-blank lines from the bottom of a document.

Relativity recommends setting 'Number of tail lines to analyze' to 16. The max is 200. Increasing the setting much above 16 will cause the operation to run for a long time.


 
 

Sean O'Shea has more than 20 years of experience in the litigation support field with major law firms in New York and San Francisco.   He is an ACEDS Certified eDiscovery Specialist and a Relativity Certified Administrator.

The views expressed in this blog are those of the owner and do not reflect the views or opinions of the owner’s employer.

If you have a question or comment about this blog, please make a submission using the form to the right. 

Your details were sent successfully!

© 2015 by Sean O'Shea . Proudly created with Wix.com

bottom of page