Analytics 9/9

Creating an Optimized Analytics Index

Apr 23, 2019

To set up an optimized index in a Relativity workspace follow these steps:

1. Create a saved search for files between 0 and 30 MB.

2. Display only the extracted text field.

3. Under Indexing & Analytics . . . Structured Analytics, create a new structured analytics set. Enter a name and set prefix, select the saved search, and run a repeated content operation. Keep the default settings in the 'Repeated Content Identification' section, except for the setting for 'Minimum Number of Occurrences'. Enter a value equal to 0.5% of the documents in the saved search. So here we'll look for segments of between 10-100 words on 4 lines or less, 16 lines from the bottom, which appear more than 7 times. (.005 times 1446). [A different approach should be followed for a saved search of more than 100,000 documents.]

4. Click 'Run Structured Analytics' in the console. Make the appropriate selection if you are supplementing the search with new documents.

5. View the results of the repeated content operation.

6. Make note of the resulting text blocks which contain boilerplate language or non-authored content.

7. Go to Indexing & Analytics . . . Analytics Indexes, and create a new index. Use the saved search as both the training set and the searchable set.

8. Optimize the training set (to take out documents with only numbers, bad OCR etc.), remove English signatures and footers, and enable the email header filter.

9. Add selected repeated content filters to the index.

10. Finally click Populate Index: Full on the console, to populate and build the index.

Relativity Repeated Content Identification - From the Bottom Up

Apr 19, 2019

Relativity Structured Analytics can identify repeated content (such email disclaimers) in order to improve searchable document indexes. You want to focus on the authored content of documents and avoid having to sort through search results with hits in boilerplate language. However, if you want an admin to find this repeated content there's a significant limitation

While repeated content can be found by manually entering language that you know to search for, in Relativity 9.6, under Indexing & Analytics . . . Structured Analytics Set, you can use the repeated content operation to automatically find repeated content in a set of documents in a saved search. Each segment of repeated content must have a word count range designated by the admin, and appear at least a set number of times also designed by the admin. Each segment must also be a certain number of lines - often no more than 4.

A key thing to keep in mind about repeated content identification is that Relativity structured analytics will only search for repeated content from the bottom of documents. Disclaimers listed in headers, or boilerplate language repeated through a document will not be identified int the automated search. Accordingly, the admin must set a number of tail lines to search. The tail lines are the number of non-blank lines from the bottom of a document.

Relativity recommends setting 'Number of tail lines to analyze' to 16. The max is 200. Increasing the setting much above 16 will cause the operation to run for a long time.

LITIGATION SUPPORT TIP OF THE NIGHT

New tips for paralegals and litigation support profesionals are posted to this site each week. Click on the blog headings for better detail.

See How-To Videos on my YouTube channel.

Creating an Optimized Analytics Index

Relativity Repeated Content Identification - From the Bottom Up