Structured Analytics and Textual Near Duplicate Identification

Sean O'Shea
Apr 9, 2018
1 min read

As discussed in the Tip of the Night for March 21, 2017, you can add documents to a list in a Relativity that are near duplicates.

A Relativity admin can control how the Textual Near Duplicate operation calculates the similarity between documents. The minimum similarity percentage cannot be lower than 80 per cent. This setting can be adjusted when editing a new Structured Analytics Set under the Indexing & Analytics tab. Note that if the email threading operation is run on the same set, textual near duplication identification will only run on non-emails.

Also available in the Textual Near Duplicate section of a Structured Analytics Set is an option to ignore numbers when running the identification. If 'Ignore Numbers' is set to 'YES', the operation will exclude numbers during its review.

Strings which begin with letters and include numbers will not be excluded. See this chart that Relativity includes in its online documentation:

This feature may be particularly useful in productions which include large numbers of Excel files. It can help locate revisions of spreadsheets that contain changing values in columns with headings matching those from other spreadsheets.

Be aware that Structured Analytics will not be run on documents with more than 30 MB of text.

LITIGATION SUPPORT TIP OF THE NIGHT

New tips for paralegals and litigation support profesionals are posted to this site each week. Click on the blog headings for better detail.

See How-To Videos on my YouTube channel.

Structured Analytics and Textual Near Duplicate Identification