Structured Analytics and Textual Near Duplicate Identification
top of page

Structured Analytics and Textual Near Duplicate Identification


As discussed in the Tip of the Night for March 21, 2017, you can add documents to a list in a Relativity that are near duplicates.

A Relativity admin can control how the Textual Near Duplicate operation calculates the similarity between documents. The minimum similarity percentage cannot be lower than 80 per cent. This setting can be adjusted when editing a new Structured Analytics Set under the Indexing & Analytics tab. Note that if the email threading operation is run on the same set, textual near duplication identification will only run on non-emails.

Also available in the Textual Near Duplicate section of a Structured Analytics Set is an option to ignore numbers when running the identification. If 'Ignore Numbers' is set to 'YES', the operation will exclude numbers during its review.

Strings which begin with letters and include numbers will not be excluded. See this chart that Relativity includes in its online documentation:

This feature may be particularly useful in productions which include large numbers of Excel files. It can help locate revisions of spreadsheets that contain changing values in columns with headings matching those from other spreadsheets.

Be aware that Structured Analytics will not be run on documents with more than 30 MB of text.


Sean O'Shea has more than 20 years of experience in the litigation support field with major law firms in New York and San Francisco.   He is an ACEDS Certified eDiscovery Specialist and a Relativity Certified Administrator.

The views expressed in this blog are those of the owner and do not reflect the views or opinions of the owner’s employer.

If you have a question or comment about this blog, please make a submission using the form to the right. 

Your details were sent successfully!

© 2015 by Sean O'Shea . Proudly created with Wix.com

bottom of page