Relativity Repeated Content Identification - From the Bottom Up

Relativity Structured Analytics can identify repeated content (such email disclaimers) in order to improve searchable document indexes. You want to focus on the authored content of documents and avoid having to sort through search results with hits in boilerplate language. However, if you want an admin to find this repeated content there's a significant limitation

While repeated content can be found by manually entering language that you know to search for, in Relativity 9.6, under Indexing & Analytics . . . Structured Analytics Set, you can use the repeated content operation to automatically find repeated content in a set of documents in a saved search. Each segment of repeated content must have a word count range designated by the admin, and appear at least a set number of times also designed by the admin. Each segment must also be a certain number of lines - often no more than 4.

A key thing to keep in mind about repeated content identification is that Relativity structured analytics will only search for repeated content from the bottom of documents. Disclaimers listed in headers, or boilerplate language repeated through a document will not be identified int the automated search. Accordingly, the admin must set a number of tail lines to search. The tail lines are the number of non-blank lines from the bottom of a document.

Relativity recommends setting 'Number of tail lines to analyze' to 16. The max is 200. Increasing the setting much above 16 will cause the operation to run for a long time.

