Repeated Content Filter Types
The Tip of the Night for April 24, 2018 discussed how to find repeated content in a document set. In Relativity 9.6, repeated content filters are linked directly to Analytics indexes, and are not managed from an Analytics profile. Regular expression searches can be used to find repeated content.
In order to create a repeated content filter follow these steps.
1. On the Indexing & Analytics tab, select Repeated Contented Filters.
2. Give the filter a name and then select from one of two filters:
A. Regular Expression: You can set Regex searches to find particular strings. For example the Regex search: \bIBM_[0-9]{9}\b would find a nine digit Bates number with the prefix, 'IBM'.
Only one Regex search can be entered.
B. Repeated Content: Filter out specified text.
3. In the Configuration field enter the Regex search or the specified text. The text gets removed from the Analytics index or Structured Analytics Set, but the text loaded into Relativity is not altered.
4. When set to 'Yes' the Ready to Index field will set the filter to link to an Analytics Index.
5. The Number of Occurrences and Word Count fields will be populated with a value after a structured analytics set is run.
As noted in the Tip of the Night for April 24, 2018 repeated content that has been identified can be viewed by clicking 'View Repeated Content Filter Results' in the Structured Analytics Set console. Multiple filters will be displayed. Each should be reviewed and only those with text that should be excluded from the index should be linked to the Analytics index.
Note also that the repeated content filter uses a set naming convention. The Structured Analytics Set prefix and timestamp precede an identifier which that ranks repeated content patterns by product of their word count times the number of occurrences.
After this identifier the minimum of occurrences; the minimum word count; the maximum word count; and the actual number of occurrences (see OccCnt#) are given.