Repeated Content Filters
A Relativty admin can help improve the accuracy of searches through the use of repeated content filters. Repeated Content Filters remove boilerplate language and confidentiality footers from an Analytics Index and help you avoid false positives in searches. In order to help find the repeated content in a document set follow these steps.
1. Under Indexing & Analytics . . . Structured Analytics Set , click on 'New Structured Analytics Set'.
2. Set a name for the Structured Analytics Set, designate a prefix, and then select a saved search containing the documents you want to analyze.
3. Under Select Operations check off 'Repeated Content Identification'.
4. In the 'Repeated Content Identification' section set the following parameters:
a. 'Mininum number of occurences' - the least number of times a phrase must appear in order to be considered repeated content.
b. 'Minimum number of words'; 'Maximum number of words' - the word count range for the repeated content.
c. 'Maximum number of lines to return' - how many lines a single repeated content filter can appear on.
d. 'Number of tail lines to analyze' - how many lines from the bottom (not counting blank lines) will be scanned for repeated content.
5. Click 'Save' and then on the console, click, 'Run Structured Analytics'.
6. You will be prompted to update all documents or only new documents added to the set. The repopulate text option must be checked if any text was changed.
7. The process will run and the results will be imported into the workspace.
8. Go back to the console and select 'View Repeated Content Filter Results'.
9. Under Indexing & Analytics . . . Repeated Content Filters, a new filter will have been created.
10. The filter will have multiple records showing the identified segments.
As you can see the EDRM Enron Email Data Set disclaimer was identified.
EDRM Enron Email Data Set has been produced in EML, PST and NSF format by ZL Technologies, Inc. This Data Set is licensed under a Creative Commons Attribution 3.0 United States License <http://creativecommons.org/licenses/by/3.0/us/> . To provide attribution, please cite to "ZL Technologies, Inc. (http://www.zlti.com)."