Adding New Documents to Textual Near Duplicate Groups Can Create Anomalous Groups of One
In Relativity you have the option to run an incremental analysis for email threading and textual near duplicate identification. Under the Indexing & Analytics tab, select Structured Analytics, and make reference to the Structured Analytics Set console, which includes the option, 'Run Incremental Analysis'.
Incremental analysis will add documents newly added to the workspace which match the criteria of the saved search used for the full analysis. Usually the newly added documents will match one of the textual near duplicate groups from the initial, full analysis. In this case they will simply be added to one of the textual near duplicate groups. Relativity groups document together by reviewing each document against the principal document, the largest document in the analyzed group. When documents fall within the designated similarity percentage of the principal document (the default is 90%), they will be added to its group. If they do not fall within this group a new group is created, with a new principal to be compared against successive documents.
If there are documents in a saved search which are not sufficiently similar to any other documents in a selected set, the Textual Near Duplicate Principal and the Textual Near Duplicate Group fields will not be filled in.
There are special considerations when an incremental analysis is performed.
1. If a new document matches a pre-existing textual near duplicate group, but is larger than the principal, the principal will not be changed. The new larger document from the incremental analysis will be an 'orphan document',
2. If a new document matches a document from the pre-exisiting set that was not assigned to a group, and is smaller than the pre-existing document, a new textual near duplicate group will be created with the pre-existing document as the principal.
3. If a new document matches a document from the pre-existing set that was not assigned to a group, and is larger than the pre-existing document, then in this case the new document will become the principal of the new group, but the document it matches from the pre-existing group is not added to this new group.
The third scenario is clearly an anomaly in that a group of one is created.