Email Threading and Near Dupe Identification in Relativity
Here's a brief demo on how to run email threading and near duplicate analyses together in Relativity to group email conversations together, and identify near duplicates among the attachments and individual files.
Note that email threading and near duplicate operation should never be run in an analytics set at the same time with a repeated content and/or a language identification operation.
1. The first step is to create a saved search containing all of the files to be analyzed. The threading analysis is run on the extracted text field. Documents with more than 30 MB of extracted text will be excluded from the analysis.
2. At Indexing &Analytics . . . Analytics Profiles, click 'New Analytics Profile'.
3. Map the email header fields - (this step can actually be omitted, and the analysis can be run on the extracted text only.)
4. In the Email metadata fields section select the Parent Document ID. The Parent Document ID has to be set for attachments but can be left blank for the parent emails. The Parent Document ID for the attachments should be the same as the Control Number for the parent email. As noted in the Tip of the Night for May 30, 2018, the Conversation ID field should not be selected in most situations.
5. Next go to Indexing & Analytics, and select Structured Analytics Set. Click the New Structured Analytics Set button.
6. Set the saved search as the set to analyze and check off 'Email threading' and 'Textual near duplicate identification' as the operations.
7. In the Email Threading section under 'Select profile for field mappings' select from the drop down menu the profile you just created. In this example, select the radial button to use email header fields. On the right the Destination Email Thread Group and Destination Email Duplicate ID fields will be used to show emails from the same group and duplicate emails.
8. In the Textual Near Duplicate Identification section, we set the minimum similarity percentage to a value between 80 and 100. This indicates how similar a near dupe must be to a principal document. Setting to a value lower than 90 will slow down the operation. You should choose to ignore numbers to keep the operation running quickly. The Destination Textual Near Duplicate Group field shows the related items for your near duplicate groups.
9. Next save the set and then click 'Run Structured Analysis' on the console to the right. You'll be given the option to either run the operation on all documents or only new documents added to the set.
The document set will be synced; file sizes will be calculated; and new documents will be added.
10. When the operation is completed, you'll see a summary indicating how many documents and emails were analyzed, and how many emails were 'inclusive', or have unique content not shown in other emails.
11. Back on the console to the right, you can click, 'View Email Threading Summary' to get an even better overview.
12. New fields are generated with the Prefix set for the Analytics Set. In this view we can see that the field named, 'SA S003 Textual Near Duplicate Group' shows a principal document whose control ID is listed in this field for two other documents which have between 98-99% similarity to the principal document. The field named, 'SA S03: Textual Near Duplicate Principal' will be listed 'No' for the two documents which match the principal document.
13. The email threads are grouped with sequential numbers in the 'SA S03:: Email Duplicate ID' field (showing how many total threads are in the set); a common ID in the 'SA S::Email Thread Group' field; and IDs unique for each thread in the individual groups in the 'SA S03:: Email Threading ID'.