top of page

Language Identification

In order to identify the languages used in a data set in Relativity, take the following steps.

1. In a Workspace go to Indexing & Analytics, and select Structured Analytics Set.

2. Create a new set, and after assigning it a name and prefix, also select the set you want analyze. In the Select Operations section, check off, 'Language identification'.

3. Next from the console on the right, click on 'Run Structured Analytics'.

4. Three stages will follow. The analysis will be set up, and the file size will be calculated.

5. In stage 2, the structured analytics operations will run.

6. In the final stage, the results will be imported into Relativity.

7. After the Structured Analytics Set has run, click on the 'View Language Identification Summary' in the console.

A report will be created detailing the primary language of documents in the data set, and the secondary language (if any).

The percentages in the reports are based on bytes of text.

Up to three languages can be identified in each document. The Analytics engine can detect some languages, such as Thai and Greek, solely on the basis of the characters that are unique to those languages. Chinese, Japanese and Korean are identified on basis of single letter, whereas other languages use quadgrams. Punctuation is ignored. Three to six languages are considered for each quadgrams, and then added to a comprehensive log. Word lists are not used as a reference. Instead a training set is generated from web pages each of the 173 supported languages.

bottom of page