Here's another installment in my outline of John Tredennick's 'TAR for Smart People'. I last posted an installment on January 15, 2017. This night's installment is on Chapter 10, Using TAR in International Litigation - Does Predictive Coding Work for Non-English Languages?
A. Department of Justice Memorandum
1. March 2014 DOJ memorandum acknowledges that TAR offers a chance for parties to reduce costs in responding to a second request in proposed merger or acquisition. The memo states that it is not certain that TAR is effective with foreign language or mixed language documents.
2. TAR works by recognizing tokens - words, numbers, acronyms, misspellings or gibberish.
B. TAR for Non-English Documents
1. TAR doesn't understand English or any other language. It it simply analyzes words algorithmically according to their frequency in relevant documents compared to their frequency in irrelevant documents. When a document is marked by a reviewer as relevant or irrelevant, the software ranks the words in the documents based on frequency or proximity. TAR in effect creates huge searches using the words ranked during training.
2. If documents are properly tokenized, the TAR process will work. Words are only recognized because they have a space or comma before and after it. Chinese and Japanese don't use spaces are punctuation The search system used by Catalyst index Asian characters.
C. Case Study
1. Review of mixed set of Japanese and English documents. The first step tokenized the Japanese documents - Japanese text is broken into words and phrases. Lawyers reviewed a set of 500 documents to be used as a reference set by the system for its analysis. Then they reviewed a sample set of 600 documents marking them relevant or irrelevant. System ranked the remaining documents for relevance.
2. The system was able to identify 98% of likely relevant documents and put them at the front of the review queue. Only 48% of the total document set needed to be reviewed to cover the bulk of likely relevant documents. 3. A random sample was reviewed from the remaining 52% of the document set, and the review team only found 3% of these documents to be relevant.