top of page

Here's another installment in my outline of John Tredennick's 'TAR for Smart People'. I last posted an installment on January 27, 2017. This night's installment is on Chapter 11, Case Study: Using TAR to Find Hot Docs for Depositions: How Insight Predict Succeeded Where Keywords Failed, and Chapter 12, Case Study: Using TAR to Expedite Multi-Language Review.

11. Case Study: Using TAR to Find Hot Docs for Depositions: How Insight Predict Succeeded Where Keywords Failed

A. Challenge: Quickly Find Hot Docs in Garbled Production

1. Multi-district Litigation involving medical device. 77K electronic documents produced shortly before depositions. No meta data, and poor scanning leads to bad OCR. Results of focused searches consist of only 5% being possible deposition exhibits, and 46% relevant.

B. Solution: Using Insight Predict to Prioritize Hot Documents for Review

1. Lead attorney QC'd documents already tagged as hot, and few hundred more targeted hits and samples.

2. Using those seeds ranked the entire document set for hot documents.

3. Top 1000 documents then pulled for attorneys to evaluate.

4. TAR increases docs being potential exhibits to 27% and relevant docs to 65%.

C. Good Results from Difficult Data

1. Insight Predict allows you to use an unlimited number of seeds from judgmental sampling.

12. Case Study: Using TAR to Expedite Multi-Language Review: How Insight Predict's Unique Capabilities Cut Review by Two-Thirds

A. The Challenge: Quickly Review Multi-Language Documents

Shareholder class action regarding violations of securities laws requires review of mixed set of Spanish and English documents. Review 66K files for responsive documents to produce.

B. The Solution: Prioritize Documents for Review Using Insight Predict

1. Few hundred emails already reviewed by attorneys that came from key custodians used as seeds to train Predict engine.

2. Separate rankings for the two languages. On demand batches sent to review team from top of the ranking round.

3. After training predict with initial seeds responsiveness of batches sent to review team increased from 10% to 40%.

4. 91% recall achieved after reviewing only 1/3 of documents.

C. Uncovering a Hidden Trove

1. Contextual Diversity Sampling - Predict looks for biggest set of documents that have no reviewer judgments, finds best examples in them, and send to review team.

2. In case study contextual diversity sampling found hundreds of spreadsheets omitted from initial manual seeds.

3. Every responsive document sorted to the top third of the review set, making review of two-thirds of set unnecessary.


 
 

On Wednesday, February 1, 2017, I attended a presentation at the New York Legal Tech show hosted by Driven. The presentation was entitled Next Frontier in TAR: Choose Your Own Algorithm . Tara Emory and Phil Favro of Driven hosted David Grossman, the associate director of Georgetown University's Information Retrieval Lab. [He is not the same Grossman who prepared the Grossman-Cormack Glossary. That's Maura Grossman of Wachtell Lipton LLP]. The following is my brief summary of the content of the PowerPoint presentation that was shown and Grossman's comments. This is a difficult subject, and there was not time to do a comprehensive study of it. This post is simply done to present some basic concepts discussed during the hour long talk and note some commonly referred to topics.

Algorithms that allow for categorization have been around for decades and many are in the public domain. However there are not pure TAR algorithms in the public domain. Grossman noted that it was not possible to draw conclusions from the early TREC Legal Track studies about which algorithms work best. Different groups had different rates of success depending on the types of documents being reviewed.

Favro noted Judge Facciola's decision in United States v. O'Keefe, 537 F.Supp.2d 14 (D.D.C. 2008) in which he stated that for lawyers or a court to make a decision about which search terms would work best during Review would be to "truly to go where angels fear to tread", and suggested that making determinations on the best algorithms to use posed a similar challenge.

Tara noted that certain algorithms tolerate noise better than others. Grossman usually uses a system that has a 'witch's brew' of algorithms. The are five common types of algorithms used for TAR:

1. Latent Semantic Indexing

2. Logistic Regression

3. Support Vector Machine Learing

4. Gradient Boosting

5. Recurrent Neural Networks

Latent Semantic Indexing involves a word occurrence matrix. It not a machine learning algorithm. It is based on the concept that words used in the same context have similar meaning. A search will return results that are conceptually similar to the search terms even they don't actually contain any of those terms. Think in terms of a matrix with rows listing unique words and columns listing word counts per paragraph.

Logistic Regression uses statistics to estimate the probability that the document is responsive.

Support Vector Machine Learning involves the use of a geometric model, finding a hyperplane to separate relevant documents from non-relevant documents.

Gradient Boosting uses a decision tree algorithm. It's somewhat similar to a Boolean search. Votes are counted for results from multiple decision trees for each document.

Recurrent Neural Networks takes the sequence of words and maps the internal representations. Grossman noted that at the Image Net competition in Canada (which evaluates algorithms for categorizing images and videos) a recurring neural net algorithm won the challenge by a 20% greater margin than the algorithm which won the previous year.

The two key variables in TAR algorithms are tokenization (how text strings are broken up) and feature weighting. Bags of words or N-grams (consecutive words or characters) may be used as tokens. Features are text fragments and metadata that is used to classify documents. There is a distinction between TAR 1.0 which involves iterative sampling, and TAR 2.0 which is computer assisted learning. Grossman made reference to a 1965 article by J.J. Rocchio, Relevance Feedback in Information Retrieval, which concerns how marking the relevancy and non-relevancy of the results from an initial query can be used to perform a refined query.

Favro referred to the 'TAR Tax' as defined by Gareth Evans of Gibson Dunn LLP. Opposing parties may attempt to get concessions from another party for agreeing to their use of TAR.


 
 

Here's another installment in my outline of John Tredennick's 'TAR for Smart People'. I last posted an installment on January 15, 2017. This night's installment is on Chapter 10, Using TAR in International Litigation - Does Predictive Coding Work for Non-English Languages?

A. Department of Justice Memorandum

1. March 2014 DOJ memorandum acknowledges that TAR offers a chance for parties to reduce costs in responding to a second request in proposed merger or acquisition. The memo states that it is not certain that TAR is effective with foreign language or mixed language documents.

2. TAR works by recognizing tokens - words, numbers, acronyms, misspellings or gibberish.

B. TAR for Non-English Documents

1. TAR doesn't understand English or any other language. It it simply analyzes words algorithmically according to their frequency in relevant documents compared to their frequency in irrelevant documents. When a document is marked by a reviewer as relevant or irrelevant, the software ranks the words in the documents based on frequency or proximity. TAR in effect creates huge searches using the words ranked during training.

2. If documents are properly tokenized, the TAR process will work. Words are only recognized because they have a space or comma before and after it. Chinese and Japanese don't use spaces are punctuation The search system used by Catalyst index Asian characters.

C. Case Study

1. Review of mixed set of Japanese and English documents. The first step tokenized the Japanese documents - Japanese text is broken into words and phrases. Lawyers reviewed a set of 500 documents to be used as a reference set by the system for its analysis. Then they reviewed a sample set of 600 documents marking them relevant or irrelevant. System ranked the remaining documents for relevance.

2. The system was able to identify 98% of likely relevant documents and put them at the front of the review queue. Only 48% of the total document set needed to be reviewed to cover the bulk of likely relevant documents. 3. A random sample was reviewed from the remaining 52% of the document set, and the review team only found 3% of these documents to be relevant.


 
 

Sean O'Shea has more than 20 years of experience in the litigation support field with major law firms in New York and San Francisco.   He is an ACEDS Certified eDiscovery Specialist and a Relativity Certified Administrator.

The views expressed in this blog are those of the owner and do not reflect the views or opinions of the owner’s employer.

If you have a question or comment about this blog, please make a submission using the form to the right. 

Your details were sent successfully!

© 2015 by Sean O'Shea . Proudly created with Wix.com

bottom of page