Legal Tech Presentation on TAR by David Grossman
On Wednesday, February 1, 2017, I attended a presentation at the New York Legal Tech show hosted by Driven. The presentation was entitled Next Frontier in TAR: Choose Your Own Algorithm . Tara Emory and Phil Favro of Driven hosted David Grossman, the associate director of Georgetown University's Information Retrieval Lab. [He is not the same Grossman who prepared the Grossman-Cormack Glossary. That's Maura Grossman of Wachtell Lipton LLP]. The following is my brief summary of the content of the PowerPoint presentation that was shown and Grossman's comments. This is a difficult subject, and there was not time to do a comprehensive study of it. This post is simply done to present some basic concepts discussed during the hour long talk and note some commonly referred to topics.
Algorithms that allow for categorization have been around for decades and many are in the public domain. However there are not pure TAR algorithms in the public domain. Grossman noted that it was not possible to draw conclusions from the early TREC Legal Track studies about which algorithms work best. Different groups had different rates of success depending on the types of documents being reviewed.
Favro noted Judge Facciola's decision in United States v. O'Keefe, 537 F.Supp.2d 14 (D.D.C. 2008) in which he stated that for lawyers or a court to make a decision about which search terms would work best during Review would be to "truly to go where angels fear to tread", and suggested that making determinations on the best algorithms to use posed a similar challenge.
Tara noted that certain algorithms tolerate noise better than others. Grossman usually uses a system that has a 'witch's brew' of algorithms. The are five common types of algorithms used for TAR:
1. Latent Semantic Indexing
2. Logistic Regression
3. Support Vector Machine Learing
4. Gradient Boosting
5. Recurrent Neural Networks
Latent Semantic Indexing involves a word occurrence matrix. It not a machine learning algorithm. It is based on the concept that words used in the same context have similar meaning. A search will return results that are conceptually similar to the search terms even they don't actually contain any of those terms. Think in terms of a matrix with rows listing unique words and columns listing word counts per paragraph.
Logistic Regression uses statistics to estimate the probability that the document is responsive.
Support Vector Machine Learning involves the use of a geometric model, finding a hyperplane to separate relevant documents from non-relevant documents.
Gradient Boosting uses a decision tree algorithm. It's somewhat similar to a Boolean search. Votes are counted for results from multiple decision trees for each document.
Recurrent Neural Networks takes the sequence of words and maps the internal representations. Grossman noted that at the Image Net competition in Canada (which evaluates algorithms for categorizing images and videos) a recurring neural net algorithm won the challenge by a 20% greater margin than the algorithm which won the previous year.
The two key variables in TAR algorithms are tokenization (how text strings are broken up) and feature weighting. Bags of words or N-grams (consecutive words or characters) may be used as tokens. Features are text fragments and metadata that is used to classify documents. There is a distinction between TAR 1.0 which involves iterative sampling, and TAR 2.0 which is computer assisted learning. Grossman made reference to a 1965 article by J.J. Rocchio, Relevance Feedback in Information Retrieval, which concerns how marking the relevancy and non-relevancy of the results from an initial query can be used to perform a refined query.
Favro referred to the 'TAR Tax' as defined by Gareth Evans of Gibson Dunn LLP. Opposing parties may attempt to get concessions from another party for agreeing to their use of TAR.