top of page

Here's another installment in my outline of John Tredennick's 'TAR for Smart People'. I last posted an installment on December 18, 2016. This night's installment is on Chapter 9 - Comparing Active Learning to Random Sampling - Using Zipf's Law to Evaluate Which is More Effective for TAR.


A. Schieneman-Gricks Study vs. Grossman-Cormack Study

1. Judgmental Seeds (selecting via continuous active learing and contextual diversity) superior to random seeds as per Catalyst.

2. Cormack-Grossman and Ralph Losey believe random sampling is not as effective.

3. OrcaTec - thinks random sampling leads to bias.

4. Contextual Diversity models the entire document population.

B. What is Contextual Diversity?

1. Contextual Diversity Algorithm identifies documents based on how different they are from ones already seen.

C. Contextual Diversity: Explicitly Modeling the Unknown

1. Will select document containing highest percentage of terms that are not included in documents already reviewed.

D. Zipf's Law

1. You can expect the most frequent word in a large population to be twice as frequent as the second most common word, three times as frequent as the third most common word, and so on.

2. The diagram below depicts random sampling. Each bubble is a subtopic in the document set. The grid shows how random sampling does not necessarily get sample from each subject.

3. This diagram depicts contextual diversity approach. The red dots are seed documents selected from each of the topics.

The subtopics which are covered by large groups of documents are not over sampled.



 
 

Here's another installment in my outline of John Tredennick's 'TAR for Smart People'. I last posted an installment on November 23, 2016. This night's installment is on Chapter 8 - Subject Matter Experts.

8. Subject Matter Experts - What Role Should They Play in TAR 2.0 Training?

- Senior attorneys don't want to spend time reviewing irrelevant documents. Don't want to have to wait for the expert to have time to review seed documents for new uploads.

A. Research Population

Conducted study on how important SMEs were to the review process using review by TREC (Text REtrieval Conference) of Enron data, against decisions by topic authorities.

B. Methodology

Random selection of documents for training included documents on which SMEs and reviewers agreed and disagreed.

C. Experts vs. Review Teams: Which Produced the Better Ranking?

Assumed that the topic authorities made the correct decision. Did not independently evaluate.

D. Using the Experts to QC Reviewer Judgments

- Third set of training documents - SME correct the reviewer decisions.

- Prediction software used to rank documents by how:

a. reviewer tagged relevant, but software thought was highly irrelevant.

b. reviewer tagged non-relevant, but software ranked highly relevant.

selected the biggest outliers, and had SME check top 10% of training documents. Ranking re-run based on changed values and plotted as separate line on yield curve.

E. Plotting the Differences: Expert vs. Reviewer Yield Curves

Yield Curve -

x-axis - percentage of documents reviewed.

y-axis - percentage of relevant documents found.

The gray line representing linear review merely shows that the percentage of relevant documents found will increase at a constant rate when documents are review randomly. So when 20% of documents have been reviewed, 20% of responsive documents will have been found. On the first issue tested, review required reviewing a slightly higher percentage of documents to get to a 80% recall rate but beyond 80% recall, SMEs and reviewers perform equally well. The rankings generated by the expert-only review were almost identical to the rankings produced by the review team with QC assistance from the expert. On the second issue tested, the three methods (the reviewers, the experts, the review with SMEs correcting some reviewer decisions) performed equally well. On the third issue, expert and review methods were equally as good, but the 'expert QC' method did not perform as well getting from 80% to 90% recall. The fourth issue actually showed the expert significantly underperforming the reviewers. Generally speaking all three methods got these results:

Issue 1 - 20% of documents must be reviewed to get 80% recall.

Issue 2 - 50% of documents must be reviewed to get 80% recall.

Issue 3 - 30% of documents must be reviewed to get 80% recall.

Issue 4 - 42% of documents must be reviewed to get 80% recall - except for the expert working only had to review 65%.

Results were obtaining using Insight Predict, Catalyst's proprietary algorithm, but still suggest that notion that only a SME can train a system may not be correct.

SMEs aren't always available and bill higher rates. Catalyst suggests experts should interview witnesses and find important documents to feed into the system, not training predictive coding systems.


 
 

Here's another installment in my outline of John Tredennick's 'TAR for Smart People'. On November 13, 2016, I posted an outline of chapters 4 and 5. This part of the outline covers Chapters 6 and 7.

6. Five Myths About Technology Assisted Review - How TAR 2.0 Overcomes the Limits of Earlier Systems

1. You Only Get One Bite At the Apple

With TAR 2.0 the reviewers that create the initial seed set are given the next most likely relevant documents for review. Their tags are fed back into the system, so it makes better decisions.

2. Subject Matter Experts are Required for TAR Training

TAR 2.0 systems take a variation in responsive decisions in account and present outliers to an expert for correction.

3. You Must Train on Randomly Selected Documents

Modern TAR systems allow you to submit as many documents as you like. Can do diversity sampling (docs you know the least about); systematic sampling (every nth document); and random sampling.

4. You Can't Start TAR Training Until You Have All of Your Documents

TAR 1.0 required that all documents be collected before training began. TAR 2.0 ranks all of the documents each time and don't use a control set to determine the effectiveness of the ranking.

5. TAR Doesn't Work for Non-English Documents

TAR is a mathematical process that ranks documents based on word frequency; doesn't matter which language. Chinese, Japanese and Korean must be broken into word segments - this is tokenizing. TAR 2.0 can only tokenize.

7. TAR 2.0: Continuous Ranking - Is One Bite at the Apple Really Enough?

Can We Reduce the Review Count Even Further?

TAR 2.0 - continuous ranking throughout the review process.

Catalyst uses contextual diversity sampling to select documents dissimilar from those you have already reviewed.

Research Study One - 85K total docs; 11K responsive; 13% richness. With TAR 1.0 have to review 60K documents to get 80% recall. With TAR 2.0 have to review 27K documents. For 95% recall, with TAR 1.0 would have to review 77K dcouments; with TAR 2.0 review 36K documents for 95% recall - a 49% savings.

Research Study Two: 57K total documents; 11K responsive. TAR1.0 has to review 29K docs to get 80% recall. With TAR 2.0 23K docs for 80% recall. For 95% recall, with TAR 1.0 would have to review 46K docs to reach 95% recall. With TAR 2.0 must review 31K documents for 95% recall - a 25% savings.

Research Study Three - Priviege Review: 85K docs; only 1K privileged. 2K documents must be reviewed with TAR 1.0 to get 80% recall. No gain from TAR 2.0 because the process would already be complete. 18K docs must be reviewed for 95% recall, but only 14K docs with 95% recall. Supplement privilege review with check of names and organizations in communications.

Subject Matter Experts - better to have expert review potion of documents tagged by review team than have her or him review all of the training documents.


 
 

Sean O'Shea has more than 20 years of experience in the litigation support field with major law firms in New York and San Francisco.   He is an ACEDS Certified eDiscovery Specialist and a Relativity Certified Administrator.

The views expressed in this blog are those of the owner and do not reflect the views or opinions of the owner’s employer.

If you have a question or comment about this blog, please make a submission using the form to the right. 

Your details were sent successfully!

© 2015 by Sean O'Shea . Proudly created with Wix.com

bottom of page