TAR for Smart People
Here's an outline for the first three chapters:
Losey in introduction disagrees with failure to use SMEs (subject matter experts)
1. INTRODUCTION TO TAR
Analogy to how Pandora works. – thumbs or thumbs down.
TAR, CAR, P/C, Predictive Ranking
Main justification in cost savings; eliminate up to 95% from review.
Some systems require that collection be completed first; but not Catalyst's Insight Predict.
Begin by either feeding in relevant documents, or let PC select documents randomly for review.
Documents get ranked by relevance; this ranking must be tested. Conduct systematic random sampling from beginning of ranking to the end. Every nth document . . .; also random sample for relevancy in total set.
"you may stop after reviewing the first 20% of the ranking because you have found 80% of the relevant documents. Your argument is that the cost to review the remaining 80% of the document population just to find the remaining 20% of the relevant documents is unduly burdensome." This is how a ranking with predictive coding might look:
. . . the yellow segments being the relevant documents.
2. CONTINUOUS ACTIVE LEARNING FOR TECHNOLOGY ASSISTED REVIEW
Continuous Active Learning vs. SPL or SAL. Grossman Cormack study indicates that CAL better in eight different case studies.
TAR 1.0 – SPL / SAL
SME tags random sample of 500 plus documents as a control set.
SAL or SPL training process tagging relevant or irrelevant – the seed set.
TAR software uses selections to train classification / ranking algorithm to identify other relevant documents, and then compares them against the control set to gauge accruacy.
SME may need to train more to improve classifier.
When classifier is stable, or no longer getting better at identifying relevant documents process ends.
Algorithm run against the entire document production.
SME review random sample to tell how well did in finding relevant documents.
Simple Passive Learning uses random documents for training.
Simple Active Learning does not rely on random documents.Instead, it suggests starting with whatever relevant documents you can find, often through keyword search, to initiate the training. The system will select marginal documents it is unsure about and the SME has to spend a lot of time on these.
Judgments on relevant documents don't get fed back.
Requires SME or senior attorney to review thousands of documents.
Doesn't handle rolling uploads well.
Doesn't handle data sets with low level of relevancy well.
TAR 2.0 – CAL
No control set needed. Rankings fluctuate across the entire set continuously.
Can rank millions of documents in minutes.
In the eight case studies conducted by Grossman Cormack, the SPL TAR required the review of an average of just under 80% more documents than CAL, differing as many as 93,000 documents in one case, and as few as 3000 in another to get a 75% recall rate.
Debate about the utility of 'relevance feedback' – feeding the highest ranked documents to the reviewers for their judgment.
Losey recommends a multi-modal approach on his blog, e-Discovery Team.
Catalyst combats bias by contextual diversity samples to show the reviewer very different documents – samples of clusters.
Continuous Learnning Process
Find as many relevant documents as possible and feed them into the system for ranking.
Have team review mostly highly ranked docs, but some with contextual diversity sampling.
Senior attorney should QC some documents that have been reviewed.
Continue until recall rate reached.
Differences Between TAR 1.0 and 2.0
One time learning vs. continous active learning
Trains on Small Set vs. Analysis of Whole Collection Each Time.
SME does training vs. Review Teams train during review.
Use Random Seeds vs. Judgmental Seeds
Doesn't Work Well wirh Low Richness or Small Cases vs. Good for Low Richness & Small cases.
3. HOW MUCH CAN CAL SAVE?
Grossman/Cormack Study Results
CAL yields superior results to SAL with uncertainly sampling, while avoiding the issue of stabilization – determining when training is adequate.
In Sample sets to get 75% of relevant files with 2Ktraining seeds
With 293K-1.2M docs – CAL 6-18K docs reviewed; SAL 5K – 237K docs reviewed, SPL 9K-521K docs reviewed.
Averages in study
Sample size 640K docs.
CAL 9K docs reviewed.
SPL 207K docs reviewed
SAL 93K docs reviewed.
SME rate per hour
Reviewer rate per hour
Docs per hour revieweed
When SME can review more training documents, savings from CAL are less.
CAL allows for addition of documents without time and cost of retraining.
CAL review begins immediately; TAR 1.0 SME must complete training first.