More than a year ago now I posted an outline of the first three chapters of John Tredennick's 'TAR for Smart People'. This is a continuation of that outline using the first edition. An updated edition has been posted at: http://www.catalystsecure.com/tarforsmartpeople .
4. TAR 2.0 Capabilities Allow Use in Even More E-Discovery Tasks
• Document Review o Classification - responsive / non-responsive o Protection - privileged / trade secrets o Knowledge Generation - specific issues / deposition witnesses
• Metrics o Recall – the percentage of relevant documents actually recalled. o Precision – the percentage of retrieved documents that are actually relevant.
• Classification Tasks o FRCP and Sedona Principles – e-discovery is limited by principles of reasonableness and proportionality. o 80% recall is a common TAR target. Gold standard of linear review can’t do better than this and costs more. o Recall usually gets more attention than precision.
• Protection Tasks o For confidential information, 100% recall is necessary. o Use TAR; keyword searching; and human review – stack techniques. TAR systematic errors; human random.
• Knowledge Generation Tasks o Precision is the most important metric. o Prioritize document population by issue. o Concentrate most interesting documents for review first. o TAR imperfectly concentrates interesting documents near top of responsiveness ranking.
5. Measuring Recall for E-Discovery Review: An Introduction to Recall Sampling
• Review high level of recall (75%) after only reviewing small percentage of documents (5%). Discard pile include so few relevant documents that more review not economically justified.
• Hypo: 1M document production. 1% relevant – 10K documents. o Using Sampling to Estimate Richness Statistical sampling to estimate richness. Randomly selected subset Concepts • 1. Point Estimate – most likely value for a population characteristic. • 2. Confidence Interval – ranges of values around point estimate that we believe contains true value. e.g. 8K to 12K. • 3. Margin of Error – max. by which a point value might deviate from true value. • 4. Confidence Level – chance confidence interval will include true value. • 5. Sample size – higher confidence level more docs must be reviewed. Determine sample size with Raosoft calculator. Inputs: • Document set size 1,000,000 • Confidence level • Margin of Error 4% • RESULT: 600 documents. o Initial Sampling Results If 6 relevant documents found; estimate 1% richness. o Determining the Exact Confidence Interval Binomial Calculator to determine confidence interval
We multiply these decimal values against the total number of documents in our collection (1,000,000) to calculate our exact confidence interval. In this case, it runs from 3,700 to 21,600. We believe there are 10,000 relevant documents in our collection (our point estimate) but it could be as high as 21,600 (or as low as 3,700). Let’s move on to our review.
The Review - if we find 7,500 relevant docs in 50K may have to review 950K docs to get 2,500 more – not reasonable. But what if 21,600 relevant documents? 35% recall not sufficient.
Sampling Discard pile – again 600 sample size. E.g. 2 relevant documents.
Use binomial calculator again – find could be as many as 11,400 relevant documents in 950K docs left. Only getting 7502/18900 – only 40%.
Narrow margin of error again and use Raosoft calculator – e.g. 1% margin of error – must review 9,508 documents.
Find 31 documents relevant documents. estimate there are 3,097 relevant documents in the discard pile, about the same as before (950,000*(31/9508)). Range could be 2090 to 4370 documents. Using these values for our exact confidence interval, the range goes from 63% (7,500/11,870) to 78% (7,500/9,590).