Clustify & TAR 3.0
The January 8, 2016 tip of the night has been updated with comments by Bill Dimm clarifying some of the statements that I about this presentation on TAR. His explanations may help you better understand his points about TAR if you had trouble fully grasping the subject like I did. See the entries in italics below.
Bill Dimm of Hot Neuron, is the author of the Predictive Coding Book, and the developer of the Clustify software. Last month he gave a particularly effective presentation on the different versions of Technology Assisted Review. This ACEDS sponsored presentation can be viewed here.
As Dimm discusses in the presentation, the main goals of predictive coding are to reduce the costs involved in Review, find relevant documents quicker, and determine when documents are being tagged inconsistently. Predictive coding trains a system on how to find relevant documents. Many people will have heard how the increased use of TAR will lead to expert reviewers taking over the Review process from larger teams of staff attorney reviewers. Dimm refers to a Xerox study which challenges the notion that the best way to tag relevant documents is to use subject matter experts as opposed to using first pass reviewers. Having a too precise notion of what is relevant can exclude relevant documents for subtle reasons the system can't understand. A large amount of data review by lower cost reviewers may trump a smaller amount of data review by highly compensated experts.
Prevalence or richness refers to the actual number of relevant documents; recall the percentage of relevant documents found; and precision - documents determined to be relevant that actually are found. Dimm's presentation includes a Precision Recall Curve that shows precision on vertical axis and recall on the horizontal axis. The curve slopes downward. You need to review more and more non-relevant documents for each relevant document that you find. TAR 1.0 uses a control set, first training and then doing document review.
Dimm emphasizes that it's improper to use terminology like 95% confidence or ‘statistically significant sample’ when referring to the overall sample set size.
> That sentence should refer specifically to training set size. The > point of this section is that you can specify how big the sample must > be to measure what percentage are relevant (or what percentage of > boxes contain gold), but you cannot specify how big a simple must be > to train the system -- training is about identifying WHICH documents > are relevant, not what PERCENTAGE of documents are relevant. In the > case of the boxes of gold/lead, if the boxes all look the same no > amount of sampling is going to tell you which ones contain gold (but > it can tell you what percentage contain gold). On the other hand, if > all boxes of the same color contain the same metal you can figure out > which ones contain gold by taking a sample, but the size of the sample > required depends on the number of colors. Likewise, the size of the > sample needed to identify relevant documents depends on the nature of > the words that indicate relevance.
It's true that in a hypothetical set of 1 million boxes, some with gold and some with lead, if you select 400 randomly and find 80 with gold then the there's a chance that 20 per cent overall with have gold plus or minus five percent, and if sample is 1600, and 320 are found to have gold, then you can predict 20 per cent prevalence plus or minus 2.5 percent. Four times as much in the sample set cuts uncertainty in half. But we still can't say how many boxes actually have gold with absolute certainty.
The amount of training depends on the task involved. Dimm gives the example of the difference between finding golf documents versus documents regarding the electronics industry. With a simpler topic like golf it's easier to find the handful of words that relevant docs will contain Depending on what you're reviewing 300 or 4500 training documents may be necessary. There is no magic number of training documents.
Dimm's formula states that the optimal number of training docs equals prevalence times the number of non-training docs times desired recall divided by precision at desired recall.
> That formula is for the number of documents you'll need to review from > the top of the list (sorted by relevance score) to achieve the desired > recall. The formula itself does not tell you the optimal number of > training docs. To find that you have to add the result from that > formula to the number of training docs to find the total docs > requiring review, then increase the number of training docs and see > whether the total docs requiring review increases or decreases (if it > increases, you should stop training).
A Static Control Set fails to account for shifting understanding of relevant as review progresses. He contrasts a Rolling Control Set against a Static Control Set, the former using the most recent random docs as control set as Review progresses, so we can learn more about the case and notion of what's relevant as Review moves along.
Other types of review include judgmental sampling, where the reviewer selects docs that will help a system to learn and active learning, where the software will pick docs based on how it thinks it will learn faster. Active learning uses many algorithms - it is not just based on one type of mathematical approach.
To help us think about the nature of Review, Dimm proposes a thought experiment where one thinks about how to teach child how to recognize dogs. Would you show a kid 9 birds for each dog, or 99 birds for reach dog, or lots of dogs a few birds, and some wolves and foxes for contrast? You don't necessarily need to slog through lots of non-relevant documents to do your job.
TAR 2.0 involves continuous active learning (CAL). You start with small amount of training data and keep reviewing until nothing relevant is found any longer. TAR 2.0 software makes adjustments as non-relevant documents are found. In an example where you want to figure the percentage of a document set that has to be reviewed to get 75 per cent recall, Dimm gives examples of where is 6.9 per cent prevalence you may need to review 12 per cent in random sampling versus 8 per cent with CAL. But if there is .52% prevalence in a data set, then reviewing 29 percent will be necessary with random sampling, but only 9 percent with CAL.
> It's 6% (not 9%) with CAL, and I would emphasize that this is a > specific example (a difficult categorization task) not a general statement.
Dimm was the first to use the term, 'TAR 3.0'. It involves using CAL on cluster centers. You use a clustering algorithm to group documents together and then run the continuous active learning process on those clusters. With TAR 3.0 a big control set not necessary. A random sample may have to review as many as 100 times docs to get same results as TAR 3.0.
> This statement is misleading due to some missing details. TAR 3.0 > gives you a prevalence estimate that is calculated from the training > docs (no additional document review needed). In the specific case > where prevalence was very low (0.24%), the prevalence estimate from > the TAR > 3.0 method was so good that making a comparable measurement with > random sampling would require 100 times as many documents. It is > important to note that that is only the case when prevalence is very > low and I was talking only about making a prevalence estimate, not > about the cost of the overall process of training and document review. > > I'm not making any claims about TAR 3.0 being 100 times more efficient > than TAR 1.0. TAR 1.0 may require reviewing 1,000 documents for a > control set and a few thousand training documents. TAR 3.0 doesn't > require a control set and requires only a few hundred training > documents. In both cases you may or may not choose to review documents > that are predicted to be relevant (which can be substantial if the > document population is large) and you may review a sample to estimate > the appropriate relevance score cutoff. That's a more realistic comparison.
TAR 1.0 doesn't work well for low prevalence data sets, whereas TAR 2.0 and TAR 3.0 do. TAR 1.0 requires a control set, but TAR 2.0 and TAR 3.0 avoid the use of it. TAR 2.0 won't include diverse documents, as TAR 1.0 does, as well as TAR 3.0.