TAR for Smart People Outline - Chapter 9
Here's another installment in my outline of John Tredennick's 'TAR for Smart People'. I last posted an installment on December 18, 2016. This night's installment is on Chapter 9 - Comparing Active Learning to Random Sampling - Using Zipf's Law to Evaluate Which is More Effective for TAR.
A. Schieneman-Gricks Study vs. Grossman-Cormack Study
1. Judgmental Seeds (selecting via continuous active learing and contextual diversity) superior to random seeds as per Catalyst.
2. Cormack-Grossman and Ralph Losey believe random sampling is not as effective.
3. OrcaTec - thinks random sampling leads to bias.
4. Contextual Diversity models the entire document population.
B. What is Contextual Diversity?
1. Contextual Diversity Algorithm identifies documents based on how different they are from ones already seen.
C. Contextual Diversity: Explicitly Modeling the Unknown
1. Will select document containing highest percentage of terms that are not included in documents already reviewed.
D. Zipf's Law
1. You can expect the most frequent word in a large population to be twice as frequent as the second most common word, three times as frequent as the third most common word, and so on.
2. The diagram below depicts random sampling. Each bubble is a subtopic in the document set. The grid shows how random sampling does not necessarily get sample from each subject.
3. This diagram depicts contextual diversity approach. The red dots are seed documents selected from each of the topics.
The subtopics which are covered by large groups of documents are not over sampled.