This blog has previously discussed Catalyst's TAR for Smart People on the night of October 5, 2015, and noted on the night of November 13, 2015, its recommendation to use the Raosoft Calculator to determine the sample size needed for predictive coding. John Tredennick's guide also recommends using a Binomial Calculator to estimate the confidence interval for how accurate the percentage of relevant documents found in a sample set will be in showing the actual number of relevant documents. So if you have a sample set of 1000 documents, and you find 50 relevant documents, and the complete document set is 2,000,000, you're dealing with a richness level of apparently 5 per cent, and extrapolating that supposed percentage to the full set, we come up with an exact guess (a point estimate) that there are 100,000 relevant documents in the total population. The binomial calculator lets us set a confidence interval of a likely range in which the actual number of relevant documents will fall.
A binomial calculator for confidence intervals can be found here: http://statpages.org/confint.html. The number of relevant documents found in the sample set is entered as the numerator and total sample size is the demoninator. After clicking 'compute' you find that the Proportion x/N of relevant documents in the sample is 5%. You get the, 'Exact Confidence Interval around Proportion' by multiplying the values in a given range by the size of your document population. In my example the confidence interval would be 74,600 to 130,800 relevant documents in the total set of 2,000,000 documents. See the below screen grab of the calculator.