top of page

This month, Magistrate Judge Andrew Peck of the S.D.N.Y. issued an important decision on the use of Technology Assisted Review, or predictive coding. In Hyles v. New York City, 10 Civ. 3119 (S.D.N.Y.), Judge Peck denied the plaintiff's application to force the defendant to use TAR even though he acknowledged that, "TAR is cheaper, more efficient and superior to keyword searching". As previously noted on this blog, Judge Peck issued Da Silva Moore v. Publicis Groupe & MSL Group, one of the first judicial decisions to approve the use of TAR. However in this recent decision, Judge Peck rules that Sedona Principle 6, which states that parties are best able to judge for themselves which technologies are best for the preservation and production of ESI, should supersede the obvious advantages of TAR.

It should be noted that this is an employment discrimination suit. The Review was staged, beginning first with only 9 custodians, and then if necessary, only expanding to an additional 6. Presumably very large amounts of ESI are not involved.

Judge Peck closed his decision by noting that it's possible as TAR becomes more widely used, it will be unreasonable for a party to decline to use it in the electronic discovery process.


 
 

A ROC (Receiver Operating Characteristic) curve plots the True Positive Rate (the percentage of correct hits in the results as a percentage of total correct hits - True Positive / (True Positive + False Negative), or Recall - also known as Sensitivity) against the Fall Positive Rate (the percentage of non-responsive documents identified as such : 1- True Negative / (False Positive + True Negative) , or 1-Specificity) . [Precision is different and is defined as True Positive / (True Positive + False Positive).}

The graph below has two curves. The red curve shows documents which were not actually responsive, and the green curve shows those documents which are in fact responsive. The X axis is showing the percentage of probability that a document is responsive, and the Y axis show the document count.

A cut-off rate needs to be chosen - a rate of probability of responsiveness after which under the TAR model, it is decided to assume that a document is not responsive. So for example, with this model, the black line shows the cut-off rate of 50% - and we have 15 documents incorrectly judged to be responsive (false positives) and 2 documents which were responsive which were incorrectly determined to be non-responsive. This example shows that 242 documents have been reviewed in total, so the accuracy rate would be about 93%.

The graph showing the ROC Curve will have the False Positive Rate listed on the X axis and the True Positive Rate in listed on the Y axis. In our example the True Positive Rate is 115/(115 + 2), or 98.3% and the False Positive Rate is 88%, when the cut off rate is set at 50%. When we change the cut off rate to 80% the true positive rate becomes 75/(75+42), or 64%, and the false positive rate is 1-(125/(0+135). Base on these data points we can plot an ROC curve (shown in purple) that visualizes all of the possible thresholds.

When predictive coding software does a good job at separating responsive documents from non-responsive documents (i.e., it doesn't assign the same probability of mid-range (30 to 70% or whatever) responsiveness to similar numbers of responsive and non-responsive documents) the ROC curve will be in the upper left of the graph as shown in this example. If the software doesn't assign divergent responsiveness probabilities to responsive and non-responsive documents the ROC curve will be closer to an imaginary diagonal line drawn across the graph from the top right to the bottom left.


 
 

As explained in last night's tip, an F-score is a combined measure of precision and recall. An F1 score gives equal weight to precision and recall. A version of the equation, which allows different weights to be assigned to precision or recall would be expressed this way:

Fß = (1+ß²) Precision * Recall

(ß² Precision) + Recall

The beta symbol, ß , is used in mathematics to indicate when a variable can be entered. The term F2 score will be used when twice the weight is given to recall as opposed to precision. When giving twice as much weight to precision, an F 0.5 score is used.

In Excel we can use these formulas to calculate these two types of equations:

F 0.5 score Excel =((1.25)*((A9*B9)/((0.25*A9)+B9)))

F2 score Excel =((5)*((A9*B9)/((4*A9)+B9)))

. . . and a formula for the general Fß equation, allowing the user to grant varying weights to precision or recall would be:

=((1+(C9^2))*((A9*B9)/((C9^2*A9)+B9)))

. . . where column C gives the ß value.

See a demonstration of these formulas in an Excel spreadsheet on my YouTube channel:


 
 

Sean O'Shea has more than 20 years of experience in the litigation support field with major law firms in New York and San Francisco.   He is an ACEDS Certified eDiscovery Specialist and a Relativity Certified Administrator.

The views expressed in this blog are those of the owner and do not reflect the views or opinions of the owner’s employer.

If you have a question or comment about this blog, please make a submission using the form to the right. 

Your details were sent successfully!

© 2015 by Sean O'Shea . Proudly created with Wix.com

bottom of page