News Detail Banner
All News & Events

Article: Predictive Coding Comes of Age

November 01, 2012
Business Litigation Reports

So-called “predictive coding”—using a small number of manually-coded documents to analyze and predict appropriate coding for a much larger set of documents —has become a hot topic in e-discovery. This past year brought the first reported judicial decisions explicitly authorizing the practice. 2012 also saw some of the first disputes concerning the appropriate methodologies for this technique.

In coming years, the use of predictive coding will continue to grow as litigants seek to limit discovery costs. Judges may also continue to endorse the practice, even incorporating it into model e-discovery orders. But early adopters should proceed with caution; the practice is likely to generate many disputes as acceptable methodologies and best practices are established.

The Evolution of Computer-Assisted Document Review
As companies have moved away from paper file systems and toward electronically stored information (ESI), the number of documents that must be collected and reviewed in civil litigation has skyrocketed. A number of technologies have been used to handle this explosion in discoverable information. Predictive coding is the latest technical evolution for reviewing and producing large data sets.

Manual Review: Not long ago, manual, linear, “eyes-on-the-page” analysis was the predominant method of document review. The process started with collecting documents that were potentially responsive to formal requests for production. The data collections, especially in complex civil litigation, often contained millions of pages. A small army of junior associates, contract attorneys, and even paralegals would then mobilize to manually review the documents for responsiveness, privilege, and confidentiality.

Although many still consider manual review to be the “gold standard,” it is rife with performance and quality shortcomings. Analysts estimate that when operating at a maximum review speed of about 100 documents per hour, a decision on relevance, responsiveness, privilege, or confidentiality would need to be made in an average of 36 seconds. See Nicholas M. Pace and Laura Zakaras, Where the Money Goes; Understanding Litigant Expenditures for Producing Electronic Discovery (RAND Corporation 2012) (hereafter “Pace & Zakaras”). As a result, the document review in a large case could take thousands of man-hours. This significant expenditure of time and money does not come with a guarantee of accuracy; studies suggest that up to 95% of reviewer disagreement is the result of human error and not simply close questions of relevance. See Maura R. Grossman & Gordon v. Cormack, Inconsistent Assessment of Responsiveness in E-Discovery: Difference of Opinion or Human Error? 9 (ICAIL 2011 / DESI IV: Workshop on Setting Standards for Searching Elec. Stored Info. in Discovery, Research Paper).

Keyword Search: Keyword searching is a rudimentary form of computer assistance that narrows the scope and number of documents for further manual review. In a typical keyword search, the producing party runs a set of keywords against emails and other electronic documents to identify a smaller set of documents to be manually reviewed for responsiveness. Typically, multiple keywords and Boolean relationships among them can be utilized. Keyword searching offers performance improvements over manual searching, and is highly common in modern e-discovery. Courts have explicitly endorsed the practice and have even incorporated keyword restrictions and search terms into model orders for e-discovery. See, e.g., Federal Circuit’s Model E-Discovery Order for Patent Cases, available online at (proposing that email productions occur using “five search terms per custodian”).

Yet keyword searching is also rife with shortcomings. Keyword searches are frequently overinclusive and underinclusive; search terms fail to capture many relevant documents, while simultaneously generating many false positives. When search terms turn out to be more common than expected in a document set, keyword searching will return a high number of documents that contain the keyword but have no possible relevance to the case—forcing the producing party to use expensive manual review to find truly relevant documents. A poorly chosen keyword often returns more “junk” than responsive documents. For that reason, great care must be taken by the producing party to identify appropriate keywords, often with the assistance of the document custodians themselves. Creativity must be employed to ensure that common synonyms, misspellings, acronyms, and abbreviations are included and keywords likely to generate false positives are excluded.

Predictive Coding: Predictive coding is the latest evolution of computer-assisted document searching. As with manual and keyword searching, the process begins by collecting a corpus of potentially responsive documents from the client. Next, attorneys review a small set of randomly selected documents to identify a “seed set” of documents that are clearly fitting, or not fitting, the desired document categories. Then, the predictive coding software uses the “seed” documents to create a template to use when screening new documents. Some systems produce a simple yes/no, while others assign a score (for example, on a 0 to 100 basis) relating to responsiveness or privilege. Attorneys then audit the identified documents to validate their relevance, responsiveness, or privilege. The computer uses the attorneys’ audit results to modify its search algorithm. The search algorithm is repeatedly audited and rerun until the system’s predictions and the reviewer’s audits sufficiently coincide. Typically, the senior lawyer (or team) needs to review only a few thousand documents to train the computer, at which point the system has learned enough to make confident predictions on a much larger data set—relevance of millions of documents.

Once a predictive model is generated, there are several ways the review might proceed. In the context of a review for relevance and responsiveness, one option might be to assume that all documents with scores above a particular threshold can be classified safely as responsive, while all those with scores below a particular threshold can be safely classified as not responsive. Only those documents with scores in the middle would require eyes-on review. Another option would be to perform eyes-on review of only those documents exceeding a particular score in order to confirm the application’s decisions, while dropping the remainder from all further work. Foregoing all manual review altogether is also a possibility, though likely not advisable, given the potential for unexpected error. As these examples illustrate, the umbrella term “predictive coding” can be used to describe a number of different ways that predictions are used and applied. The individuals supervising the review must pick appropriate cut-off points and use their best judgment as to whether and how humans will review and refine codes that are automatically applied.

Used carefully, predictive coding has the potential to offer significant performance and cost benefits, without compromising accuracy. Litigants are already touting the cost-saving potential; some defendants have claimed predictive coding would reduce time for production and review from ten man-years to less than two man-weeks, and would cost roughly 1% of the cost of human review. See Global Aerospace, Inc. v. Landow Aviation, 2012 WL 1431215 (Cir. Ct. Loudoun Cty. Va. 2012). As to accuracy, predictive coding has not been shown to be any less accurate than traditional manual review. (Pace & Zakaras, pp. 61-66.) Some studies suggest that predictive coding identifies at least as many documents of interest as traditional eyes-on review, with about the same level of inconsistency, and may in fact offer more accurate review for responsiveness than most manual reviews. (Pace & Zakaras, p. xviii) Actual cost savings will depend on a number of factors, including the size of the document set, challenges to the predictive coding methodology, and the document review methodology against which predictive coding is compared—but used in the right circumstances, the cost-saving potential of predictive coding is obvious.

Recent Decisions
While keyword searching has been the most frequently used choice of computer-assisted document review and searching, a small handful of recent cases have considered the use of predictive coding. As courts become more familiar with the practice, some are explicitly endorsing and recommending the practice.

Global Aerospace may be the first case actually ordering the use of predictive coding. Global Aerospace, Inc. v. Landow Aviation, 2012 WL 1431215 (Cir. Ct. Loudoun Cty. Va. 2012). The defendants argued that, with more than 2 million documents to review, it would take reviewers more than 20,000 hours to perform the task—10 man-years of billable time. 2012 Global Aerospace, Inc. v. Landow Aviation, 2012 WL 1419842 (Va. Cir. Ct. April 9, 2012). But with predictive coding, it would take less than two weeks at a cost of roughly 1/100 that of manual, human-review. Id. Having heard arguments, the Court ordered that Defendants could proceed with the use of predictive coding for processing and production of ESI. Global Aerospace, Inc. v. Landow Aviation, 2012 WL 1431215 (Va. Cir. Ct. April 23, 2012).

Global Aerospace stopped short of an unqualified approval of predictive coding. For example, predictive coding cannot work effectively if a representative corpus is not used for the initial training. The Global Aerospace court noted that the receiving party was free to challenge the completeness of the contents of the production and the manner in which predictive coding was used for new documents. Id.

In Moore v. Publicis, perhaps the most significant judicial decision on predictive coding to date, the Southern District of New York (Magistrate Judge Peck) held that “computer-assisted review is an acceptable way to search for relevant ESI in appropriate cases.” Moore v. Publicis Groupe, 2012 U.S. Dist. LEXIS 23350, 2012 WL 607412 (S.D.N.Y. 2012). The Court reasoned that computer-assisted review complied with the doctrine of proportionality of Federal Rule of Civil Procedure 26(b)(2)(C), and that predictive coding was an acceptable form of computer-assisted review. Id. at *12 (“…computer-assisted review is an available tool and should be seriously considered for use in large-data-volume cases where it may save the producing party (or both parties) significant amounts of legal fees in document review.”)

As courts have endorsed the voluntary use of predictive coding, parties have also sought to compel their adversaries to use the technique. In Kleen Products, Defendants sought to use keyword search-term processing, in which they had already invested much time and effort; but Plaintiffs moved to compel the use of predictive coding, arguing that keyword search methods were inadequate and flawed. Kleen Products, LLC v. Packaging Corp. of America, No. 10-C5711, Dkt. 412 (N.D. Ill. Sept. 28, 2012). The Court held evidentiary hearings in February and March 2012, during which it urged the parties to reach a compromise—for example, adopting Defendants’ keyword-based approach, but refining or supplementing terms and review procedures to meet Plaintiffs’ concerns. Ultimately, the parties reached agreement before a ruling on the motion to compel was reached. But Kleen illustrates that disputes over keyword search-terms may extend far beyond the sufficiency of specific terms going forward. Parties may challenge the notion of keyword searching itself—perhaps using the availability of predictive coding as leverage to obtain significant concessions on proposed keywords.

A recent case management order in In re: Actos provides further insight into the predictive coding processes that parties are likely to agree to and courts to sanction. In re: Actos (Pioglitazone) Products Liability Litigation, MDL No. 6:11-md-2299, Dkt. 1539 (W.D. La. July 27, 2012). The agreed-upon order in Actos allows each side to nominate three reviewers to work collaboratively to code the seed set of documents. The extremely detailed protocol contains numerous levels of sampling and review, as well as meet-and-confer check points throughout the procedure, including regarding the relevance threshold that would trigger manual review by the producing party.

Predictive Coding Done with Care
Litigants interested in utilizing predictive coding should keep several principles from these cases in mind. First and foremost, the producing party should attempt to gain the receiving party’s consent to use of predictive coding. The greater transparency offered into the procedure, the less likely that the receiving party will successfully move to compel an alternative document production methodology later in the case. An agreement regarding the basic methodology and the custodians from whom documents will be collected is recommended. Moreover, using jointly-appointed reviewers for the document training set may ease concerns with the process.

Second, the producing party should negotiate a “claw-back provision” that will allow recovery of documents that are improperly produced as a result of the predictive coding methodology. These could include documents that are irrelevant, privileged, or that should be, but were not, marked as confidential under a protective order. Such a provision is especially important if any portion of the documents marked responsive by the predictive coding methodology will not be manually reviewed.

Third, great care should be taken in preparing the initial “seed set” of documents that will be used to program the predictive coding algorithm. If the producing party does not actually involve the receiving party in the selection of the seed set, the producing party should be prepared to disclose the entire seed set to the receiving party and the court, which may raise work-product protection concerns. It is also important that the persons reviewing the initial seed set have a strong grasp of the issues in the case. Because of the importance of the initial seed set, it is critical that persons reviewing the seed set make accurate decisions; any errors in the seed set will become systemic throughout the larger review.

Fourth, the producing party should consider whether it is appropriate to use different seed sets for different custodians. For example, in a patent case, responsive documents that are held by an engineer may look very different than responsive documents held by an employee in the marketing or finance departments.

Fifth, the producing party should work closely with its e-vendor to ensure that the methodology is statistically justifiable. This includes ensuring that the documents from which the seed set is drawn is random, that the seed set is sufficiently large, and that the confidence interval and confidence level are either agreed upon between the parties or statistically justifiable.

Potential Stumbling Blocks and Pitfalls of Predictive Coding
Litigants planning to use predictive coding should be aware of potential pitfalls that could render the practice either more costly or inappropriate than manual review or keyword-driven review. For example, predictive coding may be inappropriate in a case that does not involve a sufficiently large body of documents. If the receiving party is dissatisfied with the results of the predictive coding, the producing party may face a motion to compel a more traditional document review methodology—thereby eliminating any cost savings. The danger of such a motion is especially high now, when predictive coding is in its earliest stages and best practices have not yet been developed. Where the corpus of documents contains highly sensitive information, a full manual review of any documents automatically selected for production may also be required to reduce the likelihood of damaging disclosure. This may entail significantly greater expense than keyword-driven reviews. Finally, predictive coding is not presently suitable for files that are not primarily text-based, such as video or audio files, necessitating the continued manual review of those materials.

As the amount of electronically stored information held by companies continues to grow at an exponential pace, widespread dissatisfaction with traditional manual and keyword review will likely lead to even greater use of predictive coding in 2013. This transition will offer cost savings for some, and headaches for others. As predictive coding grows, so too will litigation concerning predictive coding’s appropriate use and methodology. But the potential for significant cost savings is undeniable for large-scale reviews. Cost-conscious litigants in document-intensive cases would be wise to consider predictive coding as one tool to reign in growing e-discovery costs.