Changes in the Federal Rules of Civil Procedure in December 2006 led to an explosion in the amount of electronically stored information that needs to found and turned over in civil litigation in the United States. Traditional manual review approaches (rooms full of low paid lawyers and paralegals reading paper documents) have collapsed under this burden, spawning a multi-billion dollar electronic discovery (e-discovery) software and services industry. Information retrieval technology, particularly supervised machine learning for text classification, plays a pivotal role.
I will review the major technological and process challenges in e-discovery, the ways in which machine learning has been brought to bear on these challenges, and results from benchmarking efforts (in particular the NIST TREC Legal Track) in this area. I will also outline a new theoretical framework for studying supervised learning algorithms, Finite Population Annotation. FPA was inspired by the technical and legal context of the e-discovery setting, but arguably is an appropriate model for a range of practical applications of active and transductive learning.