 And it looks like we are live from the MIT media lab. Welcome, everybody, in the room and up there in internet land. And all those of you will be observing this video in the future as part of our curriculum and legal hackathons and other venues. Welcome to the first kickoff talk of the new MIT computational law lecture series. And really, we couldn't have lucked out with a better speaker than Jason R. Barron, who has been a pioneer, as I'm sure many of you know in the room, in the area of the burgeoning area of e-discovery, which at least in my group, the human dynamics group, is some color researchers here from where we do big data and blind analytics. This area of e-discovery is thought of as the part of the law that has really been first to adopt and adapt and embrace and apply the new science of data, this emerging discipline of computational social science, you could say. Within economics and certainly within marketing and within health care, there's an emergence of the use of computational data science as opposed to, let's say, anything that's quantitative and analytic. When I was in college, we certainly did surveys and did regression analyses. Quantitative is not new. Something is new with data science, the application of repeatable patterns that are themselves based upon models which are discerned from the data as opposed to models that come up with, say, a model of a rational economic actor or a model in psychology of an id and super ego or other models of a ordinary reasonable person, perhaps, within the law, for example. There are many models that drive different fields of practice that have corresponding social science academic overlays. In data science, I've noted one of the things that's different in my group is that we frequently derive the models from the data. Professor Sandy Pentland, who runs our group, calls that reality mining. And we've gotten extremely good results in terms of its descriptive power of the phenomena, whether it's economic or social or otherwise behavioral. And also, it's predictive power. So I must admit, I am, from a background that's commercial and transactional and also, to some extent, I suppose, legislative. That was my area of practice when I was in the law. And hacky, maybe interested in creating tools and technologies. It would be a bad day for me if I were involved in litigation. Nonetheless, litigation is the first thing everybody here thinks of and asks me about at MIT when they want to know what is computational law and what is happening at law.mit.edu. So it's with great anticipation and great pleasure that I'd like to introduce to you, Senator Barron, to talk about the path to the emerging field of e-discovery and to fill in some gaps and connect some dots for all of us who are not familiar with the current state of the art and tell a story of how we got here and then to give us a look on the shoulder of a giant what the shape of the law is today in the area of applied analytics to e-discovery and into this area of litigation and then also a look over the horizon at what's coming. And I hope that this will be an enriching source for researchers here in the media lab and our colleagues at MIT in the law.mit.edu computational law program as we think about what to research and where the trends are going and hard challenges. I also hope that it'll be useful to those of you in practice as well. So with that, welcome to the MIT Media Lab Jason, so glad you're here. And please take it away. Thank you so much, Daza. This fulfills a check on my life checklist. My father, my late father was at MIT for 40 years in the aeronautical engineering department. I grew up at MIT and I am just absolutely delighted to be able to do this lecture here at Daza's invite. I must say for everybody in the MIT community, Daza talked about being on shoulders of giant. I am not a giant. I'm exactly one smooth in height. You all know what that means or you can look it up. OK, so the path to predictive coding in e-discovery search, what we have been trying to do, some of us, in an e-discovery bubble, which is a small part of the legal profession but hopefully inflating at rapid pace, is to talk about lawyers being smarter, being more analytical, being more quantitative. What better place than MIT to have a lecture that goes along these lines? And so I graduated from law school in 1980. When I came out of law school and I worked at the Justice Department of Litigation for National Archives, my early years as a trial lawyer were years where I basically opened boxes. Discovery in civil litigation, in a large case, were 100 boxes or maybe 1,000 boxes or maybe we would go to a warehouse and a team of us would look at boxes of hard copy documents. There was no digital world of law as such, but if you fast forward through the 1990s with the introduction of the internet, with email, with the web, and with the network world we're in and larger and larger volumes of data, we all live in a digital world and lawyers have needed to adapt to that. Now they've done that by admitting that there is this two or three times doubling every couple of years of the amount of data in the world. We're gonna see an acceleration of that base given the internet of things and having smart devices in all corners of our houses, our cars, our world we're in, feeding data, which will be evidence in civil litigation for products liability cases, to accident cases, to employment cases on Facebook, looking at data. Every kind of social media data, every kind of database is potentially evidence for use in civil litigation. I'm gonna put aside criminal law for this lecture because I've spent my 37 years of practice in a civil litigation world. We are essentially in an infinite world, no one even at MIT knows the difference viscerally between an exabyte and a petabyte, petiyotabyte. We don't live in that world, we can look at the numbers exponentially, but we don't feel it. And lawyers have a difficulty, when it gets to be as old as I am and senior lawyers in law firms, they don't know what the difference between a hundred boxes and an exabyte of petiyots. It's just a more, but it's a lot more and it needs new techniques so that we can search for relevant documents in vast galaxies, in vast spaces that are out there. This is a dinosaur. This is the last time I hope that an investigation of 350 billion pages is done by Boolean searches. We're gonna talk about keywords of Boolean searches in a moment. But here, a large number of contract attorneys at low wages were tasked to go look at the results of Boolean searches against a large database to look for hits, to look for relevant documents. We can do better than that, we're smarter than that. And we're gonna talk about the path from keywords to new methods of doing it. It's a new reality given all of this data. There's a term I'm going to use today that you all should know, it's called ESI, Electronically Stored Information. It was defined in the 2006 Federal Rules of Civil Procedure for the first time, up till then from 1934 to 2006, lawyers talked about documents and maybe electronic documents. But now we talk about ESI. As we talk about data in databases, e-mail is a semi-structure kind of database. There are worlds of data out there and it's all evidence in lawsuits. So when I talk about ESI, it's all covered. Now there's a minefield of lawsuits as referred to it. Part of the world I live in is a world where litigation is demanding the preservation of this ESI early. Can't wait a year into a lawsuit. You gotta have a conversation with the other side right away about all of the electronic evidence that potentially will yield relevant documents down the road. And so e-discovery has been really, really important. In fact, less than 1% of civil litigation goes to trial. It's all discovery, settlement, dismissals of cases. Courts will impose sanctions if we lose ESI so we need to preserve it. We need to know how to search it. With hindsight, sometimes courts will be upset with parties that didn't preserve the information. And they are expecting it increasingly that lawyers understand how to advance search techniques against these databases. They don't wanna wait. Courts don't wanna wait a year from manual searching through the boxes. They know that there are analytical methods out there. Some of them do. And they're imposing that expectations at higher baseline on parties. Now how do lawyers approach the search task in e-discovery? And how is it different than finding a restaurant tonight where it's gonna be very difficult given the crowds for Super Bowl 51 in Boston today? What constitutes the state of the art in e-discovery search and what kind of benchmarking efforts? We're gonna talk about that. We're gonna talk about the difference between Google searches and the kind of expectations that I'm under in the space. Traditional document review is one where we all would be tasked with this kind of box function, folder function, turning pages. It's extremely labor-intensive. You get tired, you get tired, but people do. And the quality of the coding of relevant versus non-relevant documents when you're doing manual review has never been really measured. It was thought to be the goal state of the profession, but really it was not measured until the kind of research that I was involved with in the last 10 years. So the RAND study is a good place to take a look at that. Now, the information retrieval task, searching the Haystack, which is just another metaphor for this large universe, to find relevant needles. Not just one needle in the Haystack, but all the needles. We need to find all the needles that are relevant. So unlike a restaurant that you're searching on Google tonight, and you want to find whatever, a wonderful French restaurant in Boston or Cambridge. You can type in French restaurant Boston, French restaurant Cambridge. You will get 100,000 hits on Google. They're ranked by popularity. You'll look at the first few pages. If you're really compulsive, you'll look at a few more pages, but you're not gonna look at the entire long tail of hits generated on Google. But my task as a lawyer under Rule 26 of the Federal Rules of Civil Procedure is to find every relevant document in litigation. So if you have a billion documents, it's not just the first few pages. And by the way, they're not ranked order like in Google. You go to a corporate database, it's not ranked order by popularity. They're just documents, they're just emails. And so the question is how do you find, how do you do a reasonable search in a world where you're expected to be complete and comprehensive? My task is harder because I need to find all the relevant emails and I would like to find just the relevant emails and no others. I don't wanna spend my time on false positives and we'll get to that. So email is still the 800 pound gorilla in e-discovery. It is, I understand we move past email, especially when we're at the MIT Media Lab here. There's a lot of things going on in the space beyond email, which seems like yesterday's generation. But in the corporate space and the government space, tremendous numbers of transactions of official messages and corporate messages are done by email. It's a candid medium and it's still something as a repository to go search on there. Now I spent my life at the National Archives and before a DOJ, a White House email. In 2002, what arrived on my desk was a request to produce documents of 1,726 documents related to the re-co-action, U.S. versus the West Sea, the action against Philip Morrison and other companies brought by the Justice Department for a racketeering conspiracy. And those requests are reduced. The last one said all of the other requests the National Archives should consider to be part of its burden to go find documents. So the National Archives run all the presidential libraries. I had to look at paper documents back to Harry Truman. But we particular were interested, they were interested, the defendant to that case. And we then had to search against White House email because Al Gore and Bill Clinton brought that lawsuit. By the way, the lawsuit is still going on. I had to search 20 billion Clinton era email records. How did I do it? Well, I did it the same way that lawyers in the United States still in 2017 approached the task. Yeah, dream up keywords. So I dreamed up some keywords to go search against the database and have archivists and IT people do that. And added some other terms, noisier terms, that might generate false positives after discussions with the other side and report it back. So what kind of terms? Well, you can see on the left side, tobacco, cigarettes, smoking, tar, nicotine, these are the kind of things that anyone in this audience, anybody who's watching, might dream up for doing a keyword search. And then there were other terms that were suggested by the defendants. The thing about these other terms is that they led to interesting results. Now I have Julie Andrews on the slide and Al Gore in the Marlboro land. What I quickly realized is that there's a tremendous amount of false positives in the search space using keywords. Keywords have tremendous limitations. Marlboro, if you type that into the White House email database, you know, it got a lot of messages about other Marlboro Maryland where some people in the DC area live. So more since you PMI didn't generate any emails about that, but it did about presidential settlement agreement, medical savings account, metropolitan standard areas. There's different terms that come up. And my favorite was tobacco institute when I was asked by defendants to type in TI. What I got was a Spanish preposition and a lot of Julie Andrews references to Joe Ray Beepaw, so a lot. T. Doe. And so my world of constructing bulls with insertions which is still being done by Lehman Brothers and by other investigators today, I realized that this is a losing game. This has limitations on it. It's both over-inclusive and under-inclusive. The amount of false positives that are generated are huge and it doesn't get a key document that might not have a keyword. So you start with 20 million. You go to 200,000 hits based on the keywords. You find 100,000 relevant emails after a six-month search. You produce 80,000 on the other side. You create a privilege log. The problem is that the 200,000 to 20 million doesn't scale when you're up at billions of objects. Here's another way of looking at the process. I'm glad it's all on the screen there. We basically took six months to 25 people to go through 20 million emails to come up with 200,000 hits, have 100,000 documents. This kind of process is done every day in litigation. No, however, something that I ask my students to try to glom onto in here. If you have 200,000 hits, what happened to the 19,800,000? That's a dark batter you haven't looked at. And today in 2017, based on work that we've done and rising expectations of the legal field, now there's a quality control process to go sample against the discard pile, 19,800,000. Back in 2002, I didn't do that. And there are a lot of QC measures that we employ. Okay, so is the world growing? Of course it is. They started with 32 million emails in Clinton at the White House Capcom Automated System. We're up 200 million in George W. Bush, 300 million in President Obama, plus a lot of other types of media that were reported. Attachments to those make the volume sizes much, much greater than even what it looks like in terms of the curve here. And as of 2019, the entire government is gonna be preserving its emails and all electronic records that are permanent in nature, that are going to be transferred to the archives in digital forms. There'll be no more paper at the National Archives. So how acceptable will these documents be? I'm gonna leave that to the very last thing I say today as a sidebar issue that I am attached to. If we actually had to look at a billion documents, a billion emails, manually, it would take 54 years at 50 documents about it. We can't do manual. Any judge in the country who thinks that manual processes work for these large amounts is crazy. But even if you look at 1%, it's too much. It's too much cost. So 1% of keyword searching that was generated me is too much for the future. And so we need to move on to other methods. And we've been exploring this space for the last decade. Elections are faster. We're gonna be dozens of times larger in 2020. Maybe my next lecture at MIT will be then. We'll come back and we'll see how many Yadavites are in a typical collection. So, myth, hype, and reality, I'll do this very fast, that there's a gold standard of thought to be about manual review. There are studies back to Blair Marin that say that lawyers think they're too well in keywords if they don't. The information retrieval problem is hard. It is hard in trying to parse text. It's harder with audio and video in every form of evidence that lawyers have to deal with. There's a vast field of information retrieval research that exists, but someone had to put it together to sort of marry up PhDs of computer science and lawyers to have a smart conversation. And so I've been part of that movement to do it. There's ambiguities in language. I won't go through it, but any term that you can think of as a keyword, you can think of a bag of words that are like it. You can also think of ambiguities of that term. George Bush is fundamentally ambiguous. This is Bush 41, Bush 43. Any word you pick is going to be filled with ambiguity. There are a lot of difficulties with information retrieval with keywords, as we know it, misspellings, OCR issues performing not so well, abbreviations, the Enron dataset has all sorts of codes in it where people talk in code speak, and a priori you wouldn't know the keywords that are involved. And so those are all issues. Judges started to understand this around 2008. I won't, I'll go past these slides. It's not, it could be red light. Judge Grimm said, hey, keyword searches based on the research going on have limitations. Judge Fascio said, it would be like, well, where angels, truly to go where angels fear to tread to try to dream up keywords in combinations. And Judge Peck in an important decision said, hey, basically a wave of call to the profession, we gotta do better. And this adds up 2008. So I hope I convinced you that this approach is flawed. It's still in use today throughout the United States and around the world, but there is a alternative set of analytic procedures that are out there. How do we efficiently search volumes? How do we improve what's recall precision? I'm gonna get to that in a second. And what alternatives to keyword searching exist and how do we benchmark? Okay, so when I go to bed at night, and I'm on a big case involving millions of documents, what I want from my algorithm is to find relevant documents and only relevant documents. I wanna be in the top left of this quadrant or the bottom right. I don't wanna retrieve anything that is not relevant. I want the algorithm to tell me all the relevant stuff out of the billion that occupies the universe. I'll deal with false positives. It's inefficient, but I'll deal with them. I'll go through them. But what I don't want and what keeps me awake is I don't wanna false negative. I don't want documents that are not retrieved. That's a bad thing. So here we have for the computational people at MIT, this numerator and denominator constitutes advanced thinking in the law. And I trust me that judges have problems with denominators. They see two numerators and they're a different size. They don't look at the denominators and then they come up with crazy decisions. And I've been citing something, but maybe I'll do it offline since it's being. Recall, number of relevant documents are retrieved over the number of total relevant documents. If you have 10 documents in the universe and you retrieve five or 50% recall, that's okay. It's a measure of accuracy. Precision is the opposite concept. It's the number of relevant documents retrieved out of the total number that you have retrieved. So if you've retrieved five documents and you had to go through 100, that's five out of 100. It's a very low rate of precision and you're very inefficient in finding those five documents. But then again, if there were only five documents in the collection, you'd have 100% recall. Now, one might ask, how do you know? You can do it by sampling. Now, here is a set of slides that show differences in recall and precision. Here, the universe is the square, the rectangle. The red is the number of relevant documents in the universe. The gray circle is the search that you've conducted under any algorithm you have. And what it shows is that the recall is about 30%. The search that generated that larger circle, not about 30% of the universe of red circle here. But the precision was low because there's a lot of documents outside the red circle that you have to go through. And there's something called the F1 measure, the harmonic mean of recall. That is 12% here. That's not really that great. Here is another example where recall precision. Well, so you're getting some of the documents, but you're also having to go through a lot of documents. If both numbers are low, the F1 measure is low. Where you want is sort of like a lunar eclipse or something like that. The circles come together. You want search, you want your algorithm to find as many documents that exist that are relevant, but don't give you any false positives. No noise, all signal. Here the recall is very high, 76%. Precision is even higher, 84%. And the harmonic mean comes to 80%. That's where you want to be in the space. Under any method that can be employed, anything in the black box. At Trek, we studied that this Trek program since 1992 was a text retrieval program. In 2006, we became part of the project as introducing a legal track by introducing hypothetical complaints, hypothetical queries, which are requests to produce. I had lawyers doing boolean searches against each other. I'll show you an example. And then we use that as a baseline to go look at other methods that were emerging in the space that include various sorts of analytics that now go under the term predictive coding and technology assisted review. And where I started, I wanted to really understand the difference between boolean and these other methods. That may not have been the only or the right question to ask, but we used a tobacco settlement universe of documents, 7 million. We used an Enron database. We invited the world's vendors and the world's academics to come play in the sandbox of the Trek legal track. The Trek ran for five years. This is an example of a complaint that requested for this release, produced all documents with high cost of fertilizers. So I've had two lawyers negotiate a boolean as the baseline to compare about analytics methods that were different. And so one lawyer would suppose some terms with an AND and more and whatever and various ways to parse the story here. The rejoinder from another lawyer would be, no, you missed some terms and you really need to have additional terms. And then they came to a final query that was a consensus of the two sides and running that query against a tobacco settlement database produced 3,078 documents. Okay, what we learned, what we wanted to learn is whether those are the only relevant documents or were there other methods that produced other types of documents. And what we learned in the very first year of the Trek is something that was startling to the profession but shouldn't have been because it replicated the Blair Maron study approximately that it did about 25 years early, which is that the boolean searches that had been negotiated by lawyers, they did well on certain topics. There were many, many topics based on the fictional complaints. But there were many other ways to find relevant documents and the other analytical meetings that were used other than boolean searches found additional unique documents. And then there was a manual search without the original documents yourself too. So another way to look at it is that we found that some topics, overall 70% of relevant documents across all of the topics, all of the hypothetical complaints were found by some other technique other than keywords. So this was a very good quantitative benchmark saying we've got to be smarter in the space. We've got to use other methods that are out there. Some topics can be a very precise topic where you use a keyword and get all the relevant documents. But many were missed. And so we did kind of comparisons and it's a large space of dark matter that it's not, you get relevant documents by other analytical methods. I'm gonna go past going through these topics because that's not where I wanna be on this talk. I wanna talk about predictive coding. But for the purposes here, we came up with what is tables of what a recall and precision and the rates much higher using certain smart techniques than in the boolean works. We also found that it's all over the map that the various vendors and academics had very different rates of recall and precision against the term. It's very difficult to do a linear regression here and say that there's some line or something. It's basically all over the place, which means being very careful. It has to be very smart with respect to certain methods used. And then we did other kinds of graphs so you could sort of see here that using a topic authority expert along the way to ask questions about relevance increases generally the accuracy of what is going on, the F1 measure. Also, if you appeal, we have the appeal process of relevance judgments during the time. And the more time, the more appeals and the more time spent on appeals basically generate a better relevant sets. That is richer relevant sets in the collection. So all that's said here, go this way. The bottom line on track is that beating boolean is hard work. We know there's dark matter out there in terms of a true number of relevant documents, but traditional keyword search doesn't cut it. The 2009 results especially showed essential gains where recall and precision were up at 70%-80% levels rather than 30% or 40% based on certain techniques used. And so the legal profession waking up around 2008, 2009 to the possibility of other types of searches. There's a large space. I don't have time. And this lecture is on every type of model but I will talk about one black box model here that has been really interesting in terms of alternatives to keyword searching. Anybody who's interested should go to the Sedona Conference commentary for a taxonomy of different search methods. Predictive coding as we now understand it is essentially a clustering of documents. It's using clustering techniques and using iterative process with looking humans in the loop at the front end to look at seed sets or look at an initial, or putting it in an initial set of documents to then have software look at the entire collection. Now there's a black box element to this. I normally, I very rarely have put the next two slides up in my lecture as the lawyers because frankly, and you know who you are out there, we all struggle with things like vector space technologies. But we will, let's just go, I have one additional, one preliminary slide and we'll get to the two slides here that I was talking about. One is the predictive coding defined as a process for prioritizing or coding a collection of electronic documents in a computer system that harnesses human judgment. So on the front end, we're deciding what a small subset of documents, whether they're relevant or not, and then feeding that into a computer using a black box algorithm to generate a ranking scheme of all of the billion documents in the collection. And using it in a way that is supervised, that is that it's active, that humans are involved in judgment by looking at samples, by training the system on the front end and basically not just letting the software do its thing alone. Here's the first of two slides that I don't usually present. And I would not prefer to be an expert when it has to turn to a PhD in information retrieval, but all of you that might be understanding the concept I think of a vector space where a document is a vector, each work. So if you have a Gettysburg address, you have all of the words in the address as a vector. And the entire dimensional space is one where A1 to A100 is 100 words that are in that document. And then there, you know, some words have greater frequency at others. You basically, the every algorithm that is out there that has been shown to be doing very, very well in terms of relevance judgment in space is some variation of using a vector space model. So things like latent semantic indexing or probabilistic latent semantic indexing is all based on the idea of turn frequency. So you have basic document-document similarity. You're looking at a document in all the words. You're looking at another document, another document, and you're producing clusters of documents that have essentially using cosines between terms, the closest in the multi-dimensional space together. That's all I'm gonna say about it because I am sure everyone at MIT and many people that are watching would do a better job of explaining it if we had an hour on this. But it is a black box algorithm. A very interesting issue with the law is whether courts need to have this all explained to them in evidentiary proceeding or can they just assume that the black box works if it's producing better results than keywords? We are not also, we are also attuned to all sorts of other methods that can look at metadata in certain ways so that you can look at spikes in conversations between people. You can evaluate data sets by evaluating the visually, what's going on within the data set, what's really cool in the legal spaces that in eDiscovery there's a multitude of methods being used beyond keywords to find relevant evidence in large data sets. And it's a challenge. Some of these types of methods don't scale very well. So maybe the MIT graduates in the future can find a scalable way to visualize a billion objects rather than 30,000. My good friend Ralph Lozzi has a model for predictive coding. It's basically a workflow process where you do a variety of different types of search methods. You use analytical techniques in an iterative way by training the system and going back over and over again to train the system so that it's stable, that it's finding relevant documents in a particular way. I'm not gonna talk any more about that. We can point people to Ralph's work. Mora Grossman and Gordon Cormack wrote what many believe is a seminal article based on the track of legal tracks. They wrote it in 2012 saying that technology-assisted review of predictive coding can be as accurate or better than human review and obviously quicker. You obviously can go through a million documents using software in a weekend or a week for the computer to go through it rather than spending six months or a year with an annual process. So computers clearly in the John Henry versus the train they're a lot more efficient. The question is, are they more accurate? What Mora and Gordon showed in their paper is that comparing the study, this is the key graph in their study, which is that looking at their results from Waterloo and looking at other results from a track, legal track in a special experiment they performed, the recalled precision rates were better on average than manual review. And with a lot of caveats, their article was pointed to as a step forward in the profession. And Judge Peck, again, I mentioned before, he came out with a decision in 2012 with the landmark in the law. It's an inflection point which said that in his court he will accept what Mora and Gordon found, what the track legal track found, what a lot of us in research have found that technology assistive using predictor coding and analytical techniques is a valid, reasonable approach to discovery in the space. It's not the last word and there are lots of open issues in the space. But it is definitely something that he put his tag on. He put a protocol together on supervised learning. He talked about issue tax, which led me to believe that there's a lot of conversation we could have in an information governance space about analytics using issues as a relevant measure against collections. And what posts De Silva more, what we have is two approaches that competed for the last few years in people's minds. Whether you select for the iteration that's to come when you train a system, do you just basically let the system pick random documents and so have no bias, human bias attached to it. And then from there, looking at the relevant documents the system throws at you would then find relevant, the algorithm would find relevant documents in their way, you can train the system. Or do you put a lot of judgment on the front end? Do you throw documents at an algorithm that are already known to be relevant? Some could even be privileged, but very special documents. And there was debate as to whether bias or accuracy and how those two methods play out. There, as of 2011, when the legal track ended, the question is, should we be using seed sets for training? How those seed sets should be worked in terms of random versus judgmental? Should lawyers rely on random sampling or using known relevant documents? That's, that was an animating question. Have found more recently, and I will defer to more research in the second lecture in the series, which I hope will happen if Daz invites her, is that based on their research and research of others, that a continuous active learning process where there's not as much reliance on, and in no randomness method, you basically have software picking documents to start and you're throwing the documents in at the beginning. But basically, you do away with training these sets of seed sets as such, and you have the software work continuously to throw out, based on its knowledge, of relevance, a set of documents that for further examination. What they have found using something called computers, sorry, a Cal method of continuous active learning is that there's an efficiency here that the lines on the graph go up in terms of refold position very fast in terms of the number of documents and then level out compared with other methods. And I don't have time to go through it here, but it's a very interesting result. Somewhat, there's still a lot of open issues to explore in the space. Certainly, again, not the last word, because this is a new thing, a new emergence phenomenon, looking at ways to tweak them. And hopefully the MIT community would be excited about that. I have faith in analytics. I have faith in the black box that we know we can do better than keywords, than manual search. Very few judges and lawyers out there are fully comfortable with these kind of techniques. They rely on a vendor community and legal techies to explain what's going on in large extent. The black box is still a mystery, but we are in the early days. We're really in the first decade past the ESI notion that electronic stuff is important. And so we are on a journey for better algorithms in the space. And I've been very happy to be part of that. I think to advance the cause is to have communities talk to each other, because we do talk in different languages. And so it's very important that MIT community and the community of PhDs and computer science and information science have conversations with lawyers about what are the open issues in the discovery, what kind of methods would may prove valuable in looking at larger and larger data sets, which is going to be the rest of the 21st century. And to that end, very happy. I said the last two and a half years, working on a book called Perspectives on Predictive Coding, which is an attempt to corral 30 or so authors to talk about these methods in various contexts, whether it's anti-trust or on the defense side of the bar or the plaintiff's side of the bar. More and Gordon have an original article in there. And so I would recommend for those people looking at this to check it out. It's an American Bar Association book and I get no royalties from it. I'm just saying that for the advance of the knowledge. Lastly, I have been privileged to work on a series of workshops that do try to combine PhD knowledge with lawyer knowledge. And the next one, we've been doing this for 10 years around the world. And the next one is in London on June 12th as a workshop at the AI and law conference known as ICAEL. And what our workshop is on is something that is very close to my heart in terms of where I want to spend my next decade or so, which is to use these analytical techniques in ways to open up what I call dark archives. And that's one of the things that's going on here. What we're going to look at at the workshop is how do you use analytics as the type we're talking about, vector space stuff, to go into large public record collections like White House email that would otherwise not be open for many, many decades, subject except to maybe a lawsuit or a FOIA request for some of it. How do you open all of that if archivists insist on going through page by page? So look, and the problem that we found in our space and what's problem in law and on the privacy world of the EU and you said documents are filled with sensitivities. So the workshop is going to be talking about how to filter and extract sensitive information from large collections of documents, privileged documents, personal documents, documents with social security numbers with medical information, criminal information, documents with telephone numbers and passport numbers, but have not only strings that are subject to expressions but contextually that they are personal. And in an archival sense, how do you go into vast collections of number of records and pull out stuff that is essentially PII, personal identifiable information free? And so that we can have access to those large collections. That's what the workshop's going to be on. I invite people who are watching this and of course in this room to come to London on June 12th and participate and be part of that conversation which is not going on. The last slide I have here is references. There's a tremendous amount of work that's been done. We have the book, we have article, law reviews that I point people to and the RAND study, the track legal track, the Sedona conference was a rich source of information. There are a lot of us who would love to talk to people about these kinds of issues, we're excited about it. And so again, I very much appreciate the opportunity to give this very fast lecture here. And I will turn it over to the room for questions from Gaza. And please feel free to contact me also if anybody is outside this room. Thanks very much. Any thoughts or questions on all that? I'm curious about other kinds of database, just text, images, audio files, sounds. Have you tried to mine this kind of new database using new techniques such as machine learning or? I think, yeah. Okay, so the question is, is beyond text, there are databases that include all sorts of different other types of digital objects and images for one. And it could be voice messages or video conferencing or whatever. Many types of documents have metadata that are associated with them. And we can look at metadata for images and for voice messages, whatever. We can also try numerous techniques. We can certainly, there is language software out there that can convert audio messages to text and then we can go apply it against text. And to some extent, we're getting much smarter about other forms of media. The, as I think I said, is that the text issue is still so problematic for lawyers and because most of the good stuff that's in litigation that the gotcha documents are emails and traditional documents and spreadsheets and more traditional stuff than you might think. Are you still concentrating on that rich load to find efficient ways to go out of it? But you're absolutely right that as the world goes on we're gonna need to figure out analytical techniques against all sorts of other types of digital objects. I have a question and it relates to during the prep talk that you provided when I was with the data scientist. The last time I thought deeply about this, I was probably in second year at law school in the rules of evidence course and trying to understand the concept of relevance and how that plays out. It's so subjective what is relevant after all. But on the other hand, you need to machine that, those criteria in order to do really all the scoring and all of the, get all the results that we're talking about here. Could you speak a little bit about the state of the art now for expressing relevance computationally and especially the subtle aspects where you've got let's say a plaintiff or litigant that has a theory of the case. I mean, there's a flat almost, let's say evenly applied perhaps kind of a neutral question of whether something's maybe relevant to proving an element but whether it's relevant to maybe very apparently subtle aspect but of something that would be admissible and very generally relevant but because of the way the person's going to be seeking to enter something in evidence or lay a foundation or because of the theory the things that they wanna emphasize to be persuasive some testimony they wanna bring out or some perspective that they want to emphasize it may be highly relevant to that. And then now we're getting into things that were idiosyncratic to the litigants. How do you begin to express relevance at the level of almost like a neutral crosswalk of the law to data in general and then specifically to the litigants in the case and what they find particularly relevant for maybe even a secret litigation strategy. Okay, well I'm glad that was an easy one. So it's always difficult when the MIT computer science also has a law degree. So the issues are subtle but maybe not as subtle as one would expect after hearing your setup to it which is that I do believe that one needs to be a subject matter expert in the particular domain of the law that a lawsuit is in to do an outstanding job on parsing documents for relevance. So we as lawyers, we all become subject matter experts in a hurry. When I was at the Justice Department I had to do something on nuclear efficient or whatever, I learned that or statistics in a census adjustment case or whatever the issue is. We all have to get up to speak very fast and we use experts to do that. But so there is something to be said for being a subject matter expert and a lawyer that knows the legal processes that you're taking the words of a complaint and the words of a request to produce and with a theory of the case, figuring out what are the relevant documents that are responsive to the case at hand. Now there's a dimensionality reduction aspect of it when you negotiate, when you have these requests to produce documents, they tend to be simpler than the entire theory of the case. Give me any and all documents on this test conducted on a certain date. Now keywords may not, keywords may fail on that but you and I could probably parse that knowing the complaint and knowing the context of the case to figure out what a relevant document is with respect to whether it responds to that query. There's a background of context for the whole complaint and that has to be factored into. But I do believe that it's not as, it's not a mysterious process in the end. It's that people of goodwill could be trained to be subject matter experts to understand the context of the case and do a pretty good job. Now having said that, for a long time in the legal profession, we believed that the junior lawyers were perfectly capable of coming up with relevant, with parsing large collections of boxes or even ESI by themselves and they would put it into piles of relevance and not relevance or privilege and not privilege. And what has been measured as part of the track legal track is that something called inter-assessor disagreement. And it turns out that as one might think if you thought about it is that you put a bunch of lawyers in a room and they will come to different opinions about what's relevant. In fact, you may even disagree with yourself that you started out two weeks ago saying a document's relevant and now if you know more about the case and you've talked to more people or you've learned something, then another document which you said are not to be relevant might be relevant or vice versa. So there is a dynamic process but I don't think it's a mysterious one. I think it's one that can be subject to knowledge, to see your lawyers having a theory of a case and informing junior lawyers about it and there's a way to control for the error process of having a consensus measure of relevance if you have enough people in a room that are providing those judgment. It's very important for the software to get relevance right because sort of like a Frankenstein monster, the software will do an error to a very large degree. It's trained improperly on a set of documents and so one needs to get to a consensus relevance position early so that the algorithm is essentially doing the right thing subject to quality control checks. That's great. And so one of the things just speaking to my colleague, researcher, human dynamics, let's note the phrase consensus relevance position like at the start of the tuning of the algorithm, really powerful. Yeah. May I just ask a follow-up? How do you, in your experience when you're on projects, when you're on cases or in practice, how do people express a consensus relevance approach? Like is it, I mean, when I was in law school it would have been unparsable narrative, blah, blah, like in Microsoft Word or paper. How do you do it in a way that is machinable, that's computable and that can set parameters and instructions so that the machine will follow? Like is there a... Well, I think we can try to invent the methods of the future here. The way that really is done is in document shops and in law firms where you have a large number of people who are looking at particular documents, not using necessarily these fancy analytics, but there is a supervisory structure which means that there is some training as to the context of a loss at the beginning and a review, usually one-on-one junior or senior, to see whether errors are being made by the individual. You can do that across any number of lawyers or analysts looking at a collection and try to control for patterns of errors and come up with further training for the people involved. That's really a human-based method. It doesn't assume that there's gonna be an analytical follow-up. It becomes very important to measure the rate of error rate of various people and to look at what's being done before you train an algorithm because the consequences are much higher. Okay, and so just what I got from what you said was it's fundamentally not machinable, but it's based upon whoever the senior attorneys are and basically training the algorithms by training the people and basically making judgment calls of what is basically saying like, this is relevant, this is relevant, that is, if the same way else, pretty similar to what you would have done in paper in a certain way and part of what I've gleaned from that further, we're always talking about the future of work here in law every time I go into a non-law conference. Actually almost every conference, a big consternation is just gonna get put us out of the job. Will the robots be doing much of the processes that are currently are bread and butter in litigation and in other professional services? I think that the answer you just provided and understanding the contours of that, it may hold partly at least like something for the law schools and others that are training up what's unique about the professional service of law and how it is that we interact with what is computational. Sorry to interrupt you. No, no, no, I interrupt you. I would say I'm a great believer that there's some future for lawyers. My daughter goes to law school one day and they'll be a professional. But I must concede that another part of me says that sort of as a Bayesian construct, if you're in a set of lawsuits that look a lot like each other, like a big corporation being sued all the time over the same kind of documents, there would be less need for human judgment on the front end because you've learned the system, the algorithm has already learned on an a priori basis. So you have priors in learning as to what's relevant and what's not. You basically have a ground truth set based on a prior lawsuit. So I could see examples where you can use the robot's lab software of the future in reducing the tax and reducing the need for human intervention on the front end. However, most lawsuits have enough uniqueness about them that it is worth having some kind of conversation between lawyers and subject matter experts on the front end still. I don't think human knowledge as such is often assimilated in a way that software can take off from here. But with every passing here, Watson and its equivalents are getting closer to doing things that law students and young lawyers do in terms of research. And you know, we'll see where the future holds. I'm a great believer in AI. Just try to keep it at bay for the rest of my career. Oh, that's beautiful. Other thoughts? Well, just kind of picking up on that thread. You talked about establishing a good ground truth set. So if you have a set of like FCPA like learning, don't you think you can move that right to like what we would say like the left side of the EBRM to information governments? So you could kind of prevent affiliation and the corporation could catch it internally, like especially with Microsoft, making a big move into the space of a person that we have. So now these tools actually exist for corporations. Yes, I think that's a wonderful comment. And I must say that I feel terrible that I was not the first person today to mention the word information governance because I live in that world now on the left side of the EBRM model and advise clients about what they're doing. I think the smartest corporation in the space can use analytics in precisely the way that you suggest, which is that they can take a look either very early in anticipation of possible lawsuits or as an ongoing basis to look at their large data sets, see what they have, see what might be problematic. I'm not sure the energy is there, the resources are there, or most corporations to do that kind of thing, but once there is a hint of a lawsuit, a growing trend to use early case assessment to take a look at how vulnerable the corporation is. And if you have knowledge from past lawsuits about what kind of problematic documents you have, that will aid it. I know that my partner, Ben and Gordon at Drinker Biddle, he has successfully gone in early to convince clients that there are sets of documents that are out there and settlement is worth being discussed early and sometimes to exonerate clients that may be under some kind of issue. And so I very much support it and I very much support the idea of using all of these fancy methods across every delusion of law. That's what we try to do in our own problems. We try to convince lawyers in other places like Merger's acquisitions or employment law to think about using analytics to have greater visibility in the data sets. If the MIT community can help us achieve greater visibility at lesser cost, then this kind of lecture has accomplished something. May I ask you to extrapolate upon that just a little bit? I thought I was done. Okay, and then this will be, oh, so we got four or five minutes late start. May we have whatever, consensus to go a little later? So what you just said really is at the heart of what we focus on in this computational law program which is more or less applied analytics in business contexts. So for example, the systems, enterprise systems that Microsoft sells some of which may have just been alluded to, how we set them up for contract workflow or sizing up risk to get into a new market or all sorts of decision support, everyday stuff is really where we live. And it seems to me that the valuable resource and the investment that's made in these engines, these analytics engines have applied to subsets of data in the enterprise for litigation purposes good. In fact, be repurposed for all kinds of business purposes. And some of them could just be right directly centered on new revenue and cost containment and risk containment, just fundamental reapplication to other purposes. But I think the big news in a legal track would be applying them preventatively, which is I think that was really the center of the point of view and the possibility that the questioner was presenting, which has been more or less thinking about the life cycle of these cases and utilizing these methods as part of the regular course of business to maybe flag and identify potential problems, litigation risks, for example, and maybe take action upfront to remediate or avoid the problem, almost sort of like, you know, we do this for the security a lot when there's some pattern of behavior, which may be, even if we don't know why, it may be correlated to breaches or to other kinds of losses and you know, you can allocate resources or pretty blind in terms of what patches to do or what updates to do or where we need different kind of crypto or something just based on analyzing patterns of behavior when you get good. I wonder if we couldn't get out of the security and adaptive practices book where security teams apply scarce resources based on data to the legal track and then prevent litigation or maybe reduce its impact with the same tools, what do you think? So first questions, are we still streaming or not? But we can, if this is off the record, we can wrap it up and then have secret session in chambers. Yeah, why don't we continue this over pizza, but also if you tell me I'm off, I will tell you what I would say off the record. Okay. And so let's just be a lesson to everybody in TV land why it's so great to be in person and where we can have conversations and also form relationships and have to say it's great to now have a relationship with you. I'm welcome to MIT and I hope they can come back and participate in our programs and continue to educate. And I do wanna say to the, I'm happy to talk to anyone out there about the question that you asked. It's a very interesting one and I've given a talk at Georgetown recently but about the world we're heading in terms of security. There's, I can't emphasize enough that the kind of issues that I've been talking about apply across all legal domains. And so by all means, we would love to have the MIT community involved in that conversation. Okay, they don't have to ask twice, right? So we'll be in touch on that until next time. Thanks for tuning in. Find out about the next talk at law.mit.edu. And if you have any questions and you weren't able to participate live go ahead and use the form. And when we collect a few, I'll send them on to Jason and perhaps some we can, this does happen frequently after talks. We can maintain the dialogue as a gristly online. Thanks very much. Bye.