 Hello, this is Daza Greenwood, and I realized as I was editing this video for law.mit.edu that a little preface would be in order to let you know, number one, fundamentally this is a spotlight on not just the computational law research that's happening now in the research units that law.mit.edu is part of, namely in the Media Lab, Human Dynamics Group and Connection Science, all of which are somewhat interdisciplinary and cross departmental themselves, but more specifically in that area of generative AI for law. So not the law of generative AI, but how it is that generative AI is now being used for law and for legal processes. The other thing is as I'm editing this, I'm realizing that my attempt to produce video like this from my local park was only nominally successful, and so I apologize in advance for some of the audio lag and the poor bandwidth on my side, but hopefully that's okay because the main attraction is our star graduate student, Robert Mahari. So with that said, enjoy the show. And now a segment that you've all been waiting for and that's been long promised, a peak rather behind the curtain and the research into computational law that's happening at MIT, and here in the Media Lab and the MIT Connection Science related research group, where I serve and where law.mit.edu is situated. And I want to reintroduce you all to my friend and colleague and our star grad student who's in the midst of a PhD now at MIT and after a very successful sojourn at Harvard Law School where he picked up a JD and before which he was at MIT, so I think we can claim original kind of provenance of Robert Mahari in academia. He got two degrees, one in engineering, chemical engineering at MIT as a youth, and he's back, baby, and he's diving with both feet and his whole body into computational law, and I couldn't be more delighted than to call you a collaborator. And so I just wanted to share in this video, as I said, a bit of a look into what is happening in the research in this area. So that's a little bit different from what you'll be seeing and what we do in the MIT Computational Law Report and the sort of stuff that we spotlight in IdeaFlow and in our workshops and so forth. This is more what I consider like our day in, day out at MIT, which is primarily a research institution. So with that, Robert Mahari, thank you so much for taking time out of your incredibly busy days to join again and to share with people what it is that you've been working on. So Robert, take it away. Perfect. Thank you so much, Daza, for the generous introduction as always. And you deserve a lot of credit too. You are the first person who introduced me to the whole idea of computational law. So thanks for that. And it's been a fun ride. So what I thought I would do today is just share some highlights. We'll be moving quite quickly through a couple of research projects, just to kind of give people a sense of what the kind of research questions are, what the open problems are, and really, to convey, I hope, the kind of breadth that computational law research offers. And maybe people will be excited and want to collaborate. So with that, I will start maybe by talking a little bit about legal research, as in the research you do when you try to draft a new case, and how large language models can help us with that. And so as a very quick review and reminder, we're in a common law system, right? So that is a system that's judgment that really is built on citations to precedent. And it's an exciting opportunity for us to leverage large language models to predict precedent. This incidentally is one of, I think, my favorite cases from law school that involved a conference that went completely off the rails in a hotel. People were bringing all sorts of animals into the hotel. They were firing guns, and ultimately a passerby was hit. And the Minnesota Supreme Court had to figure out whether the hotel was liable, and they ended up citing and quoting a case from New Jersey where something similar had happened. And the conclusion was essentially that, well, if the hotel was aware of the danger that its guests posed, then it had a duty to protect the innocent passerbys. And so this kind of reasoning, right, like based on precedent is ubiquitous in common law jurisdictions. And we wanted to see whether we could build a system that retrieves past cases, and specifically these kinds of quotations from past cases. And then once the retrieval has been done, that generates an argument to be made. And so the task that we're going to focus on is given a legal argument that you'd like to make. Can we predict passages of relevant precedent? We treat this as a classification problem. So essentially we say, look, let's just try to predict, given a, essentially a list of all the possible precedent you could cite, let's predict the precedent that's most relevant to your specific argument. We construct training data by essentially mining published judicial opinions and finding passages of quoted precedent, and then looking at the context that surrounds that precedent when it's used, and trying to predict the passage of precedent, given the context that surrounds it. We're able to generate a tremendous amount of training data. I'll just pause it to underscore, like, legal data is so rich, there's quite a lot of it. And so you can create these huge data sets that allow us to do some really interesting things, like this project on passage retrieval. Anyway, so we are able to create this big training data set, and then train different models, one kind of more advanced transformer based model, one simpler one. And the key takeaway is, especially if we look at the kind of more advanced model, the correct passage from, you know, thousands and thousands of potential options is among the top 10 or the top 20 retrieved examples over 90%, like 96, 99% of the time. So really impressive results. Large language models appear to be quite good at this. I also flag like a much simpler feed forward neural network. You know, this is kind of old school machine learning performs quite well at this task. And then once we have the passages retrieved, which we built this little demo to show people how this looks. So here, we've given an input argument to the engine, and it's retrieving these passages that you can see here. Once these passages are retrieved, well, then we can give them to a model like chat GBT, we can essentially treat chat GBT like a grammar engine, right? And we can ask it to write a brief based on some passages of precedent that we provide, right? That addresses a lot of the hallucination issues. And then it will go ahead and just put it all together and write a brief that in my opinion is actually quite possible as kind of an initial starting point for an attorney. We've kind of analyzed this quite a lot. And it turns out that by providing passages initially, you really like address a lot of the hallucination issues, but you also get stylistically kind of correct briefs. So something kind of interesting and further to explore. So just kind of concluding this little section here, AI, large language models seem to be really good at finding precedent. Simple models work quite well. And we can then combine kind of retrieval tools with large language models to generate some briefs. So I'm going to pause briefly here and give Das an opportunity to ask some questions. Oh, thank you so much. And you know, auto law has always been one of my favorite projects of yours. And so one question I have for you, you sort of touched on it, but could you be a little more explicit about how the project has itself evolved with the evolution of generative AI. So when you started this, it was pre chat GBT, it was pre GPT 3.5. And some of the capabilities that that now has now we have GPT 4. So I recall you using Bert initially to start to identify and classify some of the key kind of holdings and parts of the opinions in cases. What's changed in this project with the advent of this modern generation of AI and and how? Yeah, so I think I and we as a community have been quite lucky to have been around like really at this cusp, right? We're like, I was good enough to do interesting things four or five years ago. Now it's like really good. And it can do all these new things. So when we first started this, it was really focused on retrieving the precedent. And the idea was like, look, let's just focus on doing what a legal research platform does already, right? Like helping you find the precedent you're looking for. We had thought and like considered, you know, well, maybe we could do like some sort of word plugin where you know, you're typing along and you press tab and the plugin just suggests what you should cite at that place. But what large language models can do is the generation piece, right? They can generate really high quality text. Now, they're not necessarily like natively good at finding the legal precedent, but that's the research that we've been doing for years. So we are good at that. And combining the two has led to some really exciting things, right? Where we can take all of the infrastructure we've built for legal research and then layer on top of that, the generative AI kind of language model to produce text. And I think this, I think we're at the start of something. It'll be exciting to see how lawyers actually leverage this. But the possibilities, I think, are significantly expanded now in good ways and bad that hopefully people can kind of think about some of the risks here as well. Indeed. Yeah, this is foundational. Okay. So I know we've got a few interesting projects and we were barely scratching the surface of your research lately, but let's get to the next one. Okay. So the next thing I wanted to talk about is kind of more general like the opportunities that we have to use machine learning and large language models to extract data from legal documents. And so for anyone who's like interacted with a legal document in the workshop, you'll know like documents are long. They are complex. You have to go to law school to like really understand what's going on. And so the question is like, well, can machine learning help us as researchers and as individuals by summarizing by finding questions to specific answers or by pulling out specific information? And the answer is yes. There are lots of kind of applications that you can think of. The challenge was before large language models, a little bit to your last question, Daza, what you had to do is you had to kind of collect and label a bunch of information, train a model that knew nothing about law. And then those models would perform well. But the precondition was like, well, you needed to have the training data. And that was expensive to come by because legal documents are long and complex and you need expertise to understand them. Now we can just ask, right? And machine learning people will call this zero shot learning. You can give a document to a large language model and you can ask it, hey, can you tell me how much the attorney made in this case? And as long as that information is contained in the document, you've got a pretty good chance of being able to get the right answer. And then you can do a little bit of kind of prompt engineering that I'm sure Daza is going to tell you about to get even better results. But this really unlocks a lot. So I'll give you an example of a research project that was kind of pre-chat GPT days where we are interested in understanding the attorney's fees and class action lawsuits. And the cool thing about class actions is an aside is that the final settlement is published and has to be approved by the judge. So usually we don't know how much attorney is earned, but class actions give us this kind of unique insight where there's a judicial opinion that tells us. The problem is that it's not always super clear. And so you can see from this little extract, there are all sorts of numbers floating around. It's not entirely clear. What is the correct answer? By the way, the correct answer is 140,000, which is the last sentence where the judge says, I will grant the petition. But there are all these other kind of numbers floating around and there's this load star and what is that? So what we did is we just said, this is the chat GPT playground based on the judicial opinion, which is long, identify the final fee and costs awarded to the attorney. And we can press play. And lo and behold, this will work. And the video is a little bit long. So I'm going to fast forward. And it tells us dutifully that the final fees fee and costs awarded were 140,628, which is the right number. So that's exciting. And that means that what we can do now is we can use again, plug in large language models where we had problems in the past, which is like this information extraction step, but then we can go on to do kind of regular, you know, regressions or analyses or whatever we want to do. And this really unlocks lots of possibilities. It comes at a cost like 12 cents per opinion. But usually for a lot of these applications 12, yeah, this is a trivial amount of money. You know, maybe you want to label 100 or 1000 opinions that's doable. And if the cost is an issue, if it's prohibitive, what you can do is you can label a few with chat GPT. And then you can train a cheaper model, kind of fine tune a cheaper model to do the labeling for you. And that works quite well as well. And then so the takeaways are, you know, we can go from unstructured raw legal data to structured legal data. We can do this kind of quickly and cheaply. And now we can do all sorts of interesting quantitative analysis, you know, extracting data, different insights, pulling information from different documents, understanding kind of historical evolution, all sorts of interesting applications. So I hope that this gets people excited about some of the things you can do. And I'll hand it back to you, Dasa. Thank you so much. And this is another great example of a project that started with the prior generation of technology, and where you really just blew the ceiling off of what was possible with with the current generation, something that I noticed that you mentioned. So one thing that I'll highlight here is, I think everybody that knows a lot at MIT.edu knows we love structured legal data. And so I feel like this is just doing such important work to get these natural language kind of very narrative fuzzy, you know, hard to parse, you know, legal documents and opinions into something that we can then use as the starting point for proper analytics and can turn that into actionable, valuable knowledge. But there's another aspect of it that you mentioned as well, which which I'll just add as a kind of segue. You said, and now we can use it as training data. And I feel like if there's a theme of 2024, it's going to be let's take a closer look at the data underlying these models and what what having the right kind of data makes possible. And I think that might be a good segue to your very next project. Sure. What I'll do is I'll give you a quick example of how we've used this kind of information extraction in like a concrete research project. And then I'll move to the project that you're hinting at about data provenance. So let's start with this project where we were interested in judicial impartiality. And this is, you know, this like big, important pillar of justice that goes back to the book of Leviticus and the Magna Carta. It's hard to study impartiality for lots of reasons, but one of them is kind of data access. And so what we did is we were able to match a couple of databases together. And then what we did is we use the overlap between the structured database and the unstructured database, essentially is training data. And then we said, okay, based on this overlap, right, based on the cases that we're able to annotate using the structure data, can we then annotate the rest of the data? And the kinds of things we're after was how was the case decided? What kind of case was it was a civil rights case, a contracts case, and then some other kind of like key information about the judge that was involved in the case. We were able to create two kind of big data sets, one on case data, did the plaintiff or the defendant lose who was the judge, things like that, and then on the judges themselves, their workload, their experience, party affiliation. And then we started being able to do analysis on like a scale that people haven't been able to do before about predicting judgments, using factors that are unrelated to the case details, right? And I won't go into kind of the the methodologies too much. But the key thing is that this gives us an insight into judicial reasoning and decision making and impartiality. And the kind of bottom line is you can see this red line at 50%. That's the accuracy you would expect. If you were just guessing if the judge decides for against the plaintiff. And this model trained on the factors you see underneath kind of extraneous it appears to to the case at hand related of course to judicial philosophy, but not to the specific case does quite well. And for lots of cases we can get, you know, accuracies approaching like 65 70%. So that's quite exciting. So just kind of an example of how you might be able to leverage this kind of data. But now let me tell you about data provenance, which is like, this is going to feel a little bit like a pivot. But I hope that people will see kind of how this is also a kind of computational law. So let's see. Let me close this and minimize this and tell you about data provenance. So one kind of version of computational law is to say, can we improve the practice of law? And this is very much like, you know, insofar as there's like an intersection between law and technology, there's kind of the the law of technology, lots of people are interested in regulating AI and privacy. And those are important things. Then there's the technology of law, which is kind of like, how does technology improve the legal profession in there? I would argue fewer people working on that. And I think there's like a lot of green space that people can can tackle. But even in this kind of law of technology area, there are some kind of blind spots, some like areas, especially when you get into like the more technical side, where I will say like a computational lawyer can really add a lot of value. And this is, I think, a good example of this. So you might be aware that, you know, generative AI like chat GPT and other models are trained on huge amounts of data. Not all of that data is created equally. So you have some data that's kind of scraped from the web, you know, unstructured large scrapes of the web common crawl is a good example. There are like scrapes of Wikipedia. That's usually what people think of when they talk about training data. But there's also a bunch of data that was created really kind of in a custom way to train AI models to be good at certain things. People will call this fine tuning data, instruction tuning data, alignment data. And the recent advances in generative AI maybe not that recent anymore, but like the advances of last year or so in large part have been have been at least catalyzed by these kind of custom made data sense. So we've put together a team as part of this data provenance initiative of machine learning experts and lawyers to try to understand and gain insight into where this data has come from, how it's being used, and what kind of from our perspective kind of most interestingly, what kind of legal limitations were placed on this data. This this figure is kind of complicated. But the key part is that data sets are often grouped into collections of data. So you have like original sources, and then the data will appear on a place like hugging face or papers with code. And then someone will take that data and put it into a bigger collection. And then ultimately gets used for AI development. But there are lots of stages and people kind of get lose track of where their data is going or where the data that they're using to train their model really came from. And what we did is we would find the original licenses from the original sources, and then categorize those licenses along a couple of important dimensions. One of the ones that I'll talk about is what kind of use the license permits. And so we find that if we look at the original kind of correct license based on the source, and then the license according to the various aggregators where the data set is hosted, there are a lot of errors. And you can see kind of in the reddish pinkish triangle, these are all the situations where something is incorrectly labeled. And the first column is probably the most problematic. That's where licenses according to the aggregators say, hey, commercial use of this data set is fine. But actually, the commercial usage is either not permitted, or the original source doesn't mention it at all. And that can pose, as you might imagine, a real issue. In doing this project, we also came up with another kind of interesting finding, which is to say, well, hold on a second, if this data was created for the sole purpose of training AI models, then the fair use discourse that you'll often hear applied to training data might not apply in the same way, right? Because the whole kind of idea of fair use is that you have the secondary purpose that is distinct from the primary purpose, right? I wrote an article, say to be in a newspaper, and now that article is being used to train an AI model. Those two purposes at least appear quite distinct. But when you create supervised data, well, you created that for the purpose of training an AI model. And so fair use might not apply in the same way. There's also this question about the market effect, right? So like, if I use your poem to train my AI, does that affect the market for the poem? Of course, once I start creating new poems, that's a different question. But like, the moment I've trained an AI, it's not like I've created a new poem that competes with your poem. But when I steal or take your supervised data in a way that you didn't permit, and I use it to train my model, well, the alternative is I would have paid you for that, right? Like, I'm directly kind of competing for the market of your data set. So this is kind of an interesting analysis, and we were actually able to write this up with the help of folks at the BU Technology Law Clinic as a comment to the US Copyright Office. So just to highlight kind of completely different, but I think still related, still kind of in this like realm of computational law, research application of some of the kind of principles that I hope, you know, you'll be learning about and thinking about. So hand in it back to you, Dazza. Thank you so much. That is fascinating. And incidentally, I was referring to the second to last project when I said there was training data involved because there was, and you're also correct that, you know, the crowning jewel was the last project you mentioned when it comes to just taking a closer look at this training data, which is so very essential. Let me help you with an advocate's argument as to why your last project can also, I think, validly be considered computational law, as opposed to, you know, like yet another law review article about, you know, how whatever some legal framework may or may not apply to AI or automation or technology. And that is because, just as you said, being a computational lawyer, which is a nice phrase, was some of those skills were needed in order to do this legal analysis in the first place. You had to find a way that was accurate and completely captured the relevant legal aspects of these licenses to represent that information in a form that was data that you could then analyze and then show on things like charts and graphs. And you did that. You know, your team did the hard work of reading the licenses and categorizing them correctly and then having the right kind of identifiers and metadata around each one so that you can, you could do this analysis and see was there some difference between, you know, what the license actually permitted and restricted and how it was being characterized if at all. And so that, that to me is like, that's the hard work of, that's the first step of computational law is rendering during the law in a form that it can be computed. So anyway, that's, that's my advocacy on behalf of your project as to why it can also totally be considered computational law, although it is, you know, mostly about the law of these datasets. So anyway, I think that's incredibly fascinating. And it makes me want to ask, what are you working on now that isn't yet capable of being, you know, rendered on a slide? And what do you, what do you foresee for the rest of this new dawning year of 2024 and into 2025? Like what's on the horizon at the cutting edge of MIT research in this area of computational law with a heavy emphasis and thumb on the scale for generative AI? That's a good question. There are a couple things. So we're, you know, of course, these projects are all like, you know, by definition, it's early stage, right? So there's like next steps for, for essentially all of the things I've talked about today. We are, for example, trying to really build out this retrieve logmented generation platform and really think about how do you design a platform like this? Maybe even, you know, what would be exciting would be to go to, to like, you know, pros and litigants and give them a tool like this and see like how does it change how they interact with the courtroom with kind of legal questions, kind of more of the human computer interaction side of things. Another, you know, big kind of research project that we've been grappling with, as you know, for quite a while is we can kind of start thinking of laws like a network and we're starting to kind of like think of like, well, citation networks and you have all these documents along the way, but that begs the question like, well, you know, common law, as you know, judge made law has evolved in some way. Can we kind of get a better understanding of better grasp of that evolution? So I think that's like a key research problem. And honestly, like, it's been on the one hand, we're overwhelmed because we have all of this new data, all of these new tools. On the other hand, it's like non-trivial to think of like a system of knowledge and try to say, well, where does it come from? Where is it going? Who are like maybe the key people who are changing it or the key events that have changed it? Like, can we find moments in history, moments in law that have kind of given rise to new changes? All those kinds of questions seem incredibly well suited to computational law and computational legal analysis. And then finally, there's kind of a broader like, what does the practice of law look like? Like, what does the business of law look like? And it's early days for that. And we can see lots of folks misusing these tools, misunderstanding these tools. But I think it's pretty clear that the legal profession will embrace a lot of generative AI tools, a lot of large language models. And the clients of lawyers, which we sometimes forget about, but they really matter, right? Like the clients of lawyers will also be embracing those tools and also making the connection, hey, is my attorney who's billing me like $1,000 an hour or more, is she using these tools? Because if not, then I'd like to know why I'm being billed all these hours, right? So I think that there's going to be kind of interesting shifts in the legal profession that are worthy of research in and of themselves, right? And understanding how you deal with legal risk, how you manage legal services, understanding the law. There's so many questions, honestly, kind of an overwhelming number. So we'll be busy at work. And yeah, if people are excited about this, then I've done my job. So hopefully there'll be lots, lots more people working on this stuff. So thank you. This was very fun. Thank you. Yeah. Well, I think people I'm excited and I know other people are too. This has probably been one of the most requested segments is like just a research update. So while I have you before we before we end, I've got one sort of extra question for you. And it's partly because you are really still fairly freshly out of law school, like what would you write for a couple of years ago or something like that, two, three. So you probably remember it better than I do. And well, the question I would have is what do you foresee not so much in the practice of law, which you just started to go over and research, of course, an industry. But what about pro law school itself? Like what do you think are what would be some of the good directions for law schools to look at as ways to reckon with to recognize and to start to address and I would say to support and reflect the advent of generative AIs as a part of law practice. Like what what sorts of activities or courses or skills or or other implications. And let me just start as a starting point with I'm at me law school and I've decided to prohibit use of generative AI for like any meaningful aspect of legal education. So if you start from that baseline, what more might be possible with the liberalization of some of those types of restrictions and what kind of application of generative AI would be kind of beneficial and appropriate or maybe even necessary for a competent, well educated, ready to practice lawyers coming out of law school. Yeah. So I'm actually probably more receptive to the argument like, hey, we shouldn't have these kinds of tools in in law schools because they'll prevent us from learning about the law. Like I think that there is a little bit of something to this argument. And I don't think that I could have done the research that I've gone on to do without like, you know, the slog of like using Lexis and Westlaw and the other legal research platforms understanding what a key site and what a headnote is and shepherdization. Like you do have to learn those things the way you have to probably learn long division and other kinds of things, even though it's not actually used day to day. However, it also makes sense to me that you would learn a little bit, maybe not about the specific tools and vendors, but more about like the risks and the limitations and the opportunities like as a lawyer, I need to understand how what kind of precedent I can cite, right? And I need to understand that like some things are good law and some things are bad law. Well, by the same token, I need to understand like, what are the tools out there that will retrieve cases? What are the tools that will let me, you know, compile cases into arguments? And I really expect that legal practice is going to embrace those. But we need some kind of AI literacy among the lawyers who will use the tools, not because that they're going to be developing those tools necessarily, but because they that kind of literacy is needed to do responsible kind of legal technology usage, right? To responsibly use these tools. So that's kind of one side of things. So yes, it's fine if you need to do manual legal research. It also seems highly appropriate for there to be at least one class that covers some of the kind of more AI literacy topics. However, the other thing that I think is a real opportunity for law schools with clinics is to say, well, the purpose of clinics is one kind of practice oriented and two, it's to serve clients, right? And so it seems like there's an interesting opportunity to consider like the existence of like a meta clinic, a clinic that develops and helps law school students who are interested in developing tools, you know, help the other clinics. So like a clinic for the other clinics kind of thing. And I think that especially now that some of these tools have become very accessible, right? That you don't need a computer science degree to be able to use chat GBT. So by the same token, you don't need a lot of technical know how you need kind of design thinking skills, but you don't need a lot of technical know how to build really impactful tools. And I think that there's a really cool opportunity for law schools to start kind of not just encouraging their students to build these tools to train that muscle if they want to, but to do so in a really impactful way and maybe also to dabble crossover into some of the like human computer interaction, literature and communities and start kind of exploring like what do these kinds of tools look like? And how do we design them responsibly? So there's lots of options for law schools. I think it's an exciting time, actually, to be a law school, to be a law school professor. And yeah, I'm hopeful. I'm, you know, I remain always the optimist that that law school will find ways to kind of integrate this into the syllabus syllabi in productive, responsible ways. Here here. Well, may it be so and I did have a little long cherry on motive, which is I'm hearing now from more and more of my friends and colleagues at law schools who are sharing really innovative ways they're starting to integrate use of generative AI into their pedagogy and into their syllabi and the curriculums. And you know, there's a thousand flowers blooming right now. But one thing I can say for sure is I love what the way that you're incorporating it into your research. And I can't wait to see what you come up with next. So thanks very much for taking the time to share what you've been working on, Robert, and, you know, don't don't be shy about sharing the next kind of block of projects when they come up and and I'll be sure to vector them right into the screen. Perfect. Thank you so much for having me. Take care. Thanks.