 OK, welcome, welcome, everybody. Hi, I'm Daza Greenwood from MIT Media Lab and also executive director of law.mit.edu, which is the convener of today's workshop, the eighth annual MIT Computational Law Workshop. I just want to start by saying, having done these things since actually the late 90s at MIT on this topic of law and technology, I honestly believe this is the best program yet. And that's owed largely because of a breakthrough with widely accessible generative artificial intelligence and its applications for law and its impact on law and legal processes. So with that, welcome. And by way of introduction, we really look to Michael and Dan as pillars in this emerging space of computational law. Dan arguably kind of coined the phrase before we started focusing on it at MIT. And Dan, I just want to recognize you again and thank you for being a member of the board of advisors of the MIT Computational Law Report. So with that, can you please show us how on earth you got GPT to pass in part the bar exam? Well, again, greetings from Chicago. And Mike is joining us from Michigan near the campus of Michigan State University. I guess maybe I can, well, let's keep it here for a second. Maybe I take you back a little bit for us. We've been working in this area of large language models on the academic side for a while now and more recently on the commercial side. Years ago, we had a company called Lex predicts and we did a bunch of things in that company, including things like litigation, prediction, and contract analytics. And we had a library in the library. One of the libraries, well, it's several libraries. One was called Lex NLP and it was focused on what I think will now be called classic NLP in this area, classic NLP, which is the historic workflow in which people undertook NLP tasks, which is now increasingly being displaced by sort of deep learning as the base method. And so, unfortunately, this is just the nature of things. You know, the libraries that we built back 2016 to 2018 have been eclipsed by other methods. And so I'd say, you could still use some of what was done before, but I think it's kind of, you know, unfortunately fallen by the wayside. So last year we, on the academic side, we worked with this Pan-European group on something called Lex Glue. And this kind of was an opportunity for us to really work heavily. This was a benchmark analysis of several leading large language models on a wide range of tasks, including BERT, Longformer, BigBert, so forth and so on. And we got the paper in the ACL conference, which is probably the best conference or one of the best conferences on this topic of natural language processing. So in November, when we got out of our non-competes, we were once worn to the breach. We started another company. And so, you know, it was in the context of doing that work. I mean, we're doing a bunch of stuff to build out this company and we won't talk too much about that today. But that kind of set the conditions for us to be thinking about, okay, you know, we're building a bunch of these core tools. And we've been telling folks, hey, you know, there's been a material increase in the quality of these large language models. And, you know, but we could not come up with a great way to show that to people. And then of course, November 30th, about seven weeks ago, ChatGPT enters the fold and, you know, and here we are. So I was doing this, I run this MOOC at a Busserius Law School in Germany, along with some other schools, including SMU in Singapore. And the very last session, we did this introduction of Richard Soskin using ChatGPT. And we sort of said, okay, you know, given these intros is always difficult, you know, you want to show sort of the proper amount of fealty at and what have you. And so we said, you know, we're going to outsource this to ChatGPT. And I'll say it gave a pretty high fidelity introduction and even thanked Richard for his presentation. So right before Christmas, I called Mike and I said, I think this is it. I think what we should do is try to do the bar exam. There have been a few efforts, a couple of people had shown a few things online, but I said, you know, we need like a rigorous systematic treatment of this, not just kind of like plug stuff in and see what comes back out, but like, can we kind of go through this in a more systematic manner? And so, you know, we got done with Christmas, we put our heads down and a few days later, we had got a version one and now we're on to the second version of the paper. I guess I'd just say this, I mean, language is the coin of the realm in law. And if you had to kind of say, and most roads in law lead to a document. And that document is expressed in natural language from a historical and anytime soon going forward perspective. We have had subsequent waves of legal technology. Most of the tools that have been built today, including anything we've built and any other tool and I'll stand by this, really have not had a very good account for legal language. There've been clever hacks to work on these problems, kind of in an indirect way, but never a frontal assault on the problem. And the problem is that there's a lot of semantic nuance in legal language, in general language by the way, and in legal language. And now we've seen this kind of material increase in the quality of tools. And so this kind of brought us around to say, okay, can we work on a problem that would help demonstrate to people the nature of the capabilities and the increasing the capabilities. And so we started on the bar exam. So I'm gonna pass it over to Mike and I'll be minding the slides to kind of talk us through, but I just wanted to set us up with that over to you. So, yeah, so we did what you would kind of hope we did, or at least I think it's what you'd hope we did. We went to the source of the exam in any sense, in any sense that there is an exam in a singular correct sense, right? It's the NCBE's model exam. And there's different components to the exam. Some of those are obviously better suited to something like GPT, for example, the ME or the MPT are probably things that GPT could do. They might be things that GPT could even do at an adequate or passable level, but we chose the MB portion in particular because there's not really any degree of subjectivity. It both features complex syntax in the questions, questions that are, if we're being honest, purposely written to trick people, both with the length of the sentences, the complexity of the sentence structure and the nature of the presentation of the facts, extraneous adjectives, all this kind of stuff. And there's no question as to whether Dan and Mike graded it correctly, right? We don't have access to all these NCBE or state bar graders. And so were we to do the ME or MPT, there would be questions about whether we had faithfully reproduced an assessment as the actual students sitting for the exam would do. None of those questions for the MBE. So is it only the MBE? Yes, but does that obviously allow us to speak more objectively? Yes, so here's an example of what we got. This is from the NCBE's public documentation about this. We can't reproduce, infill all of our questions because they are copyright, but you can buy them for 200 bucks. And you can see, I think this is, let's see, there are one, two, three. This isn't actually so bad. These are what four different sentences here. Sometimes these questions are one to two sentences with that many words. And the question is a four part multiple choice question. I'll point out just to be very pedantic here. The question is asking for a binary answer, but of course there are not two choices. There's actually four. So the prompt, if read literally, which is what GPT will do sometimes, and some people do, is not really aligned with the question. And this is obviously just a part of dealing with natural language. So while if you want to be really pedantic, you'd say the questions are poorly written by the NCBE and trick even GPT. It's also just like this is, this is the way your client's going to speak to you. They're not going to be that precise. So deal with it. Next slide, Dan. So again, the baseline to talk about the students sitting for this, the rates at which students correctly answer questions are presented in that right most column in this table. And if you've ever procured legal services and you're not an attorney who sat for the bar, those numbers might not instill a lot of confidence, right? Like you don't want to know that your counsel forgets rule 34A and gets you into a spoliation situation because they only got 59% on the bar, but that's the way it works. So these are the numbers quote to beat, if you will, or these at least represent the efforts or abilities of people who spend a lot of time on this. Yeah, another key point here is chat GPT is kind of the name to sure for what open AI offers they offer and have offered a number of models. Some of the models are multi-modality models that do different things. Some of them just do one thing. Text of entry three is the best model that we could get to answer the questions. There's also a codex model that has got larger token windows and supposedly better on some tasks, but text of entry three was the best and largest model that actually responded, which is technically different from chat GPT as you experience it, but supposedly the foundation. So with that detail aside, we get to the meat of this. And I think it was great, Megan and Daza, you guys talked it a little bit about very related concepts, right? So the degree to which the prompt can impact the model's response is some sense, Megan, like you said, not much different than humans. In many circumstances, the way we frame problems, the way that we pose the outcomes, the way that we contextualize which shared body of knowledge or if there even is a shared body of knowledge, all those things have a huge impact in how we as humans carry on conversations and we see that with these models. Now, we have, I don't know, let's say 70 years of somewhat rigorous psychology that can at least inform human-human interaction. We do not have anywhere near that much longitudinal research on how human-computer interaction in these LLMs works. So what we did is try seven things that you might ask a normal student to do from a heuristic perspective, helping them take a test or you might just write questions this way if you've ever written questions as a professor or whatever. So what's the answer? What's the answer with a justification or explanation? Then some variation on that with rank ordering two, three choices. In our follow-up work on the CPA exam, we did a little bit more with source elucination and source constraints, which I think you touched on, Daza. But for this paper, we just did these seven prompts. And when we did that, as Dan said, we wanted to do this in a very rigorous scientific way, not just like a coffee paste, a couple bar hero questions kind of thing into this. So we tried just about every switch and flip and dial that's exposed on the API to ensure that one, the results were robust. This wasn't just like some kind of local optimal API parameter value where it magically worked. So it basically did within six or 7% across every setting that we tried. And the only thing of note probably here qualitatively is the temperature in some of these parameters have to do with how random or how reliable and deterministic the answers from GPTR. If you're doing anything where you really need to explain what you're doing or site that you did something at a certain time in a certain way, you should be careful about your temperature values because the only way to deterministically record something to the best of our abilities with GPT is to set the temperature to zero. So we tried all these different things. And like I said, the short answer was it didn't really matter. And everybody asks, did you fine tune it? Answer was yes to the extent that we had a couple hundred test questions and no, it didn't help. And no, we don't know exactly why although we have a lot of theories and there's some other research about how fragile some of these models are. And the question is best answered by just not using GPT which is something we're working on. As far as the results, I imagine a lot of you've probably seen it because it's been kind of hard to avoid in the press lately, but I didn't believe it at first. This kind of the short answer because of how hard the problem was and how prior research, even from like Thompson Reuters with a lot of effort had been, let's just say not anywhere near close to these. So the model does worse than the students but not much worse in a handful of categories and the models top two responses are very much correct relative to what it would have gotten if it had been randomly guessing, which suggests it's very close to doing even better than what it's doing right now. And I mean, I don't know which section you hated most for those of you who've taken the bar exam or those of you who have kind of practical experience with the law, which of these you think you actually still live today, but a lot of the questions in the exam are difficult. Some are more fact specific. Some involve like information that might be deemed to be outside the scope of what the contextualization like con law, for example, a bunch of the con law questions have to do with let's say foreign relations or stuff that may have actually been harmed by the contextualization prompts, but it did better than anyone expected. Ourselves included, I think it's safe to say. And Dan, if you wanna go to the next slide, I think it's clear that something sometime, I think we said in the paper, zero to 18 months from when we published will likely meet the threshold for the NCBE's kind of estimated passage rate. When that'll be, I don't know. I think I'm leaning towards the under now on that range and not the over based on the acceleration that we're all seeing in the market. And I don't know whether you wanna talk more about what that means for the bar exam or what that means for attorneys who practice or what that means for public policy or what that means for clients, but any and all questions I think are obviously relevant and salient right now and in real questions to ask. Maybe I'll say one thing about this. This was not in the first version we put out and we thought, it'd be very, we kind of done it, but we didn't really, it'd be very helpful that again, we wanna show people kind of progress is like, let's just go back and run kind of the historical gambit of GPT models to give people a view that even 2019 and GPT too, which people have used in papers to show like, do things like draft patent applications and things like this. It's not even able to process the question. So it's a 0%. Then in eight of one, go ahead. Sorry, Mike. And we've been using some of the commercial stuff like AL and AI models, the Bloom models, all these kind of models out there that have been testing for a variety of tasks. And the prior generation of models or models that could run on 48 gigs of VRAM before some of the latest 8-bit or compression techniques, like these things were struggling again, even to respond to the prompt, right? You give a four multiple choice questions with a 500 token intro question and it just wouldn't even work. So something has materially changed even in the last six to 12 months in terms of the state of the art. I feel it too. That's partly why we've dedicated so much of this workshop and why we're gonna be focusing on this through the year. Something big is happening right now. Something has changed. There's been a major breakthrough. So glad you're both on it. I'm starting to interrupt, but I just wanna emphasize that point that, hey, everybody, listen up. Things are, this is different than it was even just 12 months ago, nine months ago even. Yeah, and I think that this chart is pretty much the proof. And I just show it to you with one other example is this saying we have the same result that you see in the bottom corner should have made a larger version of the graphic, but you see the same story. That's the CPA test. Now it gets clobbered on the math part of the CPA. You can read this paper, but it's the same basic story. You see this material jump between GPT 3.0 and 3.5, bottom left corner, as you see it. Okay, back over to you, Mike, for anything else? Yeah, and I think that the biggest point, like if you think about what does the bar exam really test, Dan, as you said earlier, it's mostly a test of syntax. There's some test of legal theory and some practical in the MBE at least that the kind of thing that you at least see in law school and that the state bars care about. But I mean, honestly, I think a lot of many practicing attorneys, especially as they lean corporate, care more about the things that are tested in the CPA exam from a concept perspective, then let's say whichever question California decides to throw under the exam this year. So the CPA exam is an interesting semantic or conceptual counterpoint to the syntactic performance of the bar exam. And to me, viewed in kind of compliment to each other, they show, this isn't just a syntax capability, quantum leap. This is also a semantic conceptual awareness that was also previously not either present or able to be exposed. So there we are. I know I think I saw a couple of questions come in. Yeah, we've got a few. I can help surface them for your convenience. Do you want to please kind of pick and choose? By all means. So one question that's kind of seminal, it's high level and then we'll get into the nitty gritty is what does this mean for the future of the bar exam? And like, if generative AI, let's say in the next revolution or evolution it passes, overall passes the bar, does that mean that our bar exam or CPA test should evolve and how? And can I just offer one provocation to that? Which is, and I've been thinking about this a lot lately when I've been trying to grapple with what does this mean? And how do we adapt to this and make the most of it, not fall under the bus as well? When the motor vehicles kind of came along, that was a big change, right? But so people could go a lot faster and a lot further. That didn't mean we changed the rules of the Olympics for running or for a marathon. So we have things that humans do. We have capabilities that machines provide that allow us to extend our reach and our power and our vision in certain ways. But it strikes me that the most important thing here is to look at the technology and not necessarily judge it solely against human intelligence. But let's take a look at it for what it is. Now, having said that, let me ask you guys, what is it? What does it mean for us and for the bar exam? Dan, do you want me to answer because I have less to lose in my faculty? Well, yeah, go ahead. I mean, I'm not any big defender of the bar exam. So go ahead though. I think the question is why does it exist, right? And there's a degree to which it exists in the absence of a regimented system with transparency. And you could talk about things like econ, lemon law, information, or you could just acknowledge that there might be longstanding gatekeeping dynamics that are a part of this and that the NCB itself has adjusted the difficulty of the exam solely to reduce passage rates, which doesn't strike me as necessarily relevant to the qualification of practitioners if they're just changing it to make it harder so there's fewer people to pass every year. I don't carry a bar card so I can kind of say what I want on this front. But I think the question is, again, we said this, I think in the intro because it's where I truly am on this. There is legal demand. There's kind of an uncontroversial quantity of legal demand in the market. Lots of people have tried to measure it. Access to justice is kind of existing solely because we don't have either supply or access or whatever kind of lens you wanna take. People aren't getting the legal services that they need for one or more reasons. To me, as long as we have this unmet volume, which is not an insubstantial volume of demand for legal services, especially among people who are probably, if we're being very blunt, not getting access to the best attorneys anyway, then we truly do have an ethical responsibility regardless of what the hell the state bars tell you, but you have a true ethical responsibility in an absolute sense to try to figure out how to use these tools to help people. And does that mean give them chat GPT and say, do exactly what it says and I'll bill you for it? It absolutely does not mean that, right? But so long as we have so many people who can't afford or access services, the ethical obligation is to figure out how to solve that and I don't see anything that can scale and get anywhere near as close as what we just presented. Is it ready? No, but is there any other system capable of scaling to the total volume of questions that people ask in our legal systems? Happy to see it. Deed, and speaking of the total number of questions, we have another question here about what that corpus should be and the question is how much of GPT DaVinci's poor performance relates to the lack of the training set for specific texts in the legal context the traditional opinion and so forth and what do you think about the next version of GPT in terms of maybe honing it or doing post training kind of fine tuning just in this or if we just make it big enough will it be able to surpass these barriers? Yeah, I mean, one of the things that nobody knows because it's kind of a closed model despite the name of the company is the provenance of the model. So there have been a number of publications that are peer reviewed although peer reviews limited in situations like this and they say that they're training on a set of data the best open analog we have is called the pile and used in a number of models like the ones from the land community and bloom and the pile includes a large volume of material that includes been unlimited to the free law project and NOLO and in our CPA paper, for example we explicitly ask the model to include a source or an authority or reference to the authority for its answer and frequently it will show you a URL for something like NOLO or LiI or a similar source and so I kind of going back and forth on this as we've collected a little bit more information I do believe that GPT and many of these models have in them most of the public law, if you will do they have every complaint in PACER? I don't think so, but do they have most of the public law that you would think would be required to answer these questions? I think the answer is yes. So then it comes down to the architecture of the model, what data was used in reinforcement with what pre-processing or post-processing or other models are in the pipeline that we experience as this singular model and that for GPT, I don't know how to tell you the answer. I do know the other models out there seem to know about source material source hallucination is an issue at large and there's techniques to handle it and I know you referenced Lang chain too which is a great way to control some of this stuff but it's an open question, I guess it depends in the most appropriate answer for this community. Well, one thing that should be said though is that we pick this test because it is not really available on the internet. That's important because otherwise it's sort of your feeding thing it's already seen. So I mean, obviously there's a concern to do it again and maybe it's being gobbled up I think this is always an issue, but... The answers for this exam were never sent to GPT. We just took, we only took the answer back we're trying to keep it clean but you worry with some of the other, there are bar materials out on the web and it's probably been gobbled up in this kind of vacuum cleaner that they use to put into the file or what have you so or a common crawl or whatever. So that was what we were trying to do is we can't because nobody knows absolutely for certain but this is not generally available on the web. That's that we can say. We're going to need to start to segue. I know you're incredibly busy but I encourage you to stick on for a few more minutes if you can, Michael and Dan because I want to show you I want you to see what Jesse's come up with in his startup by way of a new modality that lawyers and others can use for prompting and the last little bit of color as Megan and I are in the midst of a research project trying to probe what these models can do vis-a-vis fiduciary duties. One of our thoughts or one of the things I'm starting to work on with Gabe Tenenbaum and with Jonathan Askin and others is to get faculty and experts in fiduciary duties to help us come up with completely new fact patterns and cases that have not only not been published have never been thought of before so we can really finally have confidence that there hasn't been leakage in the training data. That's one of the ways it deserves like an extraordinary measure but at some point we have to just put our foot down and come up and be absolutely sure that we're at least getting performance on things that are novel. We did that with the CPA exam so maybe we can share that more. We created de novo questions from the curriculum.