 Hey everybody, David Shapiro here. Back after a hiatus. I've got a lot going on. You'll be excited for some news coming up. First, I want to address a big elephant in the room. I just put most of my videos back online. This comes after, of course, making a recent video explaining why I took them down. I did leave some of my videos down, some of my code down, but most of them are back up, both videos and repositories. After talking with people, I figured out striking the balance between creating tools that will help people rather than replace people. It's inevitable that things are going to change, but a tool is a tool. Anyways, I don't need to get lost in it. Go watch the other video. Now that I'm back, it's time to get our hands dirty again. One of the questions that pops up a lot is people want to train GPT-3 on how do I fine-tune a question answering bot so that I can talk about, I think someone asked about the case law in Argentina or something. I don't have that data, but the short answer is you don't. Fine-tuning doesn't work that way. Fine-tuning is about teaching it a structure. You don't teach it new knowledge with fine-tuning. What you do is you teach it patterns. So chat GPT is a pattern. So the pattern is I ask a question and it writes a response like that. And then you ask a follow-up question and it writes another wall of text. That's the pattern. Chat GPT was not taught anything new. It's only taught new stuff when you retrain the underlying model. You can't do that. It's way too expensive to retrain the underlying model. So I figured let's pick something that will be a good exemplar of this. So just to do a quick recap, the the genesis of this whole this whole project was that people ask for how do you fine-tune a question answering thing that will do case law or any kind of knowledge base. It's all the same behind under the hood. You have a collection of documents, wherever they happen to be, how do I do QA against that with GPT-3? So here we go. I already have one that was answering complex questions from multiple documents, but this is a little bit different because there's going to be a few steps to this. So anyways, to show you what I mean, I went over to chat GPT and I said what is the kind of law system where law is established by precedent and it says this is the common law system and they this is opposed to civil law. So common law means that a Supreme Court decision kind of sets the law of the land. So if you want to understand the American legal system, you really need to understand case law and more often than not, it comes down to Supreme Court decisions because that is the highest court in the land. So they set the tone for everything. So Supreme Court decisions really teach you how it works. So I went over to Library of Congress and I found that you can download all Supreme Court opinions and they're grouped by case topic. They're also grouped by volume or justice, but by topic that's going to be more relevant, right? Because let's imagine that you're an antitrust lawyer and you're an antitrust lawyer and you want to say give me everything about antitrust law. I need to know everything that there is about, you know, about this so that I understand the legal precedent, right? Because on the one hand, there's established procedures, right? There's procedural things. Oh, and I know all this because my fiance's cousin is training to be a lawyer and when they come visit, this is what we talk about because we're nerds. So there's all kinds of procedural stuff that I don't even remember, but you know, the idea is that when you have rule by law, it is all about procedure and protocol rather than emotion. So we actually have a very stoic system where it's we're going to think through this. We're going to look at the letter of the law. We're going to have an impartial system. Of course, when you have an impartial system that requires expert navigation that automatically privileges people with access to lawyers, aka people with training or money. Privilege is a whole other topic. Anyways, the system is there. It's a very sober system where it's about like, let's read through the established protocols. If you're a friend of the court and stuff, I watch Legal Eagle too. Legal Eagle is great. So anyways, all that kind of stuff, that's fine, but interpreting established law, common law or case law is a whole, is a very specific topic. So let's take antitrust law where let's see, how many did it say? Antitrust, there's 362 documents. They're all available online as PDFs. They've been scanned and I believe they've also all been OCR. So let's take a quick look. Close some of these superfluous ones. Yeah, so PDF 661. Yeah, you don't have something that's this long. And yeah, so you highlight it, you see that it's OCR. So that means we should be able to scrape it even though it was scanned. Excuse me, a scanned OCR. So we should be able to get this information. So let me go ahead over here to opinions, opinions PDF. So we'll save this one and then actually what I put it in the wrong folder. So you might have seen, I had a recent document scraping video. So this is, whoops, come back. This is the lead up to that. This is why you need something like document scraping is because, oh, I forgot to, the whole reason this is, is I went in and asked chat GPT say, tell me about this, this case law. And it said, I don't know what you're talking about. This sounds like it's a real case. So it's like, okay, cool. You know, it tells me about the identification. I said it was a Supreme Court case decided in 1953. It still doesn't know it, right? Because it's not connected to any external data source. So one of the biggest weaknesses of chat GPT is that it's a mind in a bottle. It has no contact with the outside world. The only way that chat GPT can interact with anything is via this chat interface. Now, from an architectural standpoint, that's not actually that difficult to fix. But you introduce a whole lot of new problems, especially when you consider the fact that there are like billions of terabytes of text data out there to search. And a lot of it isn't accessible because it's in PDFs or private databases or something. So you need to have a link between the model, the language model, which can read anything, and then the stuff that you want it to read. So that's what we're working on here. Okay. So now that you're caught up, I wanted to show this is one of the greatest flaws of chat GPT is not connected to anything. It's in a vacuum. Okay, cool. So now what? Well, we've got our data here. It's in text, but it's not necessarily machine readable. So the first thing we got to do is we've got to go over here. We've got to take our PDF and then we'll use this script that I wrote here. Let me just show it to you real quick. So it just takes everything in the folder PDFs and then converts it. So let me go ahead and just run this. It should go pretty quick and then we'll look at converted. So here it is. Tada. There you go. So you've got and this repo is public by the way. So you've got this. Oh, and one thing that I did was I added a little thing so that it keeps the new pages. I actually might remove that. Actually, no, let's keep that because it's a helpful demarcation. So I added this little token because when you read a PDF, you have to read it page by page and sometimes knowing where there's a page break is helpful. So we'll keep that. That's fine. All right. So let's come back to converted. We'll copy this and bring it back over to do opinions dot underscore text and we'll paste it there. All right. So I'm going to download a bunch of these. I'm going to pause the video. You don't need to watch me downloading it, but this is what I'm going to do. So I'm going to get like, I'm not going to spend the time to download all 300. I'll sort them by like most popular or whatever, and we'll have a whole bunch of Supreme Court case law about what was this antitrust. Yeah. So we'll be right back. Okay. I downloaded files until I got rate limited. So be kind to your data sources and don't abuse them. Many websites will do this if they detect that you are scraping or whatever. If they don't offer a bulk download, there's probably a reason for it. But anyways, it didn't give me a warning that I had violated any terms of service. It just said, we see that you're rate limiting you. It didn't say that there was any consequences. Just we're temporarily rate limiting you. So that's fine. I mean, this is all public information. Anyways, it's from the Library of Congress. So I think it's more of a technical thing. So anyways, what I'm doing here is I'm converting it all to text. So let's go to convert it. Excuse me. Delete the ones that we don't need. And this is infinitely more case law than I ever want to read. I mean, I'm not going to read one of them, let alone 22 of them. So let's go ahead and copy these over to my repo here. Let me go ahead and replace that one. Okay. So now we have 1.7 megabytes of case law, of antitrust case law. This goes back to the late 80s. So this should be, if we understand this, if we do a model as, if we do something that understands this, and we should have the ability to interact with a machine that can explain the current common law of antitrust for America. Hey, who knows? Maybe LegalEagle will watch this and want to do a collaboration or comment on how accurate it is. That would be cool. Someone, what's his name? Devin? Someone please watch this and get Devin to check it out and comment on one, my accuracy, but also the value of this tool. Okay. So what do we do next? Well, there's, so here's the thing. The token, the biggest limitation is the token limit of large language models. So it's this weird paradox, right, where the model itself, I don't remember how big they are. There are many gigabytes, right? I think GPT-3 is like 700 gigabytes of VRAM is how much it takes. It's enormous, right? So, but despite how big it is, that isn't, it can't take in that much information. It's like blowing information in through a straw, right? Same thing with your brain, right? Like your brain has, you know, it's three pounds of neurons, hundred billion neurons, seven thousand synaptic connections per neuron, but you can only speak at a few bytes per second, right? Your input and output rate is very slow compared to the processing power of your brain and the amount of information in it, right? So the UI, the API is very slow. Same thing is true of GPT-3 and all language models right now. So not only that, they have a very short memory. They can only remember what you do one task at a time. So it cannot, it is not possible for the machine to be able to tell us all about this, because even chat GPT-3, you know, which is GPT-3.5, the most recent thing still limited. And even if you go up by a factor of a hundred, there's still too much information here for the model to learn. So this is a problem that we're going to have to be contending with for the foreseeable future until there's some fundamentally different kind of AI model that can read all of this, or until it's easier to to fine tune something, because honestly the easiest thing would be include all of this data in the baseline model, in the in the foundation model, and then it knows it just intrinsically. But until we get to that point, because they are really expensive to retrain, so until we get to that point, we're going to have to figure out ways of using external databases or knowledge bases. So that's the problem statement. We've got 1.7 megabytes of text here, what do we do with it? Well, this is really dry stuff, super dry. So what can we do with it? Well, one thing that we can do is I've got this really handy-dandy thing where I've got it broken up by page, right? And you see that, like, in many cases the sentence, you know, will continue. So the page barrier is not necessarily a good semantic barrier. And so what we mean by semantic barrier or logical barrier is you might still cut something off right in the middle of an idea or a thought. But it is still a good enough thing to break, because when you look at how long this is, this is 20,000 characters long, so this is probably about two Windows worth. So we can have GPT-3 read most of this. Actually, here, let me pause it for a second and let's do a quick experiment. Instead of just telling you, sorry, I was just saying, instead of telling you, I'll show you. Okay, so we put this in here. It's 5,800 tokens long. Our maximum is 4,000. So if we just split something like this in half, right? So it's 20,000 characters. So we split it in half. We summarize it that way. We might be able to do something with it. But the question is, the problem then is we don't know exactly what we want out of it, right? So let's think about this. What kind of information, if we wanted to make, like, a Wikipedia, right? Maybe that's a good way to go, is what are the implications here? So in this case, you know, boxing matches suit Don King. Oh, this is fun. For Ricoh charges, okay, and they refer to other codes. And so basically, what this is doing is it's using language to build a web of, like, reasoning and logic. So this actually sounds kind of like a knowledge graph. So I'm wondering, what if we what if we use this to build a knowledge rest? I've never built a knowledge graph. This is fun. So maybe what the goal here is is let's build a knowledge graph. Okay, so let's go back over to chat GPT in just a second and ask it what a knowledge graph is and how to build one. Okay, I was able to get right logged into GPT or chat GPT, sorry, what is a knowledge graph? Let's see what it says. A knowledge graph is a data model that represents a collection of interconnected data and concepts typically organized around entities and their relationships. It is used to represent and organize large volumes of structured and unstructured data in a way that allows for easy querying and visualization of relationships. Okay, and then it looks like it froze, including search engines, recommendation systems, and natural language. It's gonna freeze up. So anyways, yeah, so then once this is unfrozen, the next question I'll ask is can I hit escape? You cannot abort. So the next question is that I'll ask is what kind of format is it? I'll pause it until it unfreezes or I will pause the video until it unfreezes. I don't know if I said unpause. Okay, I think it was just frozen so because I refreshed the screen and it's fine. Okay, so I'm saying how can I code a knowledge graph? It says manually build a knowledge graph. If you have a small amount of data to do so, you can build it by creating nodes in the entities. You can also use it to like graph is or Geffy to visualize and edit your knowledge graph. Interesting. Okay, using an LP tool, that's exactly what I'm going to do, a graph database. Okay, use natural, use a pre-existing knowledge graph. Cool. So I wonder what kind of format these guys take? I wonder if it knows. So Neo4j or Amazon Neptune, cool. Let's see, what file format is a knowledge graph? Like, yeah, I use JSON or something. Let's see, there are a number of different file formats you can use to represent a knowledge graph. Some common ones are graph ML and XML-based file format. Okay, graph is Geffy and YEDD. Okay, RDF, the resource description. I don't know anything about knowledge graphs other than the theory. JSON-LD is a lightweight linked data format that can be used to do that. CSV? Really? CSV is simple. One row per relationship with columns for the source and target nodes. Okay. I am personally a big fan of JSON because it's human readable. CSV is human readable, but it's a little bit on the messier side, especially when you get really complicated. So can you give me an example of a JSON-LD knowledge graph? Let's say, for instance, I want to see some nodes about the history of France. I'm kind of a Francophile. I visited France and I really love it there. Okay, sure. Here's an example of JSON-LD. All right. So it looks like each node is actually pretty simple, where it's got an ID, a type, a name, and a description. That's actually really simple. Nationality. Oh, interesting. It looks like some of the things are kind of arbitrary. French Revolution, start date, Napoleon Bonaparte. But yeah, I really like it. France won the culture. So when I visited, while this is running, I'll tell you a little bit about France when I visited. Yeah. Okay. Sure. So when I visited France 10 years ago in 2012, and what I really, okay, here we go. Let's see. How does JSON-LD establish relationships? I don't see any examples of connections in the above example. Okay. So while it's telling me the at ID, oh, okay. So all right. It'll explain. Anyways, so the culture in France is somewhat similar to America in that we both think very highly of ourselves. But there are some really stark differences, and namely the pace of life in France, where sure, if you go to the big cities like Paris, it's rush, rush, rush. But if you get outside of Paris, even in some of the larger cities, people just have a different attitude towards life. You know, the meal portion sizes are smaller, and other things like that. But then it's like people are less in a hurry. And then I hear that Italy is even worse, where it's just like nothing happens quickly in Italy. So maybe it's just a European thing. Anyways, it's very refreshing to see a modern, powerful nation, because France is the number three exporter of like military hardware or something. I don't remember, but like this is a powerful modern country that has a much slower pace of life and a different attitude towards enjoying things. Okay, let's see what it says about how these things link. It says, okay, in the example I provided, I use ID property, for example, in the following snippet. So it does that. The nationality property is set to ID of France. Oh, okay, okay. So this is the connection. So if you're referring to another thing, got it. So nationality is like a property. So the properties that are attached to each node are arbitrary, and then you can also just have one connect back to another. Got it, got it. Okay, cool. I wonder if we can just have GPT-3 rewrite this as a JSON LD knowledge graph. If chat GPT knows this knows it this well, and text DaVinci 03 is also the same underlying model GPT-3.5, it's entirely possible this will work. Okay, let's see. Convert the following SCOTUS opinion document into a JSON LD formatted knowledge graph. And then we'll add some vertical white space, just to be friendly to the thing. And we'll come down to, let's see, it's just a little bit too long. Let's cut this roughly in half. So start, let's see. New page, so in addition A's acts, so blah, blah, blah, okay. Okay, so let's, oops, come back. Just save that there, and then we'll give it some more vertical white space, and then we'll do JSON LD knowledge graph. Okay, cool. Also, one thing that I have discovered is I actually prefer to turn the temperature down lately, and the reason is because I found that you're, especially the most recent ones, the instruct aligned ones, they do almost exactly what you want. And so with a temperature of zero, you get really good consistent results. So I have changed my default temperature to zero, you know, and everything else just zero, zero, zero. It's pretty well aligned. Okay, so let's see if this works. It looks like it's going to do like the whole thing. Vehicle true, vehicle, vehicle, okay. So that's not quite what I had in mind. What I was hoping is that it would break down the other, what I want is the opinions and the, and what do you call these, like where you reference something, right? So let's give it a little bit more instructions about what I want. Specifically, let's see, let's see, yeah, focus on dates, decisions, opinions, and reasoning. The purpose of this knowledge graph is to be searchable by lawyers for legal precedent and case law. And let's say, let's say specifically by trial lawyers. So this is, basically I'm telling it, this is a research tool, oh here, I'll just tell it. This is a research tool for preparing for trials before the Supreme Court. I'm just trying to, like, what would Devin say on legal, legal. Okay, so let's try this again and see if this changes a little bit about how it composes this thing. Decision, opinion, reasoning, excellent. Opinion in the circumstances requires no more formal legal distinction between person and enterprise. Okay, that's interesting. It's still not quite, I'm still missing something. What is it that I want from this? Maybe, maybe we can't go straight to this. Hang on, I think someone's moving around. Let me close my door. I'll be right back in a second. Okay, sorry about that. So it's breaking it down into one thing, but what a, like, I guess I need to think what nodes do I want out of this? And then, you know, so each node will be, well here, let me go ahead and save this prompt because it's pretty good. So first, I think first thing we need to do is get the whole thing rewritten in such a way that it is, that it can fit inside a single prompt window. Because if we have the whole thing a little bit more condensed, excuse me, then we should be able to get a proper thing, but we also need to think about what kind of nodes do we want. So, you know, which aspect, you know, the second, the second circuit did this, RICO requires this, in this other case it said that. So I guess each node is going to be every case cited. Yeah, okay, so the case cited and why. I think that's each node. All right, cool. So let's let's see. Each node should be, yeah, each node should be a case citation precedent or prior opinion. I'm probably using the wrong term, but include, what the heck was the parameter? My goodness. The, what's the term? Why is my brain doing this? I need more coffee. Unique identifier, property, not parameter property. Each node should have several properties such as date, let's see, case number, involved parties, reasoning for including, in this opinion, and other relevant information. Okay. So let's see if we can get the nodes that we want because if we can go ahead and convert each thing to nodes, that might save us a step, but I suspect we're going to have to summarize it first. This is really cool. I was really skeptical about chat GPT, but I'm becoming less skeptical. Oh, this is good. Yes, it's working. It's working, it's working. Okay, so let's save this prompt because this worked really well. All right, so I'll save this as, let's go up here and we'll say prompt, let's see, JSON-LD, and then we'll do citation nodes. So this is an example. We'll say example prompt. Okay, so we got the nodes that we want, I believe. Oh man, this is going to be fun because then I can try and figure out how to take all this and visualize it. I wonder if we can visualize it with Python. All right, but let's pause for a second because this is only half the document. That's not good enough, right? Do we want to just read it raw and just go straight to it? Let's try summarizing it and let's see if we can get the whole opinion in one document. Now, here's the thing. Some of these opinions are like 200 pages long. So how are we going to do that? Right, because in order for the thing to make sense, you kind of do need to have the whole thing, but you also don't want to lose detail, right? So let's think about this for a second. Let's see, let's see. Rewrite the following SCOTUS opinion. Let's see, let's say as a list of assertions. No, we'll say summarize because that summarize implies that you want to reduce word count. Remove superfluous language while retaining specific details. Yeah, let's see it. Let's see if that works. Summary, okay. Yeah, those are good notes, but it's not retaining the information that I want to see, such as the nodes. Okay, so rather than read it multiple times, I think what we'll do is we'll break it into chunks of, let's see, how long is this? We'll do chunks of 13,000. That seems good. So we'll do chunks of 13,000 and just go straight to graphs. To knowledge graphs, because that worked really well. That worked exceptionally well. Okay, so let's go ahead and clean this up and we'll come down here and do chunk. And then JSON-LD knowledge graph. And then we'll do F file save as prompt. JSON-LD citation nodes. And I need to take a quick bio break. I'll be right back. I'm sure you wanted to know that. All right, actually, I just realized that this video is running long. It's already 30 minutes and I'm a bit fried. So we'll come back. We've got our feet, we've got our bearings. And so when we come back for the next video, we will start doing the data prep because that's a lot of fun. Let me tell you, that's why I don't want to do it right now. So we'll take all of these opinions. We will split them into chunks while keeping some of the essential information with each chunk and got to do a little bit of figuring about how do we format the knowledge graph correctly? Because each, yeah, there's some problems to solve. So, but we'll split it into chunks. We'll prepare the data. We'll do some experiments with generating a knowledge graph and then that's probably all that part two will have. And then part three will be actually like, let's load this into a database or visualizer. All right gang, thanks for watching. It's good to be back and take care.