 So we are on time and we've got a lot of things to go through so I am I am going to start off My name is Vanlin Berg To give you all a little bit of background. I've been I've do essentially computer a lot. I've been doing it I've been doing it for I don't know 25 years I've also been involved essentially with open source for that long and I've been doing AI since 2008 So all this is sort of whenever when the entire world started going crazy bad AI it was it was great for me So I want to this is listed in the AI and data track, but it's also listed as intermediate Because it has both We go into both legal concepts and AI concepts and so this is going to be this is probably sort of a weird audience How many of you are primarily technologists? Okay, how many of you are primarily lawyers or legally? Okay, well not quite half-and-half, but I know that because I'm going to go into both technical details and Legal details. I am going to I am going to disappoint both of you just in different parts of the presentation for everyone who is For everyone who has been involved in AI. I'm going to be going through I'm going to be going through essentially at not a 10,000 foot level not a one-inch level, but like 10 feet to 100 feet Low enough that we can talk about some of the details of how things are trained, but high enough that we can get to But we are going to be abstracting a lot of things. I've done my best to make this both technically Accurate but it and including just enough detail But also to make it accessible to to those who of us who are legally trained for those of you who are legally trained I'm probably going to be talking about some cases or things that you're aware of and If I'm going over stuff that some stuff that you already know, I'm sorry But that is also for the other people who are more technologically oriented but the good news is that this is an unusual enough and it's and Rapidly developing enough area that there is something new in here pretty much for everybody Another thing that I'd like to talk to mention is that when I'm going to be talking about AI I'm going to be talking essentially only about generative ML I'm aware of all the other AI It's been around of course forever But really generative ML is the thing that has really started to drive This this AI revolution and it is the thing that is most interesting From a legal standpoint because everything else is just a variety of various algorithms, you know fuzzy logic back in the 90s Oh, wow, we get to we get to Try and figure out percentages of yes or no, and then we we quantize it Or or a star, you know, it's fine But none of those things are really as as interesting or as legally challenging as as ML I will also when I'm talking about ML, I'm really going to be talking about the advances over the past five years and really the past three or four So moving on This is also I've got 40 minutes This is also the short version of this talk for anyone who's interested The long version was published in in the law review a little earlier this year I've also put up the short link the short link just redirects to the long link This is essentially a long analysis of both how the AI training How AI training model training and inference and generation works as well as the the copyright implications from a US copyright Perspective because we're in Europe. I'm going to be mentioning a few things from the European perspective, but This is going to be mostly US law focused especially because that's where a lot of the current action is Also, but for anyone who is interested in the European stuff this Professor Guadamuz has a great article. He touches a little bit on Europe in general. This one's more specific to the UK I haven't found as good a Review article on these same topics for for Europe generally if I do well, I'll let y'all know at some point Okay, now today we're going to talk a lot about models A model is probably the most Misunderstand term by by both a bunch by both By a lot of people a lot of people say Model and they mean the magic black box that does what I wanted to do But we really need to understand what models are how they work how they're trained in order to apply the correct legal analysis So before diving into the mechanics of ML training I'd like to start off with an analogy and this is an analogy that I found really helps a lot of people understand and Get a good mental picture for what's going on In the context of model training Assume that this guy right here is He's an art inspector and he's hired to inspect all the paintings in the Louvre Now of course when he's hired he knows absolutely nothing about art He doesn't even know what makes our good what makes our bad what the different types of art And so he decides what I'm going to do is I'm just going to measure everything So he goes in and he starts measuring the size of each painting and and he starts measuring the number of different Detectable colors in each painting and he starts measuring even even random things like the number of syllables in the Artist's name or which corner they sign in or and then all the things about it How about the name of the the name of the artwork the name of the of the artist where they where they live what Nate Geert was Things that are random like colors six inches away from each other everything possible You can think of that he measures and he writes it down in his little note in his little notebook his database Now before long this gets pretty boring. So he decides he's going to start playing a game he's going to say every time before he Makes a measurement. He's going to take the things that he knows so far and he's going to guess and He likes oh, what do I know about this? Well, it's it was done in this year. It's that are okay I'm going to guess that this is the answer to this question is X and at first of course his His guesses are terrible, but after looking at thousands or millions of different paintings his guesses actually get to be pretty good and so After he's gone through and all these things he starts to notice all these patterns and he can effectively guess all sorts of things about about a painting by having just a little bit of information so After all this time goes by and he's made all this these inferences he he decided he becomes actually sort of the world's foremost expert on Figuring out on giving information about these paintings because he's been able to before He was trying to guess things from what he knew now you give him you give him a little bit of information And he's able to infer the rest This is a lot of how This is a lot of how model training works you really go through these Five steps, of course step five is is is repeat But just like the art inspector here you go through and you measure things a lot of people when they're thinking about When they're thinking about model training they talk about oh, it's reading this thing or it's sucking in all this content to a certain extent Yes, but to a certain extent not really what it's really doing is it is trying to is making statistical measurements Measuring the statistical probabilities Associated with with various things it makes a prediction And then it checks because as you're going through the training process There is a known answer because for example in the context of of something like GPT The next word is the is the actual answer and so it Guesses you can do things like it was a dark and stormy and many of you will say night That's that's a high probability answer or you could say something like the wizard raised Raised his blank and zapped the creature. Well that blank could be hand It could be wand it could be staff. It's unlikely to be elephant, but you know these are but there are all these things These are probable Endings are probable answers to that thing in the middle And then once it finds out to you once it finds out the answer actual answer It goes back and updates all of its probabilities So that it will be a little bit more likely to give a correct answer or at least an answer with the correct statistical Probabilities the next time and then it repeats this millions or billions of times and This is this is really the AI training procedure For almost any type of ML. This is a little Now the things that they are guessing the things that they're measuring vary from from one application to the next But this is the same essential procedure so When you are looking in order to build a model You actually create what's called a an architecture now To be specific when we're talking about a model model really has three separate things The first is a logical architecture, which is represented like like here. It's a It's a logical way of thinking about how we're going to take these inputs how we're going to take them apart How we're going to analyze them? How we're going to make our predictions, and then how we're going to represent the output? That is actually a logical construct. It is which is then represented by a particular way The code that you write to implement this model architecture. That's the thing that you write in pytorch or or Or whatever The code in pytorch is not actually the model architecture. It's the implementation the model architecture is actually something that's that's in your head It's something that's mental The final thing is the the weights or the data so what happens is you go through and You have the input layer. This is this takes each a separate token. Usually that's a basically Something that's a number It's a number it's been changed some of the input has been changed into a number. This could represent anything this could represent a pixel value it could represent a particular word it could represent a Value in a log file. It doesn't really matter, but it's been changed into a useful number Goes through you have these hidden layers that are in the middle. That's the part that does essentially guessing And then it goes through creates an output. That's the final prediction The final prediction as you go through Something you go through and what you have what's called back propagation. That's where it updates its its Analysis in order to to try and try and make its next prediction better Now the so when you're going through this set of predictions it's essentially a Probability for each of those different parts of the hidden layers that at value will be modified or passed on or Send to one place or another These are called the weights if you were to open up these These these weights are essentially probabilistic it is not a Unlike a financial model. It's not deterministic It doesn't have known inputs and outputs it essentially it's a probabilistic mapping from a set of inputs to a set of outputs Based upon these statistical measurements, and if you were to think about it, it really is a lot like For those of you who are aware wear a Bayes theorem Bayes theorem is a way of essentially guessing a particular likelihood of an output based upon known known inputs What we're looking at when we're looking at these new machine learning architectures is essentially a multi-billion parameter Bayesian calculation and If you were to open up one of these models on this specifically the weights you would essentially see Something that looks like this just a huge matrix of numbers Corresponding to the probabilities associated with passing on value through all those different nodes in the in the hidden layers This is actually not It's simply a pile of numbers. It's not creative. It's not expressive. It is really something that has been developed through essentially this mechanical mechanical process of refining these probabilities over Millions or billions of times Now why is this? important Oh One of the things it's hard to say what any other what any specific probability Actually represents I mean they've been able to identify certain neurons or certain things where they think that they can identify some of them There was an interesting experiment a few months ago where people fed a lower level Set of weights into a higher level LLM and had to guess essentially what the weights represented But we really don't know So What does this actually have to do? with the law and the answer is everything because if you don't If you don't apply the proper sort of technical underpinnings if you don't apply the correct facts Then you're going to start getting all sorts of bad law and the reason why is because the law is when you're dealing with the law it really is a It's an argument about what is the proper analogy? What are the proper comparisons? and if you if your if your Model if your mental model Of AI is it's the black box that does whatever I want then all of a sudden you get people Imputing all sorts of information into it. That simply isn't true that you start imputing logic logic and emotion and Intent into it when it's not that it is simply a really Complicated statistical equation. That's it. It's not creative. It has some randomness. It has It has some randomness that you put into it. It is not expressive. It is Just a pile of numbers, but unless you actually see what's inside and you start dealing with the underlying facts people start to Infer all these sorts of things the anthropomorphize it and says well if this was a human that would be doing this it would Be thinking that but the problem is it's not a human so you can't say that so There's something that we go through in law school called issue spotting and that is all right given a set of facts what are all the different issues that are going to come up about it and it is and it really when you're looking at These these AI issues it really comes down to oh, I'm missing a bullet here comes down to these four Comes down to these four things and I'm especially going to spend a lot of time on these These first two, but I'll touch on data privacy in terms of service as well So When we're talking about intellectual property How do you apply intellectual property to made to machine learning? now there is When you're talking about applying IP and especially IP to machine learning The first thing you have to say is applying machine applying the law to what part of machine learning? Are we applying it to the training process? Are we applying it to the model itself this this architecture and code and weights? Are we applying it to the outputs because you actually have to analyze each one of these things? Independently because they're not the same thing. They all work together as part of a system to come out with the output, but The inputs are not the outputs and that and neither one is actually that model in the middle So let's let's start with training a Lot of the questions around intellectual property in AI right now are about how much can I use? Copyrighted material for training a machine learning model. Now. This is a really interesting question because a Lot of the a lot of artists especially but in some cases coders And people here are very concerned about hey, I have this thing that it is my work As you're going through you're reading my work. You're doing this. You're learning from it You would have nothing if it wasn't for my work So you should be paying me for this But this is where we get back to why we went through that technical explanation in the first place Copyright, and this is this is true pretty much wherever both in both in Europe in Asia as well as the United States Doesn't protect every use of a particular of a work Speaking from the US perspective there are only a few specific verbs to copy to create derivative works And to create derivative works to to perform These are the only specific things that are actually protected by copyright Everything else it may be a use of the work But it is either it's a use that is outside of copyright or it is a fair use which means that it has Been judged to be outside of copyright One of those things One of those things that is interesting is that a classic fair use is actually Doing analysis of a work A lot of times this shows up as I am going to summarize a work I'm going to to summarize a work and you know talk about it. I'm going to provide my I'm going to review this and review a book in the New York review of books That review is not is an acceptable fair use. I can read someone else's book I can make a review of it and even though I'm using some of their concepts I'm using some of their I'm talking about what they did I'm allowed to do that because it's a fair use. It is not one thing that is protected by copyright Another thing that's been around for a long time is the ability to study a work for example it sounds There has been this idea of Doing Engrams and and and textual analysis for a long time where you were to study and say how many verbs are there in this work how many nouns what what words are usually together what co-locations these sorts of Statistical analyses are also not part of Not something that has been protected in the US by copyright Probably the leading case here for our purposes is the Google books. Now. How many of you have heard of Google books? Just about everybody Google books was this the effort by Google and Hathi trust to buy a bunch of books to scan them and then use them for to Both improve its search engine and to allow people to search inside these books The authors Guild Said very similarly to what a lot of authors You know what you're using our books. You are copying our stuff. This is not allowed You need to pay us for this opportunity It went through all the way up to the second circuit and then it was peeled the Supreme Court Supreme Court said no, so this is what this is what the law is At least in the second circuit That doing Inputting a bunch of work for the purposes of creating a search engine or or a very fancy database is Not Protected by copyright. It is a fair use in in large part because number one Thing this this search index that they that Google created is not a replacement for the book And they said the only thing that copyright is going to protect is something that is going to compete against the book itself In the marketplace and they said, you know what these search search snippets They're actually allowing people to do new things with the book. It creates new works very use This is a Lot of people are looking at them. They're saying well. Is that true even when? Even when the result of generative AI is to create new works that in General compete against an author or compete against an artist Well, here's where you get to that very specific thing Copyright is about a work a very specific work that can be infringed this painting this book this article Copyright actually doesn't protect at all your Perception in the marketplace your ability to produce works in the marketplace in general In fact copyright is designed to encourage the creation of new works in part by allowing other people to use parts of your Works to to generate new ones and so And so this as a result the fact that you go through when you're doing this model training and You're what you're doing is you're actually creating a statistical analysis of these works. You're creating measurements of these works Now this has not been decided But this is basically what I argue in that paper in here is that what you're doing is essentially the same thing that Google was Doing when they created the search engine the search index they are Making a bunch of measurements and they are creating something that is not a Competition for the work. It's complete something completely different and so it is it My belief in my argument is that it is going to be found to be fair use that said nobody knows and It really comes down to like I said this argument of analogies because that's why I've been trying to get people to focus on the technical the technical The Technical details the technical facts is because that unless you get down to that technical level and you see what it's doing It looks very much like hey, I'm just copying your stuff just like a human would do but it's not it's just making these measurements Now there is one Tricky thing and that's that's at the bottom. That's memorization This is the thing that is getting everybody up in arms What happens when you create one of these LLMs or these image generators and it actually comes out with something? That's exactly the same as one of your inputs the answer is That's infringing So don't get me wrong it is a hundred percent possible to create in copyright infringing outputs From a model now remember how I said there's this idea of the inputs There's the model itself and then they're at the outputs The model itself I think as I as I'm saying here I believe is the likely highly likely to be found to be a fair use the outputs Maybe maybe not it really depends on a specific output and you can 100% of the time if you wanted to create a copyright infringing output out of one of these machine generative machine learning things Easiest way to do it if you were to go to the one of these image generators and type in a copyrighted character like Iron Man Now interesting thing in the US and a lot of other places a Character that is sufficiently drawn that has specific details can actually be copyrighted Which means that it isn't tied to one specific? Book or comic book or whatever or picture is actually if you copy those details even in a different pose You can create in a copyright infringing work So you type in Iron Man it'll give you back a completely new picture of Iron Man that is absolutely copyright infringing 100% Because it's infringing on the idea of on this copyrighted Iron Man character in the context of code Because code doesn't have as much many degrees of freedom as English you're also going to have a lot of times when You're going to have things where where it's going to drive toward What happens in memorization is that these probabilities all collapse so it has at least to one particular output It actually hasn't Hasn't memorized the copy the output. It's memorized how to recreate it which is a version of a version of copying But it it's not included in there, but in that case the result of that is is The creation of a copyright infringing output The interesting thing though about machine learning is that is that as you reduce duplication during your training process And as your model gets bigger the chances that you will have encounter memorization go down In fact, they there was an interesting study where some people tried to create a copyright tried to Extract copyright infringing outputs from one of these image models and when they had 90,000 model 90,000 images they were able by spending an immense amount of compute They were able to find 108 Infringing and extract 108 of the input images that would have been infringing if they But when they tried to go through the full-sized Lion 5b and the full stability AI they were not able to Maybe they were able to find a Few I forget how many it kind ended up being like zero 0.0013 percent of the overall Of the inputs that they were able to reproduce and as they reduced duplication in the inputs and as they get these bottles bigger That will go down so This however is going on we have lots of different lawsuits. You're probably interested in what's going on in each one of these So these top four have all been filed by the Severe Law Firm In the United States they are a plaintiff's class action Plaintiff's class action firm as well as Matthew Butterick who's sort of an independent lawyer this giddy images versus stability I either two cases actually one in Delaware and one in the UK. They're both asserting the same thing I'm going to treat that one differently because it's and deals with a few different issues The one that probably most of you have heard of Most of you heard of first was actually this dose versus github. That was the co-pilot case Now what's really interesting about this is that they say that this is a copyright case But unusually they did not assert copyright infringement There is no you copied our stuff in that entire lawsuit there is instead you have Removed our copyright information there is you have read in our stuff and therefore every co-all code that you create is necessarily a derivative work and they are They play very loose between this legal concept of a derivative work Which means that there is specific expression from one work that has been copied into another and They use this more broad term of everything is derived from these inputs and so therefore everything is derivative work that One has that one they They've filed a motion dismiss in part because They they've said look this is you're trying to make a copyright infringement lawsuit But you've not actually argued copyright infringement. You've argued all this other stuff You argued oh, you're competing against us in the markets. It's unfair competition You are using our names Using our names you are removing our copyrighted our copyright license information But you're not ever act but they're not ever actually saying you infringed our copyright And you just can't do that and you can't make a fake copyright infringement claim and try and dress it dress it up Without actually accusing copyright and the problem is most likely they can't find something that is infringing They can't make this output that they can't get this particular output for their particular inputs So that has been filed. There's a motion to dismiss out right now these on The next one is Anderson versus stability AI This one is about stable diffusion about the image generator Here is again where you've got poor Analogies coming to the fore they talk about stable diffusion being essentially a 21st century collage tool. What they're doing is they're Breaking everything up into pixels and then they're creating a collage of all those pixels and therefore everything is derivative of all the stuff that one's also had a motion to dismiss and it sounds like Probably almost everything is going to be Is is going to be dismissed at least preliminarily For a couple reasons number one Two of the three people didn't actually have copyrights that they were asserting They simply said oh, you've been fringe my copyright one of the requirements is that you've registered one. They didn't The other person amusingly enough doesn't have her work in the most version recent version of stability because it didn't pass the the the filter the The they had a filter of whether it was Aesthetically pleasing enough and it didn't pass So they've been able to show that that's not in there. The other thing is that this broader argument that everything that you do is Copyright infringing of every single work. That's a real stretch These final two are Essentially about gpd4 and llama the training of those there are a number of Authors that are have got in those most notably Sarah Silverman who is the lead plaintiff on both of these she is in she's an author in comic that you may have heard of The interesting the way that they're arguing it for for these three by the way They are arguing copyright infringement directly But the way in which they're doing it is interesting instead of saying look Here's our work Here's the work that we got it to create Look it is There are there's stuff that's copied instead what they're doing is they're saying Please give me a summary of Sarah Silverman's work and because it can create a summary They're saying look the work must be in there somewhere We just don't know how to get it out But of course it's going to be in there somewhere if it can create a summary But remember how I talked about the critics in the New York review of books creating a summary Isn't actually one of the things that is going to be protected. So the long and short is all four of these They're not good lawyering then in fact if you are if you really want To protect the side of artists and authors and you want them to be paid for training You should hope these guys get off kicked off the case really fast because they're creating bad law or they're about to The interest most interesting case is actually the Getty Images case that is also about stability It's also about copyright infringement And in that case again, they have been able to find a specific copied case or copied image between the two They've been able to find images that are very reminiscent but the The interesting thing is that The model did learn that you should have this little Getty Images watermark on this thing and So it's creating these terrible looking photos with a bad version of the Getty Images watermark And so the strongest argument actually that the Getty Images has is that they're infringing their trademark by creating bad images Sort of like Getty Images that have their trademark on it that argument may win but notice that's not a copyright argument another thing about that's interesting is the Copyright ability Currently the US Copyright Office the UK is actually saying AI generated works are copyrightable Copyrightable they did that a few years ago the US is actually saying This was in the Zarya the Dawn case This was this this was a case by Chris Chris cashed in over Chris had created this comic book comic book Chris wrote the the the various parts of it and then created all the images with mid with mid-journey Filed for a file for a copyright received the copyright and then said guess what I created got a copyright on my book That was partially created with mid-journey copyright office said Wait wait hold on there And said you need to explain what you did so My friend Max and I we helped Chris respond write a response and we talked about all the ways in which in which these images were generated and The Office came back and said do you know what you don't have enough control over what's coming of what's coming out There's it's too random. You don't know it. We're going to actually say that unless you have substantial Human stuff after whatever comes out of the AI generator. It's actually not copyrightable We're actually going through Chris decided not to appeal that but we're actually going through in a different case but Right now the answer is anything that comes out of one of these AI generations is not copyrightable So if you are using co-pilot by the way anything that comes out of co-pilot That is not that is in the public domain at least right now I do believe though that this is going to be The copyright office is essentially speed running through the Through the history of copyright with regard to photography Used to be that photography was not copyrightable then this then they said well It's copyrightable if you do enough stuff around The pushing the button like you select the lighting and the costumes and the images if you do enough stuff around it We'll say that that's copyrightable and then about 20 years later They said you know what actually any time you take in a picture It has something of the author we're going to say that photos are copyrightable and that's been the law since early 1900s Right now they went from not copyrightable to Copyrightable if you add enough stuff to it, and I believe That they're going to be arriving at essentially the same place they did with photos of saying you know what? By default it's going to be copyrightable We're going to say that a human because a human was involved It's going to be copyrightable, but we're not there yet. However about two weeks ago They did ask for comments in the federal register. They're collecting them until October 19th where they asked about this exact issue One final thing because I think I've got about two minutes Big question who is responsible Pointing this specifically to For this group most of you are probably most interested in who is responsible for for all the outputs of co-pilot in the Most most of these various generators say you know what whatever you create that's your responsibility However, Microsoft has decided to stand behind it with this indemnity clause The interesting thing about this The interesting thing about this clause is that If the code it doesn't It doesn't apply if the code is based on code that differs from a suggestion provided for I get a co-pilot So unless you plug in exactly what co-pilot says and you throw it down your thing It doesn't apply which means that this indemnity is unlikely to apply in most realistic circumstances But you'd have to see whether they do it All right Don't have a lot of time to talk about this interesting Two quick notes on trade secrets a lot of the public a lot of the the primary places that are talking about the primary Vendors for AI particularly open AI they have a one-way confidentiality clause for their default terms Which means that whatever you do your their stuff is confidential yours is not The other thing is that it's very hard to keep a trade secret about almost anything in AI because Turns out especially for the weights almost no IP really applies now that is That is about all I think I've got Negative one minutes for talk for questions So I am going to end and I will go right outside and I'm happy to answer to talk to anyone Thank you for coming today