 Let me introduce myself. My name is Van Lindberg. I am an IP lawyer, been doing open source for a couple decades now. I also run a company called OspoCo, which is an open source program office as a service where people were essentially were an adjunct to a lot of internal open source offices. So what the genesis of this really came from both discussions in the broader open source community, as well as specific questions that came from various clients. Especially in the past two months, this has really, really just exploded, really, really since chatGPT came out and got everybody excited about this sort of thing. And so it started out when proposed, this was about GitHub Copilot, but really we're gonna expand it just a little bit and we're gonna be talking about generative tools and specifically generative code tools. Now there are other types of generative AI. I think that's fascinating. I talked about it yesterday, but that's not the focus of what we're going to talk about today. Instead, this is going to be the advice that I have been giving my clients and that I think is what's going to happen from an open source program office perspective. What should we be advising our internal clients? Now why do the OSPOs get asked about AI? Turns out that AI is actually really, first of all, it's a really good fit for the OSPO because it has that same sort of mix of it's got open source licenses, it's got a mixed technical and legal question, it is very developer centric, and you've already got a group that has to do with new and to the regular lawyers strange licenses, it's actually a pretty good fit and so a lot of times OSPOs are being asked what should we be doing about this? Now before I get too far, I have to address the elephants in the room and that is the co-pilot litigation. Let me first say that there are two separate filings, I am personally not impressed at all with the quality of these filings, I do not think that they will be successful just on their own merits aside from the broader question that they really want to talk about and that we're going to talk about today. So if these are dismissed, you should take that as a weakness of these particular cases as opposed to something perhaps broader. When co-pilot came out, there was a lot of people who really were worried about what it means for open source and free software. The FSF put out a call for a number of papers and there were some people who talked about that. The software freedom conservancy had some positions that they put out. OSI had had some various things. And a lot of the issues really were around, the issues that people were worried about were, number one, is this essentially a, is this a derivative work? When you create this big co-pilot, was that a derivative work of our code? Number two, what about this code that comes out, particularly when it's going to reproduce some sort of internal, some sort of one of its inputs? Is this, if it inadvertently copies in GPL licensed code or AGPL or whatever licensed code and you don't attribute, even if you don't attribute, what is the liability associated with that? And then, but a lot of the arguments that were made were actually far more emotive rather than legal. The people were acting in some ways like very classical IP holders. They were like, I own this, I put this out in a particular way because I want it and now it is being used in a way that I didn't foresee and I am not sure I agree with. And while there wasn't the necessary call for, and they should pay me, which is the traditional way that IP holders really finished that sentence, but there was the, and maybe they should adhere to my license, which is the FOSS equivalent. Problem is that a lot of these emotive arguments really weren't really made about with good legal backing. And to be fair, there's a lot of things that are, a lot of this is unknown and up in the air. I've done an analysis, I'll talk about it in a second, where I think that it was very likely, very used to be able to train GitHub, or train co-pilot and code whisperer and all these various tools on all this code regardless of the license. But one of the things that I wanted to talk about and that we're gonna come back to was there was also a discussion about what is this going to do to our community? Is this going to make it so that people don't learn how to create software that it's going to splinter our communities, there's going to be a lot of low quality stuff, it's going to, we are going to be ruining sort of the wellspring of creativity that allowed them to make these large language models in the first place. That I think is an interesting and a little bit more logical, it is a motive but that's a little bit more logical policy argument. Now with regard to, let's start with what is it even okay? And I'm just going to spend a couple minutes on this, like I said, I talked about it a long time yesterday. There are a lot of people in the open source community that are more properly in the free software community who are saying, you know what, we believe that it was not correct that they were able to create these. The software freedom conservancy said, GitHub has meanwhile artfully avoided the question of whether the trained model is a work based on the input. We contend that it probably is. However, given that fair use is an affirmative defense to copyright infringement, they are obviously anticipating a claim that the trained model is, in fact, a work based on the inputs to the model. Why else would they even seem to be bringing up fair use rather than simply say that their use is fully not infringing? Anyway, we have no way to even explore these questions authoritatively without examining the model fully fixed in its tangible medium. We don't expect GitHub to produce that unless compelled by third party. The issue is that a lot of the things that are done in the model, it ends up being very transformative. When you look at, yes, they're talking about fair use, which is technically a defense. But there are certain precedents that are so well established that if you were to bring them up, you could probably get out of a lawsuit early and effectively cut things off. It makes it, it acts like a regular defense in for certain circumstances. It is, a lot of this idea of transformative-ness is about the idea is, are two things substitutes for each other in the marketplace? The fact is that the co-pilot model itself is not a substitute for the code. Anyway, this is, I think, part of what you were looking at, but I don't wanna go too deep into that. If you want the long version, there's a paper with footnotes and everything. You can read that, and that goes through the legality of creating and using the model. But I want to start talking about, okay, what about our people that we want to start using, they wanna start using co-pilot to add to our code? We wanna start using it, say I'm a project, a project maintainer. Someone says, I want to use GitHub co-pilot to start adding things to your project. What are the concerns and what are the things that we really need to be thinking about? And they usually come down to three things, infringement, security, and protectability. So we're gonna talk about each one of these in turn. Now, with regard to infringement, this is the big one. What we're talking about here is memorization and specifically the generation of possibly infringing material. This is a known issue, especially for code generation as opposed to other types of generative AI. Code generation is more likely to, because it's more likely to center on a smaller number of known patterns for solving a particular technical problem. In other words, it's more functional and that functional is gonna drive greater similarity, which is gonna make it more likely to have code that comes out that looks very similar to existing pieces. The other thing is that we as a community have created an ecosystem that has lots and lots of code reuse. This is fantastic, but what has happened is that you get duplicates of well-known code in the training set. And what happens is that the model learns, oh, if you want to write an SSL function, this is the way it's happened. Then it spits out part of open SSL. Now, while co-pilot has started developing the ability to detect certain types of inputs and alert you, this is something that Code Whisperers had since day one, which was good for them. I'm not sure that that capability's in wide release yet, at least to specifically help you attribute them. I think that they're just trying to block it right now. And what's more is that Microsoft or co-pilot or OpenAI with ChatGPT or Google with Bard or whomever you're talking about specifically do not take any liability whatsoever for infringement in the output. This means that if you do get inadvertent copying into your code base, you can and likely would be liable for copyright infringement. Because the fact that you went through the model but you were able to, but there was copying, but there was copying doesn't excuse the copying. You're gonna need to follow the license and it's going to be, and sometimes that can be incompatible licenses. In some ways, it doesn't even matter that it's inadvertent because copyright infringement is it's not something where you need what's called men's right. You need the intent to copy. You can do it unconsciously. And what they would probably say is, look, there's a transitive chain, co-pilot had access to this code. They gave you this access, you asked for a certain solution and it printed it and it came out with something that was copied. Through that, you essentially copied it. So what do you need to do? First thing is that this is, how many of y'all are familiar with snippet scanning as a tool for license detection? Excellent, not all of you. Basically, one of the tool, one of the primary tools for managing open source compliance has always been these scanning tools. And the original ones, Black Duck Protex, which goes way back, was a snippet scan where among other things, what it would do is it would go through and do a rolling fuzzy match of some lines of code against all these various things that it knew. And so it would be able to say, look, these 15 lines looked like they were cut and pasted from Stack Overflow or from this other thing. Because a lot of the concern, especially early on, was about copying and pasting as opposed to linking in later things that have come up. Snippet scanning fell out of favor because it is slow and because it causes a lot of false positives. And what most modern scanners do is they instead query the package manager system and they say, what are the packages that you're using? And they assume that you're not going to be actually doing any copying and pasting because everything is in libraries anyway. And why would you copy in the library when you could just use the library? The only way that you're going to be able to do to really have certainty about the things that you're doing is actually to get back to Snippet scanning and understand what are the things that are going to be inadvertently coming into your code. This is going to be a hard transition. It's also probably going to be one of the best things for open source compliance ever because all of a sudden you've got the AI people who are really excited about using these tools who will want to put in Snippet scanning so that they can use the tools. They're not nearly as excited about open source compliance. The second thing is, so take away number one, Snippet scanning. I always recommend it. Number two, there are ways in which you can use co-pilot or similar tools to drastically reduce your chance of having copyright infringement. And that is basically something that we've known forever and that is test-driven development. What is happening when you create one of these, what happens with infringing is you start out with writing a function and you write a function and it matches something that it knew and it spits out the rest of the function and a lot of those things are very, very similar. Test-driven development starts the other way and it says, I am going to define the necessary outcomes for the functions that I create and I am going to then write software that passes my tests. You can actually use co-pilot or chat GPT to generate, to help you generate tests for the thing that you haven't written yet because your tests and your boundary conditions are going to be something that is highly unique to your particular situation. It's going to make it a lot less likely that someone else has had the specific weird interaction of all the libraries and your particular work structure and code structure anywhere else and so you can generate this test and then you go back through, even if you have them and you say, now, help me generate functions that pass this test. If you do this sort of test-driven development, it has two benefits. The first one, like I said, it corners you in a place where it's unlikely to be copied. The second is, even if there is copying, the fact that you have filtered it through a functional, you put it through a functional filter because tests don't care about any of the implementation. Tests only care about whether it works or not. That's a functional filter and it actually allows you to, in some ways, do a semi-clean room type of re-implementation, even if it is similar. With regard to infringement, those are the two things and those are things that a lot of people are worried about. The second thing is security and I'm going to break this up into two different pieces. The first is the security of the code that you write. The second is trade secret security. The security of the code that you write is actually basically the same issue as the copyright infringement issue. What it has learned, perhaps bad habits or it has learned things that are known to be vulnerable in some way, but it doesn't have any external way of actually deciding or knowing whether one thing is good or bad. If you ask for a particular solution and that particular solution that is widely out there ends up having a built-in vulnerability or it gives you a little, because it has rembrandedness, it does something that gives you something that is vulnerable, well, then you've got it. You can't trust the code that comes out of an LLM to actually be good code and most of the time it's not. It's better as a teacher. Better as an interactive way of querying what you're doing. There are some tools that, like snippet scanning, look for vulnerable structures within your code as opposed to vulnerable libraries and help highlight those. It's a specialized type of winter, same sort of thing. This may be something that you want to invest in if you're going to be heavily using co-pilot or similar to write your code. Then what about this idea of trade secrets? In a lot of cases, what this is really about is what happens when I write my code and I ship it off to the service and the service, what do they do with it? Do they keep it? Am I violating any sort of my agreements by using this? And what are they going to do with it? Or are they going to share it with other people? The broad takeaway on these is that if you're going to use one of these things, have a paid plan because the paid plans, the APIs, they are architected to maintain confidentiality between them and it's really just a cloud service. If you trust Dropbox or you trust AWS to host your code, you're sending your code to another third party already but it's within the realm of your particular care control and custody and so you can say no, we're not doing that. However, if you rely on the purely public chat GPT, the terms of service actually say we can keep all of the code or all the code or all the inputs that you provide and we can use that for further training and we will. And there have been at least a couple circumstances most recently, Samsung, where they found that certain very specific information that they put in got used for subsequent training and then came out and leaked the trade secret to someone else in a subsequent session. So the takeaway on that sort of security is make sure you use a paid plan that architects it so that you have your own data. The last sort of issue is with regard to the protectability of outputs. In some ways, this is the least of the issues as even if you're going to be using it for internal code, even non-copyrightable code can be trade secret, it can be useful. Right now the law is that AI generated code is not copyrightable and of, and of sentence, I believe that that is wrong and that will not stay but that is currently what it is. The only thing that is copyrightable and therefore licensible under an open source license are the things that humans have then subsequently done with that AI output to add further expression. There are test cases going on to try and move that back to talk about what about the initial interaction, what about the back and forth? Is that sufficient creativity? But right now that's the line from the copyright office. And so if you were ever challenged, at least right now, there would be a question as to whether copyright applied to any of this code no matter how it looked. This is coming up in the FOSS community with articles like, if GutHub is my co-pilot, who wrote my software? And this idea that if we have this bunch of unattributed code out there, who owns it? Who was it? You know, it's floating out there. I would submit that for, if this is going to be something that's going to be put out in the public, if it is licensible and it becomes licensible and copyrightable, then you can put an open source license on it. If it is not, is it really a bad thing that there is lots of completely unencumbered code that is out there? I don't think so. Now what, these are the current issues. And I think a lot of this is driven by the current state of language models, the current state of training. Honestly, probably the next time they go through and do a big revision of CodeX or the various code-focused models, they're going to be a lot more careful about avoiding duplication. There's going to be other ways in which they improve the training process, so it's going to be less likely to generate copyright infringing code or also vulnerable code. This is where we get back to that policy argument of is this going to hurt our communities? What should Ospo's do about the long-term and thinking is this going to harm or help our communities? Well, there was an interesting article, there was an interesting internal article that was leaked from Google called We Have No Moat. This was a week or two ago. And in it, one of the Google engineers who was working on their products said, basically, we thought that we would be able to maintain a competitive advantage by having these large language models that take a lot of things to do, it was hard, we thought it was hard, it was big, it was expensive. And then they looked at what happened with Lama and with Stable DeFusion, and they said, as soon as the open-source community got their hands on these things, the amount of creativity and capability that blossomed was astounding. Now, Stable DeFusion is leading far and above Dolly and Dolly II, which came first. Dolly and Dolly II used to be, on the images side, used to be the focus of a lot of scholarly work. All of that has moved over to Stable DeFusion because Stable DeFusion has the capability to actually test and evolve with the new knowledge. This is happening in LLMs as well. And what this article talked about is, hey, you can actually do some of these low-rank adaptations and you can do these stackable things that turn out to actually be cheap and capable as capable as the things that we're doing and the open-source community is just getting started. So yes, AI is coming for open-source. You are going to, if you haven't already, your code internally or your projects are going to see a whole raft of new code that is going to be partially written by co-pilot or other AI tools. And of course, the rule applies. How many of you have heard the saying that 90% of everything is crap? It's true. And 90% of everything will still be crap when you've got 10 times more of it because it's easier to create with generative AI. But that doesn't mean that it doesn't create value, that 10% or that 1% isn't something that we can use because a lot of these other things will fall away. I would actually argue that generative AI is going to be the best thing ever for open-source, ever. And the reason why is comes down to economics. What happens when you make something cheaper? People buy more of it. What generative AI does is it makes the cost of generating new code cheaper. It expands the range of possible people who can participate in the generation of code a hundred fold. This is going to be the best thing ever for diversity and inclusion because whole rafts of people who before maybe thought that this was going to be too difficult for them now have a tool and infinitely patient mentor. Someone who will explain the various the various libraries to them who will help them get a working implementation that they can actually submit for a PR for a human to start going through. Yes, a lot of that's going to be not great code. But the fact is that humans who are just starting out don't create great code. It's no different. But what we are going to be able to do is we are going to have where you had a hundred contributors before you're going to be able to have a thousand. You're going and maybe only, you know, if only 10% of those become regular contributors or maintainers. Still you've gone up by an order of magnitude the number of people who are involved in your projects who are involved in the open source community. And that is where it gets its strength. So this is the last question. If that is the happy path and you may disagree with me, but this idea of we can harness generative AI to drive inclusion and to drive project health and to drive greater possibility, then how do we encourage and make this happen? And I would say it is really going to be about helping people understand and develop processes that are going to need to deal with an order of magnitude more code. Because that's what we're going to have. Maintainer burnout is already a thing that you and your offices are working at, that you're worried about. Investing in the people and investing in processes and automation that will be able to take those things up and take them to the next level in terms of capability. That not only has benefits right now, but that is also what is going to enable this new huge wave of coming AI generated code that is coming for open source. Now here's where I pause and I ask for questions. So the question was for anyone on the live stream, if AI is going to be this great democratizer of code, couldn't, doesn't that put more pressure on the bottlenecks? And he also said, but couldn't AI also be used for the purposes of alleviating those bottlenecks? I think absolutely it could. I don't have this specific structure or model in my head at this moment, but that seems like a solvable issue. That would be one of the concrete things that your OSPO could do to help sponsor, to bring us to that happy path. So the question, the comment was, doesn't this kill community metrics? I'd submit they're mostly already dead. And the reason why, man, I love me some chaos. And there are still meaningful metrics, but they're starting to be gamed a little bit. And what's also happening is just the bots are an increasing part and GitHub actions are an increasing part of every development cycle. And those aren't human. And those are already messing up your metrics. So unless you are filtering for those things, well, then your metrics are already off. Now if it is going to be in every 15 second you submit these things, there are ways to filter that out if you want to, or you just look at different metrics. Limiting the size of the what? So the question was, in terms of infringement or whatever, does limiting the size of the generation that it puts out is that going to solve the problem? The answer is actually no. And the reason why is because they've actually been most successful in generating long runs of copyrighted material through successive prompting of shorter generations. So it's a great idea, but at least so far empirically it's false. Oh, then there was someone over here and then we'll come back to the middle. Who? Absolutely it is. Some people use an official guideline of 150 to 300 characters. There's a famous Supreme Court case about a single sentence because it was the heart of the work. I think that that's unlikely to be, but the answer is very lawyerly, it depends. Then it was always copyrighted. So what happens if as part of a product we've used generated code and made that part of it, it wasn't considered copyrightable when we did that and then if they were concerned that at some point in the future. Gotcha. Like they could come back to buy this basically. So the question is, all right, if someone puts out a bunch of chat GPT or codex generated code and we decide to use them in our internal product and it turns out that this is copyrightable, have we just bought ourselves an infringement lawsuit? In that case, perhaps yes. There would be things that you could argue, lashes, blah, blah, blah, but if something is posted by a human, which is somewhat, there's gonna be a human involved in that, assume that that human is going to be a copyright holder in every effective sense because my strong belief is that's more or less where the laws going to end up. Same question. So the question is there, the comment was I'm not gonna take a bunch of code with just as is, I'm always going to modify it because it's not gonna be quite what I wanted and what happens in that case, is that correct? The answer is that in that case, things that you have added and changed are going to be due to your human input. Those changes are definitely copyrightable and they're going to be spread throughout the code and so that code is effectively copyrightable. By the way, again, going back to that last question, I think that that is essentially gonna be where the law ends up overall. So again, treat stuff, the more that a human went through it, the stronger the possible copyright code and copyright have never been great together anyway but the more humans touched it, the more copyrightable it may be but assume that you're gonna have the copyrighted code regardless of where it came from. That's a training question, maybe. Bigger models are actually better and less likely to cause copyright infringement issues and my strong argument is that it's going to be fair use anyway. Then there was another gentleman right over here. Great question. The question was, I spoke, I said the copilot litigation was not good, why? The answer is because in order to support a copyright infringement case, you need to have a specific identifiable work that you have registered the copyright in that you can point to that came out the other side. They don't have that. Instead, they're making a generalized argument that this model must be a derivative work because they used code to train it, therefore it must be a derivative work and therefore everything that it generates is a derivative work without actually making the argument and tying it through to any particular work at all. They're trying to do a class action. It's not gonna succeed. I think that that's our time. Thanks, y'all. Thank you.