 Howdy, all. My name is Van Lindbergh. I am happy to be here today. We're talking about best practices for AI-assisted code generation. Wow, we've got a lot of people here. This is fantastic. Now, you'll be able to see, if any of you want to go onto the schedule, the slides have actually been uploaded if you want to take a look at them there. But let's start talking for a moment about AI-assisted code generation. How many of you have used some sort of co-pilot, code whisperer, some of them? Almost everyone here. What do you all think? It's actually, I think it's pretty good. I think that, like all these tools, it is limited in what it can do. But in the extent of its limits, it's pretty good. Here's the thing, is that AI code generation is here to stay. I have the opportunity to actually work with lots and lots of different OSPOs across many, many, across many organizations. And every single one of them is being asked about AI-assisted code generation. Most places are rolling it out. And if they're not rolling it out, they're worried that it's coming in from the bottom because it is already there. Just like open source, just like open source back in the day, it's coming in from the bottom. And so if people are not, all right, so if people are not having the chance, if people are not already planning for and dealing with AI-assisted code generation, you are going to need to deal with it because otherwise it's going to be in your organization anyway. In fact, for any of you who have an organization of any significant size, I'd say bigger than 100 people, it's probably already here and it's already being used. So if you're not dealing with it, you need to. One of the interesting things about AI-assisted code generation, like right now, it's an okay tool, but it's getting better. But even as it is, there have been some studies predominantly by GitHub and they're motivated to say that it's good. But also based upon what I'm hearing, it is a significant helper for a lot of developers productivity. They did a survey to see what benefits it would bring and they said, yes, it leads to greater increases in sort of code generation output. But they said that the biggest changes that they observed was the improving of developer satisfaction. They said between 60 to 75 percent of users reported that they felt more fulfilled with their job, felt less frustrated when coding, and were more able to focus on more satisfying work when using co-pilot. And the second is that they said it was conserved mental energy. They reported that co-pilot helped them stay in the flow 73 percent and preserved mental effort during repetitive tasks 87 percent. And especially for that last one, you know that sometimes some language have a lot of repetitive boilerplate. Sorry, did I say something? And it really makes it a lot easier. That's why seriously half of Java is already generated by various code generating tools. It makes it a lot easier to deal with that sort of thing. It makes some refactoring easier. AI assisted code generation is here to stay. So why in the world are we talking about best practices at all? And the reason why is because as with almost every new tool, there are risks. And when we've gone through them, we've essentially categorized them into these five buckets. And we're going to talk about each one of these in turn. So the first is copyright infringement. This, by the way, is essentially the classic open-source risk. This is essentially inadvertent copying and pasting. One of the interesting things about code LLMs is that because code is more restrictive than natural language, you're actually going to find an increased incidence of memorized code generation, i.e., it's spitting out something that's very similar to an input, because code is just more constrained. There are reasons why things are like they are. Either it's to comply with an API, it's going to be more efficient, et cetera. That makes it more likely that you're going to observe commonalities between inputs and outputs. And the second thing is that code, unlike English, is much more likely to have reuse. I mean, that's the whole thing that we've been trying to do for a long period of time. And so when they go through and they train these LLMs on code, you're actually going to have a lot of circumstances where you've vented code or you've had copied code and it's completely legit, but you're going to have lots of duplication in code. And duplication in your training code means that you're much likelier to have a memorized output for certain input, which, again, means that you're going to have outputs that match the inputs, which lead to, if you're not careful, copyright infringement. The whole point of, even though 99% of the time, these code generation tools are going to come out with unique code, 1% of the time, they're probably going to come out with something that matches one of their inputs. Now, some of these tools, most notably Code Whisperer right now, says that if you match, if it matches closely or exactly one of their inputs, they'll give you in-line attribution information. It's fantastic. I know that copilot is also working on the same sort of feature, but even when you aren't exactly like that, we've observed in certain circumstances that you can have, even with it blocking all similar code, you can have stuff that is just under the threshold, but over repeated generations, you can actually create something, again, that is recognizably infringing. It is not a solved problem, so you need to be aware that even with all the filters turned on, you can actually still inadvertently create infringing code. When I say infringing code, most of the time, that's not a big deal, but it leads to two things. Number one, if you don't give just notice and attribution and copy the license, you will be infringing. It just is. The second is for those, for code that comes from a reciprocally licensed, copy-left licensed source, that can have, that can impact your end licensing, which may not be what you want. So here's the other things that all these various code generation tools, they put almost universally the responsibility on you for avoiding copyright infringement. They say, we don't own any of this stuff. If you caused it to generate stuff, that's up to you. Now, it is true that GitHub slash Microsoft Copilot have started to provide for those companies that provided an indemnity clause. I have seen this indemnity clause. The indemnity clause says that we will, if someone accuses you of copyright infringement, we will protect you. That's what indemnify means. If you have turned on all of our avoidance things, tools, that's fine, and if the code does not vary at all from what the tool provided you, yes, you see exactly the problem. So unless you accept exactly the code without editing it, the indemnity doesn't apply. Now, if it's a really high profile case, maybe they would be like, okay, well, we're still going to protect you, but if you're a smaller player, don't expect that this indemnity is really going to do you much good. So you need to do something else to avoid it. So here's the second issue, is that a lot of people are wanting to, this is particularly for companies that are really, that have proprietary code. A lot of these LLMs are actually not trained on your code. There is not an in the loop training process, but if you have a very constrained sort of set of circumstances, it could be that you actually want to train on your code as well in order to get higher quality outputs. But that means that you're going to need to provide some sort of, you're going to need to show other people your code, and that can lead to difficult questions about trade secrets, loss of data, et cetera. Plus everybody in their world wants to create a data product and sell you your data back these days. The thing to watch out for here is in most cases, you can get, if you negotiate, you can create a, you can have an agreement that says anything that is trained on your code, it will be done in a separate, like an overlay, a fine tuning that is exclusive to you, it's not going to be used for anybody else. But beware of the default, the default terms for anthropic and open AI in particular, does not give you any confidentiality whatsoever. Their default terms are their stuff is confidential, yours is not. If you hand it to them, they don't have to maintain it as a secret at all. So if you've got third party code or third party secrets, or it's important intellectual property, and you want to have this sort of thing going, make sure you've gone through and negotiated a confidentiality term. Here's the next one, security. Why am I talking about security? The reason is because this was trained on all the open source code that they could get their hands on. Most of that open source code has CVEs. If it doesn't have a CVE yet, it will have a CVE at some point in the future. This is not surprising, this is just the nature of code. As a result, it is learned as well as a bunch of good habits, it's learned essentially a bunch of bad habits. And so anytime it's going to be generating, it's going to be generating the most likely thing that it saw. And if the most likely thing ends up being something that has an embedded security problem, well then guess what, it will generate that, very gladly generate that embedded security problem for you, and it will tell you it's the most common thing that people do. Great. So be aware that this is, again, this is one of the things that just by the nature of the beast, you're going to encounter it. Here is an issue, this is not, this takes a little bit more explanation. Right now, a lot of people don't realize, all these tools that say, hey, it will let you generate your code or it will let you generate these images. They say, we don't own anything that you create. Great. Here's the secret, at least in the United States, neither do you. The reason why is because there is a doctrine in the United States that says only humans are eligible for copyright. How many of you remember the monkey selfie? A few people. This was an occasion where a Naruto, a monkey, got a hold of a camera, managed to take it selfie, and various people were like, this is really cool, and they tried to get the copyright on behalf of the monkey. The copyright offices and the court said, sorry, only humans are eligible for copyright, this photo is in the public domain. And then there have been a couple other six situations where people have said, have tried various things that said, oh, this was generated not by human, not by human. And the courts have consistently said, nope, has to be human, only humans are eligible for copyright. Well, last year, I don't know how many of you really watched the sort of intersection of generative images plus comic books, but there was a comic book called Zaria of the Dawn. This was by an artist named Chris Kastanova. Chris wrote a comic and then used mid-journey to generate the images for the comic book. And then, because it was being copied, submitted for a copyright and received a copyright. And then happened to post on public media, look, I got a copyright from a comic book, which was partially generated with mid-journey. The copyright office freaked out and said, wait, hold on a second, let us know why we should not revoke your copyright. So my friend Max and I, Chris was actually our client. We wrote back to the, we wrote back to the copyright office and we described the process that Chris went through generating these images. And it was actually a very involved process. Some of these images had hundreds and hundreds of revisions over time. The prompts were crafted and then they used, and then Chris used the input from one generation as the input to the next and over and over again. And we described this and then it was cropped and it was retouched and all this stuff. And we were, to some extent, unsuccessful. The copyright office came back and said, we don't believe that there was enough control by the artist over the output, which, by the way, is crap. But as a result, there is, we are, our official guidance is that the only thing that is eligible for copyright, at least right now, are the edits that come out after a human, the edits that are done after the thing is generated. So the stuff that comes out of the generator is, at least in the United States right now, public domain, not copyrightable, usable by anybody. Usable by anybody and not protectable in any way, other than by trade secret, I suppose. But it is, and only those edits are available. As a result, so Chris, by the way, just to tie up the story, did end up getting a modified copyright for the arrangement, the compilation of a leasing, but the individual images are still open. We're still working on that. I don't think that this is going to last more than a few years. They are reexamining this, they're reexamining this, and I'd be happy to talk about why I think that this is going to go away over time. But for right now, you need to understand that if your licensing or your business or your project in some way requires the copyright, for example, if you want to GPL something, you require the copyright in order for your GPL license to be effective. If you want to license this code out as part of your business, you need to have those, you need the copyright in order to be that effective. So right now you need to deal with the non-copyright ability of these outputs. This final one is I think in some ways this is, we haven't actually observed this, but this is I think something that is important to think about and plan for starting now. And that is this idea of if we get too far away from the code, we start not understanding what it does. This may or may not occur. I remember back when they started allowing kids to have calculators in school, they were like, everyone's going to forget how to do math. There was a great story by Isaac Asimov called The Feeling of Power about this person, all society had forgotten how to do math until this person got a hold of a pencil and learned how to do math themselves and got this feeling of power by doing it themselves. We haven't observed people not learning how to, stopping learning how to do math, but there have been some things, some skills that have atrophied perhaps a little bit. In the context of code, it really is, there's a concept that you really need to understand what the computer is doing, what the machine is doing, the logical interactions. That is a part of the programmer's craft. And if you get too far away from that, that can be an issue. I mean, that's the whole issue with reading code versus writing it. When you're writing it, you know what's going on. Reading it later, you may not. So you want to make sure that these tools don't let you get too far away from the code. So, given these issues, we have gone through and in cooperation with a number of different organizations, we've developed some best practices that we think really address a lot of these issues. Now, fair warning, these things have evolved. I expect that they are going to continue to evolve, but these are what we are recommending as best practices right now for anyone who's dealing with AI-assisted code. Number one is just a sort of a housekeeping thing. Mark any files that have partially AI-generated content with a comment. Not the project, not the thing. Do it on a file-by-file level. You can actually train as part of your prompt to have the prompt output AI-generated as part of the doc string or a comment for every function or what have you. This is going to be useful for a number of the various things that we're going to see, but it helps you just understand exactly where the AI-assisted code is. If you don't know where it is, you have to assume essentially that it's everywhere or anywhere in your code base, which makes things a lot more difficult to handle. The second best practice, and this one is a little bit surprising to people, is to use test-driven development. Now, how many of you have actually, first of all, I assume most of you know what test-driven development is. How many of you have actually legit done test-driven development? Maybe a third? So in test-driven development, you create the, you write, you figure out what are the conditions, the failure, the failure conditions, the success conditions for your code, the constraints for your code, you document that, and then you write your code to match those constraints. Why is this important? Two reasons that go directly to our first couple issues about copyright infringement and about security. The first is due to the way in which these LLMs generate code, they are most likely to generate infringing code in two circumstances. The first is when you give it a very broad request, do X big thing for me, in which case it'll find like the most common way of doing that, and it will tend to follow the most well-worn path. The second one is when you have a very, very niche thing that you ask it to do, but where that niche has been filled by somebody, like show me the best way to multiply these two matrices in this particular way, I don't know, something like that. If you are able to have it be relatively constrained, but not in someone else's niche, you're much less likely to generate copyright infringing code. So how do you do that? You take advantage of the fact that your code in your context, your code has a unique context. You're trying to do something specific, and because you're trying to do something very, very specific, that is probably unlike what other people have done, you can use that fact of its specificity to drive the generation of your code. So number one, you can even use the generative AI to help you generate the tests, because a lot of developers hate writing the tests. That's one of the reasons why they don't do it. Gen AI is great at writing tests, especially if you give it sort of these parameters. Then you write the code, including having the gen AI help you generate the code, to match the tests. Because you've already made it more specific to the way in which your, to your circumstance, is much less likely to go down one of these someone else's path, and it's more likely to generate something that is specific to you. Here's the other thing, for security, which we've talked about. If you are starting to encode some of your security guarantees as part of those tests, guess what? It is much less likely to generate something that includes an insecure path. Doesn't mean that it absolutely won't, but it is much less likely to. It kills sort of these two birds with one stone. As a result, you, it also ends up making your entire code base more robust. It's a best practice for software development generally. So this is something that you can use to bootstrap a, a test driven development culture and really improve the robustness of your project, as well as making the, the generative AI safer and more useful for you at the same time. The third, the third issue is the third best practice is scanning and specifically snippet scanning. Now who knows what I'm talking about when I talk about snippet scanning? A few people. Back in early days of open source, the main thing that people were worried about was what if someone copies and paste random GPL code into my code. And so you'd get tools like black duck. All of you have probably heard of black duck at some point. Black duck, their main claim to fame is that they do this snippet scanning. They go to and they do a fuzzy match and they say, this code looks kind of like some of this other thing that's in our database. And then you have to go through and you, you match it. Turns out that snippet scanning has two problems. Number one, it's noisy. It gets, generates a lot of false positives. Number two, it's slow. And over time, we observed that most, for the most part, people weren't copying and pasting. To the extent they copy and paste still, it's from Stack Overflow and not from GitHub. But the, but the, most people are actually just pulling in libraries and using them as is. And so the focus of compliance moved toward this library oriented review, which works in almost all cases. And so that almost every scanning tool on the market moved away from snippet scanning except for three. Scan OSS, black duck still, black duck protects. And what's that third one? I remember it. Faucity. That's the one. Those, those are basically the three that still do sort of deep snippet scanning. Everything else moved to this library oriented way. And then lo and behold, here comes generative AI where the risk moves back away from library oriented stuff back to essentially inadvertent copying and pasting. It's a rewind back to like 1998. Problem is that snippet scanning still is noisy and it still is slow. So do you remember that first best practice of marking the files that you've got, have AI generated content? This is why. Because you don't necessarily want to scan your entire code base. If you, you can go through and actually look for places that have AI generated content and scan only those files. It speeds it up by an order of magnitude, makes things a lot less messy, a lot easier to deal with. Here's another, another major change that if you can do it really, really helps is that people to a certain extent, when they did snippet scanning because it was difficult and slow, they would do it like right before their release and it would take a long time to clear. We have been helping our clients actually do snippet scanning weekly. And the reason why is because the snippet scanning tools actually tend to be stable in their output. Meaning that if it has a false positive on day one, it'll still have a false positive one week later. But if you get your last set of outputs and you annotate them with your human decisions, oh, this is a false positive. This is, you know, this is a false positive, etc. You can skip that. Then what happens is you can actually post process the output and skip all the decisions that you've already made and output only the new matches. And it turns out that if you do this every week, there's only maybe one match that you really need to investigate that you can either mark as a false positive or you can mitigate right there. Makes it a lot easier and as a side effect you end up staying up to date all the time. It's fantastic. That does require that you essentially put your snippet scanning on a cron job and that you have a tool or a team that allows you to both manage this and handle these annotations. We've got a custom tool that does that. I'm not familiar with any tool out in the market that does this sort of, you only view the diff from your last good, from your last annotated scan, but it is definitely possible. The next best practice is this idea about AI does drafts and humans do edits. This goes directly to this idea of preserving copyright ability. Now the best, the guidance from the U.S. Copyright Office says right now that when you put in something for registration that has some portion of AI generated content, you need to declare that AI was used and in general you need to describe the way in which the AI was involved. This does not have to be at a highly specific level. You don't need to say lines two through 40 were generated by AI and 40 through 800 were generated by a human. Their guidance is actually to have it be very general, to have something like a draft was generated by an AI, a first draft was generated by AI which was subsequently edited by a human or post processing was done by an AI or that sort of high level description. If you essentially write, create as either in your open source project or in your business an expectation that says any AI generated code needs to be reviewed and edited by a human before it is checked in, checked in and you should acknowledge that this was been edited and reviewed and edited by a human, then what that does is that essentially makes sure that a human has touched essentially every commit before it goes in. As a result, if you need to provide that to the copyright office and you need to argue about its copyright ability, you can say a draft was generated by the AI which was subsequently edited by a human and you can actually back that up with the underlying documentation and the logs from the repository. This makes sure that you are able to maintain copyright ability because at that point it really won't be able to tease apart here was the AI generated stuff versus here was the human edited stuff. It's all mixed through and as a result your entire project is going to be by default all copyrightable because you're going to have human stuff sprinkled throughout it. Now like I said, I actually think that at some point that rule is going to reverse and they're going to say as long as a human was involved in the generation at all it's going to be copyrightable but for now and that and now could be as long as five or ten years if you want to make sure that that is copy copyrightable this is the sort of procedure that you need to follow is creating a record that humans touched all the code before you before you check it in. Sagan the court has not ruled on this. This currently is the the result of a position paper provided by the U.S. Copyright Office. In part due to the Zari of the Dawn case that we that that we fought this was essentially their underlying part of their underlying AI response their response to to generative AI. So until somebody actually goes to court and challenges it at that point I like I said I do expect that it would be overturned by a court but until someone actually goes to court this is essentially the rules of the road. The final one how do you avoid losing context and this is one of those places where I think that that contrary to fears I think that LLMs in general can actually be a great tool because one of the things that they do really really well is help you summarize and regain context. They help you take that long email and understand what the main points of it were. They you know take that long video video call and say here are the actual things that came out of it. It can also if properly trained take a file of code or several files of code and describe for you the the data flow within it. It can help you have better and greater documentation and because it is to a certain extent a human or a human to a certain extent a human or a computer creating that documentation you can run it again and again over and over so that through an automated process that will be likely to stay up to date. Now here's the warning that at least for now they also do makeup stuff sometimes that they hallucinate or they may misunderstand. So you need to have somebody who has actual knowledge go through edit it make sure that it is correct but you can but by leveraging these tools to drive a culture of documentation you can not only improve your code and improve the the accessibility of your code but you can also but you can also have in that process of reviewing and correcting any mistakes that the LLM makes you help maintain that that understanding in your developers minds that help them understand the code and come back to it better in the future. This is this is one of those things where it's a two it's a two-edged sword in some ways it can be yes if you lean on it too much it could result in the loss of of context and and and knowledge and later but if you use it in the right way and actually can be a powerful tool for helping you maintain and improve the context and knowledge around your code. So here are here's this this this summary of these five things I will throw in throw in one more because I was having a con discussion with with a few people about what was my what were my thoughts about the impact of code generation on open source projects and a lot of people were worried is this going to ruin our communities and I'll tell you in my opinion code generation AI in general is going to be the greatest tool we ever have had for increasing inclusion and diversity in our communities why because it makes it lowers the bar for participation it makes it easier for a broader number of people to start understanding to start participating if we use these tools correctly this could be a revolution in the number of people who engage with code who engage with open source who engage with all of the things and the communities that we love and that we work in this is the promise of of generative AI in general and I'd really like to don't think about it in and we tend to be sort of techno optimist here so I I think that a lot of you will agree here there are some of us there are some people who are like really really worried about this Matthew Butterick who's one of the lead attorneys for the lawsuit against github and co-pilot he's uh he's involved in the open source and part of his concern in filing that lawsuit was this is going to hollow out our communities over time but like I said with regard to that idea about using it to improve documentation it can also be used to improve accessibility to reach out to new people to empower new people to participate that we never thought of before and if you think of it as that sort of tool to open up new doors and new capabilities then I think that this is simply the start of a huge and wonderful new time for open source and for ospos in general now um we have four minutes um throw it up for questions yes uh you said you'd be happy to yeah so in case anyone didn't hear the question is why do I think that the copyright the rules about copyright ability of AI generated AI assisted code will change um that sort of goes to we're essentially speed running through the history of the copyright ability of photographs um photographs when they first came out were not copyrightable either for a lot of the same reasons that people are talking about with with AI code generation it's not art it is the camera is the one that recorded all this stuff it it's a recording by the machine it is not there are artists who say well this is useful for like science but it's not art an artist may use this to take a reference picture but it's not art until the human actually did it so um congress actually added uh photographs to the copyright off to the copyright act nineteen eighteen seventy one ish uh even though uh daguerreotypes had been around since nineteen eighteen forty or so um but even then a lot of people did not respect the copyright around photographs um until there was a there was a court case um so uh seroni and in this case uh this case some person copied a photograph that a person took they took it to the to the courts and ultimately went up to the supreme court and the supreme court said you know what we're going to say that this one is copyrighted it was actually a photograph of oscar wild the the the the author and they said the reason why this photograph is copyrighted is because all the stuff that the photographer did around the photo he posed the person he created he organized the lighting he had the costumes he put stuff in the background everything but the photograph was allowed that was artistic and so as a result we're going to say that that this is that was enough artistic input other than the camera that we're going to give the copyright but the left the door open to say some photographs were not copyrightable last four or 20 years and there was another case about other photographs these photographs were not really artistic they were i think that they were photographs of like circus signs or something like that i don't know something like very prosaic um and again they were copied went up to the supreme court this is donelson and in the donelson case they said you know what any photograph bears at least some mark of the person that created it some uh some trace of that person's creative expression and as a result we are going to treat all photographs as copyrightable by default now in some cases you can actually overcome that presumption you can show that a particular photograph is is not copyrightable but since 1903 since donelson photographs the copy offices just said okay it's a photograph we're going to assume it's copyrightable and that has been the rule for 120 years right now we are essentially right after serone the copyright office has decided to say okay stuff is copyrightable only if a human adds enough stuff after the machine is done i think that we're going to get to the place where they're going to say actually a human is required to dry the machine there's always a human providing the prompt there's always a human editing guiding whatever and that human is going to leave at least some creative mark and so by default we're going to say that it is it is copyrightable and we're not going to and i'm presuming that they're going to say we're not going to try and judge the relative what came from the human versus what came from came from the machine because if you think about it i can take out my phone i can go click and i've just created a copyrighted photograph now i put literally no thought into that but it is totally copyrighted and the reasons why why courts have found that a photo as simple as that are copyrighted is is ridiculous because of the brand of camera which was used because of the type of film that was used because of the time of day that was chosen because of the location that the person was in there was even a great case where they were taking pictures of other art and there was a flaw in the film and they said well they didn't expect try to do this but because the author or because the artist chose to adopt this flaw as their art we're going to recognize that as an artistic as some sort of artistic expression there's just as much there there are comparable decisions that go into every type of ai generation which which lm do you use or which tool do you use which how do you create your prompt how do you decide it there's no reason reason why a picture that is captured or something that is that is otherwise written and something that is described and captured that way should be treated differently in copyright ability sorry that was probably a longer answer than you wanted but okay we are out of time so thank you all have a great day