 Hi everybody. I'm Joanna Lee. I'm the vice president of strategic programs and legal at the Linux foundation and cloud native computing foundation I'm a lawyer by background and I've been working in technology open source and open standards for for quite some time And my happy place is very much at the intersection of business and economics policy and law and technology So today we'll talk about some of the legal risks and compliance challenges that generative AI used for developing software in particular open-source software presents And and then I'll go through some practical guidance both for open-source developers and projects and I'll touch Briefly on what compliance looks like Within a company or organization in the audience here. We have Van Lindberg who's really the expert on that So if you have further questions about that feel free to find either of us in the hallways after this So I assume how many people in the room or or lawyers or have a legal or compliance background Okay, okay great great, but certainly not everybody since We're not all lawyers. I'm going to go over a few copyright basics And particularly in the context of AI So copyright is a form of intellectual property protection It gives the owner of the copyright certain exclusive rights in a in an authored work Which could be literary dramatic a poem and it could be software code It could be images artwork etc. And those are exclusive rights to distribute reproduce And and do certain things with that copyrighted work And for a work to be copyright able it has to be have been independently created by a human being so that means Not a monkey not not a not an AI tool. Yeah and it has to have some minimal degree of Creativity and what what meets that minimum spark of creativity is going to really vary depending on the facts and you know What court and what jurisdiction you're in? Copyright does not protect facts and ideas. It will protect the unique expression Of an idea, but something like a mathematical formula or statistics or just raw facts Those are not copyright able, you know, however, if you want to Turn a mathematical formula into some type of you know Poetic song that that unique expression then could be subject to copyright so we've discussed that only works that are Created by a human at least this is a current law and in most jurisdictions including the US and you It has to be generated by a human in order to be to be eligible for copyright protection Now what if there's a work that is Created in part using a machine and in part use using in part by a human. So let's say you use an AI tool You know, you put a lot of creativity and thinking into what props you use you take the code That's generated by that AI tool and you edit it You edit it for suitability or liability, etc The portions that you have contributed to that to the end product are Eligible for copyright protection and at least that's the current law which you know could change So even though AI on its own can't be the author of a copyrighted work AI tools could infringe pre-existing Copyrighted works of third parties So AI tools obviously they train on lots of data They're gonna train on a lot of pre-existing if they're if they're designed to generate code They're gonna train on a lot of pre-existing code which may be subject to third-party copyrights And if the AI tool reproduces any of that training data in its output And there's not adequate permissions from the copyright holders or the license that applies that work isn't being complied with Now we have a legal legal Either a copyright infringement issue or a licensing compliance issue and there is pending litigation Copilot litigation and Getty Images versus stability AI that that relates to this this risk of What's often accidental copyright infringement There's also some uncertainty about how copyright law applies to so we talked in the last slide about Really about inference and the output right when the output reproduces Training data there is there are questions about is a training itself actually an Act that requires a license if you're training on copyrighted works, right and without a license is that infringement There are some legal doctrines in various jurisdictions in the US. There's a doctrine of fair use In the EU there is a text and data mining exception. You know, there's there's not absolute certainty yet about does training of an AI on copyrighted works is that is that fair use is a qualify for an exception or is that not even considered reproduction the first place and And the best practices that we're going to talk about later Don't necessarily try and solve this issue because it's still quite ambiguous, but we are going to talk about We are going to talk about the licensing and copyright issues at the inference and output stage So we talked already a little bit about the copyright concerns. There's also a license compatibility issue so let's say you're using an AI model that's trained on Pre-existing code that is subject to a variety of licenses. They could even all be open-source licenses but some of that code that it's trained on might be License into the GPL, which is a copy-left license. Some of it might be licensed under Apache or BSD which are permissive licenses and so if you're taking that code That's in the AI output and you're contributing it to an open-source project You have to make sure that the the license that apply to That output is is compatible with an open-source project that you are you are contributing to There are some other challenges regarding licensing compliance. So currently most AI tools don't actually Let you know when the output is Similar to a data that it was trained on And if you don't know what Pre-existing code has been reproduced and you don't know what license terms apply to it How are you going to comply with the the license terms, right? There are some even open-source licenses including GPL that have certain disclosure requirements around making source code available, right? And Obviously you can't comply with the terms of a license if you don't what know what the license terms are So that's one challenge with with many AI tools today We will talk a little bit about how the AI tools are evolving though to enable enable compliance This also presents a challenge when generating S-bombs if Sorry Here in microphone issues So if you don't know the origin of the code that's being produced by an AI tool How can you generate a complete and accurate S-bomb, right? And this is not just a legal issue as you all know This is a software supply chain and security vulnerability tracking issue as well There are some other challenges that are unique to open-source software So Here's the example of terms and conditions of an AI tool that aren't actually consistent with open-source definition So for example open AI terms and conditions don't just apply to your use of open AI tools There are some provisions that also apply to your use of the output So the output if you use chat GPT another open AI tool I'm just going to read these couple of examples You may not use a services develop foundation models or other large-scale models that compete with open AI So a restriction on developing competitive models using chat GPT output Another restriction public content created in part using open AI May not be related to political campaigns adult content spam hateful content content that incites violence or other uses that may cause social harm and we're seeing these you know ethics and Social harm restriction clauses and in lots of licenses now for AI tools and while these are certainly You know legitimate legitimate provisions to put in a contract They are not consistent with the open-source definition So if you take content generated using an open AI tool and you contribute it to an open-source project There's already an incompatibility between the your contractual obligations to open AI and The licensed terms of the project which are all subject to the open-source definition because open-source definition requires in order for it to qualify as open source there can't be restrictions on the the field of use on Using the program in a specific field of endeavor there can't be restrictions on who can use it So a restriction on developing competing products or a restriction on doing things that that would be socially harmful You know while those are maybe you know notable from a from a from an ethics perspective They don't meet the definition of open source Also, it is AI generated content if you take that and generate and contribute a project that uses the developer certificate of origin There's also a question about if if there's consistency there so The developer certificate of origin requires when you make a contribution that you are Certifying that one of these one of these provisions here Are true so paragraph a is Saying that the contribution was created in a whole or in part by you and that you have the right to submit it under the open-source license indicated in the file If the if the contribution was created wholly by an op by an AI tool It wasn't created in whole or in part by you now if you edit it And you've made your own human contributions to it then it was created at least in part by you paragraph b the Contributions based upon previous work that to the best of my knowledge is covered under an appropriate open-source license, and I have the rights Essentially to contribute it under the license indicated in the file again If you're using a tool that is reproducing pre-existing works in this output, and you don't know The origin of those pre-existing works or what license terms apply You can't actually you can't actually certify that paragraph b is true, right? There are analogous similar concerns with contributor license agreements including a patchy CLA, which is perhaps the most commonly used CLA that for projects where you're being asked as a as a contribute to make certain certifications and If you're using AI generated content, it's not totally clear that that you can make these representations They're also evolving laws and regulations in this area Earlier in the in the summer chat GPT was temporarily banned in Italy due to due to a privacy issue There's legislation in the EU artificial intelligence act There's an executive order that was recently issued in the US. This regulation is is coming in in in a In a range of jurisdictions and that's going to subject AI providers and AI users Including open-source AI models to a number of compliance obligations Another concern that doesn't so much apply in in the context of an open-source project because everything is out in the open And we're not really dealing with trade secrets when we're doing collaborative open-source development But something that's a very very important inside a company when you're developing proprietary software Is the trade the risk of trade secret loss? So for example if you ask chat GPT for You know advice on how to treat your sore throat And then your neighbor asked chat GPT is is my neighbor not feeling well Chat GPT is not required to keep your personal medical Information secret right or if you ask for you know Marital counseling and advice and your neighbor says oh is is you know is Joanna over there having you know issues in their marriage chat GPT does not have to keep that secret Similarly if you're a company and you're feeding in the props Let's say proprietary code and asking chat GPT or a similar tool, you know Can you please help identify bugs in here for me or if you're feeding it? You know meeting transcripts or recording of a meeting and asking you to generate a transcript That if that's confidential information chat GPT is learning from the prompts that you feed chat GPT And it's not required to keep that secret So that is a major concern inside companies that are using AI tools that are provided by by third-party vendors So The good news is that these are not for the most part these are not brand new problems in open source, right? Even even you know ten years ago today a developer can go to Stack overflow and copy code that they really shouldn't be copying and then contribute it to an open source project Without permission without appropriate license without a compatible license, right that can happen today But when you're using an AI tool, it's introducing that risk at a Systematized broader scale and without necessarily I mean a contributor who does that They they shouldn't be doing that and hopefully that's happening very very rarely But there are some you know people who are new to open source that that don't really understand how licensing works So but with an AI tool it introduces that risk at a much more systematized way I just also want to point out although this talk is focused really on using generative AI to create software there are obviously other types of content that can be generated using AI and Depending on the type of content. I think the risks are slightly different. So with so for example with images graphic arts artwork That's almost always going to be subject to copyright protection documentation blogs You know even though that may be subject to copyright protection The the trade the output and the training materials It's relatively easy right to fix Documentation take a Donna blog if you get notice from a copyright holder. Hey, this is infringing my content Whereas with software and code because of dependencies It's not always that easy to just remove the Allegedly infringing content if you learn about it So how do we manage all these risks? Well, first I'm going to focus on the guidance for open source projects and contributors And there's a lot of overlap and many of these best practices apply within an organization and company As well, and then I'll talk a little bit about trade managing trade secret Leakage risk inside a company so Best practice guidance point number one Use AI to augment your efforts not to replace your your judgment or thinking Make sure you're reviewing and editing at any content Generating using AI tools for quality or reliability suitability compatibility, etc. And you know and that's both because I Remember the AI tools train on data. That's not perfect, right? They can be training on code that has bugs errors vulnerabilities Etc. And that can be reproducing in the output. So you still need to check the output Also for the copyright and DCO and CLA reasons we talked about earlier if you have if you have contributed at least edits or some some of your own thinking and content then that That resolves the the issues around consistency with the DCO And at least part of that contribution could be subject to to copyright Also review the terms and conditions of the AI tool. So in the earlier example for the open AI terms and conditions you know, I We we looked at a couple of provisions that are inconsistent open source definition so But there are there are tools that don't restrict place restrictions on use of the output that would be inconsistent with the With the open source definition. So it just depends on the tool This is this is not new with AI whenever you are Whenever you are reproducing pre-existing Software code and then contributing it to open source project You know or or just you're using in a proprietary product. You provide notice an attribution, right notice to the copyright holders Notice of what license applies to it, etc. So Also, if you are using an AI tool that enables compliance Utilize those features. So for example Some AI tools including AWS code whisper and GitHub co-pilot include optional features that you can turn on That either either allow you to filter out From the output and recommendations code that's similar to pre-existing third-party code So you're you're not even shown suggestions that that reproduce the training data or You can instead elect to have those suggestions shown to you but But turn on what's called a code referencing feature where it will it will let you know Hey, this suggestion matches code that exists in and they'll provide you with a list of the repositories Then you can go to the repositories and see what the terms of the the licenses are And these features aren't turned aren't usually turned on automatically So you have to go into the tool settings to turn them on But I do encourage use of these tools because as there resolves a lot of the a lot of the Licensing compliance and copyright issues So what happens when the tool flags multiple matches? so you might be using a code referencing feature in in a tool like co-pilot and You get a list of like 10 20 50 matches, right? What do you do with that? Do you have to? Provide notice an attribution with your contribution of every single match. No, you don't so I'm even today If you were to copy directly from any one of those repositories, you are not required to do research on What other you know what other software programs throughout the world is it's also a potential match to you only have to Provide notice an attribution to one and how do you go about choosing which one? Well at least at the at the Linux Foundation and CNCF. We're not we're not telling developers How they need to go about selecting one, but here are some examples of at least what's happening in companies regarding their policies for for which Program you provide attribution to so you could go with the very oldest match Think you know and the thinking behind that is maybe that was the original one and all the other Programs had copied from that source, right? So that's one way of going about it You could choose, you know any match that has a compatible license The oldest match of the compatible license, you know, or you could just select one randomly So the Linux Foundation has published much higher level guidance that essentially says And it's available in the policy section of the Linux Foundation website that basically says please check the terms and conditions of the Tool you're using and secondly, you know, if you know that the output includes pre-existing third-party materials Please provide notice an attribution And comply with the terms of license now within CNCF. We're working on a much more Comprehensive guidance document that's going to include essentially all the content in these slides So you will see you will see these these options In the CNCF guidance when it's published probably sometime early next year So what if the tool you're using doesn't include features that enable compliance? Well, there are other there are other ways of getting comfortable that you're not inadvertently infringing third-party copyrights And that's using existing practices and tools that are they're used in open-source compliance today So one is snippet comparison scanning, you know using a tool like for example like Like black doc many companies that's just part of their routine compliance Check that they subject software to code to his code scan before before it goes into production or you're even earlier in the During the development process Also, you know, if you if you talk to your company counsel, you know, they might advise you this this AI generated outputs Not even if it were produced by a human would not be subject to copyright protection anyway So if it's not copyrightable, you don't actually need a license It's it's public. It's equivalent to being public domain For example, if the the code is really just a very simple expression of a mathematical equation a very simple function If there's not some again There has to be some minimum spark of creativity in order for it to even be subject to copyright protection Also, maybe maybe you're using an AI model that was you know completely, you know Develop within your organization and so if you're familiar with how it was trained and how it's generating suggestions For example was only trained on you know internal Internal proprietary code if you're if you're at an organization that has that that much code That that might be another basis for for getting comfortable that you're not you're not accidentally infringing third-party rights Also Consider and this is not necessarily. This is not a requirement. I Wouldn't even necessarily say that for in the open source Project context, it's necessarily even a best practice yet But this is becoming a very common practice within organizations Where when you are using an AI tool and you're including its output in code you're developing You include either in the file or in the commit comments A notice or a tag that says this has generated in part using AI tools in some organizations You might also include the prompts and identify which tool you're using so that it's in the history And that does that accomplishes a few things one is that if if your organization ever does want to Register for copyright protection, you know, there's this record of there's this record of the prompt history and then Some companies will even have you add, you know will even record logs of What the output was and then what you're at what your edits were so that I wouldn't say that that's necessarily a That's not a universal practice, but it is it is a practice observed in some companies And so within an open source project context if you as a contributor Add that information You're actually enabling downstream compliance because those tags that say this is generated part using AI many adopters Many companies are subjecting Files and code that is tagged as AI generated to an additional compliance review Whether that is a personal legal department looking at it Or it's going through an additional scan for compliance and Comply with the policies of your employer many many organizations have more stringent guidelines than what? Some projects, you know, I think there's probably a minority of open source projects that actually have guidance published already for how to Responsibly at least from a legal perspective contribute AI generated content But but more and more will so make sure you're you familiarize yourself with the policies of the project You're contributing to you, but also if your employer has more stringent guidelines comply with those as well So here's some additional considerations for what AI compliance looks like inside an organization There are the approaches to what AI compliance looks like within companies and organizations really varies Widely in some ways it looks a lot like what open source compliance look like in Companies in the late 90s and early 2000s where there is a lot of fear and certainty and doubt Yeah, so there are companies today that completely prohibit use of generative AI tools for Whether for code development or for for any purpose And you just like in the late 90s early 2000s if a company prohibits their developers from using open source code Guess what your developers probably are using it anyway, right? And I think that's probably the same with AI tools I'm definitely more of a fan of I don't prohibit it just educate and provide guidance on how to do how to do this responsibly many companies Allow use of generative AI tools, but only tools that have been vetted and selected by the company and the company has a license to Some companies will permit use of generative AI Just for certain uses and contexts but not others for example, they might have a policy that says you can use generative AI for You know for for bug fixes And but you can't use it to generate, you know shiny new features because we really need to make sure that those are You know those are copyrightable and Many companies will allow use of generative AI tools, but they will subject the The AI output that's included in code to additional compliance checks and reviews And then there are some companies that only allow it by developers with certain credentials and training And that's that's for a couple of reasons one is to make sure that whoever is using the AI tools You know does actually understand what these best practices are and company policies and how to use them and then secondly The thinking it is often that You know a really really junior developer who hasn't yet like develop doesn't have a lot of experience May not yet have the judgment to really critically review the output of an AI tool whereas a more senior develop More senior experience developer is going to be able to review that more critically So some of the commonly used methods for minimizing trade the trade secret risk You know and this continues to evolve Our restrictions and types of information that can be used in prompts given to AI tools So some companies will say well, you know you can feed the AI tool with these types of prompts But please don't put confidential meeting minutes or a proprietary code in there Some companies will restrict the length of prompts so restrict it to a certain number of characters and that way That helps ensure that you know an entire You know all the code for a proprietary program isn't getting fed into the into the tool without without proper authorization Some are using models that have you know filters or or restrictions on on the model's ability to learn from the trip the prompts and A common model is really around the hosting and deployment Using LMS that are self-hosted owned and controlled. So they they they know where the data is going So one example of this hugging face announced its safe coder tool Recently and the way this works is it's they're offering LLM's that are trained on permissively licensed data So they come pre-trained But then the company will fine-tune that LLM on their own internal proprietary software So you get they're getting the best for the best of both worlds, right? It's already had been trained on a broader Set of data but getting fine-tuned for that company's Purposes and then that model is going to sit on a it is going to be entirely self-hosted and maintained And the data is never going to leave their own virtual private cloud and that prevents that that that prevents vaccinals trade secret leakage. I also just want to note that Because of laws and The technology itself are evolving so rapidly You anything that's been described as guidance or best practices here, you know very well could change in six months You know even three months So it's it's important if if you develop a policy either as an open-source community or as a company that you're Constantly reviewing and iterating so are we are we out of time or do we have time for questions? Okay, okay. Well, I'll be around so happy to chat with any of you in the hallway. Thank you