 Thank you everyone for coming and today we will be talking about JNAI's impact on open source. And when I say open source it's you know all of that pipeline of great stuff that the community produces and the companies end up consuming. So my name is Roman, I wear multiple hats so day job is you know co-founder of an AI company actually as of late so talk to me about that it's building a set of open source tools sort of think of us you know a little bit like maybe open source hagging phase. For AI I also happen to be VP of legal affairs at the Apache software foundation and have been a hands-on contributor to a number of big data projects in the Apache family. And luckily I actually have a lot of good friends in the Apache and Linux foundation community and a lot of the content that we produced you know to provide guidance for developers of how to use JNAI and actually some of the content even in in this slide deck come from good people that they had a privilege to collaborate with you know two of them definitely deserve a shout out. Joanna Lee from Linux foundation legal team helped a lot and Henry Yandel at the ASF was also very instrumental to creating a framework that we're going to discuss today. But before we do since we're here at the Cassandra summit let me actually start with the blog post. So everybody's talking about JNAI and you know some people say that it will replace developers in you know a couple of years. Some say that it will never replace a senior developer. Basically your opinion is as good as anybody else's. But here was a guy who was actually a very prolific contributor to Cassandra. You know you could say number one guy in Cassandra you know even who was basically very curious to see how JNAI can basically make him even more productive than he already is. And let me tell you again I've known him through the community work that he's done. He is an extremely productive coder right. So he basically took a challenge of you know getting a few JNAI tools out for a spin and you know working on a real problem you know adding a vector search to Cassandra in six weeks. So he documented it all in his blog post and I actually highly recommend that you read it. There is a link but you can just google it you know the title is free to self-explanatory. It basically talks about what I like about it. It talks about you know a really senior engineer's journey through the JNAI tool set and if you need to get an inspiration of how you can be you know more of a 10-axe engineer you know that's that's one of those. But the blog post was great but what it turned out into is a legal JIRA and for those of you unfamiliar with how we do business in Apache. Apache is a software development you know organization so everything we do including you know legal things that we discuss it's all mediated through JIRA so if anybody has you know a request be it for code change or for clarification it always ends up in a in a JIRA. So we have a legal site to JIRA in Apache and that blog post generated quite a discussion because at the end of the blog post Jonathan basically said that you know he's contributing a lot of work that he's done you know researching JNAI tools back to the community as he always does but now the question is well whose code is it right you know is it Jonathan's is it JNAI's because you know he go through all the various ways of how you would use you know one of those tools but it's actually very unclear to the community of what to do about it right do we accept it do we not accept it what do we do so again if you're interested in a real thoughtful discussion about how community views those types of problems I highly recommend again googling you know legal dash 656 or you know I'm sure the presentation will be made available so you can you can just click on it and just reading through the discussion right you know but what is obvious from the discussion and what is obvious from the blog post is that at this point the ship has sailed it's not like you can stop JNAI contributions I mean the open source community really took the red pill and there is a reason why because even if Jonathan basically tells you that JNAI makes him you know much more productive you could imagine what it would do to a junior engineer right you know not using those tools is just not an option and especially if you're building something in the open if you're building something in the community just kind of have to use those tools to basically stay ahead of the game or be your best self now when I talk about open source especially for today's presentation I really mean a very particular kind of open source because you know open source can mean different things to different people like Donald Knuth is doing tech as an open source project but it's a very different open source project compared to Cassandra right so today we're talking about very specifically software mostly software you know we're not talking about images or artwork or anything like that that basically evolves through peer-reviewed contributions right it's not a software that you know somebody sits on a mountain like Don Knuth and you know just gives you updates now and again it basically is very incrementally developed and very thoughtfully peer-reviewed the governance typically resides with a formal entity so again it's not some you know random guys project on github you know that is interesting I'm sure the guy who has or gal who has that project on github would also benefit from JNAI but we're talking about more formal structure to the open source so again something like Cassandra within the Apache software foundation would totally qualify and when we're talking about the how the software itself gets developed we're very specifically focusing on kind of the following flow of the of the contributions right so the contribution is developed and it's a phase right you know you can sit on it and develop it but then it gets submitted then it gets reviewed maybe you update it you know based on the feedback that you're receiving and finally the contribution gets accepted all those four phases basically have very specific requirements for how you deal with things like intellectual property and you know attribution and licensing so again we will be going through them in the presentation as well and finally when I say contribution I actually do mean not just the source code but anything that gets reflected in the source code form or in the form that ends up in the github or you know SVN repository so obviously could be source code I mean that just immediately comes to mind but could be documentation images again JNAI is pretty good at generating images these days it could actually be blogs again if you go back to Jonathan's you know whirlwind tour of the JNAI you will see how he's sort of trying to use it for the blogging as well but interestingly enough it could be patents you can actually use JNAI to generate patents based on the code that you're producing you could actually even use it to generate standard texts and all of that will take form in your repository so again we're not just talking about you know your source code contributions and what I would like to highlight is that again because it's so new when we say JNAI you know we all kind of assume a thing that we got introduced to but let's actually take a wider look so if you look at how JNAI is being consumed these days it actually gets consumed in two different ways obviously there's a SAS tool right you know it could be github co-pilot or amazon whisper you know code whisper or anything like that but it's basically something that you have no control over you just hook it up to what you're doing it gives you an output you know sometimes you can basically fine tune it to give you an output that is supposed to be only taken from the let's say source code licensed by a particular license so they're definitely knobs but you don't really have insight into the model you don't have an insight into how the tool works and a lot of times when you don't things that we really take for granted these days like EULAS because who reads EULAS right like I don't read EULAS like nobody reads EULAS because we kind of nowadays assume that you know if you have a EULA in place it's not like they're going after you right you know it's like hey maybe they will you know sell your data on you know to the highest bidder you know being an advertising company or something but like they're not really going after you now with these tools you really have to read the EULAS because these EULAS are very specific and a lot of times contain language that would be actually incompatible with the open source contribution so for example in the open AI this is actually taken from chat GPT you may not use the services to develop foundation models or other large-scale models that compete with open AI that is the language literally in the tool that you're using and if you're unfamiliar with that language if you miss that language obviously you might actually be in legal trouble and then the question becomes you know who gets sued who does not get sued but it still is a danger that you need to be aware of EULAS in the gen AI space you know I feel tend to be much less clear because again as a developer I'm used to like standard developer tools like you know IntelliJ or Eclipse or whatever right and again like it's pretty clear what they're trying to communicate to me here when I say I'm clear it basically gets to the level of remember you know maybe some of you remember they used to be the software license called Don't Be Evil you know Doug Crawford basically just as a joke you know put this license out with the Jason parser saying like you know you can use my software for whatever you want but just don't be evil now organizations like Apache basically had to go out of their way to ban you know that piece of software because again Don't Be Evil is just not a clear language right you know it's not a language that you want to be debated in court now if you talk about EULAS of the gen AI products they're actually full of that unclear language so for example again published content created in part using open AI may not be related to political campaigns adult content spam hateful content content that incite violence and other uses that may cause social harm now like sometimes you know discussions on the Apache mailing list feel like they may incite violence so am I violating the EULAS you know when I'm sort of generating code and then you know that code gets discussed like nobody really knows but the point is you know pay attention to that so these are the SES tools you know they're kind of like in their category they're well very well known and you know the majority I would say 80 to 90 percent of the time when you hear gen AI you will be talking about you know SES tools but nowadays they actually stand alone tools you know in a very traditional software development sense tools like again JetBrains IntelliJ and you know sort of that whole family of tooling basically has gen AI capabilities right now built in what's interesting again everybody when they're talking about gen AI is talking about just generating code or artifacts with IntelliJ the tooling basically tries to help you throughout the software development cycle so for example when you see let's say an error message you know sort of coming out of your compiler or when you see like a particularly you know interesting debugging situation a lot of times like a little knob would pop up say like do you want AI to help you with that if you click on it you know it's trying to basically do what we all do when we google for like an unfamiliar error message but it's sort of doing it you know through the AI tools so again it's not just generating the code but it's also helping you as a software developer which actually comes real handy a lot of times but again that is also under a certain Eula although now it's applied to a standalone product finally you actually can train your model from scratch a few people try this very difficult so I probably wouldn't recommend you but obviously if you train your model from scratch you know exactly what kind of data you got trained on and you have all the answers to like what the tool would you know give you back and middle ground is basically fine-tuning an existing model and again I highly recommend this blog post from hugging phase called personal co-pilot where they effectively took one of the existing you know code assistance gen AI sort of models available in the open source and they effectively fine-tuned it on a set of source that was unfamiliar to the model and you can do that you can basically make it give you the results that let's say aligned with your coding style or your teams you know coding style so it doesn't really violate that so there's a lot of you know norms that you can tweak that way but again if you do that then like all bets are off because now all of a sudden you're taking something that basically got trained on a particular set of data you're fine-tuning it on yet another set of data and what it would give back to you it's not even clear what legal terms that would be covered under right so again it's fun to do but if you're trying to do it in production or for a big company really talk to your legal advisor but now that you know you have all these options and jnai basically made you 10x of an engineer now it's time like I told you right you know you developed something right you know you developed it in 10 times less you know over time than you would typically do right you're all happy but now it's time to contribute it back and again it's if it's just you know you're sending code to some other persons you know github project like not a lot of people think about what is the legal contract that that creates between you and that other person and the answer is you know typically it's sort of license inbound license equals outbound license so let's say if the project is covered by the apache license you know your inbound contribution will also be covered by the apache license but it's actually unclear like there's a lot of legal debate about you know how that needs to be clarified the takeaway is that for all of the more formal you know projects within the governance of let's say linux foundation or apache software foundation we have clarified it by putting a particular document in place and in the case of apache it typically is icla individual contributor license agreement that has all of the language you know that is very uh uh relevant to a contract that gets created when something gets contributed back to the project for example there is you know section number five that basically says you represent that each of your contributions is your original creation now again if you just created with gen ai are you violating the contract that you like literally signed because in order to contribute to any of the apache project if you're a PMC member or a uh committer you actually have to have the icla on file with us within apache software foundation so you literally signed that contract with your own signature it's not like implicit contract that got created so are you violating that now well let's talk about that there's also uh item number seven uh which is should you wish to submit work that is not your original creation you may submit it to the foundation separately from your contribution and attribute it you know accordingly so again if you're trying to separate what gen ai gave you from what is really your creation what are the guidelines that you need to follow because the contract itself already makes you follow those guidelines now again a lot of the linux foundation projects you know they don't really use icla's they use dco developer certificate of origin it also is in in a way a very explicit contract it also has problems with gen ai uh now if you take a step back if you think about you know all of the sort of contractual obligations why do we care right you know why does icla specify the language that you know is very explicit about things being your original creation well we care because the open source was literally made possible by the copyright if the copyright doctrine didn't exist we probably wouldn't have open source even back when you know copy left got created copy left was actually a clever hack around how to circle how to sort of leverage the copyright concepts in order to create the types of software that you know we would all enjoy right but it's still operated within the assumptions that copyright is this really important legal construct that we all need to understand and the language that you see in icla is exactly there to protect you know some of the copyright uh uh aspects of the work that you're contributing now interesting i'll just give you a few tidbits of kind of how to think about the problem space when it comes to copyright uh so first of all the good news is that uh only the works made by a human mind are copyrightable so not anything that comes out of a machine or even animal so again if you are not familiar with this there's this famous ape selfie argument so google it there was actually a lawsuit brought uh uh in you know by a photographer who thought that the photographer has a copyright on pictures that were taken by an ape like ape literally would take uh a camera sometimes take selfies sometimes take pictures obviously the photographer who owned the camera and gave it to the ape so that the photographer you know has the copyright some people said no it's the ape who has the copyright so the actual and legal suit brought you know uh uh in the united states and the ruling is that ape cannot have a copyright right you know only human mind can produce work that is copyrightable it's a very useful uh precedent you know for anything that generated by the ai so by itself ai recombining uh the training data that it was given cannot really create copyrightable work uh now you as a human can basically take that and create a copyrightable work on top of that but if all you're taking is that recombination of artifacts that ai is spewing back at you it is not even copyrightable that's actually the very interesting takeaway from all of this it is not even copyrightable now again in a certain context it might actually be good news but it's just like you know pause for a second and appreciate that fact now again copyright actually has some rights you know that are obviously protected uh but it also has some rights that are carved out right you know so for example even if you have a copyrightable work let's say it's a disney movie or a song or something right you know one of the just very clear example that we're all familiar with that is carved out is using it for let's say parody so if you're parodying you know and the length of you know the original work is not too long you know even the hollywood mafia will not come after you right because that is a carved out chunk of the copyright that you can still use so it actually applies to the software as well in the sense that you can basically recombine some of the work to a certain extent as long as it is not really the expression is basically creative right so and again that's very interesting to me because what is creative like is a single line of code creative like do we measure it by the lines of code by the complexity of the code like how do we measure it some people say that you know if it's not creative we can apply the exception so even if AI produces something that is not total recombination like so suppose I produce something that is completely the world hasn't seen it well that is not copyrightable like we've established that that's kind of like an ape taking a selfie now second case is AI can actually produce something that's just a copy of what it has seen before that has happened uh but again if it's small enough and it's not creative then you can apply this exception saying like well even if it matches to some existing code like one to one it's not recombination it is not creative enough therefore I can use it again an analogy I would give you is we all know that we're not supposed to take a code from Stack Overflow because it's actually covered by a weird license and you know like we're not supposed to do that but again if it's just a single line of code if it's not creative enough uh I think we are all within our rights to actually go ahead and take that single line of you know code that we googled so it's kind of the same deal uh and uh we also have obviously cooperate issues in training versus inference right did the AI engine actually have the right to use the data that it got trained on well only the maker of the tool can answer uh versus you know what came out of the tool and can you use it in your work so these are the questions that are very useful to uh to answer so again you know I'm talking so far and it sounds like you know we've got 99 problems uh and that is true we do have a lot of questions that that that aren't clear and again I just you know stated a few here um so uh there are even other risks right there are things like you know people basically copy paste proprietary data into chat gpt window because again that's a test tool right you know it's like if that tool has uh a way of how you interact with it like that's how you would do it and if the data that you paste it you know contain some trade secrets well you might actually end up you know violating your corporate uh policy and you know there will be some level of trade secret loss there's obviously loss of privacy um there's some you know concern about in turn uh intentional manipulation of AI models you know sort of like makers of the AI tools effectively steering the models to produce the output so suppose you're a cryptographer right you know you're working on some new cryptographic algorithm and you're using gen AI tool to sort of like kind of help you with a mundane task of like you know writing for and while loops but what do you know maybe it will just get injected with a subtle bug that all of a sudden the keys that you're generating are like less secure but it's not obvious you know from just looking at the code so that potentially could happen especially if you're using tools uh that again are coming from unknown uh state actors or maybe even well known state actors but well known for they're not following the international laws there's a lot of AI models coming out of you know countries that are sort of uh uh on a gray list uh these days so again that is also something that we all need to consider and we're considering it because we cannot stand still uh like i said i mean the uh uh uh the ship has sailed whether we like it or not at the apache software foundation or linux foundation or any big corporation the developers will be using gen AI because it makes them so much more productive so what did we do with apache at least you know we basically came up with a recommendation about you know uh six months ago uh it's a very basic one it's just the beginning there was a blog post that we published you know uh go read it up it was created and sort of concerts in in synchrony with linux foundation and some of the efforts uh to help uh one of their biggest communities uh which is uh uh kubernetes you know to basically sort of figure out the same legal issues so again just go read it up but there's still a lot of questions left right you know the questions that are basically like that document even barely touches but everybody's still asking so if you go and keep looking up those legal jurors that I told you about you will see people coming back and asking like these questions right should we clearly mark all contributions created with gen AI assistance like should that be made a policy within the apache that if you're using one of those tools at least in the commit message you know leave us a breadcrumb saying that like a particular tool was used now again the stance that apache is taking today is we are recommending it but we're not enforcing it but again you know we'll see how the whole field evolves and maybe at some point we will actually start enforcing it uh another big question is how everything that we've described for an apache developer dovetails into a corporate policy because you know what every big you know 500 fortune company and then some now has a gen AI policy for their internal developers right you know what is permitted and what is not when they're working on the proprietary closed sourced code base so if it's completely asynchronous and you know not really congruent with what the patch is doing well now you have a problem because you obviously have this huge supply chain of open source that's coming from the foundational side and if it doesn't mesh well with internal corporate policies well now you basically effectively getting cut off from all of your favorite you know open source libraries because again there is a mismatch right you know there is a gap and it would be incredibly difficult to bridge that gap because once the code is in uh one of the big open source projects it's pretty much next to impossible to isolate it you know get it out and and and what not um so corporates take different stances you know towards gen AI and again here i just summarize some uh it is it is kind of zoo you know and it is a zoo of you know risk uh being risk averse you know um some companies for example authorize use of generative AI only by developers with certain credentials it's like in you need to be this tall to ride this ride type of a situation like you basically need to qualify to be able to use gen AI within the corporate setting uh it gets to that level like i know companies that actually do that now again at the patch here we don't do any of that uh so what do we do and again this list is literally taken taken from one of the linux foundation legal team presentations it's very much the same for the apache as well uh so we basically were given a choice as open source foundations we could take most cautious approach uh we can basically do it by use maybe allow it you know if you're just debugging or you know you're sort of tinkering the code but not like for the final contribution we could have decided by the by tool by tool basis saying like you must use you know github co-pilot and nothing else uh we could have just completely trusted the developer and say like do whatever you want not surprising that they approach that we took you know with between apache and linux foundation our eclipses joining as well so i think you know we'll see a coalition of these different types of open source foundations is trust the developer but provide the guidance uh so for all use cases and all ai tools for all projects just publish the guidance uh and give developers you know the understanding of what they're dealing with because again we don't really like reading legal contracts right you know nobody likes that even if you signed icla you are pretty much not really up to date with what you signed up for so at least refresh the memory for developers and that's what we're doing so again sf's answer is actually published at this sf generative tooling guidance so again you can google it or once the presentation gets published you can just click on it uh it's a pretty clear list of what we can say to you today and again a lot of corporates came to me you know afterwards and said thank you because they are now reusing this for the internal sort of uh at least as a starting point for the internal discussion or sometimes even a set of policies for how they do it within the corporate setting so again it's a very useful list it's constantly evolving so you know keep up keep keep up with it uh you know refresh your page so to speak now and again uh it stands at version 1.0 i think we will do 1.1 with a few updates that people have suggested you know relatively soon we are not anticipating 2.0 until you know the legal landscape or you know anything like that changes and genai is a very fast moving target uh you know basically you have changes in the law genai is one of the favorite you know subjects to all of not just the corporate but basically the uh uh policymakers right so you now have policymakers of EU you know telling you something about genai i'm sure the policymakers of the united states will tell you something about genai as a software developer as well so that of course will apply to us because again one little fact that we are software developers don't really appreciate is that law actually trumps our license so if our license for example says that you know it is provided to you without any kind of warranty but all of a sudden europe is making a law that warranty must be provided your license is worthless well you don't do business in europe but that's a separate issue right so law always trumps the license and in this particular case uh legal landscape is very active right it's not like in the beginning of the open source movement where we got lucky and nobody noticed us until you know we were pretty big and you know could basically speak for ourselves uh changes in tolerance for risk and ambiguity among adopters of OSS are also changing right you know some companies especially when it comes to AI get scared pretty easily because it's a big big big liability just simply because again within the united states legal framework everything is you know based on precedent not a single big precedent has been brought to court yet but there will be like i can guarantee there will be and nobody wants to be that first one uh so there's a few additional you know bullet points to consider so uh written guidance could also include a pre-approved list of AI tools i mean we don't really do that because again that would be picking winners and we don't do that in the open source community uh we could also maybe work with you know things like spdx a little bit more and sort of provide attributions and machine readable uh sort of uh uh notion of what AI tool generated what i actually want to talk to spdx fox you know within linux foundation you know a little bit more i don't think they're quite sort of catching up to that yet but maybe there is some work being done that i'm not aware of uh and that's pretty much it so thank you so much we actually have a workshop uh and the panel also uh today focused on what it means to produce open source AI uh so these and some of the other questions you know will get discussed so again if you're curious about these subjects you know come to the panel uh called the definition of open uh AI and the workshop that is done by osi later today at i think 430 so thank you so much