 There's definitely a lot of confusion on the market. And it's one of the main driver for this investigation, this research on what constitutes open source AI. Hi, this is Johan Sapin Bhartia. And we are here at Open Source Summit in Bilbao. And today we have with us once again, Stefano Moffili, Executive Director of Open Source Initiative. Stefano, it's great to have you on the show. Wonderful. Thank you for inviting me. And this topic is also very interesting because it's about one of the hottest technologies on our topics these days, which is AI, Genetic AI, all those things. And when we start, like, of course, AI ML has been around for a very long time. But recently, it has been getting a lot of buzz because of the whole Genetic AI, chat GPT, and all those things. Not forceful in a while. And Musk also tweets about it. What is interesting is the relationship between AI and Open Source. Because AI is not just a lamp stack, just for components. It's much more complicated models are there, of course, technology. So I would understand from you, from Open Source community, and because you represent OSI as well, how do you folks look at AI? You really touched the most important points. And my experience was exactly that, looking at the generative AI co-pilot when it first was released and tools like that. I was like, wait a second. I mean, this changes everything. AI has been around for a long time. And for the first time, I was seeing the crisis that these new systems introduce into the Open Source definition. I checked the Open Source definition. Again, I went to look at it. I was like, wait a second. What is the source code, which is number two in the Open Source definition? What is the source code of co-pilot in this case? Or similarly, like, chat GPT. What do you consider the source code? So I started diving a little bit deeper and noticed how different the whole stock, or if you want the whole set of components that go into an AI system, completely different story from what goes into an operating system. And we need new frameworks. That's what I realized. We really need to start from scratch with a fresh mind. Thinking about the principles, still can be applied, the principles of Open Source and freedom can still be applied to AI systems. And I think we should be really making an effort to recreate the freedom, the freedom sets that have enabled fast evolution and innovation inside computer science. We need to have the same sort of safeguards and values reflected in AI. And that brings me to a not entirely different but related topic, because when you're talking about, you have to rethink when we look at Open Source licenses and AI, because if you look at the very basic definition, it can go all the way to OSI or FSF, when they came up with the 4 and U4, came up with 10. The basic idea is that there is source code and you are supposed, of course, when you make changes, when you distribute that code, I want to see the changes. Like Linus says that, I don't really care about what you do, but if you're making any changes, I would love to see the changes because it may improve my code base. Everything else is secondary. Sometimes just being able to see the code is not enough, I should be able to modify it. Which as I said, this may be not related because one of the questions that we are going to ask today or discuss was also the whole, we are seeing a lot of movement happening where companies are changing their license. It could be because venture capitalists are pushing, sometimes they say, hey, cloud vendors are pushing us. If you look at cloud, cloud does not redistribute any code. So when you're not redistributing, then you're not compelled to release the source code and then you can package everything in the service. So the folks who are actually creating the software and the cloud vendors are leveraging them, but they are not because the software does not ask them to do that. So when you brought this point because I think last time also we touched upon, do we need to come up with a new definition because the market has, when you folks wrote these, cloud was not there. We didn't even know these things. So I do want to talk about, since you did bring it up, so I'm putting you in a spot to talk about these two topics which are not related, but related. They are related. And if you want, there is a little bit of a continuum, at least in my mind, like 40 years ago, and we celebrate this week, 40 years of the GNU project. The GNU project started 40 years ago. They came built with after the GNU manifesto and after the practice of sharing code and sharing modifications and having those set of freedoms embedded in licenses also, in copyright licenses. They were set 40 years ago. 40 years ago, a computer was a completely different system than what we have today. And the open source definition came out 25 years ago, so 15 years after the GNU system. In these years, in the past 25 years, like you were saying, software consumption, distribution and execution, these concepts have changed. And they're changing even more radically now. So they're changing slightly with the introduction of cloud consumption or mobile consumption, because that's another place where I was giving this example before to another person. You have a phone that is napping a picture and the picture is being fixed on your phone after it's been uploaded to a remote service that does some analysis on it and sent back to your phone. Now, if you wanted to have the source code of that system, what that would look like? How would you modify? How would you rerun it on your own premises? We haven't asked those questions, but those are fairly, still fairly easy to answer. I'm not gonna say it's gonna be easy as PISOK, but it's still working in the same realm and in the same legal frameworks that define, that are fit for the open source definition. When we start thinking instead about the cloud systems, the, sorry, when we start thinking about the AI systems, we are in a completely different game. It's a completely different story. The open source definition really is thought and designed to work in the copyright law, which is a very fairly uniform, it's a very fairly common law that looks similar across the world. When we look at the AI systems, we start seeing the dependencies of data. Data is covered by different laws. There is copyright, yes, but there is also privacy law. There are labor laws that go around the data work. There is, it's a more complicated world. Once there is even contract law, terms of services, so many different pieces that make the whole scenario completely different. Looking at this discussion, and of course you also had a bunch of feather here, what kind of discussion are you hearing not only from the OSI community, open source, larger community, but also the larger vendor ecosystem or the user's ecosystem? There is a lot of confusion. There is a lot of hidden agendas, and there is a lot of fear in general on the market. Fear that is also being reflected by the rush to deliver legislation and regulation around a generative AI and systems. The birds of a feather, the conversations I've had this week have been very deeply focused on one of the crucial issues, the crucial questions that we have to ask ourselves. What are we gonna do? So the crucial question is what kind of dependency is there between the original training dataset that went into an AI system? Are we going to be considering the equivalent of source code, that original dataset for the model, the weights that have been trained, or can we be a little bit more relaxed and think of the model itself as the preferred form to make modifications to an AI system, the model and the weights? This is a big issue. It's not very clear what we can do because it really intersects the technical limitations that are still today existing in the space. Like there is no easy way, for example, if a model has been trained that contains private information, there is no easy way to remove the private information from the model itself without retraining the whole system, which brings us to the cost of training the systems. So is that hospitable? Like do we really want to force this running again megawatts? We're talking megawatts just to train something like Chai GPT, or do we want to have other technical solutions in place? Policies, what should they be? The complexity of the conversation is really daunting. But good progress, we're making good progress on the principles. What are the basic principles that we want to see reflected in an open source definition for AI? And the principles seem to be quite straight going into basically re-reading the GNU Manifesto by putting on the lens of AI. And we want to be able to reinforce those principles and it looks like there isn't much of a, you know, divergence of opinion on the front. The principles of the GNU Manifesto are still quite valid there. The fact that as a developers, for developers and for society, we want to have self-sovereignty over data and code. We want to be able to innovate without fear of retribution or without fear of having to ask for permissions at every single step. Now that said, the implementation phase is what's gonna be a little bit more challenging because especially on the data front, because as a community of open source developers, we haven't really thought about data much. We haven't paid attention because it's not, it was, there was, until yesterday, there was a very clear separation between the code you wrote and the compiler you used, what license it had, or the IDE, the developer environment that you were using, what it could be proprietary developer environment and compiler, and you would write open source code. Advice versus you use an open source compiler to write proprietary code. We know where copyright starts and ends. With data, we start to get a little bit more confused. One more thing that is happening with JADGPD is it needs access to the widely available data in itself. And now a lot of companies have started blocking JADGPD APIs from accessing that data but the fear is same, you know, you're pulling on, and you're talking about the GNU manifesto, when I used to talk to Richard Stallman also, their focus more on the users, the rights of users, the right of protection of users. When we look at the open AI, sorry, when we look at AI, especially I'm not talking about the legacy AI, and now we can use the word legacy also for AI, we're talking about AI, who is the user here? And how do we protect the rights of the user? Well, the user I think is the same user that Richard had in mind. I mean, I had many conversations with him too, and he always thought of a user as the, anyone who's sitting in front of a computer, whether they had the technical skills or the knowledge to program, or reprogram a system on their own, or because, or not, they still had the freedom to exercise by hiring a programmer if they were not a programmer. I'm thinking the same thing with the AI systems, we need to be able to have the person at the very end of the spectrum, or lifting a phone and asking a question. They, I think that we would want to have the complete control and self-sovereignty for that person to say, I don't like the answers that I'm getting, I need to be able to tell someone how to ask someone to modify the system so that it fits my purpose. The very basic things that we were doing when Linux started to appear on the horizon commercially was to go to policymakers, to go to politicians who were writing tenders for procurement, for public procurement of new software, new systems, and we were saying, you need to be able to change your procurement laws so that software can be deployed that comes with the freedoms, because citizens, by interacting with public administrations, they need to be able to say, give me the source code, I wanna see why that form that I submitted has been denied, or I wanna make sure that the information that I have submitted is being treated carefully, so this transparency and this control distributed to the society. With open source software, in the classic sense, we get one step closer to having control as a society. With totally opaque and obscure software, we don't. With AI, we have absolutely no control for a variety of reasons, including technical, I don't wanna say it's only the availability of source code or whatever the equivalent of source code is, but we still need to be moving towards that so that society as a whole has the possibility to control its digital information and digital interactions. Only difference that I see here, or the paradox that I see with the AI, when we look at the traditional open source, when it was Ganuji, the users is purely a user. When I sit in front of a computer to open an office document, whether it's LibreOffice or whichever it is, I'm simply a consumer. In the case of AI, if you can also take automotive cars, I'm actually, when I drive the car, the cameras, they collect data from my driving. When, as a journalist, most of the people who do consume are also the one who are creating content, so here the chemistry is a bit different. The user is also the one who is contributing, feeding the data, because without the data, AI will have no knowledge there. So that's where I see that it's a bit more complicated because we are not looking at pure users. So who is going to, what control I have over the data that I'm getting is used? I mean, it is complicated, but what I would like to say in a highlight, this is not a problem of open source though. Like I think that the, where I'm positioning and I'm thinking about the open source initiative, this work that we're doing to understand open source AI has, is going to have is, as a side effect, a more knowledge for the community to influence policies. I think our work and our job, in this case specifically, is to influence policies that tell that the society needs to have access to that data. In other words, I think that we should be carefully thinking about making it illegal. Some terms of service that deprive the collectivity from having access to data that is been generated by the collective society collectively. Like it's not the job of the open source movement to talk about, to say, to write licenses and to influence. I think that we're going to the place where by signing terms of services, like in the car case, or even the Facebook or Instagram social media or Reddit even that says, yeah, come to my community, come to my website, contribute your knowledge for free. So write your blog or things like that. And then I'm going to be exclusively capable of signing a deal to resell that information, that knowledge that you have created for free to a third party. I think we need to think carefully about who we vote for and get the policies made so that they defend collective creations instead of make it proprietary and give it to a hundred to someone else. I do see that when it comes to AI, there are a lot of folks who do talk about legislation who do talk about a lot of regulation, but their motives doesn't seem to be pure in a case. So they seem like there are a lot of hidden agendas, how to also ensure that they're not bad actors who are doing a lot of things, either to a civil competition to get a dominance there. How much is that concern to you? Oh, no, it's a big concern of mine. Frankly, the work that we are doing with finding this definition of open source AI is highlighting for me exactly that the existence of dark forces, and I'm not even sure where they are, where they're sitting. They're so dark at this stage. The one thing that I noticed is, for example, there is a strange convergence of people who like content creators, for example, who have absolutely every right to be upset about the privatization of their content. The material that has been written by people who have released the software under the GPL thinking or in hoping that it would be staying under the same conditions to start with, but also artists, comic designers, who even personal, I mean individuals like me, I have uploaded my pictures on Flickr and put the Creative Commons license on it. I never thought about the fact that at the time when I did it, that I would be feeding into a machine that is now capable of recognizing me when I go shopping because I train them to recognize my face even over time. If you think about it like changing when I was in my 20s and now, right? So that kind of information being in the hands of private institutions is really dangerous and bad and I get them for being upset. But I also see the contrary effect, like the push to reinforce, to increase the reach of copyright to say, you cannot use my blog posts, you cannot use my pictures, you cannot use my content in general because without asking me for permission, that in my mind leads to a very nasty side effect. The side effect that I see is that only large corporations like Meta, Google, Microsoft, Amazon, or large governments or bad actors are gonna be able to either enter into commercial agreements with the larger aggregators of content, think Reddit or Getty Images. And they will be the only one who are able to create, to assemble large datasets and therefore train large language models or other of these large models that require a lot of data because they have the power to do so. And smaller groups like Luther AI or Lyon, they will not be able to do it or they will be sued out of existence also because they paradoxically, they disclose all of their sources. They go above and beyond trying to publish scientific papers and publish the source code of the data that they have, I mean, the tool that they have used to assemble the data, they try to disclose where the provenance. And by doing so, they expose themselves to take down requests, for example, which are starting to appear and lawsuits that are very easy to do. On the other hand, the spectrum, OpenAI, the company and others by not revealing their sources, they are protected. They can go, they will be sued and they are being sued, right? But now it's getting images who needs to prove that their data, their pictures have been used to train stable diffusion, for example. And it's, you know, they may not succeed and still, and they need to make also an investment. When it comes to AI, you know, there are, just the way in the political terms in the US, we use the word rhino and tino, republicans in name only or democrats in name only. In the open source word, we can also start using ozino, open source name only. Very good examples, lama too, you know? It was said that, but it's not open source. So what are your thoughts about things like that? We are a whitewashing open source or say, call it propriety, it should be okay. How much concern once again do you have about that and what is your thoughts on it? There's definitely a lot of confusion on the market and it's one of the main driver for this investigation, is this research on what constitutes open source AI. Lama too is not open source. Like the license itself, you can read it and then it has at least a couple of points where it doesn't make any sense. That said, I can see a lot of the scare tactics being used on, becoming popular on the market and these scare tactics probably force the hands of these large proponents or developers of AI systems. In fact, there is a very, very, very surprising at the very beginning, especially for me, surprising awareness from the AI developers about the impact of the software or the systems that they're developing. They seem to be quite afraid of the capabilities and they preemptively try to write licensing documents and agreements that contain like, hey, I'm releasing this, I need to do it because I'm a scientist, I need to publish my knowledge and research that I'm doing. At the same time, I'm aware that this thing put me into dangerous positions, like maybe this gets out of control and provokes, I don't know, a massive spam campaign or something like that. So I'm gonna put some disclaimers here and there as much as possible. The other awareness that I noticed, your Uber awareness, if you want, is of energy consumption and other issues that go into AI, these large AI systems that didn't go before. So developers seem to be more aware, it may also be a generational thing that the old neckbeards from the 70s had less concerns about the impact of the work that they were doing. One really thought about the risks of putting their software into missiles and cannons. But today's generation seems to be more aware and attentive. The problem with all of this goes back into the principles that we wanna see. The golden rule of the Gumball infesto is that if I like a program, I wanna make sure that I can share it with others who may like it. And I think that that very simple rule has enabled the tremendous progress that computer science has had. Because if you think about it, computer science and open source free software have evolved at the same together. Since the early days in the 70s when late 70s when the open source free software definition came out, that was also the beginning of the software industry. And the progress is really difficult to untangle. Now today, software that is open source that is distributed with freedoms is everywhere. If we wanna do, and I'm quite sure that it's not the progress of the scientific discipline, computer science has been enabled by the four freedoms. It could have not. It's not a casual correlation. It's a causation effect. I'm quite sure of that. Of course, one can disprove me. If we wanna make the same sort of progress, fast progress, we must try to have something written down, an agreement of what makes me, what do I need in order to share an AI system if I like it. Stefano, thank you so much for taking time out today and talk about this really, not only kind of, it is complicated but also critical topic today. It is still a lot of work in progress and we hope that, as you said, we do have to redefine what open source means in the context of AI. So I would love to chat with you again as we move some progress in this space but I really, really appreciate your time brother. Thank you. Thank you so much. We'll be at all things open to really start writing down the very first draft of the open source definition for AI and hope that we can continue talking about it for next year too.