 Hello everyone. My apologies, I don't think the monitor here likes Linux, so I'm just using Alex's laptop. Thank you for that. So my name is Nick Vidal. I work with the OSI, the open source initiative, and it's a pleasure to be here. Thank you, Alexi, for the invitation. And so today we're going to talk about open source AI. There has been a lot of talk about open source AI, but there's no clear definition of what's actually open source AI. In fact, there's the European AI Act that has been progressing rapidly in Europe, and over there on this legal document, they stay free and open source AI, but nowhere in this legal document do they actually define what's open source AI. So the OSI is working together with the community to try to come up with a definition. And this slide's here. I'm going to go over an overview of what's actually AI, generative AI, and so I'll pass through quickly here. So this is a stable diffusion from stability AI, and this is a question. What if Elon Musk was living on Mars? This one here. What if Wall E, the Disney character, was a real human person? And these are some amazing images created by generative AI. Next we have the Joker as a Disney princess. This one's a bit scary. And finally Harry Potter surfing at the beach. So all this was created using open source and generating those images. And it's quite amazing how it was able to generate that, right? It's groundbreaking. So we have the diffusion model. We also have other models, right? Like attention and transformers. And so what is generative AI? Basically we have a prompt once where we enter text or we enter image or whatever and it generates a response. So here we have what's the capital of Germany and using those large language models is going to respond the capital of Germany is Berlin. All right. This is the diagram of how this works basically. So we have a prompt where somebody asks a question. And this goes to an application. Let me see if we can. All right. And this is going to ask the large language model to see if there's a response. And to create this large language model we use a training data, a huge amount of data. And this is going to return a response to the user. All right. Besides using a training data, you can also do some fine tuning to adapt this large language model to use cases. So for example, we can adapt this large language model to generate programming language codes, right? Or to question and answer type of chat style or any other specific tasks for medical research as well or other topics. Another form of enhancing this is by using prompts engineering or RAG where you can use live documents or live data. And when you use a prompt, it's going to seek if there's something more recent or something more precise using embedding. And it's going to work with the large language model and return a result is much more precise. So this is the basic diagram for generative AI. And let's go, this is very generic. It works for both proprietary models and open source models. But let's dive into what's actually open source AI. And what's interesting here is that we can break this into four components. So we have the software. The software is the easiest part. We clearly know what's open source software or what's not. There's at least 25 years of history behind this using the OSI licenses. But much more than that, the free software movement that started before, it has four freedoms. So you can clearly explain if a software is actually free and open source software or not. So this is the basics, right? The software is very well defined. However, with data, this is more challenging because the data, there's some privacy issues. There's a copyrights around data. Sometimes you cannot fully release the data or the data sets. So there are some challenges around that, right? Maybe it's under Creative Commons and you have to give attribution. How does that work for large language models? There's also the model architecture and the weights as well. So when Yama was launched, the first version, the weights were only shared with researchers. And this weights got leaked on 4chan. So people started using Yama, even though they couldn't basically use the weights, even if there were not researchers because this got leaked, right? We know that there's a whole community around open models, right? Open source models. So hugging face, we can see here the leadership board where we have a whole bunch of models coming up every week or every day. It's very challenging to keep up with all the news happening. So this is pretty exciting. And also there was a leaked documents from an engineer at Google where they see that open source is really growing and outspacing proprietary models. We're not there yet, but even though your company might have the best engineers, the best team, it's very challenging to compete with a global community of developers, of researchers, and also this iteration that everyone can develop on top of other models. So this document basically says that open source will outpace proprietary models and Google or OpenAI don't have a modes, right? And we're seeing this happening, this rapid innovation and iteration. And so what's happening is that we see that companies like OpenAI and other organizations are trying to create a regulation difficulty for startups, right? Make it really challenging for startups to obey all the laws and all the regulations. So only the big players will be able to play this game. But we've seen a lot of companies also trying to promote a more inactive scene, right? So we have Huggingface, for example, or Stability AI who have written letters to Congress, to the American Congress to promote open source AI. Recently, Meta launched Yamachu. It's not open source. OSI created an article explaining why Yamachu is not open source, but Yamachu is one of the most open models that there is. It's almost open source, right? And it's rapidly being adopted. And we also see a question around privacy. When you use OpenAI, when you use ChatGPT, companies are basically giving away their data to a third party. And so we've seen Apple, Goldman Sachs, Samsung prohibiting the use of ChatGPT because of those questions. When you use open source AI, when you use open models, you can have this on-prem. And so your data does not get leaked. For companies, startups or even large companies, having your own large language models, running on-prem or using privacy enhancing technologies, running in the clouds, makes much more sense. It's much better because you have a lot of sensitive data. You don't want to give that away, right? So here are some advantages of using open source AI. So the transparency of the data, the data that you use for training or for fine-tuning, this is very much transparent. When you have a proprietary model, you don't know what has been used to train the data or to fine-tune the data, right? There's also a question of independence. You're not dependent on a company making changes to those models. If you want to fine-tune that, if you want to change the direction of those models with open source, you can do that. Also security, it's much more secure. You don't have to talk with a third party. You can run this on-prem or on privacy enhancing technologies on the clouds. And finally, customization. You have total control over the data, over the model, over the software. So it's much more customizable. Other advantages as well, it allows for this fast iteration, collaboration with a global community. And it's just more diverse, right? We see this happening at hugging phase, models that are being created week after week, right? And it's very difficult to actually keep up with all this innovation. And finally, other advantages. It allows for an easier interoperability because since it's open source, you can integrate it with your own systems. It allows for a cost reduction, especially a large scale. If you're open AI and shy GPT can be quite expensive if you're running this like several thousands API calls. But if you're running this as part of your own infrastructure at a large scale, it makes much more sense. And also performance, because you don't have the latency of transmitting that data back and forth. You're running this close to your data, right? So these are some of the advantages of open source. And also, I found this interesting tweets by Emmet, who's the CEO of Stability AI. This is a company that's valued at four billion. And it says here, for people asking, but how are open models going to make money for business? Imagine this, every government, every institution, every company, every one of those is going to have a model, an open model that's audible in the future, hopefully, right? This is like a trillion dollar markets. How can we enable that? How can, and this has a huge potential, right? So basically, as a conclusion, comparing proprietary models and comparing open models, open models offer a whole bunch of advantages. There's no data leakage. You have privacy and security. You can customize everything, it's not restrictive. It's not a black box, it's more transparent. It doesn't have a locking. It's much more heterogeneous, more diverse. It's not homogeneous, right? So open source offers a lot of advantages. And so as part of the OSI, we really believe in open source. And there has been a lot of talk about open source AI and what exactly that is. And we are working to actually define what's open source AI. So what's interesting is that the OSI does not define what's open source software. It's a community consensus. What the OSI does is bring together a whole bunch of communities of developers, researchers to try to come up with a definition of what's open source software. And we're doing the same process for open source AI. We're bringing together several communities. We're working with nonprofits like the Linux Foundation and Wikimedia and a whole bunch of other organizations. And trying to come up with a definition. And why is that important? When we have this clear definition, it allows for companies, organizations to clearly understand how they can use those open source models. We need that. And just how open source software has enabled companies and organizations to benefit from open source, we're trying to do the same for open source AI. So let me try to come up with, let me try to open a tab here. Can I open a tab? Yeah, yeah. All right. I think it moves around with me. Let me see. It's weird, right? Oh, wonderful. But you won't be able to see it. Do you just bring up a URL and then move the screen? Yeah, that's all. You can just bring up a URL and move to the right. All right. So this is the URL, opensource.org slash deep type, where we're working together with the community to create a definition of what's actually open source AI. And this has been pretty exciting. We are, so this started in June. We had a meeting at the Mozilla headquarters where we gathered about 15 people to start those discussions. And later we had a community review in Portland, together with Fawzi. We were also a part of the Linux Foundation open source Congress in Geneva, in Switzerland. We're going to be next week holding the deep dive webinars online. And this is very exciting. We're here in Bilbao to have those discussions, to have a review of the definition. And we next month, we're going to be at All Things Open in North Carolina to have the first release candidates of this definition of what's open source AI. And after we have this release candidate, we're going to put this online and allow people to comment around those changes. And hopefully by 2024, we hope to have a clear definition so that it will help everyone have a clear understanding of what's open source AI. We have this webinar series that's happening next week after Bilbao. And we have a whole bunch of very interesting speakers from the Allen Thuring Institute and other organizations. So everyone's welcome to join. And so that's it. I open to questions and thank you so much for being part of that. Yeah. Just a quick question. You said there's a community review happening in Bilbao. Where is that happening? So right here, I'm hoping to have those discussions. And I'm going to show a draft of this definition of open source AI. And I hope to connect with each one of you one-on-one to have those discussions. So feel free to reach out to me. Thanks. Is the draft at that link on the opensource.org deep dive? It's not there yet. We have a small group that we're working and we're going to have this official release candidates at all things hoping. Yeah. But I'll be happy to share with you. It's okay. Yeah. So you can share it privately one-on-one. Yeah. For feedback. Yeah. This is David. My question is this. Since the large language model today is very expensive to train. Yeah. So I'm wondering how open source kind of a large language model can sustain to the big proprietary like Lama and things beyond that. That's a question that basically is how relevant of this open source large language model in terms of capability compared to those proprietary due to the very expensive of those training resource. That's my question. Yeah. And secondly is that the license model seem to be those available one is getting changed to the non-commercial based license. You can see the train on hocking faces. So that mean what if this component scratch is very expensive. And then if you want to derive from those proprietary that also become very restrict. Yeah. So how do you foresee that open source large language model can head down the road. Yes. So that's a very good question. Indeed it's very expensive to create a foundational model. But we do have some organizations like Lion or a L4 AI which we are in contact with which are organizations that are either they have support from governments or other sponsors who are allowing them to use GPUs or for training rights. So it is a challenge but it's finding a way. Basically finding a way. Also we are seeing that sometimes larger does not mean it's better. Right. We're seeing smaller large language models or we call it small language models. They are just as good or sometimes even better than those foundational models that are very large because they're very focused on an area like for example healthcare or finance or whatever. The quality of the data is really important. And sometimes when you focus on that it's cheaper to create those models and sometimes it's better as well. So we're seeing that regarding your second question. Yes we do have those challenges. And this is something that we have to help define and help understand around the licenses and how we can make it more open while respecting all the other challenges like copyrights and privacy. And so this is an open challenge that we have to come up with. Yeah. Yeah. Any questions? I have a question, Nick. So compared proprietary models, open source models, right? And kind of the hypothesis is that you will have you know, municipalities, hospitals will have these open source models and you say the advantages is no data leakage but if you deploy a model on Pram you have multiple users that still can be leakage in turn between the users, right? So and also what I'm also wondering about and I think it's a hard problem whether what your thoughts are. So let's say a hospital trains a model, right? And like on the use cases obviously they will have patient data. So you cannot have all of that model being open source available. Like it may be based on open source technologies. It may run on open source tools but I don't know if you want this model to be available for inspection for anyone who wants to see it. So how do we define? Because model now has chunks of data in it, right? And so how do you wrap? Like what's the community thinking given what you start in June, right? Like how do we have an open source model which encapsulates sensitive data, right? How do you kind of distinguish between sensitive data and model and what does it mean to be open source if you cannot give this data to anybody who asks for it? Yes, so this is one of the limitations of large language models and we're seeing more and more a combination of large language models with knowledge graph and a RAG prompts engineering to allow a better access control, right? So when you have very sensitive data probably you don't want to have that as part of the model. You want to have that either in a knowledge graph or somewhere where you can have a better control over this data. And this is the direction they were seeing happening. Yeah. Thank you. Any other questions? Tim, do you see anything on the live stream? Vengeance. I think we have a couple, a few people on the live stream and I didn't see questions yet. Okay, great. Thank you, Nick. All right, thank you so much. Thanks.