 All right. Thanks. Thanks for taking the time from a busy schedule and we're going to be talking a little bit about what AI is doing with open source and software and how that's impacting us. And I want to start by looking at AI systems are being deployed at scale, right? They're making decisions today at scale. They are already out in the wild and they're out of the research labs and they're affecting life in surprising ways, at least for me, from some of the technologies in the room that might be not surprised. But you know, I looked at the story from this recently reported by the New York Times who the father sent a picture of his child to his doctor through a Google application on his phone and the child was naked. And what happened was that the AI system automatically detected child nudity and flagged it as abuse and hell broke loose. The AI make a hell for this for this father. So the New York Times wrote a story about it and the headline is fantastic. But the way I read it is this, right? It's there is a bug. There is clearly a bug in the system that AI in the AI system that Google deployed and it's a very complex one. So there are definitely problems inside how Google came to that decision of flagging and suspending the account of this father. But what interested me is more about what's happening in the AI system that led to that conclusion. And in order to make a better story and only for us, let's imagine that Google is out of the picture. And let's imagine that the humans are not responsible for the review process that flagged and blocked the person, the father's account. But let's imagine that there was the meeting in the meeting room when someone decided, oh, this is a great idea. We have this fantastic algorithm that detects child nudity at such a precision that we can really let it rip and let it be autonomous. So this AI is deployed, locks accounts automatically, sends notifications to the police and says, hey, here's a child abuser. You got it. We got it. Go get him, right? And essentially, this corporation has, Imaginary Corporation has unleashed a very fast robot that makes judgment calls involving the police and gets them wrong, gets the decision wrong. So the bug isn't there, but how do we fix it as a society? What do we need in terms of getting these corrected? And if you think about it, when we were, if this was simple software, when there is a bug in code that humans have written and compiled and deployed on a machine, we have an answer. We know what to tell people. We know that we can, we have a framework to tell that open source software is a better way to fix, to deploy software, to write software. We know that we've been saying that for over 20 years now that open source is inherently better than proprietary software. And we know the rights of the developers with the rights of the users. We have frameworks, licenses, legal understanding of what needs to be done for software itself. But when it comes to AI, do we have the same understanding of what we need to fix a bug? In reality, it looks like there are many pieces that make an AI system that don't fall squarely in the open source definition, like for example, things like the original training data, where is the boundary between data and software itself? And many also depending on the AI, like I'm really being using the term in a very generous and wide way, many of these AI are black boxes. We don't understand what's happening inside, how they come to a conclusion. We cannot really inspect the networks inside that make the judgment calls. So if we want to apply the same open source principles to these black boxes, and to the case of the father in this decision making hell, what should he demand? So dear Google, send me the code, send me the source code. But what's the source code in this case? Or how about the police? The police is now receiving on the receiving end of lots of decisions, lots of signaling that saying go get this person, he's a child abuser. But now they understand also they recognize that 80% of these calls are bogus. So what should the police tell Google, the apathetical Google? How are you going to fix your bug in your system? You need to do this, you need to do that. What are these conclusions? So we need to take a little bit of a step back and understand what goes into the automated decision making. And I like to think about pictures because I'm an architect, I'm not really a developer. But indeed, all days when you snap the picture, you then upload it into your own computer and you had to spend lots of time cataloging and trying to add some metadata like tags, try to understand whether you were at the place, you were at birthday, someone's birthday or something. And the software like in this case, the GNOME Shotwell, there's no knowledge of what's inside the picture. But we do have the knowledge of what came into building Shotwell very precisely. We have a list of dependencies, we have the licenses, we have, we know the rights that we have for each of the packages that go into it. If we find bugs, we know who to blame, we know how to fix it, we know where to file an issue. We know everything, basically. But when it comes to the modern picture managers like Google, Google Photos or Apple Photos, they see the pictures, they have a way to understand friends, places, family members. And they recognize pizza from pasta very reliably. And there is something though in that black box, right? So I have a little bit of a clip here of a documentary. Okay, so that was an episode of Silicon Valley if you haven't recognized a TV show, if you haven't watched it, I highly recommend it. It really depicts the life on Silicon Valley of a startup. But there are two important quotes in that piece. They say that the core tech needs to be fed a lot of data, a lot of images and labels in order to be successful, in order to be doing its job, right? Quickly doing this machine that recognizes pizza, hot dog, was simple. And not hot, hot dog, very simple. The rest of it is hard, right? And the other thing that is important in that episode that I show in the clip, it's that creating that data set, so collecting all those images is boring and expensive. So who wants to do that? Who wants to do that? Who's got the capacity to do that? And I think that this is the answer, right? Large corporations have that capacity. They, the web search engines, they scrape the internet all the time. They have a tremendous set starting that. They've been collecting our data for a long time and they think of all the pictures that you've been uploading since before Facebook or Existed, before MySpace. I lost the track of all the pictures that I put online. And nowadays also with phones, mobile phones, those are basically sensors in the world doing this work automatically all the time. So when we look at a pipeline that generates AI systems, we have data on coming in at one hand and decisions coming at the other end of the spectrum. And if something goes wrong in this pipeline, you go to jail or someone gets denied a loan or you get denied something else. Like there are very, the systems deploy that scale making lots of fast decisions all the time. If there is something wrong, the developers have a tremendous responsibility. And that's one of the reasons why the governments around the world are both excited and concerned, very concerned about the general availability of these tools. So when you go back to the example of the father, the police and the society, how do we protect each other? And how do we ask for a fair AI? Probably we need to have access to all these elements for in order to be able to fix and understand the AI. But we also need the hardware to rebuild and test that we are done with fixed issue. Is that all or is that enough? I don't know. But it looks to me like we are at the starting point at least understanding what the pieces are. And the main issue is that we don't have really as a society a guiding principles, a guardrail for to orient the conversation and to drive the conversation the same way that we had the GNU manifesto and later the open source definition to orient the conversations about how the interaction between citizens and governments or between each other mediated by the digital space. And everything everyone is moving at their own pace. Corporations with their own self-interest, governments with their own interests, academia is also moving at its own pace. So what about us? What about the open source communities? What are we doing? These were the questions that led the open source initiative to launch deep dive AI, which is a three-part event to uncover the peculiarities of AI systems and start establishing those boundaries or at least start to understand where we're going to have to put the guardrails to orient the conversation and to define the open source in the context of AI. And here are the what we discovered so far. The first thing is that the in a model, so the output of the training may not be copyrightable. It's not covered by copyright. So at least not in the US. So but we are or developers and researchers, corporations that are sharing the models publicly, they have been putting open source software licenses or in other words licenses that were meant to be used for open source software. Is that the right thing to do or or not? The other thing that is funny is that the output of the model itself is not covered by copyright. And that raises a very interesting question, especially with the release of the new stable diffusion models. Stable diffusion has been trained on the copyrighted data. Therefore, when you ask it to, like in this case, show me a create a picture of Mickey Mouse going to the US Congress. Here's what it spits out. It looks exactly like Mickey Mouse. That image is not covered by copyright. But I bet you that the moment someone writes a comic book with these pictures, Disney will not let it pass. So we'll have a test case. It's a peculiarity of AI. The other thing is that we learned is that the European Union is legislating already on AI. The AI Act is a very interesting read. It's still in draft and it's probably going to take another some time. I don't know how long, but it's going to take a while before it gets approved. But it has a very interesting, it's based on risk and it has already identified two uses of AI that have to be completely banned because they're just too risky. The first one is subliminal messages need to be removed. And I see the eyes of some of you on what subliminal messages. My interpretation is that the legislator here is concerned about tools that may be able to steer votes. And I think that the legislators have seen some examples recently about this happening. And the other prohibited use is the systems for face recognition, fundamentally or biometric recognition in real time, except for public safety uses. And I think here also the concern might be around being walking around, being recognized by a corporation and associated with a profile online and automatically get disturbing notifications on your phone to buy something or something like that probably. Then there are a bunch of other high risk AI systems that classify as high risk. And these are things like other pilots for cars. And here it's interesting to read the legislation because it really requires a bunch of testing and validation and stresses on these concepts that are very rudimentary even in research fields. So it really feels like it's a high risk. We recognize it. We need to test it. We need to make sure they work. But at the same time, not even the research community knows exactly how to test and validate and make sure that these things work. So it's an interesting space to watch. Now, one more thing that we learned is that these data sets, the large data sets that are publicly available, the most visible ones, more recent and visible ones, they're taken from the web. They're collected by scraping massive or publicly available data. And available to the public under licenses that are a mixed match of everything that you find in the web, right? So imagine yourself, the problems that you got into when you downloaded pieces of software and you put it into your own software and you were like, and then someone told you like, oh, no, you cannot do that. You need to read the license first. Well, now we have machines assembling petabytes of images of dubious provenance. And dubious provenance, not only for the legal rights associated with the uses of these images, code, text, but also the content itself, which includes porn, harassment, racism, all sorts of weird stuff that you can find on the web if you don't look careful, if you're not careful about your filters, right? And it's also data that has been produced by a wealthier and skewed set of the world of population, right? Much of issues, many issues. The other thing that was very interesting is when I spoke with one of the developers of this hacker community called Elutor AI, and there's a podcast episode about it. It's really about the damages that the AI can do once unleashed in the world. And they can create real problems very, very quickly. Deepfakes is one of these conversations that we, that one of the tools that we talked about, where you impose the face of someone over the body of someone else. And many tools are capable of creating realistic porn with famous actress over the body of a porn actress. And the quality of these tools, faking voices, faking environments, simulating all this sort of stuff are becoming very hard to distinguish from reality. And tricking even sometimes trained and skilled professionals. So the other one is the stop button problem, which I was not aware of. There are machines that can be trained to win a game, and they can become so aware of the fact that they need to win that can be resisting to being stopped. And maybe today is a science fiction scenario, but if you think about this kind or other science fiction AIs that hide or replicate their code in order not to be stopped, this is mathematically a problem that exists and research community is aware of. So the final piece of the puzzle is that there is a there is a problem with the hardware where for to build these big AI models, and the word big means really big, like gigabytes and petabytes of text, which are very hard to put together. And when it comes to images and videos, it's even more like petabyte and petabytes of data. To train these models, you need you need expensive hardware that is not readily easy to put together. And for this hardware, there are no there is no real open source stock. For example, to put together a cluster of parallel GPUs, you need you need the leading platform is Nvidia, the CUDA system and large pieces of CUDA are not open source. There are some newcomers like AMD, Intel, they're doing they're doing good job and they're releasing open source pieces. But still, there is no they're not full the ecosystem is not mature. Eventually, this piece, I'm sure that will be solved, because the push to open source down the stack to the hardware, I think it's really hard that we've seen it making progress in the past very effectively. But I'm still confident, but I'm still concerned about the rest of it, like the hardware itself is expensive, it's not the hands of hackers by themselves, and the knowledge to set up these large clusters is rare. And I think that these are the real issues that we find in AI. The amount of data that is required, finding the data, curating it, make sure it has no bias. The conditions, the legal conditions for using this raw data are not clear. The models themselves, what's in that model, what goes in it, how do we inspect it, how do we fix it, how do we realize that there are issues before we deploy it, how do we stop it in case it gets out of control. The knowledge that is necessary to set up these clusters and fix these models for retraining the hardware, which is cars and relies on a proprietary ecosystem, but also things like how are the AI systems being deployed? They are full of disappointing, I mean, big promises and disappointing results. Think of the autopilot inside called Autopilot inside one famous car. And the thing that is also not clear to me are the social norms of the AI community. There doesn't seem to be a shared understanding or unanimous understanding of what is an acceptable behavior about scraping the web for images, for example, or text. And what's preventing, what is the stopper to prevent for a research group to publish a paper that describes something more efficiently but also has clearly or unclearly a very dangerous use down the line? So I could stop the presentation here, but I'm assuming that many of you will want to know about Copilot, which has been a big topic of conversation for many of you. And as a full disclaimer, GitHub is a sponsor of OSI and is a sponsor of deep dive AI, is the top sponsor of deep dive AI, the research that we're doing. That does not prevent me from speaking badly about the Copilot or doing anything, I mean, I can do anything I want. We are open source initiative as a public charity. We are working in the interest of the public. And if we don't, we, the tax authority of the United States removes our status. So we're very careful about not giving a damn thanking the sponsors, but then reminding them that we work for the public. So that said, I've said it before, I accepted the money from GitHub that I don't care about Copilot. I really don't. It's a piece of the larger picture of what we've been talking about before. It's only interesting Copilot to me in terms of, as a part of the larger conversation, the other reason why I don't care that much is because there is a lot of other groups that are looking at Copilot from the functional perspective. Does it work? Does it work well? Is it fulfilling the promise? Is it doing a good thing in terms of Copilot, the other code generative tools? There is many other that work in the same fashion. Does it save time for developers? That kind of stuff. And these details, to me, they're important, but they're smaller compared to the larger picture of AI. So because we, at OSI, we've been thinking about how the new technologies have been affecting the world that we live in and changing also in shaping the perception of what our principles, how our principles apply to these new technologies. And I want to remind everyone that when the open source definition, but before the open source definition, the GNU manifesto was written, software was a very new thing. In the 80s, software was appearing for the first time as an artifact that could be sold separately from software. And it was that pull from the research labs at the MIT AI labs, funny enough, that was privatizing what was shared knowledge inside the lab. The software was being taken away and privatized into corporations. That's what triggered Richard Solman to think about the moral imperative of keeping the comments in the comments, in the science labs, in the society so that we could move faster, accelerate evolution of the computer science that was a nascent field. He wrote the GNU manifesto way before the license was written, the GNU GPL license, at least four or five years before. So we are at the same place now in history if you want with AI. AI is being pulled out of the labs. It's coming into fruition. Some people, some corporations are privatizing and immediately deploying it. There is something wrong in there because the reasons that I explained the bias in the tools, the lack of clarity, lack of transparency, we need to know, we need to understand how and if the principles of open source that were codified 25 years ago can be applied to the new field of AI. And that's my presentation. I want to hear from you if you have any questions. I repeat. I'm from the Netherlands and maybe you've seen the news, but we had some issues with systems ruining people the way they were programmed basically. And I can see the same happening with AI. But what I noticed is that when the issue is known, it's very hard to switch off the system because it fills a certain role in the process. And if you remove that system, it's not as easily replaced. And so there's a tendency to keep the system alive while there's something else. How do you see that as an issue as well? Okay, so the question is whether how to replace systems that are buggy, whether they're software or AI or other things. Yeah, that comes from the social norms in my opinion. It's the what's acceptable by society comes before defining writing the laws. So if the writing the laws and the legal licenses as an example. So, you know, the Netherlands in the society has been accepting a lot of broken systems because but you know, like any other society, you accept the broken system because in the end they bring you some other benefits or because removing it will create more damage than keeping something that is running, running off. So the answer to me lies into having conversations with the society assessing the benefits, the doubts, you know, and trying to fix them probably goes beyond the technology. That's what I'm trying to say. Yes. I have a question about licensing. Stabilized humans were released under an interesting license, which is kind of open source but also prevents or tries to prevent harmful application of the model. What do you think about this license approach? Yeah, I would know that we would refuse it as an open source license for code, but how do you feel about? Yeah, that's it. Yeah, that's an interesting topic. So the question is if what are my thoughts around the responsible AI license has been used to distribute the components of stable diffusion, one of the famous newer ones, large models. I think it's an interesting experiment and I look at it with a lot of interest because and in fact, one of the authors and main proponents of the rail responsible AI license is going to be a guest in the panel in that we'll have next month about deep dive AI. The reason why I look at them with interest and the reason why on Twitter I reply do not judge this license with the open source angle because it's wrong. The model, the rail license applies to things that are not software. So we really don't have a framework to judge the openness of this license. We need to understand. So that said, the content of the license itself basically has a bunch of limitations about the downstream use. I think that they have their own to something, to be honest, because some they know the damage that this stable AI system can do. And I wasn't not in the room when they decided to release it and they decided to release it so full of everything like the weights and the training sets. Everything is released. The fact that is out there for anyone with the knowledge to replicate could create problems. It's like having the deep fake software that is completely open source with no boundaries. And at least in the deep fake software community, they have made it clear. They have a social at least, you know, they have made it clear by themselves that they will not support anyone who's coming to their forums with requests to help them create fake porn. But so on the rail license, there is something in there. Like they're trying to, they know that their software can be used their systems can be used for damaging purposes. And they want to be able to at least, you know, project that idea that doing good needs to be embedded into the heads of the people really doing this research and doing this work. So I find it fascinating. We will get to, I'm hoping that by the end of the panel discussions next month, we will have a better shared understanding between the open source, let's say classical community of software, our principles, our, our needs, why we prevent, why we say in the open source definition that you cannot prevent, that you cannot restrict users, that you cannot restrict who the users are. Also, you cannot discriminate. Those are in there for good reasons, because that's been enabling this ecosystem. If we remove those, we know that we're going to be putting barriers to the low friction sharing of knowledge and information in form of source code. Do we want the same speed? Do we need the same effects as a society for AI systems? I think we need to have a larger and more deeper conversation before we say on Twitter, this is not open source. I don't want to do it. I don't want to deal with it. Don't use it. Not ready for that. I don't think we are. Issues, bugs, and errors at the AI. Look at humans. We accept errors and we more look at how can we appeal to those errors and then go for a trial and like a conversation. How do we solve that issue? Right. Are there also movements or conversations in the legal area to make a legal framework for AI's to become an entity that we can appeal to? That's a very interesting point. If AI can be treated as a human, so fallible and when it's fallible, basically go to court or go to some other way of appealing and fixing later. That's an interesting point. I think that the whole, I haven't seen legislation going into that direction, but I do have, there are lots of conversations with ethicists and philosophers of different disciplines and looking at this issue of how to treat an AI as an entity because it will soon become an issue with insurances. There are robots walking around the sidewalks in San Francisco to deliver food. Those are autonomous. If you step on it and fall, who's fault it is? If you drive in cars, self-driving cars, that's another big deal. If they come into an accident, who are you going to sue and who's going to take responsibility for that? Now the corporations are going around this by putting a human on remote control or maybe physically behind the little cart on the sidewalk because that's going to be the agent responsible for if anything happens. But in the future, we may have to have larger conversation. Again, this is very, it's a very beginning of a new thing and if it keeps on going at the speed that it's been going, we'll have bigger problems and we need to have these conversations now. So the question is around fairness. Is there any evidence that the models developed in open collaborative environments are better? I don't know honestly. I'm not aware. I might assume so, but that's one of the things that I would like to find out. If it's true, then definitely it is a recommendation of the open source initiative would be a recommendation. One of the guiding frameworks is the guardrails that we want to put in is exactly that. The models need to be trained on fair data because if you have unfair data, then the model will be unfair. There's no question around it. So fair. How do you measure fair? There is research though going into measuring the fairness, measuring the amount of bias technically. I don't know how to do it. It could be a snake oil again, but the researchers and practitioners that I talked to, they give that problem as a problem that can be fixed. Any more questions? Yeah. So the question is whether the model that is applied now to judge the fairness of the results of AI is done by humans. And if that's a scalable model or something that can be automated afterwards, I'm hoping that it will get to the point when we can safely use AI. And by safely I mean that we have predictable, all deterministic results that can be tested out. So in the pipeline, when you do the testing, you actually know and you know how to investigate. And I think that eventually we might get there, depending, because I've read that there are ways to inspect a deep neural network, understand the maths inside it, and change its mind to some extent. You know, if the system has been trained to recognize that the TRFL is in Paris, researchers have demonstrated that they can change the maths inside the network to make it respond that the TRFL is in Rome. So if we get to the point where disinspectability exists and this fixing inside the model directly, there are tools that are advanced enough, then we may not need to have this human check. The black box effect will disappear. And the tools to inspect this will be reused for that. Repeat the question, I'm not sure I understand. It's not really a question, it's just a comment. So you were saying that the comment is if we know what's happening inside the black box, then we can fix it. We have instruments and tools to avoid evil uses. Oh, yes, of course, of course. Yes, you can manipulate from the outside. Then we get into a different, different scenarios and then we get into the, well, the same with software, right? Confidential computing and, and, and clubs and yes. All right, I got one minute. Anything from the, from remote? Oh, so collection of large collections of, of biometric data. So again, we go, it's such a big field that, that these AI systems that they require lots of data. So collecting data is part, is, has not been something that the open source initiative has ever looked at. It's the data realm is more in the creative comments, open knowledge, other, other groups and, and, and association. But this is a place where we will be colliding and putting together a lot of different worlds. So that QR code is for the panel discussions. And in the panel conversations, we will have people from creative comments, the EFF, Wikimedia Foundation, Mozilla Foundation, and many other groups that have been in the field of open source and open knowledge, open data in order to, in order to build together a conversation around understanding the AI and putting guidelines that are safe and good for the society. The same way that we've done with open source. So thanks everyone.