 Test, test, test. So that's the type of mic that it wants to be near, but not directly in front of here now? So is it correctly positioned like this? Can anyone hear in the back? Blah, blah, blah. Can you hear well? OK. That's perfect. My fancy, my right. Good morning, Skale. And welcome to open source AI and Applied Science at Skale 21. Our next talk is what you need to know about responsible innovation and AI. Ethics for Open Source Projects, presented by Daphne Muller. Daphne Muller is a manager of alliances, ecosystems, and support at Next Cloud. And please give a round of applause for Daphne Muller. In this room are concerned about the ethical risks of AI. That's almost everyone, except one person who is still sleeping. A year ago, the chat GPT craze took off, and suddenly many open source projects are wondering if we should be joining the AI hype or not. And that's perhaps oftentimes an elephant in the room to ask. So I'm entertained to talk about that for a full hour today. We might want to wonder what are exactly the ethical problems with AI? Is there a way to do AI in a fully good way? And can we deliver better products than big tech has to offer today? Good morning. My name is Daphne Muller. I work at Next Cloud, and where I'm responsible for many different things, such as the customer support, the app ecosystem, community developer relations, software partnerships. And I also manage a team of eight developers, whose primary responsibility is the development of AI features. As you might know, Next Cloud is an alternative to Microsoft 365 or Google Workspace, but fully on-premise and open source, so no data leaks your servers. And it's no longer only a platform where you can synchronize your files, but it also has solutions like groupwares, such as calendar, contact mail, and also office solutions and video conferencing. It's not hard to imagine how AI solutions would fit into that space, but should we be doing it? In this talk, we will try to answer three key questions. And some of these are really big. The first question is, do technologies like big data or artificial intelligence bring anything positive to humanity? Can we speak of human progress if we innovate in these areas? What are the different problems and concerns? And what are the ways to deal with it? This leads to the question, can we do AI in a good way? What does that mean? And who does this technology benefit the most? And lastly, we will look at how the European Union and companies like Next Cloud are responding to the AI. So hopefully, you can take some of that home for your own projects. My interest in responsible innovation started back at university when I was doing my master's degree at the Technical University in Eindhoven in the Netherlands. I was studying interaction design with a focus on Internet of Things products. And the university's goal was to make sure that we create social progress or solve societal problems through the design of these smart products and services and systems. But is that even possible or realistic? I found out the answer during one of the projects I did in my master's. It was a project for a real-life client, and the customer was in the health industry. They provided health services for corporates, so it was about health of employees. And the design brief was the following. Please design us a service that would reduce burnout under office workers, because burnout rates are increasing every year. It has to be a service that connects to an employee's smart watch. And then it uses AI to give personalized health advice to these office workers. So innocent as we were, we went through a normal design process where you do user research, and you make prototypes, and so forth. But the elephant in the room is, can a smart watch reduce stress at the office? And do employees even want their boss to have access to their health metrics? So at the end of the project, we had this last conversation with this customer, and we said, listen, if you are genuinely motivated to reduce burnout among office workers, you would have to look at how to fix toxic management instead of smart watches. And this is where I learned that technology is often applied to fields it has nothing to do, and it's the wrong application. But I also looked at more controversial topics like criminality prevention. Everywhere around the world, it seems widely believed that more surveillance leads to a safer society. And of course, a safer society is a noble goal, but it's the solution surveillance. For example, in 2019, I got my hands on a secret vision report of the Dutch police. Of course, the goal of the report was to show a vision how the police will handle increasing criminality rates and how we are going to make sure that we live in a safe world in the future. But the entire report was about technology only. It was not about poverty or discrimination or other social issues in the country. The entire report was about how they wanted to create a full network of sensors and algorithms to predict crimes even before the crimes are committed. What could possibly go wrong? It's just like the film minority report where Tom Cruise has to run because of a crime he may commit in the future. And also, my country is not good in doing IT projects on a sharp budget. For example, a very simple application like the Corona Notification app already cost 23 million euros. So I can only begin to imagine the budget for this project. If I would have access to such budget, I would probably use it for different things. For example, in Norway, prisons have been redesigned. They treat inmates well and they give them job training with the goal of ensuring that they have a good life after prison. And this helped to reduce the recidivism rate drastically. The recidivism rate dropped by 43%. And another project that I would invest in is a project done in Denmark where the village Arhus managed to reduce the number of traveling geodists by 97%, which is almost everyone. In their research, they published how they criticized the German and the French model, which is to put heavy fines on terrorism and increase surveillance. And instead, they say that we have to understand that the root cause of these problems is not a lack of cameras, but things like discrimination, racism, inequality, poverty, and mental health problems. So what they did in the project is to invest fully on strengthening the social network of vulnerable Muslims and with effect. So long story short, if we want to live in a saner world, we should invest in human sanity rather than artificial intelligence. So what if you don't follow this advice? Of course, too many of these projects have already been implemented and are being used today. For example, the software Northpoint, which is the software widely used by American judges to assign a risk score to how likely a criminal will offend a crime again. And this will determine how long they will be, for example, in prison. The goal of the software is quite noble, namely to remove bias from the judge's decision-making process. So how well is it doing exactly that? The research firm, Republica, published research on that. And unfortunately, I deleted the slide that... Oh no, I didn't. This is the slide that shows the results of this research. In the small letters below the table, you can see that the overall accuracy rate of this software is 61%. That's not much better than random. You know that you are in trouble when a coin flip has better results than an AI software. And also, I have seen horoscopes with a better accuracy rate than this algorithm. So I'm not entirely sure that it's a good idea to use this software, especially because they also showed that it's racist. So black people more often get a bad risk score than white people. And it must be said that this is not unique to the United States, because in my country, in Europe, in the last years, thousands of families have been wrongly accused of tax fraud because of discriminating algorithms of our tax authorities. So this is the case everywhere around the globe. And of course, it's not limited to this specific industry and this specific example. It's also the case for consumer-facing technologies such as the software of Meta. Out of curiosity, women in this room is still using software of Meta such as Facebook, WhatsApp, or Instagram. That's a large portion of people who are probably aware of how bad Meta is. Everybody has problems with how bad Meta is. But also, somehow, everyone is on it. But we also find it bad because it takes up all our time and it invades our privacy, but also everybody is on it. What's going on here? We have to remember that Facebook started out as a student project with not the best means in mind. They started out with a student project to judge your classmates on their looks. So by start, it was already a success platform with arguably the wrong values. But somehow, Facebook grew out to this worldwide platform where everybody could find each other, but also where elections could be manipulated, where capital attacks could be organized, and where genocides could be facilitated everywhere around the globe. And somehow, we are still on it and we don't have collective denial of such software. Of course, we all know that it's bad for privacy, but I don't know about you, but most people in my social areas still believe that if you simply don't publish too much on Facebook, then you will be fine. They believe if you don't show everything, you can still be on Facebook. So what kind of data does this platform actually collect? After the Cambridge Analytica scandal, the research firm ProPublica published a very nice piece of research of all the different advertising labels that they assigned to users. And of course, I have this file with me and I would like to show it to you. So what is in there? Well, this is the CSV file that you can get from ProPublica. So it's not just your favorite color or your shoe size, but it's taking a bit too long if I scroll like that. So we are speeding it up a little bit. And at the bottom, we can see that there's a total of 29,000 different categories of people. And of course, on a rainy Saturday, I went through the entire list to find you the funniest one. The first one is people in households that are heavy buyers of soup. The next one is people in households that are likely to buy a Mazda in the next 180 days. Users who did Ramadan in 2016. Children of lesbians and gays everywhere. Information about ethnicity. Information about your job, which is for some people perhaps a sensitive topic. And people about information about your medical history. What could possibly go wrong? This obviously means that advertisers can easily discriminate because if they have a job at or if they do housing rental, they can easily exclude people who are black. And it's not only the groups of Facebook that you like yourself. It's not only the information that you actively give to Facebook, but they also make these categories up. And a little bit of insight in how this exactly works. We can find in a university study where Facebook and Microsoft collaborated with the University of Cambridge. Doesn't Cambridge have anything better to do? I sometimes wonder. But anyway, they researched the correlation between your likes and your personal attributes. What came out of this important study? First of all, if you like Halloween, you are likely white. If you are a man and you like the actress Katie Griffin, you are likely gay. And if you like curly fries, you have a high IQ. So if after this talk, you don't quit meta, then at least make sure that meta knows you really adore curly fries. But we have a problem because all of this data needs to be stored somewhere. All the files that you have on the cloud and all the images of Instagram and chat GPT need servers. And servers are called the cloud. And the cloud sounds like a magic, fluffy place where the sun always shines. But in reality, it's in rainy places like the Netherlands. For those of you who don't know where the Netherlands is, here's a map of Europe. And data centers are, for example, in Wieringermeer. And Wieringermeer is for our standards so far from Amsterdam. It's a 45-minute drive by car so nobody wants to live there. You can't do it by bike so it's too far away. The perfect place to put windparks for green energy. Here in the Wieringermeer, there's one of the largest windparks ever built. It has 82 windmills of 180 meters high. And they produce electricity for 370,000 households. But do you know for how many of those households they actually deliver electricity? Zero. Let me explain. The Netherlands is a perfect place for big tech to put their servers. And there are different reasons for that. We almost never have war. We are a tax haven. We almost never have natural disasters. We have a relatively stable government. Of course, we are a tax haven. We have a large internet exchange point. It is very attractive for the taxes. And it's important to protect your corporate image because big tech has a lot of electricity consumption and they know that we also have climate change. So for a better world or to protect your corporate image, they really enjoy to have their servers next to windparks. So even before a single windmill was built in the Wieringermeer, it was already clear that there was no energy left for households because the major was really proud of his deal. He was bragging in the local news that they went with the plane all the way to the headquarters with Microsoft in Seattle to do the negotiations. And in the local news, it was announced like this. They were thinking that the Wieringermeer could become the Silicon Valley of Europe. They also said that the Microsoft CEO Steve Bulmer would fly to this airport and visit the area and of course Bill Gates too. Sure, big tech fanboy. I doubt that Bill Gates will visit Wieringermeer. And this is not uncommon. It happens in many different places in the Netherlands. But of course the municipality is screwed because they thought that the power would be for them. And all the Dutch citizens are screwed because a lot of tax money went into this windpark, namely 660 million euros. And then in the end, nothing is left for the country. And it also happened in Ameshaven where Google bought terrain for data centers. And of course they did so in a way that they pay almost no taxes. So they bought the terrain, directly sold it to a country, to Luxembourg, to Liesendbeck, so they have to pay no taxes. At Google, they are really good in Googling for attractive tax avoidance strategies. And I have a little riddle for you. Which of these people is the major who just signed the deal? I think it's the person with the Google t-shirts. So what can we learn? I think the root cause of the support for surveillance capitalism is people like this. People will think that big tech has a lot of prestige, that it's something you can be proud of, and it feeds into the narrative that big tech has a lot of opportunities for humanity, that the technology can solve all our problems, even if it doesn't. It reminds me a little bit of my master thesis when I published my results on how we can do innovation without surveillance technologies. I kind of made the case in my portfolio that it is possible to do innovation without doing surveillance, and that maybe we would be way better off. Unfortunately, one of the professors who had to assess me was not very happy with my work, because he was already doing research to the wonderful opportunities of big data for decades. And he was a little bit sad because he said that I was not giving these new technologies a good chance to prove themselves. He said, literally, you are throwing away the baby with the bath water. And then I said, yeah, but whether it's a problem to throw away a baby with the bath water depends on the baby. I mean, perhaps it's a very annoying baby. And I provoked it a bit because I said, perhaps it's baby Hitler. Then we have a different story. And then I made it even worse. I said, but of course, this is not a baby anymore. This is not the idea of the last year. We are already trying to implement this for decades. And in reality, this is a very mature adult who is structurally violating human rights. What we can see is that surveillance, capitalism, projects are often like terraforming Mars. So terraforming Mars is this idea that Elon Musk has an idea. Uh-oh, and we should all be excited about it to add an atmosphere to Mars so that humans can live there and colonize the planet. But if you have the power and the resources and the money to terraform Mars, then you'll also have the power and the resources and the money to make Earth a better place. And then everybody benefits from it and we don't need to go to Mars anymore. And this is exactly the same with big tech AI and other surveillance technologies. If you have the power and the money to create an AI for criminality prevention, you also have the power and the money to do something about the real problems. But unfortunately, because of people like this in the Google T-shirt, AI has a big hype. So it's widely believed that we shouldn't step away from AI. We shouldn't critically deny AI, but instead we should find ways to do AI in a good way, whatever that may mean. So whom of you have seen this call to action of Elon Musk to stop AI development because of the risk that AI would take over the world? I don't know what your perspective is on it, but most of the scholarly perspectives on it is that these existential threats don't really exist and that they rather take away the attention from the real problems with AI, like bias, discrimination, carbon emissions, global power problems, and so forth. It also feeds in too neatly in the narrative of big tech that AI is powerful and effective in the first place. So if AI is powerful enough to take over the world, it could perhaps also be powerful enough in the right hands to solve our large problems like climate change, even though it's causing climate change, but nobody is discussing that. So they say, if we understand big tech, as we as back tech understand the problem well, so please let us self-regulate the market and we will take care of this and ensure that the technology you get will be good for you. And of course, that's nonsense, we know that, but still this is the narrative that we all believe in. Then in the scholarly perspective, there are mainly people who do understand the real risks of AI. This idea of the existential crisis is not discussed a lot in scholarly perspectives, but in scholarly perspectives, the idea is that perhaps we can design our way out of the problems. So they think, what if we give computer science students FX classes and give them the right framework so they can design values into their systems? I personally declined a PhD in this field because I think it's too fluffy. The papers I read were way too abstract and the professors that I spoke with were not brave enough to say clearly what type of applications are not acceptable and what type of applications are fine. So I thought this is like rearranging the deck chairs on the Titanic. It's interesting to do, but it won't actually change the course of the field. So I found it useless. Instead I joined Next Cloud, which was a good choice. But then Next Cloud had a problem too with AI because we got jet GPT to deal with. So I was wondering, did anything change in that field? Can we learn from the work of these scholars in our applications at Next Cloud? So I sat down with a professor in information sciences, professor, the name from Berlin from the Humboldt University over there and he did a large literature research to hundreds of papers in this field. The papers ranged from toolkits, frameworks, all the think techniques and more. So I asked him, is there anything in there that I can use? And he nodded his head in the wrong way and then he said, we can go to slide 17, which was this slide. We can skip over all the analysis and the details. This is what you need to know. All the papers are hopelessly high level he started with. The details are just missing and also most of the papers have not been validated independently. So they have never been tried out in practice. We don't know if they actually work. Also, it's unclear and there's no consensus about what's important or not. So all the frameworks vary in scope and all the ideas can be combined. So it's up to you what appeals to you or not. And also the frameworks mainly focus on Western views on what is ethical even though AI is applied globally. And all the frameworks focus mostly on the design phase, but what about after the AI has been released on the market? In other words, AI frameworks for ethical AI were still as useful as a diet plan for unicorns. Then there is some development in Europe in the law and that is essentially also an ethical AI framework, a law about what is acceptable or not. And you've maybe heard of it last Wednesday there were voting on this, the law got accepted. So now it will take a few years and then enforcement will start. Of course, some of the details are still being crafted out but I can give you a sneak peek and a preview of what it will look like. The framework of the European Union has a risk-based approach. So applications with a high risk like those in crime prevention need to apply to more regulation than applications in low risk which are applications like photo manipulation software. High risk categories mainly apply to products that already have product safety legislation in Europe such as toys, cars and medical devices. And some more application areas have been added to this like interpretation of law and employment management and also border and immigration control. Then there's the low risk areas which are applications where users can actively control if they wanna use it or not. And the differences are in how much you have to comply to regulations. So if you have a high risk project, you are going to suffer. You need to write a large risk management report and this needs to be approved by an official before you can put the product on the market. But if you have a low risk product which is things like photo manipulation software, you don't have to do a lot. You only have to inform the user that they are using AI. Whether low risk is genuinely low risk for society could of course still be debated because it also includes things like deep fake generators which are in my opinion not low risk at all. For applications like chat GPT, there are some additional rules. For example, they need to publish a summary of the copyrighted data that was used to train the algorithm and they need to ensure that generative AI doesn't create illegal content. For more powerful models like GPT-4, they are still in doubt. But likely they will fall under the high risk category which means that again a risk management system and an official stamp of approval is needed before they can go on the market. There is critique on this law obviously and it's critique that I don't share. Of course it's a law that's far from perfect but I'm proud that the European Union is making a law in the first place. I don't have a lot of requirements anymore in this field. The first piece of feedback is that it would reduce innovation or limit innovation. The professor who did the literature study also did research to how this act would have influence on European companies and European startups. For example, he worked with a robotics startup who make a packaging delivery robot and it was a truly advanced robot because it could also walk the stairs. And what do you think? Is this a high risk or low risk product? Who's high risk and who thinks low risk? Okay, it is in my opinion obviously high risk because what's the difference between a packaging delivery robot and a car? But the creators of the robot thought that it was obviously low risk and the argument they gave was yeah but we made sure that it doesn't run into humans and it also has a large red stop button although they advised people against pressing it because the robot would immediately collapse on the floor. But either way they thought yeah we are fine and then the professor was like okay but does it run into pets? Will it run over someone's cat? Will it run into disabled people? It can walk stairs. Can it get stuck in a fire escape? He said come on. He couldn't believe it. So yes, some innovation will maybe take longer in Europe to get on the market but it will also stop innovation like dangerous robots and I believe that's a good thing. Not all innovation is good. And then the second concern with the law is that it would be bad for open source developers. The argument goes that open source developers will work for free and don't have financial means would be forced into doing extensive risk management documentation and so forth. But in my opinion in the current draft this is clearly off the table. Of course I'm not a lawyer but my understanding is that it applies to companies who deploy the software on the market and most open source developers will work for free don't deploy it on the market. And also I believe most free open source developers are not delivering products like self-driving cars or medical devices. So it's very unlikely that they will end up in the high risk category. At least the products that we develop at NextCloud are largely in the low risk category. For example we have things like text generation, image generation and background blurring in video conferencing. And yes we will have to make sure that we get the basics in place but we don't need large financial resources to do risk management systems. The problem that I personally have most with the law is that low risk falsely assumes that low risk equals ethical. But even with low risk applications like we have at NextCloud I can give you many ethical concerns. For example I have tested our systems for bias and it doesn't look good. Bosses and doctors are structurally defined as men and nurses are always coming out as women. Also I know that our software increases the electricity consumption of our service by a lot. Then there's also the problem that we have not a lot of clue what data has been used for training the models. So it can well be copyrighted data, trash on the internet or data from social media where the users haven't actually consented to it being used for training AI. Perhaps you have seen it recently in the news. There was a lot of controversy because Reddit, WordPress and one other that I voted on. And Tumblr were about to sell the user data for AI training purposes but of course the users haven't consented to that. Then there's more ethical problems because NextCloud's AI cannot be used by everyone in this world because there's expensive hardware requirements and it only supports English. And of course not everybody in the world speaks English. And lastly I know that our AI also generates false information and even intelligent people don't know how to deal with that. I have called a NextCloud support engineer who used the AI for support but of course JetGPT has no clue about NextCloud OCC commands so the support reply to the customer was entirely false. And I've also heard of cases where academics create peer reviews of papers with JetGPT and the literature references didn't exist. Very embarrassing. So the question in the room is should NextCloud be doing AI in the first place? I believe it's a bit of an elephant in the room because obviously most people at NextCloud understand really well that AI isn't fully compatible with our values and that we will introduce some problems that we absolutely don't like. So a year ago we were in trouble because JetGPT became large and I remember that Jules, me, Frank, some developers gathered in a meeting room in Berlin to watch the announcement of Microsoft. And of course it was all about AI and Jules started to get a little bit nervous because the question was are we gonna lose our position in the market as their competitor if we don't do any of that? And I answered, well probably not all of the features that Microsoft presented today will become mainstream but of course we realized during that discussion that some of the features will be. Some things like text generation or summary could be useful and also being able to ask questions about your own files, for example. So we can imagine that if there's genuinely useful applications, they will become mainstream. Our users and customers will start to expect them. So even though NextCloud tries to be the good alternative to Microsoft, as long as we are their competitor, we are not immune for the trends that they start. And this is similar for many open source projects. We need to have a critical look at how to keep our relevance. Not everything is going to be successful but some things will be. So NextCloud's management came up with their own ethical AI framework and it's called NextCloud ethical AI and it came with three core requirements. First of all, the code has to be open source which includes the code used for training the data. Also the model has to be freely available so that you can run it on your own servers and you don't leak your data. And also the training data has to be available so you can study it and also see if there's bias, for example, or copyright problems but also potentially improve it. And that's of course the core ID of open source that you can study, modify and improve and ship the code. So over the last year, my team tried to implement many features that come close to meeting these requirements but you could already guess it. We are not succeeding in meeting all three in all our features because it seems that the definition of what is open source is changing when we talk about AI. Most of the open source models indeed have their model available so you can run it on premise but they lack having the training data available for example or the algorithm that was used to train the code. So from our perspective, we can't actually fully speak of a fully open source solution because there's no way you can fully study it or improve it. And this shows the limitation of our strategy. NextCloud obviously doesn't want to become an AI company. We don't have the resources for that. So as long as we implement the work of open source and the models are really good, we are depending on how open source will shape around AI in the upcoming years. And at the moment it looks like that most of the models in AI in open source are marketed as open but are actually proprietary. Also one problem I have with this framework is that it falsely suggests that open source equals ethical and that's of course not the case. Even with open source software you can have a lot of biased outputs. Your energy consumption can skyrocket. The data can come from the trash on the internet. It can have language limitations so not everybody can benefit from it or hardware requirements that are too expensive for everybody and it will continue to generate false information. So we have to be mindful that while giving more transparency is probably better than what Big Tech has to offer we are not creating software that is entirely harm free. Open source is not the same as ethical and if you meet all these guidelines that doesn't mean that it's fully ethical. So given that this is such a controversial topic and given that I also am responsible for our community developer relations in our recent survey for community I decided to measure how much support we have from our community for our AI work. And let me show you some of the results. First I measured how much is the adoption of AI features already among our user base. And there we can see that the majority of the respondents say that they are already using AI features. Then we asked how much they agree with our vision on AI. So the first question I asked was is it important for you that data doesn't leave your server and that you can run it on your own device and almost everybody agreed. Then I asked is it important for you that AI is open source and again almost everybody agreed. And also almost everybody agreed to that AI applications can be useful. So that gives the impression that it would be important for our user base that the company or project like NextCloud starts to develop these features but the community disagrees. The community is extremely divided about whether NextCloud should be doing AI and they give a lot of different reasons from it. Some people answered that they feel NextCloud was a solution to be away from Big Tech and AI is a Big Tech invention so it's kind of contradicts. Others are afraid that it will pull away resources from improving the existing product and they reference the number of bugs in GitHub and others fully agree and say that they trust NextCloud with such an important task. I must say that NextCloud is unfortunately not immune to the AI hype. Sometimes we are also like the major in the Google T-shirt. Also at NextCloud we have been marketing AI features that are just simple regular expressions. Also at NextCloud we have been creating AI features that could have been a regular expression. For example, we have a security feature with machine learning that can predict if a login is from the wrong place but I can also just have a setting where I say that I don't want anyone to login from a country where I don't live and maybe edit that right before I travel. That doesn't need to be machine learning per se. And also we have been implementing AI features that I believe are not necessary. For example, let me ask you a question. How many of you are using this feature in your mail that prioritizes your important emails? That's not many people. Maybe it's just four people in the room. And we have implemented this feature too and I believe it's not necessary for the majority of our users. So that means that we implemented a resource intensive feature that increases energy consumption but maybe doesn't return value for most people. And then it's the question if you should be doing it in the first place or if it should be enabled by default. So to make my colleagues a little bit more immune to the AI hype, I started talking about AI in the form of weasels, these cute animals. Every time when we talk about AI, I replace AI with trained weasel and check if it still sounds like a good idea. Also, you have to keep in mind that all times that the weasels are racist and sexist until proven otherwise. So then you can ask questions like, is it a good idea to train weasels to tag my photos? Maybe unproblematic. Is it a good idea to train weasels to transcribe video calls? Maybe that's fine. Is it a good idea to train weasels to screen CVs of job applicants? Obviously not. Remember, the weasels are racist and sexist. Is it a good idea to train weasels to drive heavy cars through busy city streets? No, someone is gonna die. Let's wrap up this talk. At the beginning of the talk, we asked ourselves three questions. Does big data and AI bring progress? Can we do AI in a good way? And how does the European Parliament and next cloud respond to the AI hype? For the first question, does it bring human progress? We have to reflect if technologies that fundamentally require the valuation of human rights, like the right to privacy or the right to discrimination or the right to non-discrimination or the right to a democracy, can technologies that violate these rights structurally contribute to human progress? I think we can answer that in most cases, the answer is no, and that most of these projects are like terraforming Mars where we should be spending the resources elsewhere. Can we do AI in a good way? The only reason that we are asking this question is because we believe in the AI hype, because we believe that AI is good. And as I just said, it's probably not. But if we have to answer the question anyway, because Microsoft is not listening to this talk, then I would say, no, there will always be ethical problems with your AI, but we can try to do a slightly better job than Microsoft and Google. So we can try to be more transparent. We can try to be angry at open AI for not publishing data sets and so forth. There are some things we can do. We can try to test our algorithms if they bias or discriminate. We can look into energy consumption and we can try to not use it if it's not absolutely necessary. But AI will never be perfectly fine. I'm sorry to bring the bad news. How does the European Parliament and NexCloud respond to the AI hype? The European Parliament focuses more on the risk-based approach similar to the Weasels. So they try to regulate very dangerous applications. NexCloud is responding to the AI hype by pushing for it to be fully open source. We have to ensure that we simply don't become like the major in the Google T-shirt. We have to stay a little bit immune to the AI hype and critical denial is also options. Open source maybe doesn't have to solve all global power problems, but we can try to do a little bit better than Big Tech. Thank you. Thank you so much for that presentation. We do have some time for questions. If you do have a question, feel free to raise your hand and I'll bring the microphone over kind of in order. There are organizations with whom NexCloud would seem to be kind of spiritually and ethically aligned like maybe Weakimedia Foundation or Creative Commons and perhaps it would be a waste of resources to work on AI separately in these organizations. So is there any thought of having allied organizations maybe work together to develop principles and some AI applications? Yes, we've certainly considered that. Unfortunately, the field is still quite recent. I mean, this discussion only became relevant a year ago. So yes, we would be interested in collaborating on those aspects so that we don't have to reinvent the wheel. Any more questions? Feel free to raise your hand. With legislation in mind and the direction that EU is going, part of what I'm concerned about is potential bias when categorizing high risk or low risk. And I was wondering if you have particular opinion on preventing corruption from impacting what slips through the cracks in terms of low risk or anything like that. Yeah, unfortunately, I'm not a lawyer so I don't know exactly how they're gonna prevent that. But they will create a database of applications that are high risk applications. Sorry. And this database is already partially made up by all the applications that already have legislation and already have to comply to product safety legislation. So I'm not entirely sure what they're gonna plan against corruption, but that's the outline of the law. All right, amazing talk. So first, by the way, I think the weasels have already been trained to run for a public office and therefore I think they're driving cars. So I think that particular boat has passed. All right, so right now, you're seeing the word AI use everywhere, right? I mean, they're bicycles, I've been to CES, bicycles and AI, et cetera, et cetera. With the regulations that are passing, the EU wants in particularly, do you think it's going to lead to some of these companies backing off on the hype of using the word AI in toaster ovens and everything else? Or do you think this hype is gonna continue on for however long? And what's the, if a company's starting, how do they survive without using the word AI to describe a piece of toast for lack of better words? So I'm not entirely sure if they will simply be able to get away with renaming machine learning to something else. I have enough trust in the lawyers in Europe that they made the law in such a way that it will still work. I honestly don't think that the word AI will be removed from the vocabulary. Did I answer your question? No. Well, this is machine learning, sorry, thank you. Yeah, so how do we get away from the hype? I think it starts with education, what I'm trying to do here, making people aware that there is no way that AI is going to save humanity. It's rather causing problems that it's like the software of North Point that I solved, that I showed, the goal was to reduce bias, but it's increasing bias. Or people, I sometimes argue with academics who claim that AI can solve climate change and I'm saying it's causing climate change. So we have to become a bit more realistic. But I'm not sure if it's possible to solve the AI, but I'm doing my part. Is there any more questions from the audience? All right, looks like we don't have any more questions for today. I'd like to let you know that this has been what you need to know about responsible innovation and AI ethics for open source projects by Daphne Miller. Thank you. Coming up next is training AI on your own data and that talk will be starting at 1.45 today. Testing one, two, three. Hello, hello. Hello, louder. Testing one, two, three. That seems okay. Oh, there it is. I didn't see it, okay. Is this good? Better? Yeah, just keep a loud voice. Is there a better positioning you want to try? No, no, no, no. Well, let me see. It's a great shirt. Where'd you get that? It's good, it's good. Say hello. Hello. There you go, just talk loud. Okay. Hello everyone, welcome to Scale 21 and I'm pleased to welcome you to the open source AI and Applied Science track. Today we have for you presenting Nuri Halbrin, training AI on your own data. If you'd like to give a round of applause and welcome our speaker. With a nice introduction. Another PSA, if you want to read what's on screen during the demo especially, closer is better. So, up to you. I promise I won't be smashing any pumpkins on the front row. So my talk is about training AI and your own data. And the agenda for today I have is first to persuade you that there's no AI, only math, and that it's all in the process. So as a software architect and person interested in algorithms, I think this whole, you know, labeling that we've been doing on things and giving them nice marketable name may have made them a little less accessible to us who actually wanna work with them. So there will be a little myth dispelling but not too much. I'm gonna focus on AI with your own data, maybe on your own laptop too, so that you can get started or see how at least I implemented what I was gonna implement. The first thing to ask is, well, what are you going to implement? What do you wanna do? Well, RAG as we will talk about is interested in solving this problem. A user asks a question and then the system in some magic way using my own data, meaning it knows my stuff, not just general stuff in the world, using my language, meaning speaking in English and not in binary code because I don't speak binary, I interpret it, responds like an expert, meaning I wanted to give me an answer that I on my own didn't have or know. I wanted to know to extend my ability, otherwise why am I asking it a question, right? And it should be somewhat intelligent. It should infer and understand me in a little more flexible way. I don't want to have to program it for everything I do. I wanted to kinda get the gist and kinda roll with the punches, meaning it should be tolerant of humans being imperfect. So in general, this is what the process is. There's user Bob, ask the question, where did I put my trousers? And the application at some point wants to return an intelligent answer saying you are wearing them on your head. And this seems awfully silly, but Google can't do that. If you search in Google, it won't do that right now. And it doesn't do it because this is very specific knowledge that it wasn't trained on. The position of the trousers right now is different than where it was in the morning or last night. So that's the problem with training on any size data set, as intelligent as it may be, is that it's not your data or it's not the data set that will drive the right answer for right now. So that's the main motivation behind us wanting to take an augment information that my query is implying and augment the information beyond what the model or the backend system was actually trained on. So why not a search engine? Well, search engines are keyword based. Keyword based means that they are taking your sentences, your text, let's say, and they're breaking it up into words and then they're trying to do things like stemming or synonym and map those words so that horse and horses is the same and radius, radii, and radix is kind of the same and moose and meese is the same or whatever they come up with but what they're doing is they're taking a word trying to find some equivalents that's the extent to which they are tolerant of you. So if I say I want a hamburger, they'll have synonym for hamburger and maybe synonym for want like desire or stuff like that but if you just say a hamburger would hit the spot, the whole would hit the spot will give you all kinds of other things like spot the dog or other things because hamburger is just a little piece of the sentence. So search engines as great as they are and I love Lucene and I love Solar and I love Elastic they're not quite cut up for this task. Plus, they suffer from the same issue which is they need to be fed very up-to-date information to answer your questions. So if we look at the matrix of what we want from the system, we can see the user can ask a question yes and you can even feed it your data because as new information comes you can put it into the system and have it re-index and it'll be there but it's not really using my own language it's not getting English language and English expression in a great way because of the limitation of synonym and stemming and it doesn't respond like an expert so it doesn't enrich things from the vast knowledge beyond just that data and it doesn't sound too intelligent at best it has existing information it'll kick it back to you that's the most you can hope from it. So what is RAG? It's an acronym, retrieval augmented generation. The R, retrieval means that it's gonna use your own data. The augmented essentially boils down to prompt augmentation and we'll see that in the process. And the generation means that it gives you from all the whatever it used a generated response that is not a regurgitation of an existing item that was already there. That's about it. Are we good with that? Retrieval augmented generation. Very simple idea. Take the stuff you have. Mange it somehow to information that could be a instruction to the AI to the model to generate an answer and then wait for the AI for the model to generate an answer for you. So the process in more gory details looks like this. For folks in the back of the room not listening to my advice to come into the spray zone we're still asking where did I put my trousers? The next step though is to take that text and tokenize it, break it apart and give it to a pre-trained model that knows how to take those tokens and give you back some mathy vector. And I say mathy because there's a lot of math behind building it but at the end of the day it's an array of numbers is what it gives you back or gives back to you. So after we tokenize that question we give it to a subsystem that knows how to make it into a vector. Those vectors are called embeddings. Why aren't they called something else? Yeah, bad marketing. They're in some places treated and given to you as tensor objects sometimes as plain arrays and they're all called embeddings because those words all make total sense and everybody off the street can understand that they are all one of the same. Then you take that vector and you create a query into a system that knows how to take vectors and find similar vectors. This is part one of the RAG system where the system is way more tolerant than a search engine. What effectively is happening here a vector is a pointer in space. It's like it could be n dimensional on a two dimensional it's just an arrow and the length of an arrow and the angle of the arrow have meaning. So the vector search knows how to find n mass. You have a billion documents. It knows how to scour your own data and find the data by that digest of this little vector and it does so efficiently and quickly. Have you played with AI anyone? Yes, you type it in and it's like to to to to to to to to to to. Hold on I'm typing. This is because there's a lot of processing happening behind the scenes. That portion of the show can happen really quickly actually and it can go on all of your company's documents. All of the knowledge base you accumulated on your own. All of the whatever you gave if you index the text with those vectors and we'll see how we index with embedding then that portion of the show becomes trivial. So now after your question got transformed into mathy and then that mathy went to a specialized engine that knew how to find documents related to this and the top ones are the most related to this. Now you have a subset of all knowledge that relates to your question. With that subset of knowledge I can then walk up to, oops, sorry, wrong button. With the results of my search on my data I take those top results and I create a prompt for the engine, for the LLM engine, for the model that I'm gonna use to generate the response. In my case what I will do is I will take restaurant reviews, see the ones that match the query I originally had like where's a good Bergen joint or something like that and then with those top results I'm gonna turn around and say okay people already reviews those restaurants these are relevant to my query can you tell me why and where I should go? Where and why I should go to that place? And then I hand that query, that prompt engineered query using the reviews tell me where I wanna go. If you just say where do I, should I go? It's like oh what do you want? But if I already use the knowledge base I entrusted which is the restaurant reviews that I decided are relevant then it can give me a much better result. So that's the prompt engineering portion of the show where I take some of the match records that are real and I give it to it and say hey these are real things these are my criteria for you to continue and generate a response please give me a response and then the pre-trained model that knows how to take text and give you back human-like responses gives me a result and that finally goes back to the human. Dear diary this is the rag process in gory detail this is it in English, we'll see it in Python but this is it in English, any questions so far? Yes sir. Yeah so I'm gonna repeat the question for the sake of recording. Did steps one through five by in which I took the user's input did some magic in retrieved records? Dude that's what a search engine used to do so if this is better ditch the search engines altogether. Sure that could be a view I would support. Is there still a place for search engines? Well it turns out that the quality and the relevance and accuracy of the system has to do with the embedding you choose to choose and the ability for you to affect those results is not as high. So if you've hyper tuned Elastic and submitted a different query did static boosting, dynamic boosting, different stemmers, custom parsers there's still a lot that traditional search engines have a lot more instrumentations for and they will yield different results. Different better, different worse, not a judge. But yeah but thank you good question great observation yes. Thank you and to piggyback on what both of you said also if I'm not mistaken with this you do not have the source. It will not tell you I learned this information that I'm giving you as an answer from this specific document unlike a search engine which could so it's impossible to trace. So I wanna calibrate something here. When I started talking about like your data and how does it know all of that part is actually it's stage seven here. The final engineered prompt I give to the model to expand on and generate that portion is the portion where if I just go to chat GPT it doesn't know about the trousers whereabouts right now. So when I say my data I mean actually that first portion of the show the search engine part where I keep feeding it the pants location all day long or new documents coming in new Q and A's coming from the interwebs whatever it is. So that portion of remaining current has to do with the fact that instead of retraining the model the text generation model to say ooh you learned another thing a lot of thing another thing which is very expensive. I delegate that portion to the vector search which does a decent job and keeps relevant. One more question and I'll push on yes sir. What type of storage would be ideal for large language models to index or work from? So the question is what type of storage would be ideal for LLMs? The LLMs themselves don't quite quote unquote use storage in the same way because an LLM is a saved model. A saved model is a mesh of polynomials with coefficients. It is a internal binary representation that doesn't map to using traditional tables or documents or anything in the back end. It's just a blob for all impents and purposes that creates a graph that they search in real time. So no storage there. In terms of the vector search every major vendor has something and today I'm gonna demonstrate Atlas MongoDB which has a vector search index that you can apply to existing documents but every major vendor has their own stuff and if you're just doing in memory and Python you probably use Chroma because all the samples are written that way but in small scale doesn't matter high scale there's vendors out there. Blob storage will not be it because it's not a database, it doesn't have an index, it's not efficient. You don't wanna do scans, you want an index. An index makes search fast, yeah. Okay, I'm gonna push on and we'll have time for questions as much as you want later. So I said models do stuff, what do models do exactly? So they compute in the embedding portion of the show they compute a meaning or nearness by applying a vector similarity. There is no AI only math. Vector similarity can be computed. It produces a number. That number is higher when the vectors are near each other whatever near means and they're lower when the vectors are farther away. So I am famished and I'd like to eat maybe closer than I am famished and polar bears melting ice. I don't know, just coming up with stuff. But you can see that vectors will be generated and some of them are closer, some of them are further and it's just math to do it. The second portion of the show is a little of Ms. Nomer because we were observing chat GPT whatever other things are out there in the market. We were observing that you type in a question and it gives you an answer. That's already a dressed up feature underlying the whole thing. It's all actually chat completion. What's chat completion? You know how you type in Google like Nuri is and it comes up with a bunch of, oh my God, that's what it thinks about me. So those are all probable things that Google decided just by ranking, hey, but popularity. What's the most, you know, like lunch in Pasadena is best at and then it comes up with restaurant names because lunch already magnetized to some restaurant database and they did it and they ranked stacked in by popularity. It's as simple and dumb as that. The AI text generation relies on that fact. It says, well, of all the vectors we saw in all the world, what are other vectors of already parsed knowledge that it has that most closely matches the vector? And then it turns and say, oh, you told me a piece out of a larger conversation I already indexed of all those large conversations, let me cull out what the likely continuation of that thought is. So that is what's called completion and underlying all of the models have that. Some of them dress it up with already chopping out the part that you said in order to make it appear as if it replied to you. But in fact, all it ever does is complete your thought. So prompts really are initial vectors on which those become the basis for completion and completion is the model giving you more text that matches it, it's all it does, really. So how are models built? How do they gain that capability? Well, we tokenize the input and the input could be text or images, right? Because the model doesn't actually know the difference between famished hunger and a burger. Just don't know any of this. They have tokens and they've seen these words kind of together near each other, far from each other in one document or another that they indexed. Same thing with pixels. If you train it on images, it extracts pixels, the nearness of pixels to each other, meaning they coordinate their color, their whatever. And it doesn't know a dog from a rabbit. But if you start with a vector that kind of looks like all of the images of rabbits, guess what, the rest of it is also gonna be rabbit because it's closer in n dimensional space, but closer, it's all it does. So you tokenize the input, text, images, it doesn't care, it just tokens. You create vectors. Well, tokens are just kind of a vocabulary that it built a big dictionary of all the elements it saw and assigned each other a number. So you encode those strings and you get some token sequence out of the string and then you create a vector from it. And then you have also the reverse which is from a vector like that, you can track back and regenerate the text that originally it was. That's where the magic is because if you have that ability, it means when I say I am hungry, where's a burger? And it saw kind of the hungry in the burger and it put it into a vector that lost some information. If it can deflate it back like a burger in Pasadena, then, and it's close by, then it will give me a English-like sentence from all the inputs it learned. So models are built by doing these operations, building an encoder and a decoder. And that is training the model, seeing a lot of information come in, doing a lot of math to jiggly things so that vectors end up being near each other when the meaning is when the origins are close and then further when the origins are not. And then after that is done, all of the parameters and all of the math coefficients used in the formulas are saved in a binary format in the model. That's what math models are. So it's an action of mutating all of those input vectors and remutating and remutating again and again and again until the output is satisfactory. Training is extremely expensive like tens of millions, hundreds of millions of dollars. The more data you train on, the more expensive it is. Is it always exhaustive loop within loop and loop on a billion things? No, there's more mathy ways to reduce that sum, but yes. It is extremely, extremely CPU intensive, extremely data intensive, resource intensive. And that means that training a model from scratch is very expensive. Retraining a model, there's something called base models and that's something you might wanna look into. Retraining a model is less expensive, but still pretty expensive. So if you have a base model that you like, it generates good English in response to general things and you want your data to be part of the model instead of having to search each time and your data is relatively static. Retraining a model is something you can do. You can add layers to the model so that it can answer more relevantly to your own data. That's not what we're doing today because that still is fairly expensive and I just have a poor laptop here. Iterative, expensive to compute and some randomness, meaning that if I took the same computer, same input data and tried to train it again without saving anything from the past, I will get different results, which is why I wanted to say GPT-3 and GPT-4, but they're actually different in more ways, but this is why when you see point version upgrades of models, Microsoft OpenAI are very careful to tell you, oh, oh, you're using something else. Yeah, sure, you can, but test again. Make sure it's safe and good for you because retraining a model does not necessarily get improvements and certainly won't generate the same responses all the time. So capturing a model is kind of an art still. The good news is we don't have to train a model from scratch and that's what RAG is all about. We can use a model that is kind of generic in general, that's good at English, and then feed it the information that is relevant and new and specific to us and get the result. So a little about vector comparison, I said whatever near means. So from a math perspective, if we have a vector, let's say the vector V here in green, and we have three other vectors, A, B, and C, you could ask yourself, well, which one is closest to A? Well, A and C are shorter, B is longer, from an angular perspective, from what direction they point, V and B are the closest. V like Victor and B like Victor. A and C are shorter and there are ways to compare. The easiest one is Euclidean distance. It's the one we don't use by the way. V to B will be measured just by the point of the arrow, the distance from the point of the arrow to the point of B. And V to A will be the point of the arrow to there. So you can just imagine drawing a line between them and saying what's the length of the line? And from that perspective, A would be probably closest with C being second and B being last. Then there's the cosine similarity. Cosine similarity measures angle only. Doesn't care about how long the thing is or where it's predicted in space. It only measures like the theta between this vector V and this vector B. So as we said, B is probably closer cosine wise. And then there's dot product. And dot product takes into consideration, well, I have another slide that tells us kind of what the coloring of this or how to think about it. Euclidean is just strict distance. Cosine is the angle and the direction. And dot product takes into account the direction and the magnitude. So it kind of doesn't look at the angle so much but looks at the direction and the magnitude. So those are the three. The one I'm gonna use today is cosine. And if you play it home and you use some model that you downloaded from Hugging Face, it's important to look at the model card and see what is the appropriate vector type it uses because you'd wanna use that for your embedding. If your embedding is using a model that was trained with cosine, guess what? Your vector search engine better support cosine similarity comparison. If you do it otherwise, you'll get not good results. So what does that vector search do at the end of the day? Bottom line vector search takes those embedding vectors, the one we took from encoding our text into some vector. And it creates an index of top of them and it enables you to search across millions and billions of items really quickly using one of those three strategies. How does it do it? Well, the implementation, it's a little more gory detail. For MongoDB at least, they're using what's called ANN, approximate neighbor. Strictly speaking, if you just did math and pandas and stuff like that, KNN is good, K nearest neighbors, which is a more strict, rigorous way of doing things, but more costly. It uses something called a HNSW, a Hansu. I don't know how to pronounce that. Hierarchical Navigable World Small World. I don't know how to pronounce that either. So this specialized index stores items in kind of a Z-order, which we're familiar with in geo-indexes, but enables quick search through it by storing very coarse distances at the top and then from every node being able to dig deeper and deeper into more granular node. So you can imagine if a vector points just straight up, it goes to the straight up bucket and then it goes, well, not exactly straight up, between straight up and 45 degrees and find those and then between 45 and 15 and blah, blah, blah. So it kind of stores all of the vectors in a hierarchical tree, which allows quick retrieval to the layer where things matter and then it's much faster. Comparing every vector against the vector query would be inefficient. I have a billion items, I have to do a billion scans each time, no. But if I take the log of, let's say you have 72 layers in that HNSW, then it's a fraction. You only do like 50 comparisons at most or something like that to get to an answer. So that's HNSW. It's an implementation of one of the two algorithms and then given a vector as a query, it just gives you a ranked list of things that matched against it. Another difference from a search engine. In a search engine, if you say yo soy something and it never index Spanish and it has no word for soy, no document matching soy, it will return zero to you. Because this index is just helping you traverse distances, it will always return something to you because there's stuff there. And if you hear about hallucinations, major cause of hallucinations, that underlying the whole thing when we're doing vector search, we're like yeah, what's the closest? We're not saying hey, is it within a certain radius only? Now you can tune parameters and say hey, don't give me distances that are absolutely less than 0.5, it's usually normalized to a scale of zero to one. Or something, you can do things like that but then you get a problem where if the encoding, you may hide information you want back. It's not that cut and dry and it's relative. Meaning that some vectors will be really close in the end space that it indexed and they will come back and some are further but still are close enough to what you wanted. They would represent the right answer. So if you limit it to say oh, I want absolutely less than 0.2 of a distance, cosine similarity, you will shoot yourself in the foot. On the other hand, because of that limitation, you bring results that seem a little bonkers sometimes. This is just some vocab but I think I've impressed on you enough word similarity to get the gist token. It's kind of our tokenization. We take elements and we assign them a number. So rather than saying I need to store each of the word, I just transform them into numbers and then I have a dictionary of that number means this word. In all the math I use the number. If I need that word back, I just look up in the dictionary and get the word. A tokenizer is a machine that knows how to do it. And a better is a thing that knows how to take those tokens, those numbers, and produce a vector. A model knows how to encode the initial text, into vectors, do the whole thing, look for whatever, and also knows how to decode back from a vector to original text. And it provides completions, not exact matches. And training is what we talked about. Questions on theory, can I jump into Python? I'll take two questions ish, go into Python, and at the end we can take more questions. So back when you were showing the training process and you applied there was a random element to it and then you said that they all come with a warning. Remember this was version X and this is a different version. How do you decide which one is the most effective? Is just whichever appeared first or? So training can be supervised or unsupervised attempt to essentially round trip things, meaning that they take all of the randomness, try to arrange the space of all vectors that describe the whole thing, and then they rank them on their ability to reverse the process. Given X, did I get encoded, decoded X back? And supervised training, a human goes and say, oh, this looks good, oh, this looks good. And they take a piece of the information that it never trained on and gives it to it and says, oh, did it match my expectation? I'm not sure I get to the bottom of the question here. Sorry, when you say pre-trained model, it is a model, right? And just you said. Yeah, so a pre-trained model here is just the notion that instead of me, Nuri, creating a model of all English language and learn the internet and all of the scholars and all of that, I use somebody else's pre-trained model. Okay. So I don't have to do that work. One more question and sorry. Hello, I have a question. So can you use a different token lighter than the add-ons that you're using to decode? I'm sorry, again. Can you use a different token lighter than the LLM you use for decode? So can I use a different tokenizers than the LLMs use? In essence, no, yes. Typically, when you go and you look for libraries, there will be things that are concerned with the length of the text and where to break it and they will be built in. And in terms of breaking up the initial items before it gets fed to the actual tokenizer, yes. But the tokenizer itself is a, yeah, it's, huh. Yeah, because like a lot of example, they're like, oh, you don't use LLM as tokenizer, you use a thing called sentence tokenizer instead. Right, so sentence tokenizer is the generic and you're saying, hey, can I write my own sentence tokenizer? Kinda yes, they're gonna throw away things like squiggly braces and commas and spaces because they don't feel like it or not. And I think there is still somewhere where you wanna maybe change it. But the underlying mechanism that then takes elements and assigns and numbers actually has meaning because it's part of the training. So for them to say, my vector is fixed, has a hundred spots, let's say, for them to say that the number is gonna occur in this position or of this, the token is gonna get the number X, means something to the model because it means something in terms of magnitude comparison is gonna do later on. So that portion of the show is actually the training. All right, I'm gonna show, thank you. I'm gonna show some code and what are we on time? Okay, and it's Sunday. If you need to go home, go home. If not, I'll stay after. How's that, okay? There's also a keynote, I don't wanna, yeah. Okay, so let's close this. So that's the last slide. You can contact me on LinkedIn. Otherwise, you know, see me after or forget about me if you really didn't like this. So we will start with a repo and that repo exists on GitHub. Git remote dash V. So I call it the AI foodie buddy, rag MongoDB. So it's AI, the word I don't like to use, but it's popular foodie buddy because I love food. And it's a rag system and it uses MongoDB Atlas as its vector search. And I do it because it's easy and it works. There's other options to do it, that's fine. I'm also using Google models right now here. So I have two options for the model for generation. One is using Gemma, which was released fairly recently and it's a great model. And you can run it locally on your machine. And the other is using their API, which means I make a call to the cloud and say, hey, here's your prompt, give me a response. Which is also a possibility and it works much faster than my lame non-GPU laptop. So how do I do it all? Aha, let's do a quick read of just the foodie buddy itself. There's a bunch of imports. Where from? Read the docs. The first thing I wanna do is I wanna define a few things and these functions are just served to use my MongoDB stuff and use the models. So you don't have to read too hard what it does. This one is the one that, once I got some documents, I wanna create, once I got an embedding, I want to go and ask Mongo to do something for me. So that's like step three of RAG, right? And what it does is it goes against MongoDB. It creates a MQL Mongo query language pipeline, which is just a set of instruction. The first of it is create me a vector search. Given an embedding, a vector, right? I wanna create a vector search. So I pass it an embedding of a query and the query vector becomes embedded into this query. Ta-da, nothing difficult. Well, after it found some documents that match it, most matchy first, least matchy last, do I want all of them? Well, these are restaurant reviews and if there are restaurant reviews, I suspect my query matched restaurant reviews from many restaurants. So do I count this was awesome, the same as this was great? I don't know. Well, so what I decided to do here, subject to experiment, is to take out from the restaurant review the text itself of the review, the name of the reviewer, just in case I wanna show it, and then the GMAP ID, those restaurant reviews that had locations, GMAP IDs, and then what I do is I group by those locations. So a location that got many reviews that match my query will be high and one that got less reviews matching my query was low. Is this the best thing? I don't know, for my demo, yes. In reality, if you roll it in production, of course it biases against restaurants that have more reviews, you'll need to deal with that. I don't. So once I grouped it and I found the restaurant with the most reviews that are very relevant to my query, I just pick the top one because at the end of the day, I want this to recommend me a restaurant and say you should go there and here's why. So that you should go there is this one restaurant I identify is ranking at the top according to my criteria and the why is, well, take those reviews, engineer a prompt, give it to the LLM, Gemma or Gemini, and let's go. So this is the query portion. This is how do I transform stuff. The embedding portion, I use an embedder. The embedder, there are different algorithms to do embedding and for the embedding portion, I picked the multi-QA mini-LM-L6 cosine V1. Remember it says cosine here, which means that my vector search should be indexed with cosine similarity. So I'm doing that. But this whole file can be reduced to a sentence transformer, whereby to the person's query, whereby I take some text and I use the model to say give me a vector. Every embedding model has a certain token length. You can look at the model card on hugging face or wherever you got your stuff or you can measure it really cheaply by saying, hey, embed me one text. It doesn't matter which text. It gives you a fixed size vector. That fixed size vector is important to Atlas search because it wants to know how wide the vector is. It doesn't matter how many words you give it. It always encodes it into a fixed size vector. So I just keep it and cache that token count just because it's easier for me programmatically. But you can also hard code it if you know which model you're using. So this is roughly it. Once I decided which embedder I'm gonna use, I load it. There's a sentence transformer that I use and then with the transformer, there's a method called encode. You give a text and it gives you a vector. Do your diary and embedder. You give a text, it gives you a vector. That's all the learning you really need here. So I got me an embedder and in order to perform a vector search, I need to take the user's query text. OOELEAMBOGER, I need to take the embedder and encode that into a vector. If I got nothing, well, we're in trouble. Otherwise, I can create myself a Mongo query that pipeline I just showed you. So I got the pipeline. I like to save that intermediately just so that I can show people that there's stuff and during debugging, why not? I have that. Might be something you wanna log statistically if you have a lot of users for your system to tell what kind of queries they're running because when they say, oh, it sucked, it didn't work, you can go and see what they exactly searched. And then I go to the Mongo reviews collection and search with that. That is steps one through five. Go vectorize and embedding. Go to the vector search, come back, give me results. What does a document in Mongo look like? It looks like a restaurant review with an embedding. An embedding is just an array of numbers. But every document in there got encoded. And if I added another restaurant review, I would take the restaurant review from the user, I will go to the embedder, create an embedding to it, slap the field on there, and voila. That's that. So that's that portion. Let me start off by running everything below. So hopefully as I talk, we'll get also results. So, blinkie blink. We have those definitions. We have a vector search. It does actually the vector search. Gives me some documents back. Then I need to define how to package those reviews that I decided are more relevant into a prompt for the LLM, for Gemma or for Gemini or Vertex AI or whatever they name it this month. So I need to take the initial user prompt because that was their query. And I want to take the results and I want to format it into text that I will give the model to generate the final G of rag to generate the response. So I do it very ceremoniously by doing a JSON dump because guess what? The token either throws away braces or makes use of them. So it's kind of useful. LLMs actually like that delimitation. It gives them some clue. So potentially they were trained with those. I don't know. So the LLM, the prompt engineering is write a restaurant review. That's my magic prompt. And my mileage will vary once I change that too. Write a restaurant review based on the question and reviews provided below. And I give the LLM both the question original from the user and the reviews. So now it's a one shot. That's a thing in LLMs. It's a one shot prompt to the LLM. Could I do it some other way? Yeah, there is a way. There's conversational or chat bots and they just retain context. So one shot is a way to just say this is it. Other systems say, hey, how about, you know, prime a query say, as an expert in restaurants, please write me a review, blah, blah, blah. That's prompt engineering. Yes, play with this. See what gives you the best. This is all I could do on a Friday afternoon. So that's that. That just formats a prompt. And then what do I need? I need an LLM predictor. A predictor is different than the embedder because it uses a different model. The model for embedding is for the vector search. The model for generating is a fully trained, the best I can run on my machine or in the cloud thing. So I was playing a little with text Bison and text Unicorn, both of them available and both of them performed okay. These are not chosen for being the most knowledgeable about restaurants. They're chosen for being most useful for me on a laptop to run and to be able to generate text having been trained on the English language but not on restaurants or world, you know, politics. It's just very good at interaction and Q and A type thing. So those were trained on Q and A type information. So they're really good at predicting things that sound like Q and A, which is what I want. So I'm choosing a predictor with one of those guys and that's my LLM, so my LLM predictor. And I'm using the API here, which means it's gonna make a call over the interweb and it looks like I do have networks, that's great. Almost done. So here the whole show is orchestrator. I vector search the query, I get the top search result from the vector, then I look back and get the one restaurant and find from a different collection, hey, which restaurant did they actually need? Okay, if I can find it. Then I create the LLM prompt and I take the LLM predictor, one of those Bison or Unicorn and I feed it the engineered prompt, get a result and I show it to the user. The user is us, so let's look at the result. I asked, this is the first stages, the user prompt, I generated the query, blah, blah, blah. It gave me some restaurant ID, create an LLM prompt, so it says write a restaurant review based on the question and the reviews, that's the fixed template. The query was find me the best spam Musubi that locals go to and is a hidden gem. And this again is from the template, continue to answer using these reviews and here are reviews that matched already from the vector. Those are actual user reviews that came from Big Island or something. And then finally, it composed text. That's the generate part and it gave me a response. That's it. Yay, I did choose a Hawaii restaurant reviewed because they had less than a million and my free account only allows so much storage, but you can go wild and it will perform just as well because vector search is a specialized index. So we have officially three minutes, so I'll take questions until we're actually out of time and then I'll volunteer to stay a little after but let those who wanna leave do so without judgment. Yeah. Thank you once again for that presentation. If anyone has any questions, feel free to raise your hand and I'll bring the mic on over. So do different vectors give different results? Absolutely, so the question is do different vectors get different results? Yeah, I mean, if I asked where is the best Musubi and where can I find the best Musubi, those will be slightly different vectors, right? And the vectors of already indexed results may there may be one that's a little closer just because of the word I used or the tone or something. So yeah, absolutely. So if I wanted to cure cancer. Yeah. Which vector would I want to focus on to get the best results? I wouldn't necessarily characterize this action as finding, it is not a clustering operation in the same sense that medical research does. So I wouldn't try to go about curing cancer in a rag way. I would try to go about it by building, by training my model on all of the observed data that got collected and then I would quote unquote ask a question, meaning I would generate a measure of goodness according to some ranking of that. So this is built on language and text, it's not made for that. All right, did we have one more final question? If I'm understanding correctly, vectors encode meaning from text and that meaning would be subjective to the model or tokenizer which generated that vector. Is there any system or capability for interoperating between tokenizers or models to interpret those in a similar way? So let's try to get a little more, yeah. So those vectors are the result of a trained model. Even the embedded is a trained model that indexed things and it laid it out into the end space 384 different dimensions on these vectors. They laid them out in this end space in a certain way to make close ones, close and furthers ones further. The word meaning is human. They don't know nothing about meaning, nothing. They do not understand. A duck, a bear, a carrot, all the same to them. Pasadena, all the same. It's a token, it's a number, it's it. It's those, it's this spatial view that we take on this end space that for us humans say, ah, the things that are close have similar meaning. In fact, from a math perspective, the things that are close are close. That's it. So anything else we layer on top of it is us as humans trying to use our language and our intelligence to do this. But from a system perspective, it's no more intelligent. So now you ask, hey, should there be different tokenizer to make things different? Yeah, absolutely. Is there one that's better than the others? I don't know, ask an AI guy. From my perspective, I'm a user of, I don't know what happened here. I'm a user of those pre-built models and pre-built libraries, and I played with a few. There's LangChain, which I didn't use directly. Definitely the classics, the ML classics and have torch, those kind of libraries. They have some treatment that lets you do things at an elemental and do it yourself. I use higher level libraries that the package is for me so I don't have to deal with it. And if it yields me good enough results, where would I tune next? Prompt Engineering, probably would get me more mileage. All right, I wanna let you know this has been training AI on your own data, binary help print, if I could get a round of applause for our speaker. And thank you once again for attending Scale 21. We do have in the ballroom DE at three o'clock, a final keynote. I love living in the future, half a century of computers, software and security. So please join us in the next segment of Scale 21, software and security. So please join us for that. Thank you so much. Okay.