 The models which are released in open source, so open AI has now been sometimes people say it has closed AI also for fun but then they release this in open source and basically they release the model weights and the corresponding inference code in September 21, 2022. So what is Whisper basically? I asked Bing GPT this question and it said this, Whisper is a computer program which can listen to people talking and write down what they say. So what that basically means is like Whisper is a speech to text system and it is an ASR system and it is called in technical terms, people call it as automatic speech recognition system and another thing which it said, Bing GPT said was Whisper can understand people speaking different languages and can even translate what they are saying to English. So what that means is like Whisper is not just trained only on English or something like that. It is specifically trained on almost 99 languages. It is a robust ASR model and it is not just only for ASR that is the even funny part. It can do translation, language identification and a few more tasks. You can check the Whisper paper to know more. So these are the Whisper models they released in the open source and let's look into some of the features of Whisper. Whisper performs phenomenally well in English speech recognition. So this is a picture from their paper and you can see that they have compared company A, company B and all because they can't reveal the companies when they do it in the paper and also NVIDIA STT was like the state of the art system previously before Whisper in the open source world as well. So you can see that it is performing phenomenally well in English. Another thing is like Whisper is trained on 99 languages but even though if it is trained on 99 languages I won't say it is performing well on all the 99 languages probably it is working well in 57 languages at the moment because that is what they in the open AI Whisper which API which they have released they just support 57 languages. Another thing is like that they have released this in open source right. So once they released this in open source one awesome developer Georgi Gaganau he created an open source project called Whisper.CPP it supports a lot of platforms. So now Whisper which was a very large model can be running even small devices like Raspberry Pi and lot more devices at the moment and Whisper has a lot of amazing community plugins because it was released in the open source. So I'm not going to explain all of this in the top. So since I'm assuming who will know what is fine-tuning in the audience. Okay so this is a relatively new audience so I'll just try to explain a bit more in depth. So if you are an open source one of the best courses to start learning AI is what I always recommend is called fast AI course and this is what Jeremy Howard usually teaches. So let's say we have a large model so most of the time we have some large model like large pre-trained model which is being released in the open source world by companies like Google and stuff like that and then basically if you want to work on your smaller subsection of data to get good results you collect the data and corresponding to that you train it on your particular data so if you look at in this example so we had a language model which is trained on wiki text 103 which is like a text data set IMDB is a smaller data set which we want to perform really well here so we collect the data in the IMDB and train a language model for it and now if you want to like it was a language model but if you want to work very well for classification task or something like that we can do it so that is basically what fine tuning means and I just wanted to say that fine tuning is a new training because you know at the end of the day not a lot of people like me or something we are very not very rich or something so we can't afford to train models in open AI scale or Google scale or something like that so what we can do is fine tuning only most of the time and in this fine tuning it is a very common practice for a long time in image classification especially in computer vision domain but now it is getting more and more relevant in audio domain it has been more and more relevant in the NLP world even it is getting more and more relevant in the chat GPT in these big LLM models as well in the future that is what I see so you may be thinking why should we fine tune whisper so you should only fine tune whisper if you are not getting good results with the whisper model weights which are released by the community so if you are already getting good results with the open source whisper model weights no need to fine tune but if you are getting really bad results in your particular task you need to do fine tuning and sometimes it might improve the performance so how to do fine tuning in whisper so I won't be covering it but but basically it is well covered in this fantastic article release a writer written by Sanjit Gandhi of the hugging phase team and he has basically explained the steps for doing it as well I won't be covering it in my top once this particular article was released there was an event which was organized by hugging phase team so hugging phase team organized this event called whisper event and it was an event to achieve state-of-the-art results in languages low-source languages like in Malayalam or small small languages there are a lot of languages where like whispers now performing really bad so it was a event which was conducted to get you know to train some new models and they even gave some GPU credits as well almost hundred hours of GPU credits was being sponsored by a company called lambda laps as well so yeah it was great and for Malayalam fortunately we got a lot of good models so because of that these many models were being released in Malayalam and the winning models in whisper event was 10 L whisper medium ml as well as for fluors which is another data set it was Param Bharat's whisper small ml model but personally I was not really convinced with the results of the whisper event the reason was because in Malaya achieving 10 percentage word error rate is wow if it is ever happening we can use it in production so I wanted to check this hypothesis okay can we get this particular results and we don't have much yardsticks as well like no one has done such a benchmarking exercise so no one doesn't know okay for this particular data set what is the result at the moment or something like that and another thing is like Malayalam unlike English or something like that is a highly morphological complex language in this link I have linked a research paper which basically mentions that so even achieving 30 percentage word error rate or something like that is a big thing and another thing was because I'll just show you here also so the way hugging face evaluated the model was something which deeply you know it felt wrong to me because in the model cart someone basically put the word error rate it is like they had in the step they automatically generate and calculate a word error rate based on the validation that I said you have used but if you look here I think I'm not able to show that commit so basically this is the 10 L's whisper medium model and if you look here the word error rate was 38.6207 while in the leaderboard you see his results was 11.49 anyone can edit today read me and he could have got good results that is you know it's something deeply suspicious for me so I thought of building something new so I created a new project Github project for doing Malayalam as a benchmarking and when I just tweeted about you know I have been working and then I just said I just benchmarked first on a very small dataset called common voice dataset and Kavya Manohar who is a speech researcher who has been working in Malayalam for the who has been doing her PhD for five years said you know no one has done this before Korean thank you because there is no one there are so many airsoft papers in Malayalam but no one has ever tried to do a benchmarking so that to validate stuff here so so at the moment in this talk I'll be trying to present the results of benchmarking in Malayalam so we had these models and we compared it with the six model versions released by OpenAI so this is how my model results would have been with my benchmarking tool so if you look I calculated the model name the voter rate the CER the model size and the time it took to for the model to run in that particular data set so obviously smaller data set will run a smaller model version will run faster while larger models will obviously take time and these are the results OpenAI's BISMA models if you look at the results the word error rate which I forgot to explain which is a metric for calculation similar to accuracy in ASR world in a word error rate basically means like we always calculate word error rate as hundred minus word error rate is equal to accuracy that is a formula for accuracy in Malayalam you see OpenAI's models it is like zero percentage minus 50 percentage and stuff like that so tunnel model achieved good performance here it got 11 percentage results and for common voice data set also for another metric character error rate which is equivalent to word error rate some slightly different you can see that from 180 percentage to 5 percentage and another data set which I benchmarked was Malayalam speech coppers data set so if you look this is a model names and the same and in Malayalam speech coppers data also from 139 percentage to 2 percent with fine tuning same here 177 200 percentage to 1 percent so I have almost reached the end of my top you know I just have proved that like fine tuning in Malayalam can achieve great results so you know if such a model really exists and if we want to do more and more benchmarking obviously this is just early start of this project and I have just started working on it maybe two months or three months back just thought about working on this you know we got very good results in Malayalam previous state of the art result in with other methods as well not just with whisper is almost the word error rate was 80 percentage in Malayalam speech coppers data set that was the only thing I could find when I try to research something it is very hard because there are not a lot of researchers who are working in small languages like Malayalam so and since we got phenomenal results in Malayalam you know if you are working in a low resource language if you are a speaker of a small language or some other language you know you can also get amazing results by fine tuning I think that's all and I want to thank all these people for doing phenomenal work and helping me out and finally I want to end my talk by offering tributes to Areeb Jamal that's all thank you thank you so much I don't think we have much time maybe I take one question there okay so in India basically there are two types of languages like they even agree language as well as Dravidian languages is what I think so I'm not an expert I'm not a linguist so I may be wrong here Malayalam is one of these Dravidian languages and it Dravidian languages in India basically speaking in the southern part of India Kerala, Tamil Nadu, Karnataka, Andhra Pradesh, Hyderabad, Telangana all speak these Dravidian languages but then you know you may be thinking like okay it even though it is a Dravidian language Korean you'll be able to speak in Telugu or some other language you know I don't even know any word in Telugu or something I can speak a bit in Tamil that's all so it's all languages are unique on its own and Kerala we have a population of 3.5 crore people so 3.5 crore people I don't know in English how much people will be like this is what I said in Indian numerical language but there are a lot of speakers for our language and we are fairly large language with a lot of population obviously but then unfortunately not a lot of people are working in this language obviously because you know profit and all those things well you know when we are working we just have a few volunteers who are working on our free time to do something from Malayalam that's all I guess with a lot of languages in India to your question there's probably a lot to be done in being able to optimize and introduce these into your language modules so thank you very much and I really appreciated the sharing here thank you goodbye thank you right so I see the room is filling up nicely which is great just trying to have a look for our second speaker this morning good morning everyone okay sorry for the technical problems today I want to talk about the open source and explainable AI and then I'm a master student and actually I just finished my thesis and also research then in the FGA or German Center for official intelligence and then today I want to share about the open source and explainable AI and then I have a little story when I was working on my thesis and which is the hard part of my life because a lot of things to do in the research lab and then but I got a lot of help in the open source software so when I looking for something I just see oh I got this on the open source and then I got a lot of things in the open source so I feel that I need to contribute back to the community so I just want to share the knowledge to the community especially to the open source community and then I want to share about the open source and explainable AI okay before we are going to the explainable AI we know that since the era of deep learning AI or artificial intelligence has had a strong open source traditions many important models data sets framework were done in open source way and then the open source is the important part of the development of the AI technology and then there are several things why the open source is important and then when we use the open source software we have a more control which we are to which we are able to examine the code and change part of the code and we can also use the software for any purpose that we want and then we can also study and learn from the expert by looking that code which publicly available and then we can also share the knowledge to the other person to the community and we can also avoid the same mistakes to which the other persons get and then it makes us learning and it is place for a training how to make a better software and AI system and then the open source also more secure because anyone can view modify and correct the error on the code and if many people contribute the issue can quickly fix and update it and moreover the software also stable because the code is publicly available or publicly distributed and it is for fit for the long-term projects and then the last one is the community collaborations and in open-source we usually produce test use and promote software that we loved and then the open source has a long history and this is a brief memorable milestone and then learning the history helped us to understand the past and the current state of the open source software and then when we look back to the history the concept of free information sharing in the technological ecosystem existed long before the computer itself and then it's from the automobile industry in 1911 Henry Ford for automobile want to challenge again the pattern which tried to monopolize the automobile industry and Henry Ford lead away the era of open collaboration especially in the automobile industry and then during 1950 until 60 the widespread of sharing code sharing so good was hide and most of the software produced by the academics and the lab which has a long-standing traditions of open sharing and collaborations and then in 1969 our planet a precursor of the internet was released it leads the easier exchange of the software code using the internet network and then lindstorfall in 1991 released the Linux with fully modified feeable code with GNU listens to projects and then in 2005 a simple versioning control like a kid released to the public for software projects and then since the era of deep learning in 2012 until now AI has a strong open-source traditions and many important AI system framework model and data set were done in open-source way and then when we look over the last few years the widespread use of deep learning and artificial intelligence has experienced significant growth and we also have a more than 17 billion devices with a trillion of sensor are connected to the internet and they continuously generating streaming of the data and the AI system can use this data and automate the process of decision-making and AI system also improved rapidly in their performance as in language understanding image recognition speech recognition and etc it also impact many aspect of businesses companies and organizations and then many companies and organizations want to leverage the AI system but it is not easy to trust the machine to take the decision especially in the domain that need a human expert such as in the healthcare transportation like a cell driving car and law and other sectors sometimes the decision result is not too important rather than the process of creating the decision itself and in some recent model like a deep neural network and we can only see the input the model and the output but we don't have any knowledge about the internal process of it and this is called the black box problem now we take a look at the black box problem in a black box problem in AI refers to the issue of understanding how machine learning choose a specific decision or predictions in many cases this model are complex meaning that it can be difficult or even impossible to determine the reasoning behind that output for example in here the first image is cat not a bottle but we don't know how the machine said that this is a cat not a bottle even this image contain both cat and bottle and then in the second example is in a cell driving car in machine the machine just stop but there is no explanation why the machine just stop is there any pedestrian or car which detected in their lighter sensor or etc and why the car choose a specific decisions so the lack of transparency in our model can be a significant issue in the field such as healthcare when the stakes are high and decision need to be justified in the medical image show here is this benign or malignant cancer the decision of our model can also lead to error bias and discriminations as it might not be clear why certain decision are being made by our AI system so to address this black box problem we need explainable AI model it can make a better transparency and explainability in our AI model and then explainable AI refers to the techniques to make the decision result of the AI system can be explained and be understood explainable AI can help expand utilizing AI system in a critical and sensitive domain where several criterias must be met we set only the high accuracy and it different from the black box concept which cannot explain why this model choose the specific decisions and then when facing the real problems the challenge is not how to build a complex and sophisticated model but how the machine can be understood by the human and consequently the transparency and explainability of our model become a critical factors thus we need transparency and explainability in our AI system so the human can understand why and how the model choose specific decision not only just the accuracy of the model so and so the user in the future can understand why the model choose specific decision and why not and they will also know when it succeeds and when it failed they will know when to trust the machine to take the decisions and know why the machine learning algorithm got an error and then when we talk about the machine learning algorithm there are several machine learning model the deep neural network for example offer enormous advantages in terms of in terms of their accuracy but they like the ability to explain their results the degree of explainability the higher the quality of the results the harder the model to explain and then the lower the quality of the results the easier model to explain the scrap show the model performance versus the model explainability and most widely used algorithm of machine learning are depicted here and then from the simple model like a rule-based model linear model and decision tree until the complex model like a deep learning guns CNN and RNN the ideal solutions is to get high explainability with a high performance however the easier to explain the model like a linear model rule-based model and decision tree the lower performance in those in general so in contrast when we see the complex model like a deep learning and example model they can achieve higher performance but it's hard to explain their decisions and then several methods of explainability has been proposed in deep learning model one of them is saliency map saliency map is an important concept of deep learning and computer visions and saliency method explain the decision of algorithm by using the input component it assign value that reflects the importance of the input and their contribution to the decisions the saliency usually use a heat map it will show the hardness level of the regions that refer to the image which have a big impact on prediction to the specific class and then the example of saliency map like a class activation map grad cam lrp and etc in healthcare can be used in to process the medical image and in robotics we can use the object detection and etc it can also be used in the self-driving car to detect the car and road and etc and then next we are going to the new trend of the AI the generative AI the generative AI is a class of machine learning that can learn from the context such as tax image audio in order to generate new content based on the input prompt in contrast with a common machine learning algorithm which learn and produce the decisions the generative AI produce artifacts as on output which can have a wide range of variety and complexity the generative AI has a large number of parameter moreover the generative AI are able to process multiple prompt odalities when we see in a deep many of the generative AI use model like a generative adversarial network organ variational auto encoder and transformer model and then the image show here is the transformer model and then the transformer model is a deep learning model that adopt the mechanism of self-attention and then differentially weighting the significance of each part of the input data and it used primarily in the natural language processing and contributions and then the core component of the transformer is the embedding layer attention and feed forward blocks and the attention block map the input into the query KN value and matrices and split into the array of heads and the multiple heads concatenated and create the multi hat attentions so the important question is how we make explainability in generative AI based on the transformer and then explainability in generative transformer have become increasingly complex with a large number of parameter and their ability to process multiple from input modalities when we see the generative AI which based on the transformer we need to check the explainability in their model through their increasing size transformer are exceptionally challenging for the explainability however the most explainability adoption in the transformer focus on their attention in the last letter and then let's take a look at the first image in here is the example of the explainability based on the cross entropy score it's so the probability of the next word prediction value using the cross entropy and then the second images here is the example of the multimodal prompt when we prompt the image in here the generative AI will answer the content of the image the example of the answer content is a lonely cabin on the edge of the leg with a truck nearby and the case is of the answer also show the heat matte image which highly contribute to the specific answer and when we use the image generations but in top of that is still open research problem and then so the last thing in the sessions I'd like to take a giveaway for the community the community especially the open source community has a bigger role and the explainability I tried to solve the black box problem in the machine learning and we as a community can contribute together to make a better safety and trustworthy in our AI system we can also building a foundation of an AI system for the future as the Linux era which come from the open source community and it become the foundation of the modern technology nowadays and thus there is a chance for constructing the foundation of the next generation of computing with an AI with the open source community building safety transparency and reliable for the future of the technology thank you thanks thanks any questions open to taking a question but otherwise we'll take that offline so thank you very much and yes thank you good I hope everybody's having a good morning so far coming up for 10 30 we I think we had a bit of a bit of misalignment in the diary earlier but the next speaker is definitely on time so we're kicking off with Miss Redis Varan Archana and they'll be talking this morning on a topic specifically to unlocking English writing for everyone building a chat GPT Chrome extension for proving English writing accessibility I think you all have the summary but really about going through this in motion and demonstrating practically how they built this extension and how it can benefit the users so Archana I'll hand it over to you introduce yourselves thank you so much so yeah hi everyone I hope you had a good breakfast I know it's like a bit early in the morning but yeah I'm really excited for this talk I don't think this is the first time you're hearing chat GPT this week so I think like you know as we delve down you'll learn more about it and also learn about you know our progress as such and what we learn throughout this process and so yeah let me go ahead so I'm Archana so I'm currently a data product manager with woman who code I'm wearing their very cool shirt as well and if you do come to the event hall I think you can catch our booth with woman who code see all in Singapore and apart from that I'm also a board member with woman in machine learning which is another non-profit and yeah I also produce courses on LinkedIn learning I've won on learning tiny ML and we have another one coming on ML ops with the vertex AI that's maybe coming out in a month both of us have produced that and yeah I let so I'm go ahead and so I'm and currently the machine learning lead at a startup in Singapore sleep so yeah as I said you know probably this is not the first time you're hearing chat GPT this week and so you're we just this is a short preview of what we made I hope this sort of makes sense and you know as we dive into it I know that we can't really walk you through exactly you know how we build but we can in the 25 minutes but for sure we can give you the access to the GitHub and so on and so yeah so today's agenda is what are we building today a peek into like GPT as such and most importantly we want to talk about what they don't tell you about building these LLM products and then finally like ways to contribute so yeah let's start with the first thing you know what firstly yeah we've definitely wanted to try our hands into LLMs but most importantly like you know most of us here maybe like even me right I speak three languages so in my head I'm always translating between all of them so yeah I one of the things that I oftentimes faces English accessibility problem right we oftentimes convert the text to English or in our head or otherwise and maybe doesn't sound right so I thought this is like a perfect example for chat GPT to help out with and so I went like I try to figure out like on the web like if there are people similar to me so that turns out it is so you know that more than you know half of the world's population speaks at least two languages and many people tend to think and work in the native language was 75% of consumers prefer to buy products in the native language highlighting the importance of language accessibility and 50% website with visitors will leave if they can't read or understand the context and indicating a significant barrier for non-native English speakers so in short long and complex English texts can be difficult for many people including those with reading or learning disabilities elderly individuals or individuals with limited English proficiency so yeah I think this is a great segue into talking about GPT just trying to make it easy for people to for bilingual people to people who speak more than one language to you know talk in English write in English and also you know like do communicate business English better and we're trying to do that with GPT so I'm sure you've heard of chat GPT by now what it is is it's a language model that was created by open AI I think GPT is like the most famous one right now but there's a lot of others that people are working on the biggest language model that I think there has been really so far is GPT it's been trained on data till September 2021 and a really cool thing about these models is they have this thing called emergent abilities which is they can do tasks that they have not been trained on and and they can generalize across a lot of different tasks right and GPT how it was trained to do this was take feedback from humans and learn how humans want the GPT to answer questions and GPT will kind of emulate that right so how you interact with it is through natural language and it also replies with natural language but the problem with GPT models are that they're very large I think anything more than GPT to which is you know 1.5 billion it's pretty much out of the realm of you know startups open-source companies and and even you know mid-sized companies to train and and if you want to deploy them you have to kind of deploy them on GPUs sometimes even multiple GPUs to make sure that the latency is less right but but you know it's really good at so many tasks that that these models right now serve as kind of like a baseline or a benchmark when you're trying to build new tasks and and and that's why you know you you can't really ignore them at all right now and you have to use them and I'll talk more about that in some time but but yeah so so you know we saw this we got really excited we saw a lot of other people building products with it and you know we thought okay let's let's try building something it seems pretty simple and you know we'll just make it open source we'll deploy it and hopefully learn a lot of things along the way so yeah that that was like the first step so we started building and you know it's really easy using GPT to go from 0 to 1 and build something that's modestly reliable and and you know it works pretty well and you can do that using something called prompt engineering where you know you create these natural language text prompts that and that tell the model what to do and it will you know it works most of the time it replies with what you wanted to right but then you know it's different when you're trying to build like a pet project that you might use from time to time or demo at a few places and and it's and and GPT works really well at that but when you're trying to take that to production that's when other problems come up problems like cost right how much it takes to run these models on latency how long it takes to get a response from the model and also how reliable and scalable the solution is right and and and this is like stage four where you know we're super impressed by the capability of the model but also it kind of lacks in a lot of ways and it's like it's not reliable and it's expensive right and that's the point where you get really over and trying to fine-tune these prompts trying to make sure the costs are less the latency is less it's really difficult to to make that work so so yeah before I move on to like the problems and stuff there's also another issue which is because this field and these kind of models are so new there's not a lot of standards or best practices for how to even go about deploying them right so so that's also something we're trying to learn while building this right now our architecture is like really simple the extension connects to like the extension back end which which sends a request to our python back end that you know that connects with OpenAI's API and and you know we it sends back the result which which goes back to the front end both me and Archana we we've worked like our whole lives in back end in python we have like very little JavaScript or front-end experience right I can't even center a div in CSS so so so you know the front end as well as the the JavaScript back end for the for the extension was actually built using chat GPT and it took us like an hour with like almost no JavaScript experience so yeah so with that I want to talk about like some of the challenges that we face while building this I think the biggest challenge is issues with the API OpenAI has like kind of openly like they've come out and said like you know we have no SLAs for our product for our API we have no latency or uptime SLAs and and and they'll be really happy to randomly deprecate models and and and yeah this is deprecate models right so so when we started building this we were building this on and on and on the winchee 002 model we you know we we created a few prompts that were working really well fine-tuned them to make them work better and then a few weeks into the project they said that they're deprecating that endpoint and and you know you should now move to this new endpoint transitioning to the new endpoint it's easy you just change where you're you know where you're calling the model how you're calling the API but the prompts they don't work as well anymore right we the the prompts are not as reliable sometimes they don't give the kind of outputs we expect and and so you know we had this really well been fleshed kind of fleshed out product but then we could in transition to the new API and that that would take a lot of effort at the same time while they when they release this new API this new endpoint and this new model they I think they stopped putting resources into their previous model the one that which we had built our product on and so our latency increased from like a few seconds to like minutes right and that you know that just makes the whole thing really unreliable and as someone who's trying to build products around this unscalable as well I think another issue is with prompts and how you engineer prompts so because you're talking to the model using natural language natural language it's ambiguous you know you can infer lots of meanings from it and so prompts are also ambiguous right and it's very hard to get to reproducibility very hard to get the same result from the model even using the same prompts and the model also tends to hallucinate and which is basically like it'll tend to make up answers or give you wrong answers but it'll do that really confidently right which which is also an issue right because how do you even serve like the result of your of the model if you're not sure how confident you should be of that result and and you know as as these products get more complex what you do is you chain like results from one model to to another model and then that gives a different result and then that goes on and you also have like agents which is like models that can do some tasks so maybe like you know query a SQL database so it's something on the internet stuff like that these are really inconsistent they work like one third of the time at best right and if they fail especially if you have long chains so let's say the chain fails in between it's very hard to recover from that failure a because you don't even you can't tell if the chain has failed in the first place and and secondly I don't know if you've used that GPT but if you tell them like hey you've made a mistake this is what you should do it's very hard to get the get the get it back on track right it will just say something like oh I'm sorry and then keep making the same and there's also no evaluation metrics for these errors right because it's natural language you can't really evaluate whether the output is correct or not and finally like trust and security so recently open AI kind of brought up took out this like some policy or something where they said that if you you can opt out of data collection especially for training and on which is great because that was a big concern for a lot of companies but then the other concern is we don't actually know what data was used to train their model right so you know what if you're building like let's say something that does financial modeling whatever accounting maybe and what if that would like you know like stocks and stuff what if what if open a train their model on data from like that it's the Wall Street that subreddit right you don't want that kind of data in your model right or maybe you're doing like some kind of medical diagnosis right you want to make sure your model was not trained on data that consists of incorrect you know misinformation or like you know malicious information and then finally prompt injections and attacks it's like almost every day that people are creating new ways to you know attack these models make it to act in malicious ways or adversarial ways and it's very hard to defend against them so these are some of the challenges but then it's it's not that bad there's still some solutions to it the I think the first thing I want to talk about is like the cost of the model so when you I think what open AI has done with this is they've reduced the barrier to entry for startups and for companies to actually build smart intelligent products right because you know previously you would have to hire like a team of data scientists and engineers to build to build like any AI machine learning based product right and that's expensive and you don't even know if that product will work or not in the end so but what open AI has done is it's kind of made it very easy to just write a few prompts and get a product up and running very quickly but then as you as your product grows in complexity and as you get new users your API costs start to like exponentially increase right so that's what you see in the first half of the first one third of that chat over there which is as as your product grows in complexity your costs will kind of skyrocket and it'll be it will make your business unsustainable so what you have to do is very quickly move on and fine tune a model this will also allow you to make sure the model is trained on on you know like more relevant data so it gives you more relevant outputs it might reduce incorrect outputs from the model it might also make your outputs more more manageable and then also it reduces costs especially if you can deploy that model on premise right but the costs don't decrease a lot the because you still have to deploy that model on a GPU maybe multiple GPUs and also you have the same problem where even even though you might not be paying for an API if your prompts increase if your application complexity increases your costs will increase a lot as well and so it's at that point that you have to kind of start training a custom model so what I would like what I think will happen in the future is you'll have to do this process of buy while you build out your solution right so it'll help small companies startups challenge incumbents because you know they can build our products really quickly so that's the buy versus build problem and solution some of the other solutions so for prompt engineering what we've seen work is like few short prompting which is where you give some context about the problem and then you provide also some format some yeah some formats for inputs and outputs that we expect from the model and that that works really well to make the output from the model more more organized and easy to parse and then if that that will work for I think most applications and another thing that you can do is tell tell the model to do something called chain of thought and what it does is you tell it to like run down like the how the model arrived at an answer logically and that kind of helps the model make less and betters but then you know you're generating more output so your latency increases your costs also increase that that's kind of like a disadvantage and there's a lot of other techniques you can do after chain of thought prompting but at that point the costs get so so high that it you it doesn't make sense anymore you have to find you in your model at after that point another thing that's been becoming very popular recently something called vector databases so the outputs from these models are like vectors which are like long matrices of numbers so let's and so what you can do is instead of querying the API you can kind of save that vector and if you have someone ask a similar question you can query from that vector instead of from the API it works really well when you have like documents so let's say you're let's say if you you're doing something like a document question answering thing right so you so so so you may provide so someone may have like a large document and ask multiple questions from it so instead of sending that document to open a multiple times you send it once save the embeddings and then query from the embeddings right saves costs and latency don't like don't use chain don't use agents at all they're not they're not good at the moment any anything where you know the agents are completely autonomous will that will probably not work right now that's what we've seen even with very simple you know applications and then another thing people do is like to solve this issue is they have like another model like a watcher model which is also another LLM to watch the original LLM to make sure it's not making mistakes it works sometimes but if it doesn't work you have the same you know like snowballing effect where errors lead to more errors and then who watches the watcher and it just becomes like watches all the way down and then finally like I want to talk about some best practices that we think you should do so all of these are also best practices but then another thing is like these prompts that you create while building your product they are now like your proprietary data they are the equivalent of like and in-house deep learning model from like the old days like like like three months ago so yeah save those prompts make sure they're not leaked it's very important that you know like that you treat them very as as important as like you know your API keys or something and then for prompts as well make sure you version them and you test them you know treat them like you would like any data and for testing you know you have to make sure that the same prompt the answers don't drift a lot and the quality of the answers don't degrade over time stuff like that yeah so so these are some general good practices but I think we need like more from the community and from other people who are building similar products and and you know as we also come up with more we build out our application we'll you know we'll share those best practices as well yeah yeah so I wanted to give a quick demo before I sort of progress because I know like we have been talking a lot about LLM's and I also want to go back to the product and help you see like how it works so yeah over here you can see like a input and we tried using Chinese so yeah and yeah you can see like when it converts it says good I'll take the numbers but it doesn't sound that good in English so you know you can take that text again and you can go ahead and formalize it so these are the three functionalities that we have but like the Chrome extension right now it does these three things is summarization formalization and then translate I want to also show you the GitHub so we do currently have you know GitHub to yeah like where we have hosted the project so you can see like the entire project and you can we have also like raised up issues where you can go ahead and see those issues and contribute back again like this is completely open source and we really want people to go ahead and contribute because that's how it'll help and most importantly I feel like both of us have like a tunnel vision where we just are looking at one problem but there's probably a lot more to explore and yeah so this is like a short road map we see where we are right now is like something functional enough it does something small and we do feel like there's a lot more things to figure out like tone is for example right now only formalize but what if someone wants to make it fun someone wants to change it entirely so I think it should definitely work out for that as well and then have like a better user experience as so I mentioned both of us don't have backgrounds in that and it's taking us time to understand what a customer might think and it's mostly like how we interact with the extension so if you are someone who has that experience we would love for you to contribute and finally like we are definitely dependent on the community to take it forward we are writing more about our you know journey as such on the newsletter but definitely like they come you to also do so so if you have interested you right now especially with our you know LLM rant I would go ahead and tell you how you can contribute so there's obviously back in ways that you know you can help build the product and the API as such this front-end development as well as I said improving the front-end features this newsletter content and we'll share you the newsletter as well we're trying to sort of journey this entire you know Chrome extension and you can if you love writing you can go ahead and send us and we can for sure put it up on our newsletter and finally infographics the infographic that you saw I created I feel like making infographics is very soothing so if someone let me do let me know and do contribute to the infographics and yeah reach out to us and you know finally also want to show you the newsletter this is our newsletter it's tinyml.substack.com we actually covered one yesterday as well about the LLM conference as well I don't know how many of you attended it was way in the night but yeah you can also check that out and Soham's latest whatever he told you right now in a blog format so yeah that's pretty much the end but feel free to ask us questions you know I know like this topic seems very complicated I mean it's I feel like GPT definitely like OpenAI definitely has made a great interface to interact with it but what happens behind the scenes seems quite complicated and we also started out like you know only a couple weeks ago so you know feel free to ask us questions you know don't worry about your questions sounding naive or something because you know we also started off with nothing it can you know your questions can also help us and vice versa so yeah does anyone have any questions you mentioned about vector databases in a question answering app how would you go about matching the query to a saved vector yeah so you convert the query into a vector as well and then you kind of search for similarity in the in the from the database to the query very vector yeah so you just output with with a very similar yeah with yeah you just find a similar vector and output that I'm pretty sure chat GPT does something similar if you go to chat GPT and ask something like summarize this archive paper for me and give it like an archive link so let's say an archive link from like a few months ago right which it's not trained on it'll find like in its vector database another archive link from like from when it was trained so it might so you know archive links are like two one zero four dot one two three four so it might find like two zero zero four dot one two three something like that yeah so just something similar and output that yeah so something like some ball of pic cosine similarity between yeah exactly so kind of like yeah cosine similarity is is one of the yeah one of the ways you can find similarities yeah awesome thanks two questions so for the at least as a GPT three it leaned heavily towards English content and I believe it's also true well I'm not sure how true that I haven't checked how true that is of GPT for well based on observations and going back to the idea of accessibility how have you observed its performance in other languages rather than say English and Chinese like say Vietnamese Thai or Indonesian yeah that's a good question neither of us know any of those languages you mentioned we've tried with Hindi and like Bengali and so works pretty well on that yeah also like we've just tried with very simple like business use cases so not like really long texts or like articles or something but like small paragraphs that you might write in an email or in a in a slack message something yeah and actually on that idea so since the tour right now is primarily for translation as as the demo year showing off what was consideration between just choosing existing array of translation API's out there translation neural network API's out there versus like same uses GPT's so I think like the art translation is definitely like a part of it but the idea is also to sort of use the sentiment the way that GPT was built like because the fact is it can formalize it can change those tones really well and that's what we wanted to capture because oftentimes when you move from your native language to English a lot gets sort of lost in translation and we want to make sure that such users don't get left out as compared to people who usually speak English so the idea is definitely translation but an added part to it is this formalization and I don't think there's an API that does all of that currently and if there's something else on top as well we want I think like opening opening eyes like GPT API provides us that or maybe even just to save costs a translate function uses one is the trans cheap the translation API and the formalization just uses GPT I'd say cost I mean the cost is like we haven't actually seen the cost right because we use GPT plus and they give you like $18 right so we have an aid up to a large scale and we have more users using it can we actually understand the cost yeah good great questions anyway any other questions from the audience we've got two experts here we've been playing with GPT's since the old days which is a nice quote any any other questions in the audience before we break for tea yes please hello I'm from China and then probably as you know we cannot easy to access open API like chat between China sometimes we can find the agent I try to use the image search okay I ask a chat GPT to provide some images but I cannot get what I want so would you help to explain what's the purpose of the what's the principles of the image generation in the charge of beauty yeah so yeah thanks we know that the latest version of charge GPT can use images and and you know combined images and text but they haven't released that API yet they've just shown us demos of it and then there's other models that can you know do images and and extract text from images and stuff like that we haven't tried that yet regarding like not being able to access and all so right now there's there are a few open source language models the problem with that is it it's very hard to host those models because you need a GPU it's it's pretty expensive if you want to get low latencies that's something we want to work on eventually yeah but we are just figuring out the logistics of that I think one of the big learnings there's as you mentioned on cost yeah it's it's easy and quick to play with GPT but as soon as you want to go beyond just experimentation then you've got to take these other factors into consideration still open for questions otherwise of course the speakers will be around outside and over tea sorry yeah we'll be in the event hall on the in the woman who code boot somewhere near there so yeah if you have questions you can always come there okay thank you so much everyone so we'll resume in 15 minutes so that'll take us up today's presentation you're going to be improving efficiency with prompt engineering and thank you for coming first of all and for taking your time here and it's not only hardcore technical talk but I believe it's going to be very important given the current situation and state what's what's happening with large language models and the character here is a generated inspired by Gordon Ramsey and I have him because secondary topic of today's presentation is going to be cooking cooking few words about me I'm Janijak I'm technical manager specializing in productionizing a solutions and I'm focusing on improving efficiency of teams so slightly different you know you can find me on Twitter I'm not that active later I'll share with you link to me as well this is my hobby is pretty common to sorry I work on and written text recognition as well and technical stock agenda welcome first then I'm gonna talk about introduction to large language models and how people are using them then I'm gonna tell you how I was inspired to start using large language models and then we're gonna go to the meat of the presentation which is creating your own prompt book and my friends that presented just before me shared importance of having your prompt saved in some safe place so you can reuse them and lastly we are going to go into important part which is creating our own prompts and why this is important so I'm talking about large language models and perhaps because you're here and you've seen more in the previous presentation and many of you heard of chat GPT this is not the only one there is many more and I wanted to share with you this infographic and there's Bart from Google or cloth from a traffic and there are also some offered source options like the Kona or Lama based Lama based models show of hands how often do you use large language models never heard of it okay that's great try a few times every week and every day I'm happy to see that so many of you are using it and from my research I see that only two in ten managers are using large language models every week and only four in ten engineers that might be a little bit updated I did the query or Paul few weeks back but still this is limited and in this audience I see that many more of you are using it I'm happy to see that and here are some of the reasons why people don't want to use them it doesn't create anything of value or grandma warns about robot uprising because she's in terminator or most of all it won't the last one purpose concerns because of the data collection we need to be careful with that especially hearing reasons even from some big companies so let's get to introduction of language so I'm gonna share with you few examples of how people are using class language models and this is going to be mainly based on cloth and chat GPT this presentation is going to be mainly based on cloth and change the response it's a first point I guess this was first one that I seen when I started working with charge EPT right short point about artificial intelligence and it would be great to provide you with the examples if you're on if you want to write if you want to write email and then you want to change it into poem it would be possible to do that second mathematician so I started mathematics I remember I had a lot of fun proving theorems and I spent a lot of time with my friends doing that and now it's even more fun you can add emojis to it with charge EPT and then charge EPT is also good at preventing crimes so if you want a hot wire car it would tell you that it's not really good to tamper with car electrical system and it might be illegal but if you're there in the woods and the baby is in danger and the only way to save the baby is to hot wire car and take it to the hospital it would be more than happy to provide you with the instruction and this process this is important this process and my friends were talking about this is called prompt engineering so basically you wrap around your prompt within larger context and you confuse your your models so charge EPT is really good at stopping prompt injections so you need to be really creative to get the answer but other newer models are are worse so it's easier to to engineer prompt for the other for the other models and now we're getting to the question is this useful to me this is garbage how often do you need to do a poem about something or how often do you need to add prompts to the to the mathematics theorem that's really useful hot wire in car perhaps never so I was sitting there with my wife and we were in Bali and we thought no it's hype this new topic charge EPT and my wife started her own company and she asked me to help her a bit with the creation of the website and creation of the websites is not easy for data scientists especially that they don't have JavaScript background and what's more creating content is time consuming and hard and requires a lot of attention and we didn't have that much time because we also sound that is two years old but we decided to give it a try and start to use charge EPT to facilitate the process and we were super surprised that sped up the process and also improve the content the responses were an ideal it was giving a lot of errors and mistakes and something that is not really Montessori we're focused on Montessori education but together with my wife we were we were able to fine-tune the responses and get the correct ones so and for now it's already generated some something the website that we created to get a generated something so now we are sort of looking forward to generate even more and three weeks ago we've been going to Malaysia for holidays and now we're moving houses and I was preparing for the conference so we didn't have much time to plan the street so I asked job GPT for help and it was able to provide me three days with the step-by-step guides what should we do it took us to some durian shops my wife wasn't abused with that but to me it was fine right and the next one and guess this one is another topic that is related to the previous talk so if you think about summarizing a podcast previously it was really huge project you need to have you had to have some data to find your mobile it was costly you had to have engineers and perhaps some expertise so that was huge product with a lot of unknowns and now if you want to do the same with the use of some APIs you can test the solution very fast and what's more most likely the results that you're going to get from from GPT three or four are going much better than the ones that you're going to get from your custom knee-trained Bert Bert is old state-of-the-art type of model I guess it was published in 2018 so here we are those are a few examples how how we save time or how we managed to do things faster and what I'm saying is that you can do a lot of things that were previously time-consuming much faster and you can get the results that are not ideal but you can tweak them and you can speed up a lot of processes so what I'm saying that is you can save time by using large language models as your brain scaffolding you would you replace it and all of that you can do to make yourself relevant because I believe we are not going to have this terminator example that I was talking about and my surprise but we're gonna have situation where people who are using AI would be competing against people who are not using AI and I don't have to tell you who is having head start in that race and what is more lastly this is not 100% accurate but it's fairly accurate so introducing of chat GPT to me it's very similar event to introducing iPhone or smartphones because it enables a lot of previously impossible things a lot faster so during iPhone release we had companies such as Instagram or Square that were built on top of that so for now I see the completely new new universe of application is going to be enabled and already we see the trend how Microsoft is using co-pilot for PowerPoint presentations or or work documents we see also the same for Notion even on the offices using charging between their software I guess this is giving you good motivation to start your own Chromebook because that might be relevant but you might still question why do you need Chromebook and I give you gave you a hint just before but basically I'm suggesting that so then you can reuse some of the prons that you're using often and you don't have to recreate do you reinvent do you multiple times how to use this presentation from now on this is important I'm gonna share with you some prompts and some answers they're gonna be truncated so I'm not gonna share with you everything because I want you to focus on how things are being made and not each word step by step and I'll share the presentation after the talk so you can copy paste some of the prompts that are sharing with you some of the prompts were ones that I found on the internet and they found them useful and they keep them for myself and some of them were invented by me alright so the power of prompt engineering this is also another answer to the question why do you need prompt engineering and I'm asking questions to the chat to the answers and everything is fine I'll ask complete the sense now this is called by the way from anthropic I'll ask complete the sentence life is like and the answer is like it's like a box of chocolate you never know what you're gonna get this is good answer it's answer from first camp I remember this movie when I was small and I was watching it multiple times but it's boring everyone knows that it's nothing invented and I can create it myself you don't have to think much to create this answer so how about if we ask model to behave like a mission Starship to behave like a golden empty and complete that sentence and then it would say life is like about risotto too many people don't steer you know don't play enough attention to the details and end up with the magic flavor less yeah and I like this I like this answer much more especially if you're if you like cooking yourself and I'm sharing why cooking is secondary pro secondary topic of this presentation because cooking is much like prompt you have your cookbook where you serve your recipes for later and you can reuse them and you can either use some recipes that someone has created or you can create your own and I'm gonna talk about both both of the things for first I'm gonna show you few pre-made prompts so cookbook or prompt start okay and here I'm gonna show with you only three prompts one of the things that I started to do more when I started to be adults is answering emails and my wife facing this even more often as she works as a teacher and if you use chat dbt you can give you boilerplate boilerplate message on how to reply to some illa and this is asking about the air air resources is the full answer and he's truncated answer and it provides you it provided me or the person who requested for that a lot of options for Jim to learn AI it's very basic you would have to still edit it but the answer is very reasonable you don't need to think but you can think about deeper insights and here is the template for the template you can copy base second thing learning many of us want to learn new things and previously you to pose a big thing and now I guess charge dbt can accelerate that process as well so here's the example I want to learn about artificial intelligence same topic 20% of the topic that is that yields 80% of the results he's the full answer full answer was really long it was eight sections covering number of things but the main thing that I wanted to share with you it suggested basics of machine learning machine learning basics I don't see this there's machine learning by Andrea and she so it suggested lots of relevant sources as well to learn in depth this is full this is full prompt for learning it's modified prompt that I found online I'll share the presentation so even to take photos and last one I wanted to share with you the travel prompt and this was I guess it was shared by generate if I I and make day-by-day three in terry about run three to Singapore you can also ask about hourly in theory and I asked about multiple options it's not really great but with providing food options but I was okay with that full answer and truncated answer that's really cool if you're visiting Singapore I also I also recommend you to go to crunchy you can see a lot of bird species and alligators there and it's worth it this is travel a full prompt and now last part think why do I need to create prompting what is prompting and why do I need to do anything like prompt engineering and prompting is like simply asking questions and prompt engineering is art of asking right questions to the right people to get right answer and right answer is wherever you need so framework there is no single framework that is mainstream yet because prompt engineering is fairly new but they decided to provide you one option for for using it and here it is and basically while providing framework I want you to focus on providing a lot of context to the chat or to other language models because it will provide context then it will be able to answer you with relevant response if you don't provide enough context or the consequences wrong then it's more likely to to give you wrong results so the context that I suggest is task so whatever you need to do second is role so who needs to answer or to whom the answer is the third one is constraints so what should do or what shouldn't do and the last one is chain of thoughts that my friends covered our examples so you provide even more context by showing how and the chat should be all right and let's get to the example I'm hungry it's late and what should I do then when I was starting using chapter 3 I would write what should I eat for dinner and it was fine it created few ideas and I like the ideas there they're already good good salmon I would eat that chicken I wouldn't because I'm vegetarian vegetarian chickpea not vegetarian pescatarian chickpea vegetarian chickpea curry reasonable I don't like chickpea that much that's still fine and now we're getting to the last example so how if you provide more context I'm hungry it's late what should they do and my girlfriend is there and both of us we have dietary restrictions and I want to make it moment that matters so instead of asking what should I eat I would ask to design a trick or smanu that I can make at home and also I will ask Gordon Ramsay or other mission is start to design that many for me and now about constraints I have lactose intolerance and my girlfriend doesn't include them all right so I'm done and think about this step-by-step and this is really important this you should say don't answer you if you're not sure this reduces hallucination so-called by much so this influencing the thought process and here's the answer that I got and I was amazed as a mission star chef I would be delighted to create customized menu for you and your girlfriend and it provided us with stricters menu with ingredients and step by step guide on how to create them so I guess this is fairly convincing example on why engineering from there engineering prompt is better than just prompting random and then premium all in one example so this is inception it's not really pre-cooked example or it's not really engineer example but you can ask tragedy to help you engineer your future examples and I've been using that a lot and I ask tragedy on how to amaze people joining presentation on prompt engineering and it's helping in creation in creating this presentation as well so let's get to conclusion so I'm suggesting you to start your prompt book and use the methods that I covered here to automate some tasks that you're working on and in the effect to stay relevant thank you it was pleasure to talk to you and now scar this is the QR code to my LinkedIn post and in comment you have the presentation link and for now the presentation is still closed because I didn't want anyone to see it before the presentation and now I will open in two minutes okay wonderful so we have someone who's played extensively with prompts and your opportunity asking lots of questions on this so oh we still have time for questions awesome yeah so it's noted in the last presentation there's been changes with even within the model the releases of GPT-3 and 4 so when we're going to scale how yeah how what does that look like are you are how are you are you since they don't explicitly revealed to us like version numbers or a versioning of GPT-4 how are you versioning these or like controlling these so on so forth so if it comes to using this I'm trying to be very practical and I'm trying to automate especially tasks that are time consuming and and I need to have starter code or started a starter template that I can later edit and adjust to my current needs and for that use case it's not really it's not really problematic whether the model results are consistent or not if I get result that is from early stages of GPT when prompt engineering works or oh this this might be the only problem when the prompt and not prompt engineering but prompt injection works but for email replies for learning for learning examples or for designing trips or for boilerplate code for debugging code for user story creation for being creative and finding two concepts that are unrelated at the beginning and you connect them to together all of those examples are in the presentation in the appendix all of those the responses in between the models wouldn't change much they would be good they were good and I expect them to only be better in the future they wouldn't block none of those use cases that are relevant that's an answer question by creating prompt books because there are a lot of online resources already like for example playground.ai has a list of props right so how do you recommend creating prompt books awesome really good question I'm happy that you asked so there's tons of resources and you would be overwhelmed with them so I'm suggesting to create prompt book because there is lots of cooking books and you don't use them but my wife always had a small book where she saved all the recipes that she she used often so I'm suggesting you to create your own probe book with the prompts that you are using and not to use generic results because if you use generic results you still need to search so I want you to save the time of search of the prompts that you need and use often that's that's the reason for creating your own prompt book and you don't have to have your own prompts that's an option but lots of prompts that I use in the presentation are from someone else because there are lots of people who are working on that and they're creating great things this is the open source part of it so thanks for sharing I have two questions so first is that as we know is the whole yeah I think a whole large language model is everyone very fast so we're having GPT-4 coming up and I mean it's already there five is coming up and we have other companies a publishing their own model so I believe you definitely play with a lot of these models so do you find the prompts like consistent like if you feed the same prompts of different models today still perform the same and because I like your analogy of like the cookbook but I mean let's say the rice that we're using have not have not changed for the past thousand years but models are evolving very fast so do you find let's say if you have a prompt test built using let's say GPT 3.5 and after four comes up or if you're using other models do you see any inconsistency in the results in the quality of that and my second question is that how do you like if let's say you're using the prompts for some some some tasks where facts are very important and how how do you do fact-checking okay awesome two great questions so first consistency and then fact-checking consistency in between the models so it's there's one great thing open AI has big budget and lots of people who are working on it and they had head start so lots of prompt injections are stopped but other models don't do that so you can do you don't even have to do prompt injections for some models for some newer models you can just ask how to hardware a current answer and again similar answer to the previous one for most of the use cases that I'm using it's simple tasks that are time-consuming and even if the answer is slightly inconsistent between two models it would provide you boilerplate so you can reuse that and adjust it and then fact-checking is second question so fact-checking and adjustments of the result I guess they're related topics so when we were creating the website for my wife this Montessori website lots of responses from the chat were inaccurate and we're Montessori and we are reading into Montessori so and when I said that AI wouldn't replace us yet and it's not going to be fight between people and AI but people who are using AI versus people who are not so you still need to have an expert to validate some results the prompt that I shared with you don't answer if you don't know it works to some extent but sometimes chat would make up things anyways and less powerful chats would make up more things than the other ones so just be practical use it to your needs and don't go for the I don't know for people who are shouting oh no this is this created something that is incorrect that's fine it created just take that into account and use it to the fullest and don't lose your time on complaining that something is not ideal and if you don't have expertise find someone who has expertise and can validate this and and what you can do you can ask your DPT hey how should learn about AI and then it would give you further plate and you ask someone some of your friends who is AI hey I was thinking about learning AI I have this idea what do you think what should they add what should they take out and then it's much better than asking oh what should they learn about the AI like you're more specific with your questions to people who are experts and then you're not wasting their time and you're not asking them to I mean literally it's not about wasting time but it's like you're being more proactive then with with answering more directly to questions so about and experts and more questions really valid points maybe also a question of how we measure productivity going forward especially where people are AI assisted so very very interesting question on the prompts for different models yeah still open to the floor we got two more minutes before we bring up our next speaker of course Jan will be around yeah take the conversation off line yes please okay this is coming more from the text-to-image side of things because but it applies to check GPT basically but so what's happening with text-to-image is that image there's a lot of copyright issues coming up because when text-to-image things like you know mid-journey generate an image is it copyrightable what are the problems training are we seeing the same issues popping up with check GPT type you know text-to-text kind of models yes I believe there are even some legal actions related to that to slow down open AI but I would need to double check I would need to double check what's the current state of that and I see some of the open source large language models who would use only open open licenses to train their models for add for some closed source you don't know what the models are trained on and they didn't they are not transparent for for open source you you see that and also open source models would allow you to opt out from like if you're sharing your code on beef up or whatever with open license you can opt out from model train but yeah so this is for open source and then for closed source yeah I would I would need to double check what is the current situation and I think I think that there might be some some some legal actions similar to what's happening with the text-to-image but yeah we'll see I'm a post Thanks a lot Machine learning in search where we're working has some unique constraints because we generally have a very high volume of queries and very strict latency requirements like real time a lot of machine learning is run in batches like once an hour or so but when you have to return things in real time with very large numbers of queries things get a little bit more interesting so I guess I'll give a little background about what Mercari is and the situation we were in when we introduced our machine learning model Mercari is Japan's largest consumer-to-consumer online marketplace like an online free market flea market our vision is a circular economy so sustainability we want to reduce the number of things that go to landfill reduce a number of new things we have to manufacture by having allowing people to resell their things instead of putting throwing them away our net sales are about 150 billion yen a little over a billion dollars US and within search we have over 20 million monthly active users so unique visitors to our website every month hundreds of millions of active listings in our catalogue and thousands of queries per second and our all-time peak is over 10,000 queries per second so that's kind of the landscape we saw when we had our first machine learning model that we wanted to put into production our initial original architecture was a traditional term based architecture based on elastic search elastic search is an industry-wide standard very widely used but it's term based that means that the keywords that you enter in the query are what are returned so if somebody had not written that exact keyword in their description of their item you wouldn't find it so that was a big limitation of elastic search although it was still working very well for us for 10 years but our challenge was then to take this traditional architecture and add machine learning into it and so now I'll pass the mic over to Tio thanks Ryan and so as Ryan mentioned my name is Tio I'm going to be discussing the next part of our talk which is the problem so as Ryan mentioned we have this existing search architecture but why do we need AI in this system right and so while this was very effective this was Mercari's search backbone for over 10 years there were a few major shortcomings so elastic search is a keyword based system and because of that it kind of falls for things like this right so these are areas where AI can help for instance ambiguous keywords semantics saying cool toys for boys and personalization and so with keyword based matching you can't resolve any of these issues so again going back to ambiguous keywords in Japan one piece is a type of women's clothing but it's also the name of a very popular Japanese manga franchise and so if you're just using queries matching you know the query to relevant items which gets returned maybe both right when you only wanted one of the other very different items and then semantics again cool toys for boys maybe this one piece back is a cool toy for boys but that's not in the title or the description so it will never come up and then personalization again there's no real easy way to personalize results to a user just based on a query itself these are all ways that we knew AI based methods could help improve the search that we had at Mercari and so the question is where do we start right so we have an idea let's do AI in search at Mercari but how do we do it in a world that wasn't designed for it so the search architecture was designed originally without AI in mind at all right so there are no easy ways for us to integrate AI and because Mercari is again Japan's largest consumer marketplace at the scale that we had we had very strict performance requirements and so our latency budget was in our case just tens of milliseconds which is this really really tough machine learning models and we also didn't have access to again no easy hooks for AI optimal ways of serving these AI requests right that's not really built into the search architecture and so the final thing that we want to discuss is these are kind of like infrastructural constraints but the main constraint we had was user search experience at all cost so no matter what we do the user search experience must be as good or if not better at each step of the way and so we always want the situation on the right and not the left that's exactly what we're trying to do by using AI to improve search results right and so this was the hardest part of the I guess the constraint but I think it was the one that served us very well to deliver the actual business impact of our system as opposed to just say model performance or something like this and so what is the first way that we decided to apply AI in search right within existing search infrastructure that didn't have AI so enter search re-ranking this was the original place that we felt AI would add the most value highest ROI with the simplest least of integration into our existing system and so our colleagues Alex and Norbert actually got into the ML side of this talk on I think Thursday morning the second talk of the day and so there's two aspects when you consider adding an ML system into existing search architecture where the elastic search result the original result that came back and be considered the first phase of retrieval and now we're adding a second phase where we're going to re-rank those results on top and so essentially the goal is you know we have this set of results can we surface the most relevant ones to our users first which is even more powerful in e-commerce context right so when say for instance if you're on Google doing a Google search maybe you're you have a little more patience to go to the next page or third page for search results but for an e-commerce platform it's very crucial to have the most relevant results first in the long term which is ultimately serves our broader purpose of matching users to the items on which they are looking for and so with the first concrete use case for AI in search in mind we go to the next step which is the evolution of this ML system right so how do we grow an ML system while again running the business and sharing a really high user search experience and I want to emphasize that the focus is always on least risk very iterative development on the highest ROI areas that we see in each iteration and then going from there and so given our existing infrastructure the first thing that we did was the simplest solution which is if we do have a machine learning model we can integrate it directly within the search server and so this had a lot of benefits but there were a few drawbacks for instance we tightly coupled model development with development of the search service itself the code base is massive we have many developers working on it concurrently as you would imagine such a key part of the Mercari platform it has a very high release cadence and so we had to be very careful when we were developing along the way this whole journey was very measured we wanted to make sure that we weren't able to provide major features out of the gate but we also preempted any major incidents right so in production we were running very reliably the whole time so because of that we made it very difficult we went very very slowly at this stage but we went right this was a step in the right direction and I do want to mention that because in keeping with the theme of this talk of open source for each of these stages of system evolution we wanted to just include one of the many libraries just to highlight that kind of brought us from that stage to the next stage and so in this case our search server is actually written 100% in go and so we're very limited in the libraries that we could use but we did find a very useful one this is called leaves and so in this case we just developed an internal component that used this library to serve a light GBM model directly in the go code in the search server and so it was very simple to implement as mentioned there's just a function call right to this component which served this model with that being said again there were problems with the solution that we addressed in the next phase of the system and so in this case it was characterized by really us going all in on our microservices architecture so Mercari is very very heavily invested in microservices very Kubernetes first company and so it was very easy very natural choice for us to now split the ML model from the search server to a custom Python microservice for model serving which was again very simple but actually unlocked huge benefits now we could develop independently from the rest of the search team from that huge search code base and again originally it was just a function call to the model in the existing kind of search server now we just easily replace that function call with an RPC and now since it's going over the network we have a small timeout and now we introduced the notion of a baseline response which in this case is elastic search rankings and so what's very useful in deploying ML productions or ML systems to production is having some kind of really simple baseline that you can iterate on top of and because we already have elastic search results this was the natural choice for us and went back to our do no harm to users kind of tenant right so by definition this re-ranking system was implemented in the worst case it would do no worse than what was already in place currently so again this key aspect of having this baseline response helped us to really iterate quickly in this stage and the next stage and we also took the time to as we're building out our POCs to implement basic production kind of in this case metrics but production features that help us in the future so in this case we add more observability to our system but we're constantly paving the path to a more production system in the future along the way with each iteration. That wasn't strictly necessary but we knew it was a good thing to do at the time and very natural and wanted to emphasize that these key points is when we add the complexity because now we realize there's an actual need and so in keeping with the OSS talk and because we moved to a Python microservice this gave us a lot more flexibility and so using re-ranking we went with TensorFlow re-ranking, TensorFlow for those of you here who aren't familiar is an open source library made by Google, it's been open source for many years now, industry standard in a lot of ways, high performance, it's been validated, it has very active development so again a natural choice for us going forward and really kind of helped us to really get through this next stage successfully. So with this being said the system was still very basic and there are still many ways for us to improve on the system and so this is the next solution that we settled on and so as mentioned previously the search server with the model baked in features were computed within the search request and so we realized a lot of those features were, well the computation of the features was very redundant and so if now we have say consider a feature store like a distributed cache for inputs for a model and what we iterated on which is kind of not present in the slide, I'll go over it really quickly was again very basic next steps and so instead of computing features within a search request every time now we can pre-compute them, in this case we used a big table which is cloud managed on disk storage by Google, that was a lot better but it did not meet our current latency requirements at the time reliably I should say and so we then upgraded from that to using Redis which is an in-memory key value store and it served the same purpose we were able to keep the interface the exact same and then that was the key to further improving the systems performance to meet the aggressive latency requirements that we mentioned earlier and in addition as mentioned before we have timeouts on the re-ranking request, we also have timeouts and fail safes on this feature store component so we can get into it a little later there's a few details in there but essentially with each abstraction that we added and each layer complexity there was always a safeguard to fall back to that previous more simple layer that we had before and again further improving our monitoring kind of observability suite along the way realizing kind of what metrics we needed to track both operationally and for model performance and it's worth emphasizing here that a major bottleneck that we now had is again I mentioned oh we're developing very quickly right that meant that we had to do a lot of AB tests which is essentially you know we have something that's in production we run something that receives a certain part of traffic this new feature and we see oh does this feature raise any of these key metrics that we have that are related to the business and because we're able to develop quickly we're able to develop and test new features quickly but now the bottleneck wasn't necessarily feature development but it was actually the AB tests themselves so we could develop features very quickly but each new AB test required a new model which needed essentially to be deployed as a new microservice and set up the same way as we set up the original model which you multiply by the amount of models in any given AB test so that was actually taking us about maybe a full engineers time per quarter and even with that it would be one to two AB tests maximum and so that became the next bottleneck in this phase of our system evolution and before I go on to the next phase I do want to highlight the next open source library that we use that was very high value again features still resettled on Redis as the actual backing technology and so very simple to go Redis which is the Redis client for the go language nothing really special to say which I guess is kind of the main point right the main emphasis was it just integrated out of the box really easily with our current cloud managed Redis instance and it had really great performance and so again open source along the way we didn't have to reinvent the wheel on this one either and then finishing up with the systems evolution this is the final iteration currently we have many more in the future but this is where we're at right now which is adding selling core for serving and Istio for model routing and these together will alleviate that bottleneck that we mentioned previously so instead of having to set up new microservices and kubernetes configurations for each model that were AB testing and then tear them back down again we can move all that manual work and automate that so the feature flags we mentioned earlier those can get sent directly as is without us modifying any code and then Selden can actually just with Istios help route those to the correct endpoints and we can just spin up and down AB tests hopefully in order of magnitude faster and so again iterative development this I think will allow us to iterate even more quickly and really help us to now not just develop the features but AB test them and now release them to really improve that user search experience a lot more and in conclusion we believe that ML enhanced search really is worth the effort so if you have a traditional search architecture ML can really add a lot to that and I can't really get into exact numbers to show how worth the investment it is but as Ryan mentioned earlier our annual net sales are on the order of a billion dollars so even just a 1% improvement in that direction is pretty significant right and so again this went back to our very iterative approach to this system so 1% decrease would also be horrible right but at the same time because of we went iteratively along the way we were able to prove our system out release its production and have positive lifts so it definitely was well worth the investment in our case and we would consider we'll urge everyone here to consider a top down integration of AI right so while it would have been possible to totally rewrite this architecture implemented from scratch that would have been really dangerous and I can't feasibly think of a way to do that that wouldn't have negatively affected our users but with a top down system we can slowly integrate into search and now over time you have an AI based search system that is essentially as if you had written it from scratch anyway and again in doing this top down integration this iterative development you can really balance that engineering business trade off which if you go in this way we think maybe very rapid iterations you can start with the simplest highest ROI features first and each step just kind of laid that minimum amount of brand work necessary for the main stage so don't over engineer your systems but at least a little bit make it extensible so you can evolve to overcome that next bottleneck that presents itself and then I wanted to finish off with this quote that I really like personally it's from the great John Maxwell which is one is too small a number to achieve greatness and I mention that because at Mercari one of our values is an all for one literally it's all for one it means just teamwork everyone helping each other so a system at this scale would not have been possible well at the help of many many teams across the organization so I wanted to say thank you to them and to give back and keeping with that spirit we're also building the system to serve search in general so it started with our smaller team that was in search now a lot more search engineers can be empowered to add machine learning to the overall search system at Mercari and once that's in place we are aiming for an organizational wide platform so that really AI can permeate much more easily across the org and reach its way to all of our users and so again an all for one one is too small a number to achieve greatness and we wanted to thank the open source community so as we wanted to emphasize along the way this wouldn't have been possible without open source right? Search ML at Mercari would not have been possible at least at this speed with this quality without the open source software that we use it's almost entirely built on open source libraries and in keeping with that we at Mercari also open source our internal tools and resources back to the community when possible in addition to as engineers you know contributing PRs and bug fixes to the libraries we use because you know Mercari was founded on the premise of a circular economy where everyone can buy and sell we also believe in a circular development economy where anyone can contribute and hopefully anyone can build these you know scalable ML systems without you know having to be at the scale of Mercari and so we wanted to give back to that and really kind of pay it forward so we believe open source was the key differentiator in our case and we really hope that the information in this presentation was valuable to the open source community present here today thank you for listening and we are excited to answer any questions that everybody may have thank you so some questions for Rick and to you thanks thanks for the sharing so I just have one question about the search method that you're implementing do correct me if I'm wrong but you're actually using the ML models to re-rank the results based on your original results so does it mean that it will in most cases be slower than your original method without the ML models in actual product yeah exactly 100% and sorry what was your name one more time sorry what was your name Billy sorry one more time yes Billy yes okay yeah so that was a great question the short answer is yes and that's what led to our strict latency requirements earlier is that we kind of did the bookkeeping and said this is the only permissible amount of latency that can be added to the existing system right the current latency metrics that the system without AI was performing at thanks all this add to that we did some experiments to determine how much extra latency a user would tolerate and we found it was pretty pretty low but not zero so that was our our opportunity to add in re-ranking on top of the existing system yeah we do my sharing what the common metrics you look at when you perform the AB testing for a new model that is a good question in general we could say our business metrics were sales so the more we sold the better that our product people were happy hopefully that answers the question yeah I guess that testing or that validation you only get once you put it into production right so perhaps to your question is how do you test that in advance to know that you're not going to miss the mark once you once you roll the model out okay just pass it back first hi would you be able to share a little bit on the SRE side and all the performance you know redundancies that you have to put in place sure we'd love to is there a specific area that you're considering yeah like I guess that you probably have you know multiple elastic search instances you know you will have multiple models because you know if one of the models went down I mean containers that contain the model then I mean how does that look how do you do the upgrades and things like that yeah great question so we actually run everything in Kubernetes including elastic search was a little maybe non-standard these days but yeah everything is in Kubernetes we do canary deployments and rolling updates of our models themselves and so if they break we can quickly roll back and then we use DataDoc for monitoring our various metrics latency QPS those kinds of things and we get alerted sometimes late at night when we're not meeting our metrics so it gives us a good incentive to not go too slow yes but does that answer your question yes so our main state is our feature store so we compute features about our listings and we store them and like Tio said in memory store we update it regularly so in addition to caching them we directly compute them offline and write them to our store and that is the only stateful part of our system thanks for the question thanks for sharing your experience of integrating your machine learning model with your search and season right and you mentioned you use Microsoft and you use Kubernetes right so would you please share more details about how you use Kubernetes to organize your microservice thanks so that's a good question and maybe a better question for our platform team so I can only really give you at a high level how we use Kubernetes internally is there a specific I guess question within that that you were thinking of a specific topic so my question is probably my question is you mentioned you use Kubernetes right yes to organize your microservices so I say would you please share more details how do you use your Kubernetes like to organize your microservice for CSDO just we quickly to we quickly to change your system to implementation system and share more details about that sure I can try to give a kind of a high level so at least from the ML productionization side of things it's a very simple solution where we just you know create another set of Kubernetes manifests we have kind of an automated templating system internally that will help us to do that so in our case as mentioned earlier when a be testing these models there's a lot of kind of manual setup of these manifests and honestly a lot of it was a copy and paste and just change a few fields here and there and so I think that's the system we were trying to solve so short answers again maybe maybe not at the right level of abstraction but we would just for the ML models themselves to create different microservices we have just set of manifests we use customized on top of that and then we just run it for a new set of you know manifests for that model thanks I'll add to it maybe a little bit so like we said originally the system wasn't designed for ML and so it was a very kind of static system and endpoints don't change often and so most of the endpoints were hard coded in various configuration files so our entire Kubernetes configuration is a GitHub repo and if you want to change the system you make up PR so if you want to add a microservice it's a PR and when we first started it would be we have to change almost every part of the system to manually code where the endpoints are and then we would have if statements in our code to switch between which model to serve so that wasn't scalable so what we did now is we're using Istio for routing so we just put the feature flags into the headers and with Istio we just allow Selden to route the request to the right model and that allows us to serve a new model without changing any part of the system the search server just calls the same endpoint every time and depending on what the feature flag headers are Selden decides which model is going to serve that request May I ask how does this search AI model in general support pagination? Sir can you say it one more time? Pagination that's a good question so again we worked with the existing search system so whatever the pagination that was used within the search server is what we ranked on top of and so we can't get into too many details but essentially we we had to re-rank with that pagination in mind there was nothing special on top of that yeah a short answer hopefully that answers the question I just want to ask more regarding the feature store can you elaborate more on how are you guys using the feature store? Is it used to save the results for the feature extraction from item descriptions and do you use that for model training mainly or does it come in place how does data flow in this system? I can get to at a high level so the short answer to those questions is yes for all of the things so primarily in the very beginning features of the model that again they're being computed within the search workflow itself and some of those were already kind of just there right let's say like let's just pretend item name right we don't need to do any processing for that it's just there but there are other features say maybe a clicks on an item right let's just pretend that's one of the features in the feature store and so those were the things that I realized oh like that is infeasible to really calculate within a search request and so you can easily look it up if we have just a feature store and see how it started and then the yes to all the other questions that you had is we are evolving it to serve many more complicated features after that that have to do with both items and users I'll elaborate a little bit so the first features that we get back is what we get back from Elasticsearch things about the item what's the description how much does it cost things like that how many people have clicked like on it and there are some things that we can't calculate just from the item itself like what's the click-through rate that Theo mentioned things like that or personalization like what does this user prefer things like that that we can't calculate just from Elasticsearch we pre-calculate it and it's also the same things that we feed to our model of course and then at query time we look those up in the feature store and feed the things we get back from Elasticsearch together with the features that we compute it feed them into the model and we get a re-ranking okay I think we're up for time but thank you so much Rick alright that's what we can do thank you so much how do how we have built out scalable time machine learning models at Episodes so Episodes deals into healthcare analytics and I'm a part of the Episodes so in the stock we'll focus on how we have built out the specifically for a near real-time machine learning system how we have built out pre-computation feature pipelines and what is the ML serving platform that we have used so here's a little about me before we move on I'm a lead data engineer at Episodes and so far in my career I have worked in building out distributed data processing pipelines machine learning systems and setting up ML ops practices in the team so the agenda of today's talk is that we'll start first with the challenges that we have faced in building out the one of the recent ML pipelines for our team and for Episodes and then we'll lead to that how it has with the idea of building out a pre-computation feature platform dwelling into what ML serving platform are we using and then ultimately going to complete overview of the ML model server pipeline at Episodes okay so this ML pipeline that we are talking about that started when the data science team at our organization came up with the problem statement of deploying transformer models so specifically this talk will focus on how do we deploy the complex models when we are saying complex models when they introduce the transformer models we pick up that model and deploy it into that time current ML serving pipeline that was like more of a real time type of API systems so we deployed that specific ML transformer based ML model into our pipeline and what we saw that we are facing high latencies when we are saying high latencies they were more than a couple of seconds and if we are leading into a API type of framework if we are having high latency of more than seconds then that is where the problem is and you can't take up to such a model to a production and when it comes to complex models what an essential ingredient is that you have to use GPUs and GPUs are really expensive so if we go to the first point which is the high latency suppose if I am getting high latency one simple solution is that I add more replicas to my API and make the system more scalable but now when we are talking of GPUs we cannot scale to GPU directly from say 10s to 100s just to serve more load because that will then go over budget of our whole computing load so that was the first problem that we are seeing we were seeing high latencies and we wanted to scale up to the load incoming load that we were getting to reach back to the to reach back to the serving time into milliseconds and the next problem that came along was that these model sizes were too large so they were into GBs so the one model that they came up with that the size itself of the model was going up to 4 to 5 GBs and the usual way of deploying machine learning models into production at least at our organization is wrapping up into containers so once your model is wrapped up into a container the whole Docker image size goes up to again a size of 10 GB so if we are talking about scaling the API replicas so first I have to scale the GPU machine so if you have already worked on cloud and seen that if you even scale from 1 to 2 GPU machines the time it takes for to get a GPU machine is approximately around in a range of 4 to 5 minutes and then if your Docker container image size is going up to say 10 GBs then the time only to bring up the whole container will again take additional 3 to 4 minutes so it means that for me the time to bring up one more replica was going around 9 minutes so if I and at episodes we always we always believe in scaling up the load as per scaling up the infrastructure as per the load itself so we don't over scale and always have architectures cost optimized so we have to make sure that whenever the load is coming up first we do the calculation and then scale the instances accordingly so over here if I am seeing that to bring up one replica I need 9 minutes then that will additionally add to the first problem that we have which is the high latency for serving of the request and into the whole NLP engines there are multiple machine learning models that are working together to bring up the outcome so we have to make sure that we are not losing data across the models when say a request or a data is going from one model to another and the data one of the feature of the whole machine learning pipeline is that we should have the data in its management whenever a request is going from model 1 to model 2 we should know that what was the state of the request was it success was it failure so this was the last challenge that we were there we had in designing the RML pipeline so over here the first problem is high latency for inference tasks so we picked up the task of going deep dive into what happens to an inference task so this is just a general overview of that whenever we have an inference task we get a data so for us if since we are on AWS cloud I have shown an s3 bucket for that so we get the data on an s3 bucket we pick up that data we do data preprocessing and then from that we go for a feature generation load the feature load the model we load the features into that model so do the input processing and then ultimately model gives out a prediction and then we upload the prediction and do post processing into an output bucket so these were the broad steps of an inference task and if you see and what we notice is that the first two tasks which is a data preprocessing and feature generation they both were CPU based tasks and still they were happening into a GPU based machine and they were taking approximately 30-40% time of the total inference task so what was the first conclusion that we came out after deep diving into this one is that we have to remove the first two steps out from our inference task and make it more of a pre-step tour inference task so that gave generation to the idea of creating a pre-computation feature platform so over here whenever data comes up first we do the feature computation and then only embed the request to the model and then get the prediction so first let's talk about the pre-computation feature platform benefits so now when we have and since like at all that we have multiple models working together to bring out the predictions on a piece of data so there are also chances where these models are sharing few features so over there we have reduced the computation load so if we have in the last talk also we saw the mention of feature store so in those features store we are first generating all the features at once storing over there and then in the model we are retrieving those features as and when which model require which particular feature so it has overall reduced the computation load for each request and the resources needed and since we are removing and pre-cleaning our data it has improved the accuracy at the model side the next benefit is that it has increased scalability because now we are reusing features across the models and not generating for every time we are predicting we are sending a prediction request to our model and since now in terms of inference time since we have now moved this whole process of generating features out of the inference task we are saving directly 30-40% inference time on that side and which means that we are actually saving 30-40% of our GPU time and GPU cost so now over here we have seen that we have moved out the feature generation part to a feature platform the next step that comes out that what will be our ML serving platform so now since we have the whole architecture that we look out in the end we will see that the latency was still coming around in not like a few milliseconds but that was still coming around one second or so for this complex model so we were very clear that we can't go with the real time of paradigm so at this point we were clear that we have to go for a async type of paradigm for our ML serving platform so after performing experiments on Cortex open source Sheldon core and AWS SageMaker we came up to a conclusion that AWS SageMaker is fitting the whole problem area that we are trying to solve and also we believe in keeping our team lean so what we wanted was that if we have a managed service so that our engineers spend less time on maintaining the infrastructure rather than innovating on the infrastructure side so what we needed in our ML serving platform was that it should have support for complex heavy models and it should be able to first save the models outside even outside the container and then help us to retreat there so SageMaker has its own SageMaker model registry where you can keep your model outside the container so now we have the container size or the Docker image size is not comprising of the model size which is an advantage over here tolerate high latency inferences so for this we were sure that we can't do with the real time APIs though the real time API of SageMaker also was giving us results but we always believe in scaling from 0 to 1 whenever there is load and not keep the resources up whenever there is no load so SageMaker asynchronous help us over there also and in terms of SageMaker asynchronous endpoints the way it deals with it has its own internal queue so you have to just invoke the asynchronous endpoint it will save your request and then do the influence and then save your output to an S3 bucket again so it has its own internal queue and the way it scales out you just have to define your own auto scaling parameters in terms of data lineage management there was no direct support from the SageMaker side so we have built out our own pipeline to support the data lineage management that we will see in again into the next slide scalability reliability and security that is all guaranteed under AWS service usage and in terms of cost optimization with specifically with the asynchronous type of endpoints in SageMaker you have the flexibility of going from 0 to 1 so suppose there is no load at this point of time then you can keep your instances at 0 and whenever there is a load and since the start feature of asynchronous endpoints is that it has its own manage internal queue so first when your request will come up it will go into the queue the SageMaker depending on however you have defined your auto scaling policies it will bring up one GPU instance and then your instance and then your request would be served so now even over here the whole GPU provisioning time is coming around 4 minutes because that's how the cloud yeah so because that's even the whole the cloud provisioning logic but over here you have the flexibility of saving your request till the time it is not being served or influenced okay so this is the complete overview of ML model serving pipeline that we have built out at episodes so we were talking of the feature store and building out a pre-computation feature store platform so over here you see that we have build out Apache we have we are using Kafka over there to fore save data event points and then compute and then consume those data event points to bring out the features from that data points and then ultimately save saved again the feature events to a feature store or directly to the S3 bucket once you have the feature pre-generated we have integrated the lambda to listen to the Kafka topic and then it picks up the data points from there and pings the SageMaker Amazon SageMaker Async endpoints and with the SageMaker Async endpoints you have the flexibility of delivering your result status to SNS topics over here for say you are sending a X request to endpoint A so whenever that request is say failed success whatever the state of that request is changed you will get a notification on a SNS topic and from so now for the data lineage management part you are getting all the information into that SNS topic so we can directly consume that SNS topic and pull out the information and drop it into a database so that web machine learning engineers can directly query the database there to know that what was the status of their request in terms of near real time because since nothing is happening in real time but what we have kept is that the whole architecture is more of an event driven architecture so as soon as your data is dropped at the event topic the whole pipeline is triggered for that particular message so that's why it's a near real time architecture and in terms of data storage the SageMaker Async itself uploads your inference output to an S3 bucket so that's how we are dealing with the data that's the complete overview of how we have built out the ML serving pipeline at episodes thank you okay great thank you thank you so much we're just coming up for time so I think that was a good walkthrough of your technology approach to this challenge just before we step out though