 And today I'm going to talk about wasm for AI infrastructure, to use wasm to run a lightweight, fast and secure AI inference. So the GitHub repo is our project, it's a CNCF sandbox project called the wasm edge. So I'm going to talk about the following several topics and first I'll talk a little bit about myself and the project. So I just actually flew from Tokyo, they have an open source summit last week and I also gave a talk there and it's one of the highlighted talks so I got a lot of audience, it was really quite intense and we also had a lot of developers who have been using our runtime. And so I mostly organized the dev meetups and conferences and I'm one of the CNCF ambassadors and I sometimes write technical blocks and documentation and do Chinese English translation. So you can see the Japan one is the talk I did, I just mentioned. So that happened last week at the Japan open source summit and we also did some meetups about Rust language and web assembly in Singapore. The Singapore one happened in early September, late September I think. And also another one in short P last December in Singapore. So this wasm edge runtime, we are using wasm on server side so it's highly efficient and lightweight sandbox is like a container. So you can think it as docker but it's much more lighter but it can also run side by side with it can run inside docker. So in some use cases it can replace linux containers because it's much more lightweight and it has zero startup time. So yeah, these are some of the partners that we've been working with in the ecosystem and the screenshot is saying that Oracle is also using us but we didn't know before because we are open source software and they have been using it without telling us. So yeah, so we just found out in Chicago's KubeCon several months ago. So yeah, we are going to what we want to talk about today. So right now when we want to run AI we are used to do it with Python and docker and Python has been really popular because it's very easy to learn and use that has a really big community. I just saw some attendees today are wearing PyCon t-shirts so Python is being really popular and also docker, docker has been established for almost 10 years and it has a very mature tooling and it's very portable and scalable so that's why right now everyone has been choosing Python and docker but Python does has its limitations so on the right screenshot is a paper published on science so it's saying that after more slow it's going to an end or what will drive a computer's performance because before for the past four decades people have been optimizing the semiconductors the hardware and not caring too much about the performances of software but you can see as a benchmark here that Python has been really taking a lot of times much slower than the other languages so the performance has been a real bottleneck and also parallelism and GIL ensures that only one threads execute Python by core time which is also a problem and also memory management can be very difficult and you have to have a lot of Python dependencies which take a lot of space. So how about Python plus C and C++ so the probability issue can be also big so there is also maintenance cost and Python would interact with native libraries or system level dependencies across different environments differently so you have to do a lot of configuration when you change to another environment also that integration can also be very complicated like when you glue Python to other languages like C or C++ there are a lot of management of data types memory allocations and a lot of error handling for example like Python 11 this gluing process is really going to make a lot of trouble sometimes and so the developers would need to have a very deep understanding of Python and the other language so this has increased development time and risk so this guy Chris Auburn he's the CTO of Wikipedia he said he has been having a hard time to install Python and Greg Brogman he's he was he quitted OpenAI but then was rehired so Greg was an engineering chief for OpenAI he also said machine learning engineering has to tackle this problem that is to make Python not your bottleneck and also this guy on the right said well I've been taking a lot of time to just to install Python and it's it shouldn't be like that and also for I'm not sure if you have heard of Mojo is a new language created by Chris Lathner he's a creator of LLVM and Swift programming language and he said he want to use Mojo to replace Python because Python has been such a headed so also there are limitations with Linux containers or Docker so it has a long co-start time and the disk space is big like several gigabytes and hardware accelerator support is lacking and also for the portability part it has to be different for different CPU architecture and the security also can be a problem because it relies on the user permission of a host operation system so it might introduce some risks so now that's what we want to talk about today we would think REST plus web assembly is a right solution that we should go for so on the right is Elon Musk saying that REST would be the language of AGI age so you might be hating him but I guess that's how Elon thinks REST will be really important so performance and memory safety so REST if you have a bug you cannot even compile it through so it's very secure and you can see at the others so they have cargo this modern package management tool and it also has a rapidly growing ecosystem we are hosting a lot of REST events in China and also in other places like even in Silicon Valley as well and we are seeing that younger generation of developers are getting really passionate about REST programming language so you can see that so on the right is actually an image generated by AI so it's not that accurate because you can see on his the guy's t-shirt is some gibberish so that but it's just made makes my point so it's saying that Python dependency can be too heavy if you are running on AI inference you should use REST and the web assembly so if you so on the right is what we have did that is to run a larger language model on my own Mac with a 2MB inference app written in REST and Wathom so yeah I will explain that later but so first you can also see that to use Wathom to write an application radius application it can be much more smaller like 100 times smaller only 0.7 megabytes than the normal Linux container app radius that could be easily 50 plus megabytes and also on the right is the post sequel the app can be the running inside the Wathom container can be only 0.8 megabytes and it starts in milliseconds instead of seconds so compared with a Linux container it's 50 also 50 megabytes so here on this page of this link you can see the more comprehensive comparison and yeah I guess I will not get into too much details so we would think that the technology path of virtualization so the first generation of virtualization would be VMware and that is the age of hypervisor and micro VMs and then we evolves to application containers like Docker and so what's next so we would think that would be more high-level language virtual machines like V8 or WebAssembly you can see that the sandbox mechanism would provide a much safer production environment and protects user data and system resources and the byte code verification would prevent malicious code and is isolated execution sandbox even among all the modules so it can be really safe and very high performance so the key findings of CNCF survey 2020 if has just one finding that is WebAssembly is the future and we would think is the next way of cloud computing so this year about three months ago we also together with CNCF made a specific landscape for WebAssembly so it has an ecosystem of many project with 58.4 billion valuation you can see there are languages, runtimes, application frameworks, edge and they are all WebAssembly so yeah it's too big so I cannot display the full. This survey is this year 2023 so it find out that many developers have been choosing WebAssembly because of its faster loading time and a lot of other advantages and on the right is WebAssembly system interface is allowing a wasn't filed to access system files so that you can use use it on the server side so instead of just in the browser so because when you talk about WebAssembly people would always come to mind that it's only used in the browser but we think it's just will be a huge on the server in the future so already 34% of people are using it and 34% of more are considering using it in the next 12 months so this is actually the founder of Docker Solomon Hikes he said that if wasn't plus what he existed in 2018 we wouldn't have needed to create a Docker and yeah and that was being said in 2019 and fast forward to several years later in 2022 he said he saw that what the has been launched and being maturing and said again that that is when the news came out when wasn't has been officially integrated in Docker to say that wasn't basically he said that Windows containers and wasn't containers and OCI can package all the all the needed software and he would think it's very it has a really bright future so wasn't match runtime is officially integrated in the Docker desktop and when if you use Docker to run a wasn't fire it will automatically be running inside was a match sandbox so I we think there are specific use cases that's being preferred for WebAssembly that is IOT Microsoft services SAS plug-in browser and what do we want to talk today about is AI inference so you can run so with WebAssembly you can run large language model on your own Mac or IOT devices and you only and it's fully portable so you don't have to you just only have to run code once so and these are what are very few part of the models that we support now and it's it would support Lama 2 series of models so it's the advantage would be it has automatic GPU detection and the inference app like I mentioned is only two megabytes is a cross-platform binary and the result would be 20.54 tokens per second on a M1 MacBook so it's fully self-hosted without any there without any Python dependency so is I said it's only for command we can quickly look at this command so first is this one to install wasm with LLM support and then download an LLM chat app that the app is in wasm so it is a two megabytes cross-platform binary so that file is compiled from Rust you can also yeah you can also build your own app and then download the Lama 2 7b chatbot model so the model might be five gigabytes your computer need to have need to have those the space for that model file and the force is to chat with the model with on CLI it's super simple and yeah you can see that some Japanese developers have found out themselves and run get it running in very short time it's super simple so this is what I've talked about you can just copy copy the code and do it and also even you can create open AI compatible API service it's like the API server is completely compatible with open AI and so you can you when you write any application you can just replace your open AI API with your own self-hosted large language model it even can be your own fine-tuned model so these are several commands for you to create an API server so yeah here is a demo I guess yeah so if that's the chat wasn't file and gguf model file so it's it's loading the model inside the memory so the first question usually take longer than the following up question so because it's need to load the entire modeling yeah that's I just explained why the first yeah the first question would usually takes longer yeah so because it's a long answer because the question was about Chinese cuisine so you can see the token per second is about 20 tokens per second so when human talks it's usually about two or three tokens per second yeah I guess that's it I will not go play the entire video so and also there are even a simpler method that is just one command you can just run this in your on your terminal and it automatically let you choose the model that you want to test and you can have it on your own computer yeah it's completely offline and it's run across all different devices including different operation systems or GPUs or CPUs and it would automatically detect the accelerator and use it this is some community member article he he said how to run what manage on RHEL9 and that's a few steps you can find the details on this bottom link so beyond language AI so it's we say it's very good for AI inference so it's not only about running LLM tasks it can also run the vision and audio inference as well it's so the what wasn't runtime with support pytorch tensorflow open Vino AI framework etc so using Yomo this framework is real-time data stream framework this is actually being used in product in production in some factory so the factory has been using this to real-time the detecting the on the belt when a product comes whether it's a whether it's a 40 product or not so with wasm it can be processed really fast and the AI recognition can be much faster than the traditional solution and also it supports a popular open CV and FM M pack libraries for image processing and you can apply this vision and the audio AI project using media pipe RS so it's a rust library for media pipe tasks and it's yeah it's rewritten the Python part with rust so it's has a much better performance so this is details about that it's very simple to use with the local APIs such as media pipe dash pass Python that has it has really low overhead so no necessary unnecessary data copy allocation etc and it's really flexible so users can use custom media bytes as inputs yeah so I will show you how to build a serverless AI app in just several minutes with the rust and wasm so we are going to take advantage of this platform is a serverless platform that has connected large language models and sats tools like a GitHub discord telegram and slack so being serverless means that the developers only need to focus on business logic and no need to compile or deploy rust functions so this is how it work so we are going to show how to create a learn rust bot so it's a programming language learning bot so in the core is we can also call this LLM bot an agent so firstly you ask a question and the and turn the question the agent will turn the question into an embedding vector and also search search for the similar embeddings and return the relevant test to the bot and then would use the relevant test at contests to ask the users questions to because we are connecting a large language model like charge gbt or your own self-hosted one and then the bot would pass the charge gbt response to the user so you can try this this and this is an example bot we have we have created and you can use it to learn the rust language you can try it out but we are going to show how it's built so first you need to create an embedding for your data so you have a certain data you want to feed to the bot so with this case we have feed the app with some authoritative rust learning materials like some official documentations and other very good high quality rust learning rust related knowledge and the second is to fork the this repo fork is this repo at the platform we talked we mentioned to get a web hook that can embed and store our data into a vector database and the second step the third step is to upload your prepared text chunks into the vector DB and name your data with one command line so this right demo it's showing what you need to do from step two like fork the rag embeddings demo repo so you use this platform to fork this demo repo so that you don't have to oh yeah so you don't have to write your you don't have to write the rust code then you need to also connect your open API and now it's built so you can get the web hook URL and then open the terminal to enter so and later add the embeddings to the vector database you can name your embeddings by changing this default name and use the web hook we just got yet to replace the original one okay so now you have successfully created embedding so the second step so that's only two steps in total so the second step would be build the agent with the embedding we just created so it's also really simple you import the rag this port discord bot demo repo into flows.network for deployment and then configure the following five environments variables so you need to so this is a discord bot so that so we need to get some variables from discord as well it looks a little bit small but yeah so you give it a name and go to the bot tab and turn on the three intents then you go to the Oath3 and click Oath tab and then go to the URL generator to use the link to invite the bot to your server the bot is offline you need to go back to the developer portal of the discord to get the discord token and puts it to the environment variables so copy and paste the discord token and then the bot ID okay paste the bot ID so yeah you need to add the variables one by one by clicking on the add button of the right the purple one so the collection name here you can name it yourself is the one that you just entered so you also need to set up the other lefted configurations like you can customize your own prompt and enter it in the environment variable and it's pretty much intuitive you just then enter your open AI API key and click on deploy so it's deploying and then a building and then when it turns to deployed when it's turned to running your bot would be online so you would have a rack-based charge Bt bot it's super simple you can get it down within five minutes and you can ask this bot a question so it's a bot that's being powered by charge Bt but it's also fed with rust programming language related knowledge that you just fed to him by creating the embedding and yeah and building it with this flowstone network framework platform yeah so the other use case would be a building a code review bot you can also connect any large language model and deploy it on your on your github repo so that that's the result of deploying so once you have this bot on your repo you can every time when a PR or a commit is being made you will be able to see very quick and clear what's being changed like in a very clear way and the summary so it's also a few steps you load also load the bot template and also you can do some customization with the template by tweaking the rust function because it's the function is written in rust and then give the bot your open AI API key and then also write the bot access to github yeah so the very detailed tutorial would be you can access as the bottom link is a scene on the CNCF blog so if you want to see what the structure is you can check out this two image so on the left is actually from a paper by Lillian Wen from open AI she said the open application oh sorry an LLM application that can also be called an agent would have would be like acting like a brain we and so you have you would need to give it give the brain memory and also it would do planning and then it would use utilize the tools it can access and then it will make action so the memory part would be the embedding we have created for the agent or the bot and on the right you would see that the the function written in rust would be the the logic that you want to you want the bot to do it's like you want the bot to review your new PR or commit so it would use the tool and actions on the left that would be github service and github integration that's rust create and would also take advantage of the memory that would be some vector database and it has the vector database integration and so it's with the do the planning that is to utilize the chat GPT service to give correct answers to anyone who's been asking the bot questions also this one is more similar structure but it's like because the entire rust function is compiled into wasm and running in the wasm runtime so it has a very it has a very good performance and just like I mentioned in the previous the first part of my talk so yeah I guess that would be it and I think you can try the try it out and maybe run self-hosted large language model on your own computer and also write a large language model app based on the framework and resources I provided yeah and this is a open source project wasm edge github link you can find it here and also this is my Twitter and github account you can always talk to me also join our discord to talk with other community members and also join our monthly community meeting yeah thank you