 All right, so the title is Wasm as a Wrong Time for LLMs because I submitted the talk with a mild title so that it can be accepted. The real title is this, you know? So as Liam has said, I've been attending Wasm Day for three years, you know? Every single KubeCon, I think? You know, that's six KubeCons already. So people have always asked, what's the killer application of Wasm? You know, when we started, it was the web applications, compile C programming to web application and Adobe Photoshop on the web. And then it was blockchain applications running, smart contracts and all that stuff, right? You know, which is actually highly successful, but it's a little far away from the mainstream. And so we're still searching for what is the killer application of Wasm? You know, I have come to realize in the past, I think, year or so, Wasm is not a Docker replacement because we had a Docker plus Wasm announcement last year in Detroit at KubeCon. It is not a go on replacement. It is not to replace wholesale microservices or serverless functions, it's a written goal. I have come to, you know, a rather strong conviction. It is a Python replacement. And I'll tell you why. I hope I'll make it entertaining. So before we even start, you know, there's a big issue with the industry that the large language model doesn't know Wasm, okay? So I asked Lama to what is Wasm. It gave an answer that is, I think, largely correct. However, it's mostly geared toward web application. So it's not even in blockchain yet. It's still in the web space. If I ask you the more complex question, it would give an outrageous answer, right? You know, so what is Wasm? We have heard about Wasm today, right? It's a web assembly system interface for neural network. But it's a wide-angle spectral image in neural network. I actually searched Google on here to say, does such thing exist? Did it learn somewhere? It doesn't exist, okay? You know, so it completely made it up. So it doesn't know Wasm, it doesn't know Wasm. So you know, so now, you know, I find depressed because every other company, every other subject has now trained their own, you know, large language model to answer questions in the way that they like. So right before the conference, I decided to train a new large language model called Lava. You know, it's Lama for Wasm, right? You know, that's, so basically, I write up less than 100 question and answers, you know, instruction, Tony, and then put that to Lama2. And then it computes for some time. So my first demo I'm gonna show you is how to do AI inference with this large language model. This large language model is the same size as Lama2.7b, so it's about five gigabytes, you know. So let's go through the steps. The steps are really easy. So the first is really to install Wasm on your device. You know, it's a one line of command. It's easy install. You know, it's installed on the Mac, on PC. It's installed on BlackBerry Pi. Although BlackBerry Pi do inferences a little slow, but I install it on just the Nano, you know, which is, you know, the VDS IoT device. You know, so it's one line of command. You install it. And then copy over inference app. This is, you know, I'm gonna get into more details about this, but I think this really achieves what we have set out to do all those years ago with Java. It's right once around anywhere. You know, with Java, it has only been about operating system and CPUs, right? But with Wasm, as we all see with Wasm, we are now expanding to GPUs and other accelerators as well. So I have a single compiled Wasm application. It's written in Rust, and the whole compiled artifact is less than two megabytes. And I specifically said it's not to install applications to copy over. You just download it and copy it. Put it anywhere on your hard drive and you will be able to run it. And I'll show you, it can take the full advantage of whatever the accelerator that you have on your device. And copy over, then download the LIM. So before we go download the, so here's a link to download the Wasm instruction tuned Lama 2 model, right? We call it Lava. But in fact, if you follow Twitter, since Facebook launched Lama 2 like three months ago, I can't believe it's only three months ago. There are over 2,000 different fine-tuned Lama models that's available in the marketplace. You can see those every single day, multiple times every single day that claims to unreach to the state of art performance, right? So here are just some examples that we can run with Wasm Edge, right? You know, those are, you know, those, what they call the true open source model, the mist trial, the AWS to open source model, mist trial light, tiny Lama, which is a smaller one, open chat, which is a compare with chat GPT 3.5, you know, all that stuff, right? You know, so there's a large variety of models, you know, but what I want to demonstrate today is to run an instruction-tuned Lama model that with Wasm knowledge. So then all that are to it, you know, so you know, if you sit here with your laptop, I highly encourage you to just go through that steps because the longest step is the last step, which you download the large language model, it's five gigabytes. I think the conference Wi-Fi is pretty fast. It goes like, you know, 10 megabytes per second. So before then, within 10 minutes or 15 minutes, you have a large language model sitting on your own computer. You know, that's you would be able to do the same thing that I'm gonna do in a minute, let me see. So let's look at the example. So the example really is, I have already downloaded the large language model and I have Wasm Edge installed, you know, those are the two prerequisites and I have the Lama-chat.wasm file. I gonna show you the source code in a minute, but that's the cross-platform Wasm binary that I put on my Mac. So all this is running on my Mac. You know, so I, unlike my previous demos, I was always nervous about whether the demo gonna work or not. This I'm not because nothing on this is online. Everything is local, okay? So it's all on the same device. So it brings out some information about my Mac. So now I ask the same question I asked before. What is Wasm? Wasm. So just keep in mind this is a five gigabytes model. So it's, Wasm needs to read into the memory and then it exports, it answers the question by doing the inference on the GPU, you know. It can only achieve this speed by using the GPU, right? You know, you see the word that's typed out, it's faster than I can speak. So I'll get ahead, start with the next question. What is Wasm? But you can see the first, it gives us short answer and the short answer is largely correct. It's talk about the virtual machine aspect of Wasm. It doesn't really talk about web application anymore. You know, it talks about this instruction set. Now you can also answer Wasm accurately. It's a system interface for neural networks, right? It allows Wasm app to use and integrate with the different neural network frameworks. And the demo you are seeing right now so we have integration with TensorFlow, we have integration with PyTorch, we can integrate with like, Andrew has just demoed, you know, that's what was open vinyl. And here you are seeing integration with the neural network framework called GGML, which is a new, you know, an inference framework that designed for the Lama to a large language model, right? So, you know, so then we can say which runtime's supported, you know. So just to show this is conversational because if I just ask the model which the runtime supports it, it would have no idea what I'm asking because what's it referred to, right? But now, but with this, with the context of this conversation, it actually knows that, you know, actually it can pick from Wasm or Wasm, you know, as replacement for it, right? Which one is it refers to? It actually picked Wasm and it says Wasm runtime are, you know, Wasm Edge, Wasm time and the one, you know, that's obviously, you know, I'm biased because I created the 100 questions that to train this, right? So, you know, that's, but then let me ask the last question which is slightly longer. It's how do I run AI workloads in serverless functions? Okay, you know, the startup time is a little slow because, you know, every time it needs to load the five gigabytes large language model in your memory and then when it starts to do inference, you see the streaming text which becomes really fast, right? So basically, you know, that's a, you know, so that's it for this example, you know, that's, so what I'm trying to say is you don't have to use this model. This model is sort of a toy. This model is what I, you know, what I have trained and with my own any questions, you know, that's, you know, over a weekend and so, but get one of the state of art models, you know, there's a ton of them and they are coming out every single day. Some of them are doing role-playing, some of them are doing medical QA, some of them are education, some of them are, you know, a content summarization. So there's a lot of things you can play with on your own computer by using Wasm, okay? And so, let me continue my presentation. So this, so now we see on the command line, I can run a large language model and ask questions and chat with it, right, you know? So I wanna show you what the code looks like, you know, because, you know, isn't something like this gonna be very complex and you say you write it in Rust, isn't the Rust a difficult language? I would say the whole, the core part of the application can be fit into one PBD slide with the size that is large enough at the back of the room, you'll be able to see it. You know, this is how simple it is, right, you know? So this is largely thanks to the API design on Wasm, so basically you, you know, load the model and start a contact and then you pass data to it and then you pass your inference, the text you want to feed into it and then the result returns back to the buffer and then you decode the buffer and then you get the result. So it's a very, very simple application and this application I also want to point out is that, you know, if you look at line two, it's execution target auto means that it's automatically searches for whatever accelerator is available on this device. So if I run the same application on a media GPU device, it's gonna go a lot faster than how I run it on a Mac. If I run it on the older Mac that doesn't have a GPU, it's gonna go a lot slower. You know, it's about order of magnitude difference in those three different, you know, scenarios. But the point is, it is, you know, it is ultimate portability solution. You know, so you have Wasm, you have a single Wasm file compared with a single source code and then you copy it over to different devices and it would automatically figure out whether the device has GPU, whether the device has Apple silicon, whether, you know, what kind of accelerator this device can use and then automatically use it, right? So, you know, now it has, you know, we created this Rust application, but, you know, that's our goal, soon we'll have a JavaScript API for this specific thing as well, running inside QuickJS, you know, by connecting QuickJS into with the Wasm API. So with that, we can do a lot more by leveraging the Wasm ecosystem. So in this example, we created another application, by the way, those are all open source, you can go, you know, just look at their Rust source code and compile them. It's started server. What this server does, this server listens to port 8080. What this server does is that it's, it mimics the open AI server. So meaning that you can use the open AI JSON format to send requests to the server and the server would respond in the open AI JSON format. Why do we want to do that? It's because, oh, it's because there's a large ecosystem of software that has been developed around the open AI and chat GPT ecosystem. So you see things like Lenshain, Lama index, you know, those are, I think barely one year old project and they have, you know, tens of thousands of GitHub stars, right? You know, that's, you know, those frameworks are designed to build chatbots or designed to build AI agents using the open AI interface, right? You know, so once we have, once we turn the, using Rust and Wasm to turn the large language model service into an API into a microservice, we'll be able to do a lot more, you know, by leveraging those ecosystem tools. But there's more, you know, so I skip over this side and then I come back. You know, I don't know if I have time to do the demo, maybe I do. You know, so I'll, because this is just one minute. So I'll show, this demo was just ready today. So, you know, so we build the Wasm-Edge shim for Ron Wasi and through Ron Wasi, we can integrate into container D and then it's, it can be driven by Docker, by Podman, by Kubernetes, right? You know, so in this demo, we have a new, we have a new unmerged version of Ron Wasi so that we can use Ron Wasi to run the large language model as well. So that allows application developers to use Docker to do it, right? You know, so one of the most interesting things about Docker, if you look at, you know, it's, you know, people's criticism about it, it's not cross-platform. Not only the, you know, you need, for the same workload, different CPU in a different Docker format, on different GPU, you need a different proprietary format for the media driver and all that stuff, right? You know, and with Wasm, and we can normalize those differences. We can have us, you know, by having Ron Wasi run the Wasm workload for AI inference. You know, it's, this is not a chat model, so the text just goes on until it's finished, right? You know, it asks, what's the life achievements of Auburnheimer, right? You know, that's the, so it gives a lot of wrong answers. It gives to the, you know, I've never been to, sometimes he said he got the Nobel Prize, which he never did. You know, that's a, but, you know, that's, you get the idea. You know, it's a Lama model, it's a base. You know, once you can run that, you can run a lot of other things. So, with the, with the Kubernetes support, sorry, let me, here. With the Kubernetes support, you can now run this in, run the AI workload in your cluster, leveraging the GPU device or other AI accelerator that you may or may not have on your host, on your host, right? You are not, you know, you do not need to manage different versions of container for different hardware devices and such, right? You know, so that's, I think that's a nice add-on. You know, that's, let me, let me continue. So I have eight minutes. So, by the way, AI is more than just the LAM. So I showed you how to run LAM, how to train a Lama too, and then how to run, how to do AI inference on that. In fact, there's a, there's a large number of AI frameworks that's currently running with, with Watermatch. You know, we, we spend a lot of time because, you know, we take advantage of the Linux foundation internship. So we get 12 graduate students interns every year. You know, we have, most of them are, you know, master students or PhD students. So we get them to, to work on integration of those AI frameworks with our runtime. So we now have OpenCV running. We have an MMM, FFMPEG running. We have the, you know, a different AI frameworks running. And then we have, not only the Lama model for meta running, we also have, you know, the, you know, audio and video model from, from media pipe running. So let me show you a simple example of media pipe. So this, again, this is an entire Rust application. It's even bigger font and fit into a PPT slide. You know, so it creates a, it creates a contact, load the model and then send the image to have the model recognize what's the image, right? You know, so since I'm in Chicago, so I decided to give those two images. And I don't believe this model is specially trained to recognize food. You know, it's trained to recognize objects, right? So it has a couple thousand objects in its dictionary. So that's what the result come back. You know, that's, you know, you can actually try it on, you know, we have it on GitHub Actions, you know, in our CI CD, so we're on this daily. You know, it takes a couple milliseconds to get a result from those pictures. So you can see on the hotdog, it's a little confused what it is, but it's more or less knows it's a hotdog. And it definitely knows it's a pizza, right? You know, so, you know, that's, you can do a lot of those with non-language models, with, you know, video and image models and, you know, things like that. So I think I probably answered this question, but the biggest question of any writing or any technology is the so what question. You know, that's, okay, you did all this, and so what, right? You know, the key point of so what is no Python. You know, that's, you know, I really like this photo. You know, that's a, so Python is just a C framework, but if you can read the PyTorch and, you know, Docker images on the right, it's three gigabytes to begin with. It's a very expensive C framework. You know, to just to have something, a nice language that rapporteur on the C, it's very, it's very, very expensive. And also it's not that nice anymore anyway, you know, because it takes a very long time to install Python and then to make sure it doesn't come, you know, I have a separate machine that I install Python with, you know, because I often needs to, often needs to, you know, wipe it up, right? So, you know, just to draw home the point, you know, it's, it wasn't I who said, you know, wasm or rust is the language of AGI's. He said that, okay? You know, that's, and so yesterday, X has made an announcement, it's called GROC, right? You know, that's, they have their large language model and very specifically in their press release, they, as they mentioned, their whole backend infrastructure is building with rust. And, you know, Kubernetes rust and Jax, right? You know, you have to have some kind of AI, AI framework and Jax is mostly CC++ with a little bit of Python. It has, it talks about a lot about the version of rust. And, you know, as we have seen, you know, because a lot of people here are, you know, rust developers because those wasm and rust communities are very close, you know, that's why we have co-located conferences in Seattle, you know, a couple months ago. And so it's my belief that wasm has become the go-to runtime for rust, you know. So, you know, so I think in the near future, a lot of majority of rust applications would be compared to wasm. You know, that's a, so that's something that we really look forward to seeing. So then to finalize, you know, I think, you know, I only have a couple more minutes. So, you know, to go back to my original points, what's the killer application? You know, it's not to replace Docker, not to replace Go, not to replace today's microservices, it is to replace Python in the new, in the new AI application. So it talks about, you know, on comparison with Python. You know, that's, you know, I think there's, it is no bringer because you install the whole tool chain with, you know, almost a zero dependency in four, five minutes. And then, you know, the whole thing is like 1% size of the Python. And then the application itself is much, much smaller than Python as well. And then it runs much faster. And then it's, you know, it's utterly portable, right? And, but, you know, the other question people always ask is, you know, how about C++? You know, that's, you know, why can't I just compile to native, right? You know, that's, you know, because a lot of, you know, in a lot of big firms when they do AI inference today, they use C++ based runtimes, right? You know, so there's a, but that's our standard, you know, there are so many speakers today talking about the virtue of isolation, you know, the need for, you know, components and all that. You know, Wathom is just a better, you know, programming model than say the full native model that we sort of deprecated since the Java days, right? You know? So those are, you know, my main arguments on Rust plus Wathom, you know, that's how it's compared with the leading competitors in this space, right? You know, so I think that's it. Yeah, that's, so thank you, yeah. Vicey take, Michael. All right, who's got some questions? I was afraid we were gonna get boarded. Attacking the Python community. Canceled, Michael. Trying to get Wathom off the ground here. Yeah, hi. The talk mostly talked about interoperability and, you know, bashing Python, but is there an advantage also in terms of performance compared to, because ultimately it's GGML that's running, right? So does this pattern of building with WebAssembly plus Rust have any performance advantage compared to running Python with, you know, GGML as a backend? Yes, you know, so in fact, I normally talk about the size of the complexity of Python, but if you look at the, you know, the academic paper has been published, you know, and also, you know, there was a very famous paper, you know, published on Science, I think, two years ago, the title is, there's plenty of room at the top. You know, because 40 years ago, another guy published the paper, of course, there's plenty of room at the bottom. That guy was Gordon Moore, right? You know, that's the Moore law, which says you can continuously improve the semiconductors and, you know, and not worry about too much about the software. But, you know, we have had the war at semiconductors, you know, that's, you know, I know because most of my colleagues are in Taiwan, right? You know, that's, so, you know, extracting performance from software becomes paramount, I think, in this day and age. So they compared Python with, with say Rust and C++ and all those. There's a six order of magnitude performance difference between Python and those languages. The only way Python is still in the game for AI and machine learning is because the C framework. You know, if you are not careful, if anything that you actually perform computation in Python, you're gonna be, you know, it's gonna be so slow that you would immediately notice that. You would be, so that's why, you know, that famous, I forgot who said, it's much of today's machine learning is to figure out, you know, how to make Python, you know, how to get around Python's performance problem, how to make, how to use C framework under Python. So your quote is Python broke Moore's law? No, it's a Moore's law has reached the point where we have to squeeze performance out of it. So it's a software Moore's law, start with Python and goes to Rust and goes to... I love it, Michael. I'm only teasing a little bit. I like the spicy take. Who's next? How do you feel this relates to Onks, ONNX? Does it make it obsolete, irrelevant? No, I think this is, you know, perpendicular to the format, right? You know, so ONNX is at the same level as TensorFlow and PyTorch and the GJML, which you all support. So it's a model format. And so we through the Wasian abstract interface, we can call out to the runtime that runs those formats. So I don't think it'd make it a lot obscure at all. You know, I think it's gonna be able to accommodate the ONNX ecosystem. So this could actually read Onks? Yes, you read ONNX and the do inference on the half of it, yeah. Cool. Okay, we got time for one more, I think. Up in the front. I'm not sure if somebody asked this already, but during your demo, one of the things that you mentioned was that it, you kept having to load the five gigabyte model and as, you know, models get bigger, that obviously that becomes a bigger problem. One of the things that some of the Lama authors did is they M-mapped the model and that obviously reduces startup times by quite a lot. Do you think that's something that could be replicated with WASI or maybe WASI-NN? Yes, yes, so, you know, so the way WASI works is that it's an abstract interface that's for the host function. So there are lots of things you can do at the host function level. You can, so, you know, because this needs for large language model, so we actually updated WASI spec, you know, so I don't know if Andrew is here, but you know, that's the, so the WASI-NN committee has driven that direction because, you know, for five gigabytes, you can't read it into the WASM memory and then transfer it to the WASI memory. You can't do that. So you have to bypass WASM all together to read the memory. So there's lots of optimization that we can do here. You know, there's lots of details and we love to see the community's contribution that we are working very hard on this as well. Another great Cloud Native Computing Foundation project, a WASM Edge, please join me in giving Michael a huge round of applause. Thank you.