 All right, welcome to my talk. So my name is Michael Yuan, and I'm the maintainer of Wasm Edge Project, which is a CNCF sandbox project. And if you want my contact information, stop there on the barcode. So the talk is efficient and cross-platform AI inference with Wasm Edge. So the question I have to answer first is, compared with who? You know, why you call yourself efficient and cross-platform? So the word efficient is compared with, perhaps, the most common way to do AI inference today is to use Python or PyTorch. What I'm going to show you is that we are orders of magnitude more efficient than Python. So that's, Python is great for training, but you should never use it for inference, unless you just want a research or a hobby project. So that's the first word. And cross-platform is, in large operations, people often use C++ to develop inference applications. And the problem with C++ or C++ native application is not cross-platform, so it's difficult to use them in, say, a Kubernetes environment or a cloud-native environment. So efficient compared with Python and cross-platform compared with C++-based frameworks. So that's a big claim to make. So I'll start my talk with a demo. You guys can also follow along. By the end of the demo, I think it takes five, 10 minutes. You're going to have a large language model running on your own laptop. That's how we run a large language model in our, in my laptop, everywhere from our team members' laptop, and also around our office server. So we have a Jetson-orange device. It's about this big. It has a media DGPU unit. It costs about $2,000. And we run a large language model here as a coding assistant, so everyone can use it in our office. So just one line of code. And I'll go ahead and do this demo first. So copy and paste this, and put that into the command line. So it asks a bunch of questions. The first, of course, because we are using this, we're using the Wasmage runtime. It's a web assembly runtime to run the large language model. The first is to ask you to install Wasmage. But since I already have my Wasmage installed, so this is about a 20 megabytes download that's to install Wasmage. So I say I already have it installed. Then it gives you a list of large language models that you can choose from. If you search Huggingface, search for Lama 2-based large language model, you can see over 2,000 of them. So they are all based on the Lama 2 model that Facebook released. It's called Meta AI now. But people use different data sets to fine tune it, or use different pre-train data using the same architecture. So what we've listed here is 20-something that's most popular and has tested with our setup. So you can see some of them are good at Q&A. Some of them are good at coding. So we're going to do the simple thing. We're going to choose the first one, which is the original Lama 2-7 billion parameter model, so 1. And then you have two choices. You can run. So we're going to show you both. So you can run it as a command line application, or you can run it as a web server. So let's do one first. Do I need log? No, I don't. So at this stage, what it does is that it downloads the inference program. So the inference program is a Rust program that we wrote, and we compile it into Wasm. And so the entire size of the inference program is 2 megabytes. So why do we say we are efficient compared with Python? Because the Python with all the dependencies, the PyTorch Docker image, for instance, is 4 gigabytes. So it's 4 gigabytes versus 2 megabytes. And people say, OK, that's not a fair comparison, because what about the dependencies you have in this Wasm file? The only dependency it has is the Wasm runtime you just installed, that's the 20 megabytes binary. So the way you can think about it is like a Java application. So the only thing that's dependent is the JVM. As soon as you install the JVM, you can run any Java application. So the Wasm application here is the same as well. So it's a 2 megabytes application that we wrote in Rust, and then compiled into Wasm. I'm going to show you the source code in a minute. So it performs the works, that's the 4 gigabytes PyTorch application, the 4 gigabytes PyTorch application going to accomplish. And the other thing that I want you to notice is that it's just a llama chat.wasm. There's no Mac version of this, Windows version of this, ARM version of this, Intel version of this, or NVIDIA GPU version of this. It's all the same. It's one application that cross-platform. It's not only across different operating systems, but also across different CPUs, GPUs. So on whatever device it knows through the underlying Wasm runtime, it knows how to detect the application runtime, the hardware that's available on this device, and we'll be able to use that. So let me ask the first question. The first question, where is the capital of Japan? So the first question takes some time, because for the first question, it needs to load the entire large-language model into memory and then run through the inference. So it already answered, but the large-language model is like, I think, 5 gigabytes here. So it needs to load that from disk. By the way, everything I have here, I have confidence to do this demo because I'm not relying on the network here. Everything's running on my MacBook. My MacBook is two years old. When I bought this MacBook, I thought I had no use of the GPU, so I chose the lowest configuration that I can possibly find. So now I'm using the GPU all the time. So this ability to return in that speed is because there's a GPU device that can be detected by the wasmage runtime. And the wasmage runtime would be able to load that model into the unified memory, and then answer this question. So let me ask the follow-up question. What about the United States? So it answers much faster now, because the model has already been loaded. As you can see, the inference it gives answer faster than I can talk. So the capital of the United States is Washington, DC, District of Columbia. So in this simple conversation, you can already see that the model follows conversation. Because if I just ask the model, what about the United States? There's no good answer. What about the United States? It's a country. Only with the context of the previous question, I'm asking the capital of the United States. So if I say, what's about the United States, it answers the capital of the United States, Washington, DC. OK, so let's ask another follow-up question. Please plan a two-day trip to that city. Well, it is faster than I can read it. So it says some nonsense, but it gives you actually those places where you can go in day one, day two. And I can have it plan a two-day trip in Tokyo for me or in Kyoto for me. So it answers a lot of those questions. Yesterday, I was messing around with it. So I said, the capital of Japan is not Kyoto. And it says, no, it's not. It's the old capital of Japan. It's no things like that. So it's a, but you can see, that's on this very low-end Macintosh that we were able to run those AI, those large language inference at the speed that is faster than a human can talk. So let me switch gears and do another demo, which we run the same script again. But make different choices. So we still keep my own water-mage runtime. And then we choose the first model. The reason I always choose the first model is because it's already downloaded on my machine. So if I do need to download the model on the conference Wi-Fi, it would break the demo. It would take a very long time because it's a five, six gigabytes file download. So if I say, I want to create an API server, this. And it now downloads a different water-mage application. This water-mage application is about five megabytes, instead of two megabytes, it's five megabytes. What it does is that it starts an API server. So why do you need an API server for the model? It's because a lot of the two chains in the large language model ecosystem started by working with the OpenAI API. So you use chatGPD as a back-end. The OpenAI system has their own API format. It has certain JSON format or certain HTTP conventions that they want you to use. So the Lama-API-server is another Rust program that we wrote that essentially started a server using Rust. In this particular case, we use a hyperquate. And then from that server, we connect that to the large language model that we have specified. So essentially, it's an application that start a web server in front of the large language model. But the web server takes input and output messages that are compliant with OpenAI format. So that allows all the two chains out there, like the LAN chain or Lama index and things like that, that used to work with OpenAI to work with your own model. So essentially, this is what we do. So we can do that on your, that allows the model to be served outside of your local computer. Also, it's running on your local computer, right? That's how we do it with our own development team. So we have a small media server. That's JSON-oriented that I talked about. That has 64 gigabytes of unified memory. We put that in our office. And so all the developers in the office ask programming questions to that server. So to have it essentially as a co-pilot, right? Except we don't have to pay for it. It's states entirely private. We don't contribute any data to any outside of entities. And it can be trained with the code and with the documentation that is very specific to that team. So as you can see, it started a server. I can interact with the server by using curl and JSON. But I'll show you something that's even simpler. So it started listening at that port. So I can load that port directly in the web browser. So you can see here the URL. It's an unsecured URL because it's listening on the local host, right? So the entire thing is running on local host. Although if you have, if it's on a local network, you can also access the server by its IP address. So what does it's not only it has an API server that's building that can respond to JSON queries. It can also provide a nice chat UI user interface where you can interact with it in your local browser. So let's ask a question that is more complex. So I don't have strong confidence if I'm going to answer correctly, but we'll see. So it's write a function that determines if an input integer is prime. Wrap your code in blocks so that it tries to format the code better. So that's a very typical developer question. It takes some time because it's the first question you're going to ask. Oh, it's come back pretty fast. So let me read that. Sure, the function that determines if an input integer is prime is this. So it's formatted in a JSON format. If you look at the code, it's a Python. That's because all those models are well-learned in Python. I'm going to ask a question later to see if it can do rust. So the most important thing is the line behind it otherwise. So it starts from number two, goes all the way to the square root of n, n to the 0.5. And then test those numbers one by one to see if that can be evenly divided by the input number. So if none of them can be evenly divided, it becomes it is a prime number. And if it's not, then it's not. So it also has explanation right there. So this is a 7B Lama 2 model running again on my very low end Mac laptop that I don't even have external power source connect to it. So it doesn't really do the high power consumption mode, right? However, let me try to see. Can you write it in rust? I don't really like Python because the whole point of this talk is not to use Python. So I want to write it in rust. So the same algorithm and even the same comments. And it goes up to, it's the same note, right? That's, yeah, it used rust to convert to the floating numbers and things like that. I can do even more or go one more step because that's one of the things when we use this model to do code reviews in our own code. So we have a GitHub bot set up to channel our GitHub PRs into this model and ask it to, not the 7B model, it's a larger model. It's a code Lama model, right? To ask it to review all the community submitted PR requests into our repository. So in one of the PRs, the developer wrote something very much like this. It's an outside developer. And I look at it, I thought it looks fine. I think most of you guys would think determining a prime number like this is fine, right? But this model returns one single comment to say you don't need to check even numbers over and over again, OK? That blows my mind. I mean, how we didn't know that. So I can ask it to see if it can respond to that. I would just, you do not need to check even numbers over and over again. I don't have high confidence. Because this really is a smaller model. So it may give a correct answer. It may not, but we all see. I don't think this is correct. I mean, it's still doing the same thing. It didn't really go up by two, right? Well, so there you have it. But for a larger model, it would deal with problems like that very easily. So we have seen this model detects infinite looping over code. And also, minor things like documentation. When we change the certain things and refer to documents that didn't exist, you would know that, and it would complain about that. So that's the demo that I want to show you. Let me see. So let me go back to my slide. So you can do this today on your own computer, even if you are a Linux laptop, Windows, Mac, whatever. It's a simple command line. And it's going to install all the software for you. And all the software are very lightweight. And no Python required. There's no complex Python packages that you have to manage. That's hundreds of packages and gigabytes of dependencies and all that stuff. It's very simple and very easy. And it's a single binary file. It's think about it as Java. The Java bytecode file can work anywhere where there's a JVM, right? So the wasm bytecode file, as we compile, can work anywhere where there is a wasm runtime, like wasm edge, right? So those are the screens that are just in case the demo didn't work. So let's go through that quickly. So it's really just step one, install wasm edge with what we call the ggml plugin. The ggml plugin is a community plugin based on llama.cpp. So the great thing about wasm is that it provides unified interface. So from the developer point of view, you just write to the WebAssembly standard to do AI inference. And whether the WebAssembly execute your program, it could choose any of the inference framework underneath it. So if there's no ggml, it could choose PyTorch. If there's no PyTorch, I've just seen Intel has a new library that come out that take fully advantage of their new CPUs. So the new generation of Intel CPUs has a lot of co-processes of building. And they have drivers for that. So if you are not on the Mac, you don't have llama.cpp to contribute to that library to run on your Intel hardware and things like that. So install wasm edge and download your favorite model. Like I said, on Huggingface, a model of the ggml format, you can find thousands of that. That's just a casual search. Those models come out practically every single day. They're all the same architecture, but trained with different data. It has different, some are good at coding, some are good at Q&A, some are good at role playing, some are good at math, some are good at medical knowledge, all kind of stuff. So that opens the whole world for you in terms of large language models. So download your favorite models. And then chat with the CLI really just first to download the compiled wasm application. Again, this is cross-platform. So just to download the wasm file, there's no different wasm file for your particular platform. And then use wasm edge to run it. That would be the CLI demo. And then the web UI demo is a little more complicated because we have to download the HTML and the JavaScript that you see. That's when you go to the website. So we have to download. This is another open source project called Chatbot UI. So you download those HTML files, and then you put that into the same directory as your API server. And when you start the API server, it responds to those JSON requests, but also server HTTP and JavaScript assets. So from that point on, it becomes fairly easy. So because we have an API server now, the API server responds to OpenAI-like API requests. So you just point your tooling from OpenAI's endpoints to your new endpoint that's on your local computer or on some computer on your own network. And then you will be able to use, say, REG applications like lanchen.com and laminindex.ai, things like that. People have built a ton of applications that can read chat with your PDF, chat with your GitHub repository, and basically chat with any documents that you have. That's how they're online. And then use tools like Flowstone Network. You can connect that to external systems. For instance, the chat API doesn't have to be a web API. It can be a Discord bot. There's one thing that's really demonstrated yesterday. That's she built community-assisted bots by feeding project documentations and project descriptions into a vector database and then have people interact with that through GitHub or through Slack. So because this is a fast-evolving field, so there are just so many opportunities. So those are the large language models. However, the title of talk is not to say how to run a large language model with Wasm Edge. It's how to run AI inference with Wasm Edge. So AI is more than just LAMs. So let's look at another example. So in Wasm Edge, the way it works that we have those native-based high-performance inference framework that's plugged into Wasm Edge. So as application developers, you just need to learn the WebAssembly API and write your code in Rust, in Go, or in JavaScript, compile that into WebAssembly, and then you'll be able to access the underlying infrastructure. So there are some of those AI-related frameworks that are plugged into the Wasm Edge runtime. So at the top, those are what I call AI runtimes, like PyTorch. PyTorch sounds like Python, but you can use PyTorch without Python. PyTorch has a core C library which you can use. TensorFlow is the same idea. And then OpenWin0 is the Intel's project. And GGML is something we have just demonstrated. However, when you just have those AI models, it's enough because the data you get, you feed into the AI, typically come from, say, a camera or a video stream, things like that. So you need to use OpenCV and FFMpeg and libraries like that to do preprocessing. And preprocessing can also be done on GPUs. And at the bottom line, those are the large families of models that can be supported by the technology at the top. So what we have shown are the meta series of Lava 2 models. And the media pipe is a very popular set of models that Google Cloud has to do media processing, like image recognition and things like that. And ULO 5 is also a very popular computer vision model. So I wouldn't show you a real demo, but here's the project. It's one of our student interns through the LIFX internship program. So we had about seven or 10 student interns every year from all over the world that is paid by the Linux Foundation to work on projects like ours. So one of the students, I think he's a graduate student in the US now. So we asked him to build the Wasm interface for the media pipe libraries. So this is basically his work tracking issue. So those are the media pipe libraries. They have, say, object detection, image classification, hand landmark detection, hand gesture detection, things like that. And this is GitHub repository for this project. So I want to show you a very simple example that is because I keep talking about developer write code to the Wasm interface and then compile that to Wasm and then have the Wasm application run across platform, across all the devices. So what do I mean by developer write application to the Wasm API? So here, I keep saying this, in this day and age, we are talking about applications that only less than 10 megabytes. And we are talking about code that can fit into one PPT. That's really all there is to this image detection application. So the large language model is a similar setup. So essentially, the code itself is really in Rust. So the first is that you load the image from the disk. Then you load the model from the disk from the pass. And then you have the object detection builder, which is what we call the Rust API. So it wraps around the work that developer needs to do to instantiate the model and load the model into the memory. And then all you need to do is to call detect. Once you build the object, you call detect and put it in the image. It is going to return a tensor. The tensor is a detection result. And then you have some post-processing work. After the print line, those are the post-processing work where you can draw a box around the, that work is done by OpenCV, to draw a box around the detected object and show the probabilities and things like that. So the whole thing is just, I think, this is less than 20 lines of code. And the large language model is the same. So you have a builder to build the model. And then you say, instead of detect, you have predict. So and the parameter it takes is a string. It's a prompt that you give the model. And the return result is a ray where you can encode it back into natural language. So if you run this, it will show this. So it will print out the detection result in formatted text. So it will tell you there's, I think it specifically was referring to this image. So that's, I use that in KubeCon Chicago because this is a typical style, hot dog style in Chicago where they don't just have the bomb and the sausage. They put a lot of stuff in it. So it really confused the model. So the model thinks there's two hot dogs laid on top of each other. But that's really the strengths of why we need to use open source private models. It's because all those models, for instance, this model is what Google has in Google Cloud in their media pipelines called MobileNet. So it's trained on a standard set of images. It doesn't have images of this complex, food images of this complex. So it probably doesn't recognize a lot of Japanese food either. So in order to, for your own application, for your own specific use case, like the large language model, you wanted to do programming or you wanted to do it role-playing. Those are entirely different use cases. So for vision models, it's also important that if you could retrain or fine tune the model, so that's the model that can fit your specific application. So with all that, so I've shown all my demos and the code and all that stuff. So the question people really ask is, so what? So there's so much, you talk about the new way to run inference. OK, I've seen that before. Python does that. So the real answer is, again, like I said in the beginning, the real strength here is no Python. So let's dive a little bit into the rationale behind all this. So today, if you want to do anything AI related, you have to use Python. All righty, Data Survey says that Python is the most popular programming language out there. However, it's not without detractors. I think this, Christopher Latner, some of you guys may know him. That's the father of L, V, M, and Klein. That's the most popular used compilers. And he also invented the Swift programming language, which is the most popular programming language on iOS. So he recently started a company called Modular. They actually had a big announcement yesterday in the Bay area, so they had a conference. So they have a framework called Modul. Why, when we asked why he invented another language to do AI, it's because it's 35,000 faster, 30 times thousand faster than Python, six orders of magnitude faster than Python. This is how slow Python is. So if you have a native application and native framework, you compare that with Python, there's six orders of magnitude difference. So when I first did this presentation, sometimes I have this screenshot, and a lot of people don't know who he is. But I think now you all know who he is, because of all the debacle at OpenAI, him and Sam Altman. So I think that is a very interesting quote from Greg Brockman, that's much of modern machine learning engineering is making Python not to be a bottleneck, is to figure out where Python is slow and avoid Python in those places. So that takes a large number of engineering work. So the root cause of that really is that we have gone through decades of progress in computer science with the presence of more slow, whereas the hardware computing power getting ever increasingly more capable. So the software efficiency becomes less important. So this paper was written in 2020, and the authors are our well-known computer scientists in Stanford and in Google. So the title is also very interesting worded, to say there's a plenty of room at the top. Because 40 years ago, there was a very famous paper that published with the title, There Are Plenty of Room at the Bottom. And also that paper is more. So that paper was slow, is to say, the semiconductor going to improve so fast, the software efficiency is not that important. So the software should be focused on developer products. So in all those years, through Java, through JavaScript, through Ruby, then through Ruby, it's all about making life easier for developers, but making life harder for the hardware. Because as you can see, that's the efficiency of those hardware, the different languages on the hardware is staggering. So the key point of this paper published in 2020 is that that era is behind us. There's no more more slow on the hardware side anymore. So in order to squeeze performance outside the future direction of the whole industry or the future direction of the, to squeeze more performance, to continue small slow in the world of AI, is to have more efficient software frameworks, which is exactly what we are trying to do. It's to move people away from Python. And not all the way to the native C++, because that's 40 years ago. That's very hard. People gave up that for very good reasons. And about to a portability and abstract layer that's sitting in between, that's where we call that web assembly, but that's where we are trying to go. So that's a very interesting talk from. So Chris Aubin is the CTO of Wikipedia. And he said, what is the right way to install Python on a new M2 Macbook? I dare not install Python on my Mac laptop, because I know that after two months, three months, it's going to get messed up. It's going to reach a point where people cannot install any package, because if you install anything, it's going to conflict with something else on your computer. So it's a really hard problem. The Python ecosystem gets so complex. But all Python does in terms of inference is to set on top of the C++ inference framework. So Python is just a C framework. For C framework to be four gigabytes big, and you can't run it on the edge. You can't run it on a factory device. You can't run it in the car. So it's a gigantic waste of resources and will provide a lot of limitations in terms of things that you can do. So we think the future of AGI or the future of machine learning is not going to be Python. And so the guy said, AGI would be building Python, let that sink in. So I didn't say that. Elon Musk said that. He thinks the future is rust. That's pretty much exactly what we are trying to do here as well. So the rationale really is that because of the more slow, it's moving to the softer. So we can't have all those heavy weight software packages anymore. So I ended the talk, because I think I only have three minutes left. So with a comparison chart, why we do that, right? So on the left side is to compare with Python. And the single most important benefit of using Wasm, as you can see from my demo, is the simplicity. There's a single container binary application. There's no supply chain security issues. There's no hundreds of packages that potentially conflict with each other. And it's as easy as just to copy across a file. And it's ready for cloud deployment because it's managed by Kubernetes without the container, because most of the Docker desktop OpenShift and ContainerD are integrated with Wasm today. So you don't really need a Linux container to do that. You can just have those Kubernetes or Container management tools to start a web assembly directly. And it's three orders of magnitude smaller. And it could be four or five magnitudes faster than, say, a compare with Python. And compare with C++, it's a comparison to do another directory. So on the left side is fission. The other side is portable. So C++, you have all these portability issues. I mean, even if you have a C++ application that is compiled and put into a Linux container, you just solve the problem of portability across operating systems. Because for x86, for ARM, for RISC-5, you need different container images. And if you want to use NVIDIA GPU, you need different container runtimes. So you need to install your entire, say, the Docker setup accordingly, and then have different container images to be orchestrated across your network. So but in our case, there's none of that. It's just a single binary. I run it on my Mac. I run it on my Office, the Jetson device. I run it on Windows laptop with the NVIDIA 4090. They all just runs at the top speed. So yeah, I think that's it. Thank you very much. And we have one minute left. Yeah, if there are any questions? No question. Then go try this one. You're going to have this before the end of today. That's one command on your Mac or on your Windows. All right, thank you. Can I ask a question? Yes. Are you planning to replace entire Python environment by Rust plus WSM? I mean, you include training process? So this is very strictly about inference. So I do believe Python is needed for training. Because training, you need a lot of interactions. So a scripted environment and a lot of tools that Python has available is very important there. But inference is fundamentally a very simple process, like I showed. It's like only 20 lines of code. So to have Python to do the inference, you are wasting 99% of the code that in PyTorch because it's not used. So that's why there is. And so thank you. All right, yeah, thanks.