 Hello, everyone. Welcome to my session. My name is Vivian Hu, and I'm one of the maintainers of the CNCF Wasmai project. Hello, everyone. My name is Michael Yuan, and I'm also the maintainer of the Wasmai project. So today, Mike and I will talk about creating native R&Ps. We will use Lamai. It's a lightweight and portable runtime based on the Wasmai runtime. So let's get started. So the current tech stack for our applications is dominated by Python, apparently. We use PyTorch and Python to do IAM inference. And if we want to build some more IAM agents, we will use launching and Python. However, Python has its problems. First, it's heavyweight because there are lots of complex dependencies with Python. The picture here is an official PyTorch Docker image. From this picture, we can see that PyTorch Docker image can be fogged by it. It's almost the same size of language model like Lamar 27b. So it's very huge. And one more thing is that the Python-based IAM applications is not portable. Different GPU drivers requires different Docker images. Also from this picture, we can see that QDA 11 and QDA 12, the Docker image for QDA 11 and QDA 12 is different Docker images. So this is the problem of Python. So how about we use the native-compiled applications such as native Rust or C applications? Yes, they are much smaller than Python, but those native applications are not portable. If we want to run the same application on MacBook and on a media GPU, then we need to recompel these applications on the device. But Kubernetes can only extract binary artifacts. So that's the problem with Python and native applications. How can we achieve the, how can we solve these problems? All right, yeah. So I think this is one of the common thread in this KubeCon is that how do we make Kubernetes work better with machine learning and large language model workloads? As we have seen, as we have shown, those PyTorch images are gigabytes, measured in gigabytes. It's very heavy, even in a cloud environment. And the native applications, apparently, they are not portable. It's not just a problem with Kubernetes. We are going all the way back to the problem where if I develop something on my laptop, I can never be guaranteed this thing going to run on the server or on another device because they may have a different GPU. It's not just recompiling because the code may be different. The Mac Metal framework and the NVIDIA CUDA framework are very required different code, the source code to begin with. So you have to recompile. So we thought about how this problem can be solved. You know, it occurs to us about, I think, a year ago, is that we can use the same approach that Java has solved right once, run anywhere problem, but expand that to the GPU setting using a lightweight virtual machine, in our case, using WebAssembly. So the idea really is to provide abstraction layer between the application and the underlying driver and the hardware using the virtual machine to provide that abstraction so that application developers only need to write towards a single unified API and then compile the application to Wasm and then distribute the Wasm bytecode to anywhere the application might run. And then the Wasm runtime takes over at runtime to figure out how to dispatch, how to translate, and how to route those high-level Wasm function calls into low-level, say, CUDA function calls or Metal function calls. So essentially, it's to provide a virtual instruction set that is abstract for developers so that we can provide the compatibility. So the key specification that we use here is something called the Wasi ANN. If you are familiar with WebAssembly, WebAssembly started in the browser, but Wasi means WebAssembly system interface, meaning if it runs outside of the browser, you need to have system functions available to it, things like the networking socket or file system and all that stuff. So GPU access can be abstracted the same way. So the Wasi ANN is a WCUC standard that abstracts a set of neural network or inference function calls through the WebAssembly bytecode specification. So that allows us to build different language SDKs. For instance, we have Wasi ANN SDK for Rust, Wasi ANN SDK for JavaScript, Wasi ANN SDK for C, C++. For those WebAssembly languages, you write towards Wasi ANN specification, make their function calls, and then compile that to the Wasm bytecode. Then, a Wasm runtime, like Wasm Edge, like I said, translate and route those Wasm API calls to the underlying AI libraries and drivers and hardware and all that. So many GPUs, drivers, and the inference frameworks are already supported. So in the Wasm Edge community, we have over 100 contributors. At any given time, we have four or five graduate students working through the Linux Foundation mentorship program to help build those plumbing and connections between the Wasm runtime and the underlying framework. So today, we already support calling, say, PyTorch function. So it's not the Python PyTorch, but the C-based PyTorch. PyTorch functions from Wasm, TensorFlow functions from Wasm, OpenWinor functions from Wasm. The GGML is what is something maybe more commonly known as Lama.CPP is one of the most popular ways to run the large language model. And on top of that, we have GPU accelerated connections to C-based functions like OpenCV, FFNPEG. And the result is that we can run the whole suite of Lama2 models, large language models. We can run things like the YOLO5 for object detection on the Edge. We can run the Google Media Pipe, which is a series of TensorFlow models which does object detection and things like that. So what do we get? We get very lightweight inference applications. That's the application plus runtime. In a complete image, it would be only measured in megabytes, in 20 megabytes, 30 megabytes. Although one thing I would like to say is that although the PyTorch Docker image is four gigabytes, the PyTorch C library that performs the task is only two megabytes. So it's only less than 1,000 of that Docker image. The rest of this stuff is complex Python dependencies or the native dependencies that they have to add to it. By extracting the C library out of the PyTorch environment and hook it up with the wasm environment, we produce significant cost savings. So again, we are talking about application measured in megabytes and not in gigabytes. And when the application runs on the devices, that's on the Mac, or on the NVIDIA JetSync device, or on server, we can have native and GPU accelerated performance. We support many GPU and hardware accelerators. Again, this is through our mentorship program. A lot of graduate students working on those things. And you can see from that diagram. So even within the NVIDIA ecosystem, we support Q.11, Q.12, TensorFlow RT, TensorFlow RT RIM, different ways to accelerate the hardware. On the Intel side, we are working on to support the Intel CPUs to do inference with a special library that come from Intel. And on the Mac, there's the traditional metal framework. And also, there's the newer MLX framework. So there's just a large number of devices and a larger number of inference libraries that we can hook up with the wasm edge was in an API. So developers only need to know what's in an API. They don't need to worry about metal. They don't need to worry about Q.11 versus Q.12. So they just write that application, compile, and then let us figure out how to run those devices. So and we also have, because we support those hardware and those runtimes, we support a wide selection of AI and AI models that I've just shown, from vision to audio to large language models. So we have learned what can run our apps better. So I will do the first demo. I will build and run an open-ag compatible API server using the Lama ID SDK. So I'll provide the voice, the narration, while Vivian is doing the demo. Yeah. If you have a computer on your hand, you can follow us. You can visit lemai.com. And then we will copy and paste this one comment line to our terminal. So this command tool will download and install all the necessary software that we need to run an open-ag API server. Yeah, so what we are showing here. So what is Wasm Edge? What is Lama Edge? So Wasm Edge is the wasm runtime. So Lama Edge is a large language model, inference application we build on top of Wasm Edge. So because Wasm Edge, you can do a lot of things with web assembly, including microservices, wireless functions, and all that stuff. But very specifically, if you want to do an open-source large language model, this is a way to do it. That's because it's probably a very compact environment. And it's a fully programmable environment. You can write applications on top of it. And this first application, just one of the applications that we wrote, the source code is completely available. It's under the GitHub Lama Edge repository. And what it implements is API server. Now it has started. So, Vivian, do you want to go back and see? Yeah, please. So this command tool will download the Wasm Edge runtime with the wasm-engin plugins, like Michael just mentioned. And they will also download the Gamma2b model by default. Gamma2b is recently released by Google. And then it will download the portable or wasm app. It's called LamaAPIserver.wasm. This app will create an API server for the Gamma2b model. And then it will download the user interface, chat about the web UI. And then finally, it will run this command line to run the API server. So the server is started. Let's go to the localhost, the 8080 port, to chat with the Gamma2b model. Let's create a new chat. So let's ask a simple question. What is the capital of France? So this is completely working on her laptop. So she has a fairly low end Mac laptop. And it's already running a large language model. So it's the Gamma2b model released by Google. So I can ask a follow-up question, plan a one-day trip. So yeah, you see it follows the conversation. If you just tell it, plan me a one-day trip there, it will not know what you are talking about, to where. But the previous question asked about Paris. So it knows it needs to plan a one-day trip to Paris. So this is one of a very small, large language model. And you can see it runs really fast on her laptop because it uses the GPU. It's used the Apple's, I think it's the M2 device, right? That's the M2 MacBook. So it certainly speaks faster than a human can speak, I think. So it's complete local, right? And the web server and the large language model are both running inside Wasm. So we compiled applications that stand up for web server, serve this HTML web page, and then server API endpoints that points to the large language model so that you can chat with it here, right? So this entire application, so that's llamaapiserver.wasm is an application that is compiled from Rust. And then we demonstrate here. Yeah, so this is the first demo. We just created an open-end compatible API server for the Gamma2B model. And as Michael mentioned before, with Wasm Edge, with Lama Edge, we can run it once and run it anywhere. So the next demo I will copy and deploy the llamaapiserver.wasm app to NVIDIA GPU device. So let's go back to my terminal. I will stop the current process. And then I will use SCP command to transfer the Lama API server.wasm to the NVIDIA device. So you can see we are copying that Wasm file, right? Just a single Wasm file. There's no Docker image. There's no recompiling anything. So it's just binary files very much like a Java bytecode application, right? To a server that is located halfway across the world in our engineering office in Taipei. And there, we have a NVIDIA Jetson device. It's about this big. It has a fairly large CPU and GPU unit. It costs about $2,000. We use it in our office as a coding assistant. So we run the code llama model on that device to hook up with all the Visual Studio IDE's that we have in that office. So that our developers can ask questions about it, about their daily work. One of the most interesting questions that we ask is because we build a lot of the Python stuff into WebAssembly. So oftentimes, we need to translate Python program to Rust. In the past, you need someone who understands both Python and Rust. But today, we just send the Python program to the code llama interface and say, translate to Rust. And it can six out of 10 times. It can do it very well. So that's how we use it. But like we were going to have just shown, we copied this file to the Jetson Orion device. So now you can see the file on the Jetson Orion device, and the timestamp is today. So it's a 5 megabytes program. It's not a 5 gigabytes program. Like PyTorch would have you. So it's a 5 megabytes program that's being copied to that device. Now, please. So I will move the wasn't file to the model directory because I have lots of models here. So in the model folder, I have lots of long way models. I already have the gamma2b model. And this is a llama APS or wasn't file with just a mode. So I can use the same command line to run the gamma2b model. Let's go back to find the command line. So the interesting thing is that all the models she has downloaded on that machine can work with this single wasn't file. So that's what we call it's a supportable wasn't file. Not only it runs across different GPU architectures, but it runs across different models. If you go to Huggingface and search for llama2 models, there were literally thousands of them. People are trying to thousands of different versions of with different expertise, different styles of wasn't models. So yeah, that's OK. So she started the application server now. And because this is behind the firewall, so she's going to do the SSH tunnelings to this local device, right? No, I will log into the device in another. OK, OK, so I'll log into the device in another terminal. So next, I will use a call to send API requests to the server. So we demonstrate another way of using that API server. It's called API server, but we showed a chat interface before, right? So but it also responds to OpenAI style rest queries, right? So if you are familiar with the OpenAI stuff, it looks fairly familiar. It's just replace localhost with API.openai.com, right? And you send it in a JSON message about the question you want to ask, yeah? Yeah, the same question, what is the capital of France? I think there's something you are typing something wrong because you shouldn't have that. Yeah, it's not properly escaped the previous time. But now you can see it's capital of France is Paris. It's come up with a slightly different answer every time because it has what they call the temperature setting, right? So the model doesn't come up with the same question every single time. But it's both correct, right? So do you want to ask another question? Yeah, before we ask another question, we can conclude that with LimeIG, we can create an OpenAI-compatible API server for any open source language models. It's portable and it's lightweight. Because the API server is compatible with OpenAI, it follows the OpenAI spike. So we can make the API server work with any tool chain in the OpenAI system. For example, Launching and Limeindex. But I want to recommend that you use Flowsnetwork because it's also a soulless platform in Rust and WebAssembly. You can use Flowsnetwork to build our engines and discover both our telegram boards. So this is an OpenAI API server. But it's not enough. I will ask the model another question. So we'll ask a question that the model is likely to get wrong and then show you how to fix it. The question is how to install what's managed on my OS? Yeah, as you can imagine, the model may not have that information because it's a 2B model. So it's a severely compressed knowledge of the internet. It takes longer because it's not doing the streaming return. It's going to have to generate the whole answer before it comes back, which is what OpenAI does. But it's going to generate a menu to install what's managed. And because truth to be told, what's managed is not, say, Linux, whatever. So it would give you some very generic answers. So it says, yeah, it's installation steps, download the wrong time, where to download the wrong time, it doesn't know, and select the latest Mac version from wrong time. In general, those are very generic steps. Yes, if you want to install Apple on your MacBook, go to the website and download the release site. This is a general step. But it's wrong. It can't work with what's managed. Here is the right answer. We can install what's managed with only one command line. But how can we make a language model can answer such questions? So that's where I think the ability to develop applications on language really shines. Because I think there's lots of choices where give you a model and a stand-up API server. We have seen this from the keynote. We have seen this from many, many places where even Lama.CBP has one. It's called Lama.CBP serve. But a lot of times, you need to do more than that. So for instance, there is a very common technique called RAC, meaning that you vectorize an external data source and then put that into a vector database. And with each problem, you search that vector database for relevant information in your own knowledge base and then put that into a prompt and ask the language model to reply based on that prompt. For things like that, if you just have an API server, you would have to set up another two-chain outside of it, typically in Line Chain or Lama Index. And because oftentimes you need multiple models to act together. One model detects what the question is about. The other model actually answers the question. So you end up having a fairly complex multi-container setup that has Python everywhere. So it is complex and it is a difficult to manage. So however, with a programming environment, with a development environment like Wasm Edge and Lama Edge, you'll be able to build all this logic into a single application and compile them into a single binary and then distribute that single binary. So there are lots of things you can do with this single binary. You can, by having all the elements of the APIs and everything being programmable, you will be able to, say, build our rack functions directly into the API server so that from the outside, the user doesn't need to know there's a rack function. Just ask some question and it will magically give the correct answer. You configure the knowledge base on the back end. Instead of having the user have to know what the knowledge come from. You can do state management in the API server. You can provide answers based on the previous history. And you can do integrated function callings by in the LIM structure response. That is actually a fairly hot research area where you have prompts and you have fine tuning to make sure that the LIM always reply in certain JSON formats. And because it's reply in certain JSON formats, they can be consumed by another application. We are seeing a lot of those type of projects in the show floor today, in this cube as well. So a lot of those agent applications use specially fine tuned and specially prompt LIMs to come up with structured data and then use another runner to run the structured data. All of those needs a different application to work together. And you have ability to call external services. For instance, if the problem is, say, refers to some URL, you need to go out to get the URL first, get it down to the content first, and then try to analyze it with a large language model. So you can also do multiple models in a single service so that maybe some questions are answered by model A, some questions are answered by model B. Or you can have model A and model B each come up questions and have model C to summarize the answer or to pick the answer. Architecture is similar to MOE, right? So there are lots of things you can do. And again, today you can do that with Python, or you can do that by having a lot of glue code to glue them together. But with Lama Edge, that you will be able to build a single application that's using Rust, or using JavaScript, or using Go, and then compile it down to Wasm, and then just distribute that Wasm file into any GPU device that you may have. So in the next demo, we're going to show one of the RAC applications that's one of our end users in our community build. It's very interesting. Yeah, in the next demo, I will show you how to build a single binary RAC. It's from our end user. It's called Gainet. So I will go back to the, this is a very draft demo because they haven't been launched successfully. So it's very draft. So in this config, there, Jason, we need to config some information that we needed. For example, chat is for the, here we use the Lama 27B model to chat with the model. And we also will use a special embedding model to create embeddings. And the document shows data source. This is the Wasmage documentation for installing. So this is data source. Install Wasmage on different OS. So after we have set up the configurations, we can go back to create some embeddings for the document. Gainet also provides some shell scripts to kickstart. So I will just use one shell script. We will use the init script to create embeddings. From the returning log, you can see that it will, oh, sorry. It will download all the necessary software. Like Wasmage runtime. And it will install the Kildron. It's a vector DB. It's written in Rust. And it will also download the chat model Lama 27B. And it will also download the embedding model to compute embeddings. It's called our mini-LM. And it will also download the API server. It's a portable Wasm file. And it will also download the user interface. Then it will create a Kildron collection to store the data. And then it will start the Lama-age API server. So the server will download the document from the link I provide in the config.json. And then it will upload the document to the Lama-age API server. And then the Lama-age API server will chunk the document. So a large article will become smaller chunks. And then the API server will call the embedding model to compute the embedding space on the chunks. And after the embedding is done, it will upload the embeddings to the Kildron vector DB. And this is a step to create embeddings. So we already have embeddings. Then we can chat with Lama 27B with these embeddings. So I will use another shell script. This shell script will start the Kildron instance and then starts the Lama-age API server again. This server is for build an API server for the Lama2B model. So we can go to localhosts to chat with the model. So we will ask the same question, how to install Wasmage? Because we have given the model the Wasmage documentation for installing. So this is a 7B model. So it's run slower than the 2B model, especially at startup, because it needs to load all five gigabytes of the model content into the memory. But now it's come back. Based on the provided context, you can install Wasmage runtime on macOS by running the following command. And that's exactly the command that's appearing on our documentation. The Lama2 7B model doesn't have this knowledge. It's a documentation doc that we provided through the vector database. Yeah, so this time the model gave us the right answers. So with Lama-age, we can create complex applications. This is all the demos today, but we still have one more thing. Yeah, so with all this. We're at the KubeCon. We have demonstrated command lines. We have not have time to show you the source code. All the source code are available at the end of this. The next slide, you can see the Rust source code for the API server, for the RAC API server, and things like that. But all this runs in Kubernetes. We have been working with Kubernetes distribution in the community for the past two, three years to support Wasm as the first class container or artifact in the Kubernetes cluster. So we can push Wasm applications to Docker Hub and label it Wasm. So when C round or container Ds puts down that image and sees it's not x86, not 64, but Wasm, it knows to use a Wasm runtime like Wasm-age to start it. So with that, we would be able to run those large language model applications. We also have demos on our website to distribute those large language models in device agnostic ways through our Kubernetes cluster. As you can imagine, there's lots of things that you can do now with devices on the edge, with devices on the server. So I'd really encourage you to try it out, although we don't have time to show a full Kubernetes demo. We only have time to show the command line demo. So those are the resources of the things that we talked about, including all the source code that you see. That's to build the regular vanilla API server and the Rack API server and how to run those with container tools. So I think that's it. Thank you. Thank you. So I think our time is up, but we're going to stay here around for a couple of minutes. So if you have questions, come up and ask.