 So this talk is about, I think there's a couple of AI talks in this room, so you know, from this and the following two are both AI talks. And we're gonna have some exciting stuff to show. And so it'll be AI, getting started with AI and WebAssembly. So my name is Michael Yuan, and I am the founder of the Wasm Edge Project. And so. Yeah, so I'm Michael, I'm a staff engineer in the Wasm team inside the VMware AI labs. Thank you. All right, so I think we should start with a demo. So that's, you know, just, and then we get into why we do this and how we do this and all that. So this is a demo I showed earlier today at the Rustic Global Conference. It is, what you see is what you get. This is all there is to the code, to do object detection in using Rust and running in Wasm, okay? So people keep saying that Python is simple, Rust is difficult, but look at this. It's in fairly big fonts. I can fit everything into a single screen. And what it does is also, it's easy to read. It's a main function and it takes a file name called model pass and then the other file is called image pass. It takes a model and image. And then it uses a builder pattern to build up the model and then run the model and then get a result to say, you know, detect what's on the model, what's on the image, and then draw boxes on the image. So let's see the image that we gave it, right? You know, so if we give it like image like this, there's a cat and a dog. And if we just compile this using cargo build, compile this into WebAssembly and run it in Wasm Edge. So here is how you run the Wasm application in Wasm Edge, right? You know, so you use the Wasm Edge command line, you can use it in many different places, you know, embedded in other SDK. So you run the Wasm application and then give it an image called example.jpg, which is that cat and dog picture and have an output image, which is what we're gonna generate from the application. So the application, if you remember, it has a draw boxes, right? So once you run this, it runs the model and then gives a result. It's the result shows there's two boxes and we draw the boxes on the original image and output to it, right? You know, so this is how easy it is to use Wasm to do AI inference. There's several things that I think are really important. The first thing is that this has zero Python dependency. There's no need to install Python and figure out how it works and then wrap a container around Python, which from the cases I see could easily be one gigabytes. The whole Wasm application here is like two megabytes or one megabyte, something like that. And it's completely portable across all the CPU architectures and all the GPU architectures because the portability of Wasm, you know, and it can be easily managed by container tools like Docker or Podman or Kubernetes, right? So, you know, that's a power of, you know, to do AI inference in Wasm. For our point of view, it's really to Wasm as a language group that ties together all those underlying AI frameworks and inference frameworks. So if you're interested in the demos, there's a link here that you can run it yourself. So then in the next session, I'll pass to Andrew to talk about machine learning. So, thank you. So, yeah, before jumping a little bit into the specifics of AI and WebAssembly, let's do a quick introduction about machine learning. Just in case you are not really familiar with this matter, I think it's interesting that you get a taste about how it works. So, if we think about machine learning, sometimes it looks like magic. You could basically something and you get an output. But in reality, if you want coversies, everything is mathematics. So that's everything that runs under the machine learning and AI ecosystem. If we follow this model, what we basically have is inputs, which are data that we give to the machine learning models, which could be many different sources, like audio, a video, an image, metrics, text, whatever you want. Then it moves inside the neural network, which is the actual model that runs all the inference. And it gives you an output, which are predictions. And basically, it depends on the model that you're running. This output may be the next token or the next text in large language models. It could be an image classification, as Michael was showing me before, like what are the different boxes, the different probabilities and the locations. If we try to uncover a little bit more that neural network block, it's basically a graph of interconnected notice. So it takes the inputs, it performs several mathematical applications to the different data. And it moves all the information, performs and then goes into the output layer in which it decides what's the value that it's going to provide or what are the things that it predicts. If we go into the example of the image classification, we are going to simplify that a little bit more. So here we have a, suppose we have a model that actually classifies an image as a dog or as a cat. So you have this input, but in reality what it happens is that you split that input in data that the model understands. And this kind of models usually work with the three different channels. So first you get the image in the different channels. You get the pixels. You're watching here the image, but the model what it understand are the pixels. Then you get, this is what it's called the input layer. Then we have the intermediate layers which are the ones that actually performs all the computation. It gets all the numbers about the image that you already provided. And then you get the output, which in this case is basically two different values. So you get if it's a cat with both probability and if it's a dog, which is the probability. This is how it works in reality under the hood. But there is still a missing piece. So we have this example model which takes the dog and then it gives you this information like it's a dog with high probability and it's not a cat, but how it actually knows what's a dog or what's a cat. The thing is that anytime you have a machine learning model, it doesn't know at the beginning what are the values inside those mathematical operations. You need to train first. So you give them a set of information to identify, hey, this is an image and this image is a dog. This is an image and this image is a cat. So after looping into this specific training set, it detects what are the best numbers to put on those nodes to get this information. So after you train the model, you can take it and then you can start using it to predict the image that you are going to pass to it. So now that we have those base concepts about AI, let's continue with AI and WebAssembly with Michael. All right, thank you, Angel. So in the first example that I showed is to run the inference. And as Angel has mentioned, once you have the model trained, you can run inference easily in WebAssembly. But how exactly do you do it? So there are several ways that you can run inference. A lot of people consider wasm as a CPU architecture. So the compiler certainly sees that. You can compare it to x86 or wasm, right? So if you compare it to wasm, it's generated code that can run inside a wasm runtime. However, it can also interact with other systems on the host computer, including the machine learning framework that potentially runs on the GPU, right? So it could possibly have an interact with the machine learning backend and the talk to the CPU and GPU that's running on the same host machine. So not only CPU, GPU, but also TPUs, there's Apple Silicon, they have their own set of neural chips. And you also have, I think ARM, also have their, I forgot what it's called, but their own set of standard for neural networking inference, right? So you can have, because the wasm application lives here, the actual inference application lives here, it was separated by an interface called the wasm, and the wasm is a wasm standard. It allows the wasm application to be able to adapt to different backends. So that's where I said it's portable. So you know, the wasm application can, you compile it once, run it everywhere. It doesn't need to change at all. It's, but whether it's runs on CPU or the GPU or Apple Silicon or anything, it depends on the host. Where did you install the wasm runtime? Because when you install the wasm runtime, it figures out what hardware do you have on your system? And then through the wasm interface, it allows the wasm application to be able to use all those underlying hardware features. So the wasm standard, you know, it's was a neural network, right? You know, meaning it's exposing the underlying AI and machine learning framework as a host function into the wasm runtime. And it started in 2019, right before the pandemic. And so this, you know, in this conference, it's the first time I met the authors of Wasm protocol face to face. You know, that's, you know, we talked many times. We implemented this standard, but we never met face to face. And so it enables, you use a host function, meaning that's, so the way, the benefit of using the host function is that it provides full performance of the underlying hardware. So you can use any GPU or any, you know, accelerators that's available on the host. It's a, as a standard is currently in phase two, meaning that it's being implemented. There's a, there are new things that's currently still being added to it. And it's currently being implemented by the three wasm runtimes, the wasm timer and wasm edge. Wasm and supporting wasm edge. So first of all, because the standard, so we are fully standard compliant. We contributed to the current Rust API crate, you know, meaning that's, because there's a host function that's runs the workload, but inside the wasm, you still need API where your application is written in Rust or written in other languages and you compile into the wasm standard to run inside the wasm, right? So, you know, we contributed the API crate. And one of the interesting about wasm supporting wasm edge is that we support multiple backends. We support TensorFlow, TensorFlow Lite, PyTorch, OpenWin0, and GGML. I'm gonna talk about GGML in a minute. So essentially all the leading machine learning frameworks that's, they're all supported inside wasm edge. So, you know, so if you write an application that's say using a PyTorch model or using a TensorFlow model, you can compile to wasm edge and then you will know that if the underlying framework is, dependency is installed and it's available on the host, it's gonna run, right? The other interesting about it is that wasm edge has a plugin mechanism that is a fully component model ready, meaning that's, we've heard about component model today, right? You know, there's a bit interface, you know, that's, they call, I forgot the word, is that the universal SDK or the, you know, code-less SDK, right? So the plugins provide the interface between the wasm runtime and the PyTorch and TensorFlow. But the interface between it and also the API inside the wasm can be generated by component tools, right? And then, like everything in wasm, that's, you know, we started in Rust, but other languages are still also being supported like JavaScript and Python APIs. This is one of our collaboration with VMware to enable Python APIs for, for, for Wazi NNN. And so the current, the current support in, I can only speak for the wasm edge ecosystem is that we have those backends and those backends are our TensorFlow frameworks. And we have data processing frameworks to support. So for instance, you know, in the image processing scenario, to resize the image or to make the image, you know, convert the image into tensor. There's lots of things you need to do with the image. And you could do that within the wasm or within wasm, or you can use other host functions to facilitate that. So for instance, we are porting the, we are halfway through porting the two libraries, the OpenCV and FFMPEG, into become wasm compatible so that in the Rust or JavaScript application, you'll be able to call those functions to pre-process the image data and turn them into tensor. And to combine those together, you know, we have now supported several, I would say popular model libraries. You know, one is MediaPype. You know, that's, Google has a large set of AI models that they offer as part of the GCP service, which does, you know, image recognition and gastric recognition, natural language, sentiment analysis and you know, things like that. And YOLO5 is a very advanced and dedicated image processing framework where, you know, we use OpenCV on the front end and we get the data, you know, we get image data out of that and then we process that with YOLO and then we come back to use OpenCV again to draw the boxes on the video and on the image and you know, things like that. So both of those are mostly image processing. And the other thing, which is a demo we're gonna show at the very end is that we are now supporting the large language model as well as the Lama model from Meta. We go all the way up to the largest model, you know, so there are, you know, 70 billion parameter model that can now fall around inside WaterMatch, you know, by using the architecture we have just described. So the way to do it is to use this model framework. So the Lama model was originally published in Python, but there's a highly optimized format called GGML and we convert the model to that format and then run it. In the end, I will give a demo. But yes. Okay, so we saw like how you can run inference using the host as Michael explained. There are many different options, many different backends you can use. But in WebAssembly you have also a different option or a second option that you can take and it's basically run their time inference inside the model. It means that you are not using anything. From the host, you are running everything on the CPU. Of course, the performance may vary depending on the model. You cannot run like large English models because of the limitations of the WebAssembly standard. But for third time use cases, you can get a pretty decent performance and usability. So one of the things that I really like about this approach is that you get, would I say maximum portability, which is basically any place in which you have WebAssembly runtime can run this specific inference, this specific model for inference. So the main requirement, if you want to follow this specific approach, is that you need to compile a machine learning engine into the WebAssembly model. So you can run it from inside directly. You don't need to rely on anything on the host to run, for example, PyTorch or GGML. Everything must be done inside the WebAssembly model. One of the examples that I found, I was pretty excited. I actually saw this example in a community meeting from Wasanets, it's about the Lama2C project. So Lama2C project is a project created by Andre Carpathi, which is now on OpenAI, but before he was the head of Tesla, the autopilot. So he created a project that runs, that is basically an inference engine for Lama2 models, but it's written in a single C file of 600 lines. So just for learning purposes, this is amazing, because you can see the entire architecture, you can really follow all the different steps that you need to do inference with this specific model. The thing is that after a few weeks, this project was ported to many different languages. You can see SIG support, you can see RAS, Python, JavaScript, all the languages, all people started to port this just to demonstrate how you can run it in many different environments. And the good thing is that Andre also created or trained one model that basically is a lurch language model that runs, that can run on CPUs in a very efficient way. It's called TinyLamas, which is, it's just a toy model that creates stories for children, but it's pretty cool to see because he created it from a very big model, which is around 400 megabytes, but then he started to string and created smaller versions and you can have one that I will show you later, which is only one megabyte. And it's actually, it's able to write English that makes sense, which for me, I think it's amazing. And having this ability of creating models that are really specialized and at that size opens a new set of possibilities. Just to show you how this is portable, this is my personal laptop. I'm running this with Wasntime just to show that this works in any runtime. And here you can see that when mapping the model inside, calling the Lama2C.wasn file, that it's just a compiled version of that project into WebAssembly and Wasi. You get all this directly. You don't need to do anything more. It just runs properly. I can take the same exact model and put it into a RISC-5 board and it runs exactly the same. So I don't need to do anything. I don't need to recompile. I don't need to install any tool. It just runs and it works, as it was doing in my macOS. But that's not all. I can take it and put it on the browser. Why not? It's just a WebAssembly and the browser comes with WebAssembly runtime. So these are the kind of possibilities that this specific approach of running Inferencing WebAssembly opens that I didn't see before. And you can access to this URL, inference.wasnlabs.dev, just to see how it works and try it by yourself. And yeah, so now what's next on AI and WebAssembly? Thank you, Andrew. Yeah, I thought the demo was very impressive. That's because that really shows a Wasm's capability to run efficiently on the CPU, right? You can see the text to come out, just streaming out, right? However, through this approach, there's a problem that is difficult to solve with Wasm because the Wasm architecture only allows four gigabytes of memory. And if we go beyond the tiny llamas, it would be, I think the smallest Lama 2 model has seven gigabytes, you know, something like that. Yeah, so the file is that, and then you put input into the memory, it's gonna be larger. So it's typically requires, you know, people recommend to at least have, you know, 64 gigabytes on the GPU to run those models. So there are some models that's difficult to run in the, you know, that's especially general purpose models that's difficult to run. Using, you know, reading data into the CPU and run inside Wasm approach. So that leads to the next things that we want. So the first is in the Wasm space, there's now, so by the way, Wasm is a phase two specification. I would really encourage you guys to, you know, Andrew is right there. So he's one of the original authors of Wasm at Intel. You know, so I would really encourage you guys to participate in this work because, you know, we are seeing a lot more an interest in demand for running inference through WebAssembly. So Wasm would be a great way to do that. And one of the things that we do now is to name the models. What does name model mean? I can show a demo in a minute. So the name of the model bypass the Wasm memory space. So it copies the model from the desk directly to the inference framework. So the inference framework is PyTorch via TensorFlow, albeit GGML backend. It's copies directly to the inference application on the host. That's really increased efficiency and really that allows us to run GPU and really get around of the CPU memory constraints that Wasm may have, right? And then there's also a lot of ecosystem work that has to be done in the space. For instance, you know, just to have inference is not enough because the inference input is tensor and output is tensor. That's it. You know, tensor is just a big array of numbers, right? How do you get to that tensor? It's a million dollar question, right? I have an image or I have a piece of text. How do I turn that into a series of numbers? And then after the inference to come back with a series of numbers, how do I turn that into whether it's a dog or cat? You know, that's a, or the boxes on the images, right? You know, so the pre-processing and the post-processing requires, it's one of the biggest strengths Python has today because it has a lot of libraries to do that. And as you can see, we have done a lot of work on the, you know, in a Rust ecosystem with Wasm, you know, to build those pre-processing and post-processing functions into Rust functions or into Rust host functions like OpenCV and FFMPEG. But that requires more work from the community, you know, to really support more models. Right now we support some of the models that we use, our customers use, but you know, I think there's a lot more, you know, to grow this ecosystem. And with that related, there's improving the tool chain. And I also mentioned early on, is that the strength of Wasm really is the multi-language support. So we really want the JavaScript API and Python support. So we actually, you know, through QuickJS implementation, we now have a version of Wasm that hook into JavaScript. So when you run JavaScript inside Wasm, you can have an inference function that you can, you know, you can use JavaScript to do image processing. And then, you know, to do the inference using, using a JavaScript function, but in reality, it calls Wasm and then calls TensorFlow and then get the result back, right? You know, so JavaScript API is done through that. And the Python API, we can do the same. So, you know, VMware has a CPySum port to Wasm. So, you know, so we have a joint internship program. We have a graduate student working for us that is adding those Wasm functions into the CPySum port of the Wasm port of CPySum, right? So, in order to illustrate this name, the model, you know, how it handles a very large, very large language model. Again, this is the entire application, you know. This is, I'm not showing you part of the application or showing you, you know, and with all the print statement in there, because, you know, I just got it running like last night. So I got all those debug statements that I haven't even removed, right? You know, so the point I want to convey is that it's really simple. You know, it's, you know, people say Python is easy, but look at how easy this one is. You know, so it takes a model, then it takes a prompt. The prompt is a piece of text, right? You know, like Andrew said, it's once upon a time, you know, a long time ago, whatever, right? You know, so it asks a question about the model. So to build the model from the... And the model encoding is now GGML, which we took the Lama 2 model and converted it into GGML, which is still about 10 gigabytes big. And we read that model and build that into, you know, so in the wasm memory, now we have a representation of that model. And then we initialize the context and then we create the tensor data from the prompt. So the prompt is an English sentence and we turn that into bytes and it becomes the model input. So the good thing about GGML is that you don't need to do other preprocessing. You just turn the text into a byte stream, and that's it. You send that into the model and the only thing that does computing is really is the compute, right? So in the context, you do the compute and then you get a result. You turn the result into, you encode the result into UTF 8 and into a sentence. So let's see how it works. So at the top, you know, it's wasm edge and the nm preload is to say, I want to use the named function in wasm nm to preload a file. This file is Lama 2 13 billion parameter, the chat model. And so this posting is a file that in my local file system. It's about, I think, 15 gigabytes big. And then the Rust application I just had compiled into this wasm edge dash GGML dash Lama dot wasm. This whole application is about two megabytes and it's entirely portable across all the CPU architecture and all the GPU architecture depending on what GPU did you install here. So if you have a GPU here, you just in the command line to change that CPU into GPU, you know. So you would be able to get a media CUDA access to it, right? So then the problem is who is Robert Obenheimer, okay? And so it's thinking and then it generates the output. The output, it's, I'm showing the output here because, you know, it's very self-evident. It's generated by a well-trained large language model with a lot of knowledge. It knows who he is. It's produced completely correct and a natural sounding English language sentences. However, the middle part about his education background is completely made up. It's entirely hallucinating because he graduated from Harvard. He never went to MIT. His PhD advisor is not Ernest Lawrence. It's Max Boyne. He's not from University of California at Berkeley. He went to Cambridge and then went to a German university, so you know. But it sounds very reasonable, isn't it? I mean it's, so you know, just to show, you know, that's... So people use this type of models, you know, use Lama 2 models to fine tune it, to add knowledge to it. And you know, that's one of the biggest drivers at AI today, right? You know, that's, you know, we don't necessarily want to use open AI models or, you know, chat GPT, but we want to train our own models so that it would say things, have our domain knowledge. And a wasm really provides an efficient way for you to apply services around those models. You can do it with GPU, you can do it with CPU. You can do it with Apple Silicon. You know, Apple has become, I thought we were a really interesting player in this because right now if I buy a media machine that has the proper GPU to run this, it costs like $15,000 and a two-month waiting list. The Apple Studio also runs this, it's $6,000, you know, it's become the cheapest option out there. So, you know, so I want to show you, you know, that's how the name of the model works, and you know, and it's really can scale all the way up to anything that the hardware can handle. You know, it's get around wasm's limits of the pointer size and the memory size and all that. So in that sense, a wasm really become Python, you know, because what Python does is to manage those things and delegate it to the local libraries so that SQL libraries knows how to run on the GPU, right? Wasm does the same thing now for us, but in the way that is cross-platform, that is a cross-platform in cross-language, right? So there are a couple of links, you know, that's, you know, if you're interested in learning more. And on top is, you know, we have all our AI inference articles and tutorials at the top link. And then the demo you have just seen, including a GitHub action that's used the GitHub CI to do AI, you know, to do a large language model inference is in this repository, we call it Wazi in the examples. So Andrew, do you want to talk about your... Yeah, so one thing that we released last week is that we have one project which is called Wasm Worker Server that allows you to develop and run serverless applications on top of WebAssembly. You can mix and match also different languages. We thought, like, if you already have that capability about, for example, creating one application using Ruby, Python, and JavaScript at the same time, why not take advantage of AI? And thanks to the Wasm NM proposal, we integrated AI-powered workers in the platform. So now you have this article that shows you a tutorial about how to run your first AI worker. And you have the documentation with all the information about how to run it. It's just one page. It's not that you can download directly the example and run it in case you are curious. Yeah, I think that's... Yeah, so what Michael was mentioning before, there are many different opportunities to contribute to the Wasm NM and the AI ecosystem. And one of the programs is the Linux Foundation. It's Linux Foundation. Yeah, exactly. So there are different open issues that you can contribute if you are interested on the Wasm Edge repo and different repositories. One, for example, that we have is about adding support to TensorFlow and PyTorch in the Python version in Wasm Edge. So if you want to start contributing to the ecosystem, these are great opportunities and I encourage you to check it. At least you have also the label so you can see another different opportunities that you have. And it's a good way to get engaged and start working on this. And yeah, I think that's all we have for you. Thank you very much for being here, attending the talk. I don't know if we have time. I think we have one minute for questions, so maybe we can take one question. Okay, so I think for the first question, I mean, the first question is about since we have already this opportunity to run AI inside the WebAssembly module, but how it compares in terms of performance about running it in a bare metal or in a native way. So I think that you may have something to say here because we didn't do benchmark on Wasm workers yet. Oh, okay, okay. Yeah, so the bulk of the workload is down in the AI framework itself, say PyTorch and TensorFlow, right? So in that part, it's exactly the same because we are using PyTorch or GGML or using those native frameworks to do the inference. The performance difference comes from the pre-processing and the post-processing. By some estimates, 95% of the CPU cycle was spent on pre-processing and post-processing because the 5% inference is mostly done on GPU, right? So in that regard, Wasm has a huge benefit over, say, Python because if you have Python to do pre-processing, that's, it is really terribly slow. So that's why Greg Bachman said a very famous quote is to say the modern Python program is to find out where the, modern machine learning program is to find out where the Python bottleneck is, right? Because you find something slow. It's probably because Python doesn't call a native library for that. Python didn't use OpenCV. It just used, tried to do this inside Python. That is 10,000 times slower than, say, to pass it to a native framework. And in that regard, Wasm doesn't have this problem because Wasm has near native performance, especially when you do AOT. So we can write those things in Rust and then have it run in Wasm, right? So I think in a fully optimized environment where you are just using Python as a, as a dress and have everything passed to C++, you're gonna get similar performance between Wasm and Python. But, you know, if you are not careful and one of your pre-processing functions are actually implemented in Python, then you're gonna, then the Wasm gonna be a lot faster, you know? So that's, so the second question is, how, yeah. Yeah, so we heard Luke Wagner talk about component model today, you know, that was, you know, to address this type of issues, you know? So we have a really high hope that's, you know, the cross-language invocation in Wasm would become, you know, would become easier in the future, right? At this moment, however, I think it, it's still mostly Wasm, it's Wasm invoking host functions, not Wasm invoking Wasm, you know? So it's, so you have a Wasm and then you have some Python TensorFlow here, it's easy. PyTorch is here, it's easy. But you have two Wasm modules to cross between those, it's still a little difficult. I think component model is the aim to solve that problem. All right, so we're gonna hang around for a while. So, you know, maybe outside of the room, you know? So, you know, if you have anything you want to discuss with us, you know, just feel free. Otherwise, thank you so much. Yeah.