 Hello, everyone. Welcome to the cloud-native WebAssembly day. In this talk, we're going to talk about AI inference, you know, how to do artificial intelligence with WebAssembly. And where do we use it? It's probably on the edge of the cloud, you know, so we call AI inference on the edge cloud. So my name is Michael Yuan, and I'm from the company called Second State. So this talk is about how WebAssembly makes it easy to run TensorFlow inference functions in production on edge devices and platforms. Well, so before we start getting to the details of WebAssembly, let's spend a couple of minutes to talk about AI, you know, that's deep learning and AI is hot topics today, right? So when people think about AI, people think about the data, the algorithms and all that, you know, that's how everyone loves to train the greatest AI. However, you know, after years of development with deep learning, I think, you know, most people in the industry has realized that training the AI is not the difficult part now. You know, it's pretty much a local work, you know, you don't even need to be a programmer. You just need to know how to run a Python script, right? You know, how to collect those data and how to label those data and that feed that into a computer algorithm. Then you get your, you know, TensorFlow model or, you know, own X model or whatever, right? However, most of the work is actually most of the engineering work at least is actually on the inference side is to when you have an AI model that's trained and when you use it in your production system to say, to recognize part of speech or recognize image, recognize the face, you know, whatever you train the algorithm for, you know, that is the bulk of the computing part that's associated with AI today. So if you look at this graph, you know, that's estimate from 2019. And the training only takes like 5% of the computing part and the inference is 95%. So in order to deploy your AI system or deep learning system in production and to do it efficiently, we have to optimize for that 95% that inference part. However, there's, inference is very important, it's 95% of the work, but it's also difficult to optimize. If you look at this graph, you know, there's each dot, each of the blue dots represents a trained AI model for image recognition. And on the X Excel, it's the time it's required to run this model. And on the Y Excel is the accuracy of this model. So you can see the accuracy goes from 70% to 85%. It's about 20% improvement for the accuracy, but the time needed to run ranges from a couple of milliseconds to over 30 milliseconds. So in order to get 20% improvements in accuracy, the model has to run seven times, eight times as long. So that means it's difficult to scale those models, right? You know, so when you have those models, if you have a slightly more accurate model, you need a lot more computing power. So to save computing power to make the models run as efficient as possible, becomes a major issue in deploying those models in production. You know, when you need to deploy those models in production, you really do need to pay attention to how efficient they are, how efficient you can make them run, right? So, you know, that's, so let's look into several ways. That's how people, you know, run their TensorFlow models for inference. So here, so the title of this slide is why not Python, right? You know, so the official TensorFlow serving is that application is a Python application, because Python is easy to use. A lot of people use it to train the model. So it only stands to reason that people use it for inference as well. Once you train the model in Python, you can see the model in action in Python and then perhaps you can use it in production as well. However, Python as a language is very inefficient. You know, there has been a study that's, you know, there's a famous study that published on the journal Science. It's, you know, the authors took a machine learning algorithm and implemented it in Python and then C. The C version is 60,000 times faster than Python. It's four orders of magnitude just by suffixing to a different language. As a result, you know, when you use, when you say I'm doing machine learning or TensorFlow in Python, it's actually doesn't actually do both of the bulk people work in Python, you know, it's under the hood. The Python functions actually cause the C libraries, the TensorFlow libraries written in C. So the workload is actually down in native code written in C. So, you know, that's sort of, you know, how people get reasonable performance out of Python. You know, Python is mostly just the interface language, you know, that's makes the system easy to program and easy to use by, you know, by the developers, right? You know, that's, and then when you really need heavy lifting, when you really need computation, you go down to the C libraries to go down to the native library, TensorFlow libraries written in C, right? However, even then, the data pre-processing and post-processing are still written in Python, right? You know, because when you read, say, if you have a, you know, image recognition model, that's you have to read the image and resize it and flatten it to make it appear to be the size of the tensor you want. And then feed that to the TensorFlow model and then to get the result from the model. And then map the model result back to a metadata file or label file to figure out what's the English name for the objects that are, that's on that image, right? So that process, only 10% of the work is actually done running the TensorFlow model itself in native code or in C library. And most of the data pre-processing and post-processing are still written in Python, you know, if you are writing this entire application in Python. So 90% of the work is slow and can be heavily optimized. As we can see, you know, there's orders, there's orders of magnitude or performance you can achieve by just using a different language. And of course, there's other issues when, you know, when we talk about edge computing when, you know, because a very large use case of AI is on the edge, right? You know, so you say if you have a thought camera and you have, you have trained it to recognize all the people in your household. And so it would want to see a new person that's standing in front of your door, it would be able to tell whether it's a person in the household or it's somewhere else, right? You know, it can unlock the door or, you know, it can do things like that. And for things like that, you want it to run to, you want the AI model to run either on the camera or in the house on the router or in the server in the house, you know, something of that nature. You don't want it to go all the way back to the internet. And not only because the performance would be, you know, run on the remote server, say in New York, right? You know, that would be, you know, you would see a big performance impact, but also because you don't want the data to be in your house because for privacy reasons. And of course, you know, another, you know, very important use case is self-driving cars, right? You have cameras and all that and you need those data to be processed in real time on the edge inside of the car, not go back to somewhere, not go to the cloud and process and then return to the internet. So in many of those cases, Python doesn't run any of those devices, you know, it doesn't run very small devices, it doesn't run a very old Linux operating system, it doesn't run a real-time operating system, you know, so there are lots of issues. And Python, of course, offers limited support in web and application frameworks. You know, there's only certain frameworks that work with Python, and there's many others, that's, you know, for instance, in Node.js, it would be difficult to call a Python program from Node.js. It's possible, but it's not as easy. So the little table I have here shows the performance characteristics of different programming languages. And you can see C++ of course, by far is the best. The rest come to a very close second. And Java, you know, it runs fast enough, but it consumes a huge amount of memory, 100 times more memory than everyone else. And for Python, it runs very slow, it's 100 times slower than everyone else, and consume a reasonable amount of memory. So that's the performance characteristics of Python, making it suitable for AI training, but not suitable for using AI inference in production. So now we have established that C++ has the best performance, and Python may not. So what about native applications? Why don't we just write all our applications in C and C++? You know, or at least the AI inference functions C and C++, right? You know, that's, it's like the TensorFlow library is already written in C. So, you know, why can't we write our application around it in C as well? But to do so, we basically give up all the advances that we've made in the past 40 years of computer software engineering, which is the goal to make software safer, to make software more secure, right? And to make it easier for developers, and to make it easier for users as well. So, you know, specifically, if you, if we are writing C applications in native code and then run AI inference, we have the problem of the software being tied to a specific OS or hardware platform, it's not portable anymore. And you can't, you know, that's especially a problem for edge computing, where you have, you know, you may have a variety of different CPUs and operating systems on different devices, right? And it cannot be managed and orchestrated because it's a native application just to start it and it could crash and it cannot be managed and stop, you know, that's, it's difficult to do that. It's not safe. It crashes from bugs or attacks, you know, that's, you could, you know, you could make a mistake and it crashes the whole system, right? It's costly-grained. There's no, you know, the access control is at the operating system level, you know, so it's whoever the user, the user who runs the application would have, the application would have the same privilege and prevention as a user who runs it, right? You know, so if you have root user that runs the application, you have access to the entire system and which is generally a bad thing. So the native application, there's lots of issues with native application. That's why, you know, 28 years ago Java gained popularity because of JVM and then after that, you know, Docker gained popularity as well, you know, that's because it provided isolated runtime and portable runtime for developers to run their applications. So we don't want just native applications. We want something that is, that is more modern but also have high performance, not slow performance like Python, right? So WebAssembly to the rescue, you know, that's, so we have two conflicting goals, you know, that's one is to have super high performance. And the second is to have security, safety, and the ease of use and all those nice features that a container gave you. So the best of both worlds, the compromise can be reached through something called WebAssembly. So WebAssembly, that's why we're here to discuss, you know, unconnected WebAssembly. So this particular project that we are discussing here is a new sandbox project that we have contributed to the CNCF and it's called Wasm Edge and it's also called SSVM, which is production and second state VM or Wasm Edge WebAssembly on the edge, right? So WebAssembly has some really interesting advantages over, you know, over both a fully managed application and also over native applications. The first is high performance, it's much more performance than say Python or Node.js or JavaScript or even Java in the JVM. And this has sandbox safety, it's a sandbox that contains the native applications, so it has safety. It's a kind of capability-based security, meaning that when you start a WebAssembly VM, you can give it a set of policies or rules, what kind of system resources it can access. So that allows the WebAssembly runtime to have a different set of security permissions than the users who run it, right? And it's language agnostic in a way. So, you know, that's a meaning that any front-end language that can be programmed, that can be compiled to LVM, it's going to run on WebAssembly. Although language that requires a large runtime become somewhat impractical to run on WebAssembly, for instance, you know, it's difficult to get Java or Python running on it. However, for Rust, C, Swift, TinyGo, and you know, languages like that, you know, that's, you can compile all those languages to WebAssembly and just run them in the virtual machine, in one virtual machine. And the last thing is we see a product community fit here, you know, meaning that, you know, there's a large, already a large community for WebAssembly products, and there are lots of people contributing, which we thought is a crucial factor for success of the technology. So to sum it up, you know, from our point of view, we see the Rust programming language on the front-end and the WebAssembly on the back-end, roughly equivalent to the Java programming language plus the JVM about 20 years ago. So we believe maybe this thing can be as important or as, grow as big as Java. So this is the WebAssembly runtime that we have donated to the CNCF. You know, if you're interested, you know, just like it on GitHub or, you know, fork it or check out, just generally check it out and see, you know, this is a popular WebAssembly VM specifically optimized for high performance applications, outside of the browser. So there's a bunch of WebAssembly VMs that's designed for the browser use case, but this is not, this is designed for the server, for the use case outside of the browser, meaning it's on the server or in the cloud. And it's called Wathom Edge, you know, it's WebAssembly on the edge. So in the next couple of slides, I want to talk, you know, using Wathom Edge and SSVMS examples to talk to you, to discuss the WebAssembly's approaches to TensorFlow inference. How do you do TensorFlow inference in WebAssembly? We all know how to do it in CC++. We also know how to do it in Python, but how to do it in WebAssembly and what has the implications? So the first easiest one to, for developers to understand is to just run WebAssembly interpreted on the CPU. You know, so the TensorFlow operators are implemented in Rust or C++. It is then compiled to WebAssembly. And then on the host machines, the CPU, we execute this program and read TensorFlow model and then, you know, execute those operators. And here is an example, you know, we have an example of this approach in our repository. It does image recognition. So it takes about 10 minutes to do, to recognize, to do image recognition on more than CPU. So it's a very long time, you know, the performance is obviously not acceptable. But, you know, that's perhaps the most straightforward way to do it, right? You know, so, you know, you provide an implementation of the TensorFlow operators in WebAssembly and then, you know, use the interpreter to run them. You know, that's our benchmark. It takes about 10 minutes. A much better way is to do the just-in-time compilation. So it's the same WebAssembly application, but it runs in JIT runtimes like V8. The way JIT runtime works is that it's no longer interprets WebAssembly instructions step by step. It tries to compile it as it's run, right? You know, so most of the execution happens in the native environment. So in that case, we immediately see a 300 to 500 times boosting performance. You know, the same image recognition tasks now takes two, three seconds. That's great, right? You know, that's essentially how TensorFlow.js does it, right? You know, that's the official TensorFlow library for in-browser application or for mobile devices. That's it runs inside the JIT WebAssembly runtime. Of course, we can go one step further to say, you know, the JIT compiler compiles things as it goes, right? You know, so as a VM and execute the program, it compiles it. We can compile AI models directly into WebAssembly bytecode programs. That allows us to save the time needed to map TensorFlow operations to WebAssembly operations at runtime, right? You know, because TensorFlow model consists of a series of TensorFlow operations. The WebAssembly program, what it does is to read the TensorFlow model and then execute those operations. If we combine those two steps, we just, you know, the steps that needed in WebAssembly to execute this TensorFlow operator is built directly into the WebAssembly application. That would make it run even faster. So in that case, the image recognition task takes one to two seconds, which I have to say is more or less acceptable for Internet applications. Now, you know, that's, if you have something in the web page, on the web page that it can recognize a face in one to two seconds, I think that's for most people, it's acceptable, right? So representative examples of those is the TVM project, Apache TVM project, which compiles TensorFlow models directly into WebAssembly. And the second state squad come work for the ONIX compiler for Wasm, compiles, you know, ONIX model, AI model directly into WebAssembly and runs on Qualcomm hardware. Let's take it one step further, you know, because let's recall how Python does it. Python does, you know, does not call TensorFlow library in the interpreter, in the interpreter, or in the JET compiler. What it does is that when the Python program needs to run a TensorFlow model, it drops out of the Python and into the native C++ environment and runs the TensorFlow model there. And then it gets the result and it goes back to the Python environment. We can do exactly the same with WebAssembly as well. So we can call native WebAssembly libraries from, we can call native TensorFlow libraries from WebAssembly. You know, we provide a Rust API that compiled down to TensorFlow cost. And then your image, then your data preparation and the post-processing work can still be written in the highly optimized and high performance language like Rust. So to do so, we can show that the image recognition task now takes half a second. That's 2,000 times improvement from the original interpreted mode, right? You know, so an example of this is SSVM with WASI TensorFlow. It's a Rust SDK and WebAssembly extension that we wrote for the SSVM which we donated to CNCF. Of course, now, because the TensorFlow model is run in the native library, right, in written C++, that particular native library can now take advantage of the hardware features on that particular device, right? So we can use, you know, ahead of compilation techniques, AOT compilers to optimize for data preparation tasks that seem awesome. We can run the TensorFlow library on specialized hardware such as, you know, AI chips or GPUs and, you know, things like that. And the image recognition task now takes another 10-fold improvement. It only takes 0.05 seconds to recognize the image, right? You know, in that case, we can do 20 image recognitions per second. The full video speed is 30 frames per second, right? So it's close to a real-time video. So if you have a real-time video feed, we will draw a model on it and see what's in each of the video frames. So the SSVM on Tencent Serverless shows an example of this. It's a very nice example that shows you a webpage and allows you to upload a picture and it tells you what kind of food item is on that picture. The whole application takes about five minutes to deploy. It's a very nice way to demonstrate the power of this approach. So for the remaining of this talk, I have about five minutes left. So I want to show you some code to show you how simple it is to run TensorFlow model in WebAssembly using the approach I have just discussed as a WASI TensorFlow approach. So there's a function that's written in Rust. It's called infer. And the infer function takes a data array that represents the image and come back with a string, which is a text description of what's in the image. So that's the method signature. And in this screen, you can see it's read the image, and then it's resize the image, and then it's flatten the image into a predefined format that can be consumed by the TensorFlow model. Now in the next, in this slide, we read the TensorFlow model and we read the labels associated with the model, meaning when we try to train the model for what's on the picture, we give the model labels, right? This picture has cats, this picture has dogs, things like that. So those labels are in this label file. And then we have a really simple API to run TensorFlow session. And then we gave the flattened image as input, and then gets the output, the predictions output from the TensorFlow model. So in this process, we drop to the operating system and run this model and get the result back. Now we get the result back. The result is in the array. It's in the array of probabilities. Say if the data has, if the model has 100 associated labels, it's going to get 100 numbers. And each number corresponding to the probability of that label on that image, right? So if the label corresponding to cat has probability of 0.8, that means there's 80% of chance this picture contains a cat, right? So here we just sort the return array and find the object with the largest probability, find the index with the largest probability. And then from there, we go to the labels file, the metadata file to find out what that largest probability correspond to and then generate an English description for the model, right? So this whole Rust program is now compiled, cannot compile into WebAssembly and run SSVM in the cloud, in the edge cloud. And that would allow it to detect objects from any image that you send to that cloud service. So if you're interested here, we have the source code, we have the demo, and we have the tutorials. And make sure if you're interested to check them out. So what's next? There are lots of things. The field is still young, but I think it solves a real big problem. It's how to deploy production-ready AI inference application, right? In the manner that is secure, that is safe, that is developer-friendly, right? So there's many things that we can do to improve what we have done so far. At first, we can support the WASI neural network specification, which is a specification developed by the community to unify different AI model approaches, right? So the approach we took is very, so far, is very TensorFlow-specific. It's called WASI TensorFlow, right? So the WASI AI specification can accommodate more AI frameworks, and we are committed to work with people in that specification group. And to provide more SDKs for front-end languages, which we just saw, we have Rust SDK, and we call those APIs in those SDKs to run TensorFlow inference, but we can do it for other languages that support it on Web Summit as well. And on the other hand, on the other hand, is to support more AI frameworks on the back end of TensorFlow, beyond the TensorFlow, right? So there's Onyx models, Palo Palo, Tengen, MKNet, and MXNex. You know, things like that, you know, those, you know, you can just, the idea is that you can just plug in a different model, and then run the exact same code, and then you would be able to utilize that model, right? More AI hardware, you know, that's, that's to be supported by those AI frameworks. So I think my time's up, and thank you very much for watching. And I have another talk later in the afternoon that talked about how WebAssembly can become, can be used as a serverless runtime. And a lot of this, that use case is already illustrated here when we develop AI inference function, we actually deploy it as a serverless function, either our own infrastructure or in Tencent serverless or AWS Lambda. You know, that allows application developers to just write code, upload code, and then, boom, have the AI inference functions in production ready for everyone to use. So I hope you enjoy this talking. Go to our GitHub repository, on github.com slash second dash state, supash SSVM. Thank you very much.