 Hello, everyone. Thanks for being here. Today, we'll be talking about how the work we're doing to bring AI to the edge by using WebAssembly and Onyx. And before starting, my name is Francisco Cabrera. I'm a technical program manager at the Azure agent platform team. For the past couple of years, I've been working on Kubernetes, Linux, and since I met Ralph, WebAssembly. And I'm Ralph Scolacci. I'm a program manager in the Azure core upstream team. And so we're going to kick off with Francisco for a while, and then I'll come in and talk a little bit at the end. So before starting talking about the technology and the work we did, let's start by talking about what are the kind of customer use cases that we're seeing for edge AI and, you know, at the Azure agent platform team and specifically working on Kubernetes on the edge. We're seeing a lot of kind of different use cases for this AI on the edge. And because the edge is, you know, every time we talk about the edge, it's like so big and it feels like everything that is not in the cloud now, it's the edge. So we kind of try to divide this edge in these different buckets, right? The first bucket that we talked about is what we call the highly distributed edge. And here it's just that we're talking about more the PC class hardware devices. We're talking more about, like, sensors. We're talking about two editors doing this. They're running, like, four GB, right, small devices. And we can see these devices, for example, in industrial scenarios. So for industrial scenarios, we're seeing demos where, for example, they're using, they're doing welding detection to actually determine if a welding is correct or not. And based on that, actually, do, you know, have the human interaction to determine, you know, what happened with that welding. We're seeing use cases, for example, to actually use data analysis, like predictive kind of maintenance. Or we're also, you know, seeing a lot of kind of visual AI to actually determine if somebody can just go through a specific zone or not. And based on that, actually, you know, create an alarm. We do have also customer, for example, in the kind of medical vertical. There we see also a lot of kind of vision AI, you know, using kind of these algorithms, you know, to help doctors with, you know, diagnosis. And then finally, a lot of retail. So in retail, we see use cases, for example, where people are using these kind of AI and models to actually determine customer patterns. And so, you know, in these retail shops, how are you moving around? How is the customers moving around? How are they interacting with the different products? And also, you know, like, we're now working, for example, with some food chain restaurants, right, where you actually can use edge AI to take the order. For example, right, use, you know, gen AI to actually take the order and then prepare the order and have another kind of algorithm to do vision AI and determine if that order was prepared correctly. You know, by the human, you don't want to arrive home and have the wrong barrier. So, you know, it's pretty important that. And then the second kind of part of, you know, the edge we have is just, you know, what do we call more, you know, the on-prem edge, you know, the big edge. And here we have, you know, the customer data centers, the CDNs. So customer data centers were seeing a lot of kind of, for example, industrial scenarios with these kind of specific, you know, just GPTs kind of trained for, let's say, industrial kind of scenarios where you could just, you know, train your model with a specific guide on, you know, how to use a machine. And, you know, if you're an operator of that kind of full plan, you just go ahead there, you know, you ask the question why I get this blue kind of LED blinking on my machine and just use it for that. CDNs, a lot of kind of, you know, like, you know, WebAssembly is there running out to, you know, really fast. And, you know, those functions that right now, maybe they could also get, right, edge AI as part of those functions. And then finally, if you think about it, almost every computer right now with, you know, the whole co-pilot, you have the office co-pilot, we have the GitHub co-pilot there also, it's just running edge AI. Like right now, GitHub, if you think about it, every time you're typing one letter, that actually is doing an inference and it's sending, like, to the cloud, what sign, what, you know, what you type and then you get the inference back of, you know, and just suggest a code. And that right now is being done in the cloud when it comes to GitHub co-pilot, but it's going to come back, right, all the way to the edge. So the different kind of, you know, kind of buckets here of the edge. In our case, you know, we're working what we call this highly distributed edge and we're working on this specific kind of what we call utility inferencing. So basically, these devices will be doing, like, small kind of inferencing. We're not talking about big models, we're talking about small models, right, and we're not talking about training on the edge, but actually just doing the inferencing on the edge. So what are the kind of the challenges that we see on this kind of, you know, edge computing scenarios? So the first thing is about, you know, resource constraints. So we're seeing that a lot of these kind of devices, the small devices, generally they don't have a GPU. They just were talking about i5, you know, i3, 4GB of RAM, you know, like really kind of, when I was talking about the big on-prem servers. The second thing is, this is a really heterogeneous environment. So when we talk with customers, we see customers that are using Windows, they're using Linux, they're using ARM32, ARM64, X64, now RISC5. So it's just this matrix is getting bigger and bigger. Also cost, so you know, and when we talk about cost, it's not only the cost of purchasing these devices, which maybe, you know, you have a fleet that has like 20,000 devices, but also actually also the cost of running your solution. So if you're running, for example, let's say an AI model and all these kind of 20,000 devices, if you're running a modelized energy efficient, that could actually have a big impact. The same right when you're doing your connectivity. If you're running, if you're downloading a container that actually was not optimized and maybe it's like 200, 300 MB, there's going to be a cost right in that kind of, in that connectivity and that bandwidth that you have. So we, you know, like sometimes we're working with a lot of customers that they're running, for example, AGI right on the ships. And then, you know, the connectivity that they have is really limited. Sometimes it's just satellite. Sometimes they're disconnected for like, you know, a couple of months. And then you actually know, once they buy them, they reach support, they just have a really limited connectivity. And then finally security, when you're running an AGI, kind of an AI workload on a cloud, you actually know where that's running. You're running a specific data center, you know, who has access to that data center. Then you have security on the hardware layer, the security on the software layer. But when you come to these edge AI kind of solutions, you know, this kind of device could be running on any kind of retail shop on top of a shelf there, and anyone can just go ahead there and connect, you know, like a USB device and inject them outward. So, you know, when you're trying to do this kind of solution, there's a lot of these challenges that you need to tackle. But when you AI, it gets even worse, right? And, you know, what we've been talking with these customers that are you trying to get AI to the edge, they need to tackle all these kind of challenges that we talked about, but now also they need to define, okay, what's going to be the hardware acceleration that I want to use? It's going to be CPU, it's going to be GPU, NPU, FSN, PNG, right? It feels like every day there's something, there's a new hardware that we need to support. Also, there are multiple frameworks. So, sometimes when we talk about, you know, these models, you have, well, I'm using Keras, I'm using PyTorch, I'm using TensorFlow, OpenVINO, right? So, there's so many kind of frameworks that, you know, these kind of, our developers need to choose when they're doing this kind of AI solution. And finally, we have multiple models. So, now with just GPT, you all know, that's at 3.5, that's just GPT-4, we have, you know, mid-journ if you want, V2, Lama, and even in Lama V2 you have the 7 billion, the 70 billion, the 30 billion. So, you know, the matrix that you actually need to kind of support is really, really big. So, over the past kind of few months, we've been working with Rolf's team in order to kind of, you know, use and develop these kind of edge AI solutions and address most of these kind of edge AI problems by using WebAssembly and Onyx. And we believe, you know, we can tackle kind of these main edge AI problems because by using WebAssembly and Onyx, first of all, you cannot get a reduced footprint. So, you know, you can bring it down from maybe you had a 300 MB kind of container, now you bring it all the way down to a WebAssembly module that maybe is like two or three MB. At the same time, right, there's also better memory kind of usage. The second thing is just like, right, as we talk, this environment is really terogeneous. We have ARM64, right? X64 Windows Linux. So, by using WebAssembly, you can actually get one unified kind of architecture and compilation. These problems are just like, you know, compile once and, you know, deploy everywhere. It kind of starts making sense when you're using WebAssembly. The third thing is around improved security, right? We all know that containers do have some kind of flaws. And, you know, the WebAssembly kind of security posture is much better, you know, by having this kind of denied by default. And we, you know, like, we can get a much better kind of security approach when using WebAssembly. And finally, by using Onyx, you can get this unified AI framework. So, you know, Onyx is kind of open, kind of open source standard, you know, how you can compile these models. So, you could have different kind of models that say PyTorch, TensorFlow, and then you just have a unified pipeline where you convert the models from, you know, PyTorch or TensorFlow all the way to Onyx, and then you always just use, right, WebAssembly, Plastonix, you know, to run your AI models. So, what's the work that we've been doing? So, the first thing was just, like, let's go ahead and pick a model and convert it to Onyx. So, in our case for this demo, we just used a really simple model, it's a SqueezeNet model in order to do image classification. Although, if you wanted, you know, the whole tool chain is there, you can just get whatever, you know, and just convert it to Onyx that should be pretty fast-forward. The second thing was just that right now, the Onyx runtime, it's only, it's, well, there's no Rust support for it, but there is a C kind of support for it. So there's an open source library that's called Onyx runtime.rs, which was a bit outdated, so we updated that. It's fully now with the latest kind of Onyx runtime support. We already test that with macOS, with Ubuntu and Windows. It works. We're going to be upstreaming that in the next couple of weeks. After that, we had to actually bring the Onyx compatibility with WASNN. So WASNN had the backend implementation for OpenDino, and I think it was for PyTorch, but had no compatibility with Onyx. So we had to actually implement that in order to be able to run these Onyx models directly with WASNN. And then the final thing was just that once we finished that, we actually needed to create, right, this bring this compatibility to WASM time and run WASI. So we need to actually add our own kind of Onyx WASNN implementation to this WASM time and run WASI versions, which also we have a PR open here, right, and we expect to be upstreaming right in the next couple of weeks. So now let's go ahead and do a really quick demo of what's the work that we've been doing. So I'll just go here. So I'll start by just going over the process of what we did. So the first thing is just, like, going over a model. In our case, again, we were using an Onyx model. This is a really simple model. It's a squeeze-net model. And if you go to, you know, this kind of page, you can get all the documentation. You can just download from here, and you'll get, you know, what are the kind of steps that you need to do in order to run the model. In our case, it's just as an input, it's getting an image, right, which is 224 by 224. Pre-processing, we just need to do some there. Pre-processing there, just like to move from FP32 to actually, right, to kind of do the range of 01 and doing the normalizing. And then as an output, right, what we get is, okay, what's the probability for each of these 1,000 classes that ImageNet was kind of trained for? And then finally, right, there's a post-processing also there, you know, to run a softmax. So this is just really simple. I can go ahead here, and this is our example. We can see that we have, this is the model, which is, you know, the SqueezeNet 1.1-7 ONIX, and we have the labels here. But before going into kind of the demo that we have here, let's go a bit, go over our wasian and implementation. So if we are here right now, right, this is the wasian and kind of wasum time implementation. We can go here to, you know, the mod, and we can see here that we now have, right, the ONIX backend, right? So this is one of the things that we added. And here we have the ONIX kind of implementation. And basically this implementation, what it's doing is just implementing this kind of interface, the main kind of functions here, which, you know, the load function, which is actually loading the module. We have the init execution context, which actually will execute kind of, start this execution context and the graphic execution context. And finally, you know, maybe the most important one, which is the compute one. So this is what it's going to do is actually get as an input. This is a byte vectors. That's how it is defined when you're using wasian and, you know, the input is always a vector of bytes. So here you can see, right, that we're getting the bytes. Then we transform that right from bytes to FP32, because right now our label, the ONIX runtime dash, whereas it's just using FP32 kind of vectors. And then we go ahead, we execute it, the inferencing, and then finally the output of this. It's an FP32, so we need to convert it back to bytes so that we are kind of conformed with the wasian and implementation. So now let's go ahead and check the example. Yeah. One of the things that we did here, right, was actually we, as part of kind of this implementation, we added an example of how you can run this right now. So what we're going to be doing here is just doing really simple inferencing. So here I have a dog image, right? So we'll go ahead and check our main here. So what I'm doing here is basically, first of all, I'll just define the path of where I have the image. I'll go ahead and import all the width here, and specifically I'll import all the wasian and kind of modules that I need. So I have the graph here, I have the inference kind of thing, and then I have the tensor with the different kind of tensor definitions. So then the first thing that we need to do is actually load a model. So in our case, we'll just load it locally here. We have it right, this is under fixture model, you know, squeeze it 1.1. Once we have that model created, we'll create the graph, the execution graph. We need the context here. Then we're going to be loading the labels. So this is just the model was trained on 1,000 labels, so we're just going to load that so then we can actually show the final result of it for each label, what's the probability. And here is one of the most important part, which is actually we need to go ahead, grab the image, and prepare the wasian and tensor. So if you see here, right, we have this function that is called the image to tensor, and this is actually doing that conversion. So if we go here to actually the implementation of this, we can see that it's basically reading the image. Then it's just cropping the image to a specific size. So this model needs a 224 by 224 kind of image. Then we'll just grab the bytes. This is just an RGB kind of a matrix. Do the normalization of each pixel. And then finally just put it down right into a vector of bytes so that we can put that right as the input for wasian n. So once we have that right, which is the data, we'll just say, hey, we're going to be creating a wasian n tensor, which is going to have these dimensions. It's going to be of type fp32, and this is the data that we'll be using. We set the input tensor, and then we just go ahead and run the inferencing. So you can see here we have the compute. This will actually execute the inferencing directly on the Onyx model. Once we get finished that, we'll retrieve the output. Again, the output is a vector of bytes. We need to migrate that while we transform that into fp32 bits, kind of a floating point. And as part of the post-processing, we need to actually apply the softmax layer because that's not part of the model. So we'll just apply a softmax layer, and then finally we'll arrange the different probabilities and print the final kind of inferencing. So now let's go ahead and run it. So before running this, because we was sometime, right now the public version has no kind of Onyx support, we'll have to build was sometime hours. So I'll just do it here. So I'm going to just build was sometime with these specific features, right, with the component model and the wasian n. So there it is. I already have that compiled. I'm going to get it also. Now I'm going to run, right, I'm going to create my component of these kind of examples. So if you see here, we're in wasian n examples and this classification component Onyx. So I'm going to go ahead and create that. There's there, and this is going to create, you know, the WebAssembly module. And then now I'm just going to go ahead and run this. So how I'm going to run it is I'm going to go use my own kind of was sometime bits. I'm going to say that I need the component model. Also I need as a module in the experimental wasian n kind of implementation. And then what I'm doing here is actually I'm going to, you know, mapping one directory from my machine to actually inside the module so that I can get, you know, the images, the labels, and the Onyx model. And then finally I'm just here, right, just running my WebAssembly module that I just created, right. So classificationcomponent.onyx.wasm. So I'll just run it here. And there it is. It's pretty fast, right. So you can see that it was reading the model, something like 45 MBs. It loaded the graph into wasian n. It actually created the wasian n execution. It executed the graph inference. And then finally we get here, right, that this is a golden retriever with a 99% probability. So now I can just go ahead and do another example. I'll just go ahead and change the image. So let's say I have a panda here. I'll just change it. Just change the image. It's really quick. I'll just recompile, right, the component with a new kind of image. And I can then just go ahead and run it again. But now with the panda image. And we get, right, okay, this is a panda bear with something else, that 97% probability. So that's kind of, you know, the demo where, you know, we're running WebAssembly with wasian n with an onyx model doing, you know, like what we call UTG inference thing for these kind of edge scenarios. And now Rolf is going to talk a bit about, you know, what's next. You know, this is work that we started, but now we're working on things for the next couple of months. Right, so if you can hear me, apparently I did a little feedback in front of the speaker. Forgive me. So when we talk about what's next, first I want to reset some of the context. So like wasian n for this, I think it's a proposal for a component interface that people who don't know is a proposal for a wasi interface, a component interface, that it would be standard in the component model for things like models. The original implementations, as Francisco said, were things like TensorFlow, I think originally it was. TensorFlow and OpenVINO. OpenVINO, that's right. And so that's really good, good approach and implementation as well. Wasian n is fine, but so one of the aspects of wasian n being open source means it's a proposal, means that anybody in the component, and if this is interesting to you, you can actually go and participate in the evolution of that proposal. It's the experience that we had here. We're going to talk about a couple of variants that are clearly going to be part of the evolution of wasian n. By the way, Onyx and wasian n, all of this stuff is open source. So much runs everywhere, and so that's really a critical part of what we're doing here. And as Francisco mentioned, this is also going to be supported in the run wasian kubernetes shims, where you can run essentially everything in kubernetes, no matter whether it's an edge kubernetes like k3ds or something like that, or whether it's a large cluster and you want to really use some big engines on it, for example, and that's all open source in the run wasi repository of the container d project in the CNCF. So all of this stuff is open source. However, there are two things. Francisco mentioned a couple of things. One is the size of the models. Now, the way the interface is defined, you load the model into the module, you pass the model into the API, and if the model is quite large, you can imagine that's going to really slow things down. So if you're interested in size, you're in a constrained space, or if you're networking, it's not really that great, regardless of the resource space that you have. You know, passing a 40, 50 megabyte model is going to take some time, and it's not time that you want. You can't pass it on every indication, for example, and you don't want to actually put it in the WebAssembly, because now your WebAssembly is essentially pretty substantial and you run into the same capacity problems. So for larger models, it's possible that the interface for WASI-NN isn't really the final approach. In the repository and the proposal, you might want, if you're interested in larger models, you might want to look at something that in the issue queue is called named models. That was not my fault, whatever it was. And that's the idea that you can actually have, essentially side loaded some big models or in a cache, something like that, sitting next to the module execution environment and merely refer to that. In other words, the larger model is already loaded and you're using only the module for the compute of the inferencing, for example. So that's one area where either the WASI-NN API might be expanded or evolved, and so if you have opinions about that, I'd love to have you upstream to help us out thinking about how that might work. Or it's possible that there's another proposal for large models because if you think about it, smaller models for the kind of inferencing that Francisco was showing here are super useful for inventory things or factory sensors. He was mentioning things like looking at seams for welding, making sure those are okay. We even had an Azure space, there was a situation where one of the things that the astronauts do is they examine gloves for faults every time they come in from a spacewalk and that's actually prone to human error so they want to use that, this kind of inferencing to examine gloves automatically to get a lower error rate for reuse of those gloves. So these kinds of inferencing comes up quite a bit. So we think of those smaller models and focus models as very utility-related. So you think of that as the 80% case. When you get to the larger models, a bigger panoply of targets and you almost always need the other thing, you may want hardware support. You probably do want hardware support, especially for a bigger model. So things like GPUs and other kinds of customized chips are gonna be really, really important for those kinds of models. And now the question of how do you detect a particular GPU and lean into that optimization becomes part of your question. So for things like big models and for hardware optimizations that are required for your scenario, it's possible that Wazianin needs some evolution or we're looking at a different approach or a different proposal. So that doesn't mean that Wazianin is unsuccessful and so far it's been really successful. But another example of an evolution we already know is gonna happen because we've been talking with the community and we sort of turns out everybody is sort of thinking about is that FP32 that I mentioned, right? Having to convert the model into a vector of FP32 and then convert it back after you've done the compute is pure waste, right? It's pure computing waste. And so one of the evolution is clearly gonna happen and Wazianin working with everybody up there is that there's gonna have to be some elaboration of the API proposal to take in larger, more complex types to pass those kinds of things so that they don't actually have to be converted merely to do the work that we'd hope to do. So that's another example of the future, though. So a couple of other things we wanna do go ahead and click that because I wanted to call out a couple of people here. For the Wazimetime implementation that'll be submitted upstream in a PR to Wazimetime so that people can use Wazimetime. We obviously hope to participate with other runtimes and get that adopted as a proposal so that people can use it anywhere. It's super useful, right? And then also the Wazianin implementation will be proposed if it is not... I don't know if it's already proposed. Have we already put the PR in? We'll put the PR in. So the gentleman over there, David Justice, whose name is up there, he can nod or raise his hand. If you want it immediately in the proposal, go yell at him to submit the PR because that's our intention as part of the evolution. And this is all part of the component model. So we're really looking forward to Preview 2. That's something that we think is really, really important. But in addition to calling out David to embarrass him, I also want to thank him and Joe specifically for their work trying to update the component model definition, get the new runtime compiling to it, recompile Wazimetime, and get the PRs ready for upstreaming. So I want to thank them. But I also want to call out Radu Matai who used to be on the Microsoft team and is part of the new Fermion team. He's here at the conference. He's out there. He originally compiled Onyx to WazInn about a year and a half ago, two years ago, and his code was so good, which is great, that David and Joe and everybody were able to just pick it up and update it and get it running really, really fast. So I want to call out Radu's earlier work that we're building on here. So thanks to everybody, including you two. So that's about it. Do we have another slide? Are there any other questions? Thank you very much. All right, I don't know if we have a mic. Do we have a mic? It doesn't matter. I will repeat the question. Go ahead. Yes. So this is interesting. It's actually going to be a big, how shall I say it? It's going to be an experiential thing as we roll forward with components in WebAssembly. And the reason is because WebAssembly has certain constraints. We'll call them constraints, but they can also be spun as benefits depending on what you're doing with respect to the kind of performance you're going to get because, for example, you don't have a really robust threading model there and async and things like this. And that means that for scale-out computing that you normally think of as a server process, something pretty heavy that can make use of a lot of async work and pool things and stuff like that, that's how you get throughput in a container workload or a regular workload, right? So wazzy right now, that's a little bit difficult. So only when we get those threading and async will we be able to sit there and go, hey, I could take this big model and really burn some threads on it inside the module as well as outside. Well, what that means is right now, you're going to do the computing work for the implementation in the runtime. So that's why we're compiling here to wasn'time or whammer or whatever your runtime might be. That's where most of your heavy host implementation will be and the guest is going to route that to the implementation in the runtime for now. As we go forward though, once you get the ability to have all the async power, the threading, the networking that you might need in a web assembly, you'll begin testing for those environments where you're not highly centralized. So specifically when you're talking about distributed edge, you're not necessarily going to have high-powered machines. And so it's probably easier and maybe better in some circumstance, actually ship all the stuff in a module and you'll get the kind of performance you need for the kind of utility, the small model work inside the module itself. You can ship the whole thing in the module. That will make sense in certain environments. But then even in that case, if you're thinking about a centrally hosted service, where like a CDN service or like, for example, I work in Azure, so let's say we have a Microsoft Azure Onyx service, which we're not going to have. But the point is if we did, that kind of centralized service would be trying to target and support the whole world. For that kind of optimization, you'd almost always not run it in a module. You would actually have a custom built implementation in the runtime part of things. And all of these scenarios are merely different use cases. But if that answers your question, that's how I see it right now. So the answer to the question is, will this make it possible to actually run models without knowing whether you have hardware acceleration? Actually, yes, it's possible. Again, you get into one of those utility versus optimization scenarios. So in small cases with big modules where you know you have a specific chip, because of the resource constraints or because of the size of a model, you'd be required to optimize on a particular piece of hardware, because you know it's there. And so that's a case where you're not talking about utility inferencing, you're talking about something that you really need to dive into the hardware. However, in most cases, say, for example, when Francisco is referring to cases where you don't really know all the hardware you got and stuff like this, sometimes there's going to be a GPU, sometimes they're not. You really are going to want a generic model that actually can pick up the GPU and use it if it's there. And so you are going to see implementations. Radu's original implementation for us actually checked to see if it had GPU support, and if it did, it used it. And if it didn't, it just did CPU inferencing, right? And so that's entirely possible. So the answer is yes, you can do that. And again, the choices you make, whether you do everything in an agnostic fashion or whether you lean into your optimizations for hardware or for size or things like that are going to be dependent on the scenario you're trying to address. That makes sense? Any other questions? Obviously we'll hang out, so if you have questions you'd like to ask us afterward, feel free to come on up. Thank you very much. Thank you so much.