 Welcome. My name is Andrew. I'm an engineer at Intel. I've been working on a web assembly and Wazzy for some time now optimizing several runtimes worked on some of the specifications Hi, I'm Matthew to my reos. I am a staff software engineer and Octo a fastly Octo is office of the CTO Okay so To kick things off a little bit. I was I saw came across the slide recently And the it's original sort of purposes like hey, we have a hard time telling how hard things are and This is actually from 2014 and at that point in time people used to think checking whether a photo of a bird is going to take a research team in five years, but today We're going to see it done and it actually took about 10 years to actually make this a reality to be able to do it in a few seconds in and Wasm, but we're actually going to see how AI is going to make everything we care about better and So that what you actually have there is a piece of generated at AI will actually show it live Running in a wasm time So the prompt was AI will save the world fire robot and it gave some some interesting imagery and And so what the the reason why we really care about Generative AI is because as the the potential to transform almost every aspect of knowledge work Whether it's like text generations to think about like legal contracts like auditing business summaries Like the words what some companies were trying to do before which like provide a automatic summarization of like a business report to an executive that's now actually in the realm of possibility Obviously with as you saw in the previous session there is a little bit of hallucination that can sometimes go on So there's still some work to do But there's already a ton of acceleration in the code space Where like if you've used co-pilot or other similar tools you can Dramatically increase the productivity of yourself as a as a developer. There's a lot of generative AI and images And Also, there's starting to be a lot of stuff in video. There was recently a paper that just came out that actually showed Some of the first like realistic 3d models I think about for like game design We have to create character models that the for the first example of being able to do that with generative AI Which is just really incredible. So I'm super excited about it because it's gonna it's really going to transform and impact everything And obviously there's some hype that we have to work through but we'll get there So here's our agenda for today. We're gonna explain since this is a web assembly conference. We're gonna talk about how we can use Machine learning inside web assembly and it's gonna, you know sort of dovetail into the previous talk if you were here Where we're gonna use wazian and the specifications. So we'll explain a little bit about wazian and And then we're gonna try to describe if you were to build your own Product that combine a function as a service service with machine learning. How would you do it? And so we'll sort of describe that and we'll describe the two key Changes that we that we decided to make in order to make this possible So named models and process isolations work what we're gonna use for performance and security And we'll show you a demo of all this working and hopefully we'll have time for Q&A at the end So Why do we need wazian and I hope you're familiar with wazi as a system interface for web assembly modules? And the key really piece here is performance So Ming Shu and I Ming Shu can you say hi? Several years ago, maybe three years ago. We give a talk at a CNCF wasm day where we showed that Without something like wazian in you can lose orders of magnitude of performance when performing Machine learning inference, but when you do have a way to get get outside of the web assembly sandbox, you can get that performance back Why is that the case well because there's special hardware needed for performing machine learning inference quickly And so some of the things are you know, there's full width SIMD on different architectures on certain architectures There's you know machine learning specific features like AMX or vnni on x86 and and of course there's all Types of new hardware that you might want to use for running machine learning models like GPUs and TPUs NPUs You know whatever comes next and and not all of that will be available to you within web assembly Some of the you know web assembly needs to take into account all the different architectures and it can't make special instructions And so that's why something like wazian and exists It also exists to scale with the ecosystem, which right now is moving quite quickly There's new models being developed all the time. There's new operators being developed within the model There's new tensor types There's new model encodings and and if we tied ourselves too closely to a specific model or framework We wouldn't be able to keep up with how fast the ecosystem moves and so that's why wazian and exists It's an inference API for web assembly that uses machine learning capabilities of your native system For any machine learning framework not just a specific one Okay, let me give you a sort of brief history lesson of Wazian and was proposed being she and I Sort of kicked it off in 2020 with a with a proposal and since then we've had several presentations On this I showed you a slide from wazim day in 2021 and Radu has talked about it in in 2022 More recently we've you know talked in and proposed the name model extension with several of the different implementers And here's sort of when support appeared in the different implementations of standalone web assembly run time So in wazim time there's support in 2021 or 2020 And wazim edge and whammer soon followed What does it look like well look at this pseudo rest you need to load a model From some bytes You need to set up an execution context and then you compute inference This is all these yellow sort of calls here what you need to do to use wazian in and right now It's all defined in widx, which is the interface language For the old version of wazi It's a very high-level API. It covers 80% of the use cases, but you can't fine-tune any of the you know knobs that that you might with like a More low-level API and it's made up of various parts. So there's a specification of wazian in there's bindings to various languages And then there's implementations in the engines like wazim time or whammer But that has changed with recent work. We've done to the spec Which that's the link there if you want to contribute now The specification is defined in wit the new language for wazi And we're trying to make it simpler. So we have some issues out there to boil this down By using name models and just you know a single call that it should be a little bit simpler to use in the future Matthew, you're up. Thank you. So As we started thinking about how do you build this function as a service What are the key components and sort of you need in that recipe? There's a couple of broad areas that came up. So like deployment user experience performance and security I think We basically went through these and And the solutions that we came up with are sort of aligned at tackling really the two big ones are performance and security because those are the ones that I Performance in particular was addressed at the spec level and then security We had to make some design choices in wazim time as we did this implementation to really To really make sure that like we're not just protecting the infrastructure. We're also protecting other tenants from models One of the things that is not covered enough is a lot of these machine learning frameworks are just just like gigantic RCE packages like Pytorch still has open bugs where it's like, oh, we can unpickle files and like take over your entire machine So you shouldn't load any untrusted models. And so that becomes very challenging when you're trying to build a scalable Framework where you want people to be able to bring their own models and be able to run them at the edge So Compute what at edge? What is it? It's a it's makes it easy to secure securely deploy run and scale globally distributed WebAssembly modules The it's a it's a service list request-based architecture and it runs Basically really close to the user so you can get low latency This can sometimes not be as important way if like you're running a model that takes a really long time to run But if you're doing a lot of like lower latency use cases like image classifications Those models can complete in like hundred milliseconds or even sometimes faster And then if you do like heavy quantization and a bunch of other things you can get that to like nine milliseconds Even for some of like the text on LLM use cases So How hard is it to support for wazian n? Well, it took a little bit of work. So we're gonna dive into that So these are some of the things we encountered when trying to get this working so the the wasm time event loop Can be sensitive to expensive guest processes So say like if you run an inference request that takes like five seconds And you have a lot of a lot of requests that being handled by wasm time. There can be issues Models can exceed compute memory and module size limits. So computed edge for example, I believe has a 128 megabyte wasm module limit and Like some of the models are like eight gigs. So like how do you like load and Sometimes bigger like you as you probably heard earlier like llama to and some of those can be like hundreds of gigs and some some are like Terabytes in size. How do you actually load those and run those at the edge? Tuning for machine learning workloads is going to be different than traditional edge compute So like it when you're running a machine learning workload and you're having a bunch of requests coming in like your CPU usage And you're like your GPU or TPU usage is going to be very different than just like regular edge compute workloads There's also limited sandboxing when directly linking framework libraries in process So as you saw earlier in the slides, it was like load these bytes pass them through to the host and then the host Passes them into some other library And what it's actually doing underneath the covers. It's actually compiling that model into like an execution graph That gets loaded on to either the hardware the CPU and it's literally passing XML encoded as bytes and so I mean not all them are XML But it's the XML like things that again are not exactly the safest things to do when you're running directly in process And then the stateless server model means each request has to like reload the model So a request comes in and you have to go and fetch those exact same bytes Again every time over and over again Recompile it because there's just no way for caching with it within the current spec So what this all boils down to you is we need to decouple the model lifecycle from the fast request handling So we have to split them apart so that We can manage the how the lifestyle like how long the model lives in memory where it's where the execution actually happens And do it in the best place for each particular model in each particular request So let's dive into how we did that how we changed the spec to solve that problem How do we decouple the models from the actual fast request handling? So how it currently works with the old version of the spec is you'd have to do this You'd have to fetch all the bytes of the model, which as you know could be quite large somehow from a cache from over HTTP, I don't know and You'd have to pass those bytes into the load call And that might do some work like compile the model for your CPU or GPU or whatever And then you'd finally get to do your your compute command and if you had to do this each time That an HTTP request came into your functions as a service, you know machine. It's just too high of a cost And it's too high of a cost especially to pay for every single request. It's just gonna take too long So here's the change Instead before instantiation before that WebAssembly module starts running The the web the machine learning model gets loaded and configured and it gets given a name Let's call it through here. And so that way when the HTTP request comes in you load the model by name It's already present in the host and now you can both load it and compute your inference Quick enough, you know for to satisfy these requests, you know now overload your server So that was the that was one big change we made to the specification named models The second big sort of decision that we need to make is how to limit the blast radius from malicious models with certain model frameworks You can have operators that do all types of crazy things to your system like open files or an open do network IO Don't probably want to let that happen in your fast environment and so we looked at two different ways of Protecting the environment from a malicious model. The first was this hardware-based thing where There's a x86 feature called MPK memory protection keys Because you 16 keys to mark pages of memory as like protected with that key and so The idea would be well, let's protect all the pages of the of the like runtime from the pages that the model needs There's some downsides it only protects the CPU, you know memory pages So if you want to run on GPU, we can't use this feature So we start looking at another option Which is the process-based isolation and so if we you know move this machine learning back in into a separate process Well, we can reuse all the OS provided primitives then for you know process isolation and you know, I guess you can talk more about the GPU drivers, but you know, they they already understand that GPU calls from different processes should have a boundary between them So it sort of solved the GPU problem and but the downside here is there is a little bit of overhead to Communicate between process, but I'll say something real quick here. So one the memory protection keys are also Spaces within a single process. So it's when you want to run multiple threads within a single process You can actually isolate a thread from accessing other threads memory and then the other thing on the process base for the GPU drivers Part of this is built into Like it's usually built into the GPU driver when they're copying memory to and from the machine They actually build stuff. So like hey, you can't go grab some other processes memory Because that was actually one of our concerns originally because if you have a DMA enabled PCIe device You could you can just go start grabbing stuff from memory but most GPU drivers do same things like hey You when they generate like a host pointer for you such that a GPU thread can't go and just grab random memory And so that is another reason why we're like, okay if we do the the the process boundary it matches all up with what a other things are doing so for example There's this thing called WGPU. It also has process boundaries the isolation and lines up with what the drivers are doing Which basically makes makes us aligned with what the broader overall ecosystem is doing in that space and So it's gonna give you a quick overview of what we're gonna show you today and and how it's all sort of glues together So one of the big key pieces is we want to allow models to run where it makes the most sense And this can mean if you have a small model that's only gonna use CPU you can execute it locally Don't pay for an expensive network call to move it off to another box Just run it locally and return it quickly if you have a larger model, you know say like one of these like 400 gigabyte like large language models The tensor for that is tiny and the tensors that come back are tiny Maybe you want to have a special box dedicated to that somewhere else where you can ship that tensor over to that box to get the results you Another thing you can do is you can manage the life cycles of miles outside of laws and time with this approach, so what What that means is we can now That we can control the caching behavior. So right so cat a fastly One of the the best features of the we have both a storage and caching layer That's basically globally distributed and just works and we can actually tie into the hike how Where the models are stored and how long they live in memory to optimize that automatically for for customers And then the async host APIs Is that actually another back-end change? So this was a change to not the spec but the actual host APIs inside of was in time to make them async So that you can actually continue handling requests without issues So as incoming requests are coming in even if you're off doing inference in another box for 20 seconds and then the Out of process model owning means way easier to like sandbox. So like if you go and you basically Run this in another box, then you can focus on sandboxing that process there And instead of trying to like introduce additional machinery On your like wasm host to try to protect it and isolate stuff running in there And so what's it? Oh, what's that? So one thing I kind of skipped over this a little bit was the case serve back end So case serve is a Kubernetes protocol It's already an old standard And so what we actually are contributing to was in time is actually a case of back end Which means that any sir inference server that speaks case serve you will now be able to invoke from was in time to be able to go Make these be hate these calls and behaviors So I think it's gonna be super useful if for folks that are trying to run this to be able to use that that and right now it's HTTP because Long story gRPC is not in wasm time yet, but if that ever lands we'll we'll look at implementing it But so right now we implement the HTTP protocol All right demo now it's time for the demo Okay We'll switch over here. We'll do a whole new tab Okay, so you're sure running live demo Did you I might yeah, I'm on my city the video in case doesn't run back Of course, I don't think like we literally we've literally tested it Like right before like oh, this is good thing. We made the video though. We'll give it one more shot. Do you want to? Hey, give me one second give the shot. Oh again. Okay. We want to do a live Okay, so Do you want to switch? It's I think it might be your internet really? Yeah, cuz my it just tested it and I have internet Okay, go for it. All right, so I'm just trying to We might Hopefully doesn't do the split screen. I guess you can just drag Yes, okay, so I'm gonna do a real quick I'm gonna start off with a classification demo. So this is sort of your classical example of a smaller model This is actually using the open Veno back end Cuz one of the tests inside of was Ian and is actually using Actually uses mobile net v2, which is a very small intended for mobile devices and classifications. So very small very fast I'm just going to try killer whale picture and so I'm just gonna hit upload file 96% probability Identification it's a killer whale The what's actually sort of what's happening here is This is this image is actually uploaded to a computed edge basically Worker and that worker actually takes that image Does the extraction into a tensor calls the was Ian and code? Gets back like a 1000 element vector of probability Sort some reattaches the labels and then returns like the top and that's specified there. So all of that is programmatic web assembly and web assembly and then We're gonna do You know it out of completed the prompt. I've been using so I'm gonna do a prompt of heroic kittens in the style of Naruto This is gonna take about five six seconds to execute and so what this is actually doing So this is stable diffusion. So it's a standard model. It's stable diffusion 1.0 This is that we it actually takes this prompt and it's basically sending it converting the prompt to a bite tensor And then sending that over it goes through case or and it is basically pie torch and on and X It's like three different models that it actually glues together on the back end But now with that that case sort of back in it's actually possible to do More complicated stuff like stable diffusion which has a scheduler and all this other stuff that has to work to happen. And so If anybody has any like interesting prompts they want to try Yeah, so that's demo and so this This basically is working right now. And so yes, oh you have the next slide Switch switch again switch again, but if anybody wants to try more, I'm happy to stick around if you guys want to try more prompts This is gonna be a really exciting slide. It's very exciting Yeah If you want to try this out on computed edge, please drop me an email We are looking for folks to help us test this out and improve it as we're as we're building out this this functionality And then if you have any if you're interested in participating and working on the spec You can contact Andrew or you can even just go and open issues at on the github site And it's not posted on here yet But we'll actually be posting the source code for the computed edge portion of this as well Since we got thanks to Tyler who said hey, I can open source the demo so thank you so much Thank you so much any questions on anything we've covered so far Thank you again for coming and we'll be available to talk afterwards if you want to run some more prompts or Look through any of the details