 Hi, yeah. So I'm Aaron. I work on the cloud native research management team as does the TANAS. I've also been working on this with Andrew Brown who's probably also pretty well known here already who's working in the SDG group. So just the agenda, we're going to quickly just cover kind of the motivation of what you'd actually want GPU off-load in the Wasm ecosystem. We'll quickly just cover some of the potential off-load APIs, the two main ones that we'll focus on are just Wazian and Wazee or Wazee web GPU and then we'll just cover some of the AI frameworks with off-load backends and afterwards we'll kind of just look at the outlook for the future. So first the motivation is why would you want GPU off-load in Wasm? The main or two kind of very viable options would be something like AI as a service or rendering as a service running on the host, so not through the browser or anything like that. We can kind of get into some of the nitty-gritty of that a little later with Wazee and N. But there are also multiple other use cases for GPU off-load, but at the moment with AI being so popular, these are kind of two big ones. So there are kind of a couple or well quite a lot of native APIs already for regular GPU off-load. So for visual libraries, you have things like OpenGL, Vulkan, DirectX 12. In the scientific world there's things like OpenCL, SYCL, there's CUDA, there's OpenMP, which has this target pragma. And also in AI you have things like Triton, TVM, you have AI compilers. So the idea would be that typically it could be possible to implement Wasm support in all of these. It does require a lot of work as well. These are a lot of these are kind of very old kind of legacy libraries with a lot of support, but they're also quite complicated to integrate Wasm into because of that. We actually had some success with OpenMP. Just getting SIMD instructions working, but even then that was quite a lot of work involved. So in terms of actual currently existing off-load APIs for Wasm, two main ones are WasianN, which we'll get into right after this, and WebGPU, which is pretty common, probably well known here already. And there are also a lot of other off-load APIs in the ecosystem. So first we're just going to cover kind of some WebAssembly environments and their use cases. So something we'll see a bit here is just stand alone, see you're running your Wasm directly on the host. Typically through something like WasmTime or any other Wasm runtime. So this is use cases for just stand alone desktop apps. It has use cases for cloud applications and especially in resource constrained environments where you want to get as much performance as you can out of your application. Another one, which is probably the more commonly known one, is just running through the browser. There is also running through Node.js, which is a bit more of a niche use case nowadays, but it does still have its uses. So next we're just going to go into Wasm off-load APIs. So Was-ENN is a framework for importing pre-existing, pre-trained AI models, and then compiling that into a Wasm file and passing it through a backend. In the example I'll show in a second. We use OpenVino as a backend, but you can also use something like TensorFlow, PyTorch, there's a lot of support for it. So why would you want to use Wasm for machine learning? Well, machine learning is deployed on a lot of devices, a lot of OS, so you really want to have hardware support across environments. Because of that, you want your machine learning applications to be as portable as possible. You're probably thinking, well, if I want to do that, I could probably just use a container. That's an easy solution. But the thing is in machine learning applications, you want to have as much performance as possible and the time spent spinning up a container is just not comparable to using something like Wasm. As well, why would you want to use Wasm over just standard pure Wasm? Well, Wasm just has much better performance and has support for things like AVX512. And yeah, in machine learning, you obviously want support for a lot of hardware features like Cindi, AVX512 on your CPU, or more preferably, if you have access to them, something like a GPU or a TPU for early workloads. So the Wasm in NAPI is in Rust. It's very user-friendly. It simply just consists of creating this graph object, which you can then import your pre-existing models into. And it's then simply a case of setting your input, running a compute function and receiving your output tensor. So I'm going to quickly just switch over to a demo. So this is courtesy of the Wasm Edge, Wasm in N examples repo. We have a link to it at the end. It's very useful repo with a lot of very user-friendly examples. So just here I have a Jupyter notebook open. So this is running on a remote server and the actual application is written in Rust. But I will just quickly run the cargo build to actually build the Rust and generate an output Wasm 12. And in the code here, it's a very simple API. You simply set things like your input and your pre-existing models in both XML and up-in. Next, it's just a case of loading those into a graph. And after that, you simply set your input. And you simply set your input. You run your compute function and you get an output. In this example, we just get an output tensor that we write out to a file. But if you want, you could use it for something else like passing it to another model. So here we are just going to, in Wasm Edge, we're going to pass the output Wasm file that we got. And we're going to also just pass in the pre-existing model and an example image. So here we just have an image of a road. And we have imported a pre-existing model for road segmentation, which is used for things like autonomous driving and things like that. So here we just have the original image. And next, we're going to just visualize the output tensor that we got. From running Wasm in. And here you can see an actual image with the segments of the road clearly marked. Next here, we're just going to do a small bit of data visualizations. We can actually form an overlay for that output image. And lastly here, we have our base image, the output tensor that we got from Wasm in. And afterwards, with a bit of data manipulation, we also have a masked image, which is something more akin to what you might see in the camera of an autonomous vehicle. And now I'm going to pass it over to my colleague, Vitanis, to talk about WebGPU. Thank you. Yeah, how many of you are familiar with WebGPU? Does this ring a bell? How many of you did some graphics programming in the past? Right, WebGPU actually is quite interesting kind of API coming from the WWW consortium. They looked at APIs basically for graphical APIs and like Vulkan DirectX, and they identified that the web community needs something more generic, which you can run platform independently on many devices, something simple. Like if you look into modern APIs like Vulkan, they are very complex. But they also, yeah, if you are familiar with the history, also there was a lot of work in the past with WebGL, this was the predecessor of WebGPU, where there were performance issues. So this led to the development of a new standard. It's an official standard from the W3C, the WebGPU, and a lot of the browsers today support it. Like if you have Chrome or some other browser, they will usually support it. But yeah, we want that also in cloud or in data center. If you have, let's say, GPUs remotely, we want to use those. Why is WebGPU also so interesting? This diagram shows it's actually quite nicely. If you think about comparing it to the old WebGL standards, we were lacking the capability of to do compute shaders. And compute shaders are very important for people coming from data science community. A lot of the frameworks basically can express their compute to compute shaders. And you see the differences to WebGL basically with blue and red colors in this diagram. We have also some new stuff, a lot more buffer control. We have also better scattering of the work. We have also better capabilities to control how to utilize local memory on GPUs and so on. And yeah, performance is really looking far better than WebGL. So we definitely are interested to use that in WebAssembly world. So we started recently in the community a new project, the WasiWebGPU project. Basically, it's based on the component model which you heard about until now. And we want to provide the Wasi interface for WebGPU, which has very good mapping or our target is to have very close mapping to the native interface so that applications in Aboard have minimal work to do. And yeah, we want to support rendering compute passes, memory map buffers. So we started actually prototyping and writing first kind of runtime which supports exactly this kind of compute pass, memory map buffers, dispatching. And we started with a backend called WGPU. This is Rust backend for implementing WebGPU. Yeah, and so next I want to go through a single example. We managed to implement a simple compute pass through the WasiWebGPU interface. For people not familiar with what is a compute pass, it's displayed on this diagram. So you have a pipeline usually which is associated with a shader. And in the compute pass, you do the relationship between or so-called bind group between your buffer. The buffer is basically the input of the user and your shader. The shader will do the computation. So we are trying to demonstrate this today. Let's look a little bit through small comparison. How does it look in terms of writing a WGPU program and writing a WebAssembly WasiWebGPU program? They are very close. So usually you will go through a very well-defined cycle. You need to request a device. In WGPU, you have a request device function where you pass some options. And very similar functions exist basically in the WasiWebGPU API. Here to point out, we heard about WIT interface for the component model. All this is described through the component model interfaces. We have similar functions for creating buffers. This is the next step basically when you are doing some sort of application for graphics or for compute shader things. So you need to create a buffer where you will be storing data or reading data from user. Then setting up a pipeline, very similar on the left WGPU and on the right the WasiWebGPU application. You see again we have a programmable stage. More or less the same kind of fields. It's very easy to navigate if you start from WGPU to WasiWebGPU. Then we have the bind group, which is a shader with the actual buffer. So again, very familiar, little bit syntax sugar but yeah, very close to the original code. And dispatching work, you have to set the pipeline, tell your device how many work groups you want to spawn. So this is the classical kind of approach if you want to program a compute pass. So this is what I am going to demonstrate shortly in a demo now. First, just to point out on the screen, you will see I'm running on my local system and I opened the WGPU website just to show you that actually the browser can run WGPU and you will see here I wanted to point out it's using my local GPU on the laptop. And now we want to avoid that. So we want to basically run a Wasi component which will run remotely. So I will just close that. And here I'm on the remote server where I have basically a top for GPUs which will show some utilization. And I will start a small application which is my kind of WasiWebGPU runtime loading two components, a vector addition component and a matrix multiplication component. So I start those, they are actually opening some sort of server for me and this should be the server. And the first application what I am going to demonstrate is Vectorat. You see shader, a compute shader, usually for WebGPU you write those in WGSL, very close to normal kind of shaders what you have in graphics. Yeah, here we will try to do vector addition for a vector with relatively a lot of elements, two million elements and we set a workgroup size of 256 threads. So we execute that and we do one iteration with just one addition. Right, you saw a little bit load but not too much. My local GPU should remain untouched. Result is valid. I do a small comparison basically to the CPU implementation. Now let's pump up the iterations. Let's do a thousand, make it more interesting. So we should see a little bit more happening. You see my remote GPU got something to compute so it's still not too much. So we can play with that further. So we can also adjust the workgroup size. The other example is matrix multiplication. So I have 16 by 16 matrices where multiplying with a workgroup size of 4 by 4. Right, so again normal compute shader code. I execute that, it's really quick. A little bit load on the integrated graphics. The remote machine which I'm running on the server side is again with some integrated graphics, basically Intel GPU. Right, so you see really easy to use and yeah, so this is first kind of success story for WASI WebGPU. Right, this was the same application as the displayed here. Okay, that's nice and so we saw some nice pictures but maybe it's useful also for AI. Why this can be interesting. So we covered a little bit the WASI NN site that we have ONNX runtime support, OpenVINO support, TensorFlow. With WebGPU also it's quite promising. So there are a lot of frameworks which are already enabled for WebGPU. For example, TensorFlow.js, ONNX runtime also. What does mean enabled for WebGPU? They provide basically compute shaders similar to what I executed not for matrix multiplication but for the actual AI workloads. And just to give you an example with something more challenging, there is a framework which is supporting also WebGPU called Apache TVM. Apache TVM is a graph compiler. You can basically provide your graph and Apache TVM will get this representation, this kind of AI representation of the graph and try to map it to your device. This is a very difficult problem in terms of optimization as if you are trying to develop a component which is platform independent there are a lot of combinations how you can optimize the code. And Apache TVM basically implements some sort of stochastic search generating optimized kernels for the dedicated device. So that's one potential kind of benefit if you take WebGPU. In the future you might be able to run Apache TVM as a Wasi module. So a little bit outlook and then we will finish with questions. So there were some other APIs which are available in the native community as we heard which we did not look if they are suitable for enabling for GPU offload. One example is the OpenAP offload. This comes naturally from C++C community also usable for Fortran where you have pragmas and you can heroize loop. Usually you can also offload the loop calculation on the GPU. This might have potential drawbacks because there are a lot of pointers flying around so security might be impacted if we look into enabling such kind of approach. The other very interesting kind of project which is currently being discussed in the W3C consortium is the WebNN interface. So a web interface for neural networks where you don't, if you think about WebGPU it's more close to actual graphics but WebNN is really an interface which is for the future looking to be mapped better to neural networks. To summarize the talk today those are our sources so you can find all details about Wasi and the Wasi WebGPU project if you look into the references of our presentation we will push also the examples on the Wasi WebGPU project as a branch so that you can play with that and yeah, please reach out to us we have also, we are on the Intel boot where we'll be showing also this technology there. And more sources if you're really hungry for information, there is a lot out there. Thank you. That was a phenomenal talk I think WebGPU is pretty hot. How about questions? I knew you'd have one. You've got hungry eyes over here. So you have a link to Mendy Berger's proposal for the... You had a link on a couple of screenshots to Mendy Berger's proposal to the Wasi WebGPU interface is there any interest from the Intel side to contribute to that, add some effort to it because I know that it's like a part of the code base that's kind of in need of... I'm personally working with Mendy on that basically and me and Aaron are working together with Mendy to contribute on the Wasi WebGPU we'll start by trying to merge our compute pipeline basically the compute pass coverage but yeah, definitely it's very interesting for Intel, the whole Wasi WebGPU effort. That's really good, thanks. There are questions around the room if not I've got a couple for you. I really love how your demo started with a standard that was proposed for browsers in the web but most of your demos were general purpose. Is WebGPU, does it have a real potential to be this sort of universal ML platform or API that people can build on top of? Yeah, I think it has big potential because in one of the pictures we showed that WebGPU has the benefit that it's platform independent. It can detect, if you run it on a server let's say with Windows, it can detect that you have DirectX 12. If you are running on Mac it will detect another kind of API another backend that I see as a big benefit so basically usually in the ecosystem you might have very heterogeneous kind of platforms and definitely WebGPU opens the doors for deploying your GPU applications on heterogeneous platforms with decent performance. So, why performance is being shown to be better than WebGL and the integration with Compute Shader capabilities it's really great for other communities outside of graphics like the AI world I think they are really interested to use that. There are also further formats like usually graphics, they compute most of the stuff in floating point 32 bit but WebGPU is being extended with 16 bit, most probably more stuff will come in which will make it more widely usable. Thank you so much. Any other questions? Please join me in a huge round of applause.