 Now that we know the state of the world, pun intended, we can be inspired to contribute to all parts of the project. So thank you, thank you so much Bailey for your talk. Our next speaker is Radu Matai. He is the co-founder and CTO of Fermion. He's actually the person who introduced me to WebAssembly one fine day at a conference. I think we were at KubeCon, we were sitting outside on the floor and he was whispering in my ear like, hey Michelle you gotta check out this WebAssembly thing. It's like really cool. And here we are today. He is also an open source software engineer, passionate about WebAssembly distributed systems and artificial intelligence. He has way too many mechanical keyboards for his health and he has worked at Microsoft Azure in the Deus Labs R and D team where he was a core maintainer for multiple server side WebAssembly and distributed systems projects. Today he's going to talk to us about how WebAssembly can power the new wave of serverless AI. Please welcome Radu. Hello internet friends and real world friends. I'm Radu and over the next few minutes we'll explore this new world of serverless AI and how WebAssembly can have a significant impact in how we can run this new world. When we started Fermion our goal was to build a new kind of computing platform to be the ideal foundation for this new wave of full stack serverless applications. And very early on even before Fermion existed we picked WebAssembly as the underlying runtime profile and in the process we built Spin which is the open source developer tool for serverless WebAssembly integrated into the cloud native ecosystem and we built Fermion Cloud to run this kind of serverless applications. And over the last few months we've been busy in Fermion Cloud adding a new kind of data services to help developers build full stack applications, things like built in databases and key value storage but there was one type of data service that kept coming up in conversations with users and customers. And that was this AI inferencing. People really are really interested in seeing how they can inject AI capabilities into their own applications and in particular this new class of generative AI. And the reality is that artificial intelligence is slowly starting to trickle down into what it means to build a full stack applications and more and more application developers are which are non AI experts are traditional application developers are now tasked with integrating some artificial intelligence bits in their software. And this is where they hit one of the biggest issues for inferencing infrastructure. It's really difficult to set up this infrastructure on your own. You either have to pay for an always running GPU which can be prohibitively expensive in most cases or you suffer from cold starts that can take anywhere from 30 seconds to a couple of minutes in some cases. And if you dive into this problem you can trace most of the cold start problems in inferencing AI to using containers as the deployment and execution unit. Before the first line of your application is ever executed you have to fetch a multi gigabyte container image that contains things like a CUDA runtime and Python and PyTorch and then your application and then find a way to move a multi gigabytes machine learning model close to where that container is attach a GPU to it and then start executing your application code. By using WebAssembly components in some scenarios you can compile your inferencing application to a few megabytes that can be fetched over the network just in time in a few milliseconds and then you can start it in microseconds. If you come by our Fermion booth you will see an example of an inferencing application that is compiled into a 2.3 megabyte wasm component and then dynamically started just in time. This means that we can truly achieve one of the early goals we've had for WebAssembly that is move compute close to data but we can take this one step further move compute close to data close to hardware because one of the most expensive bits in our infrastructure is a GPU. So by moving the WebAssembly module close to where that machine learning model is that multi gigabytes model is close to where you have a GPU available you can achieve significantly better efficiency to this. And the other thing here is instead of assigning a GPU to an always running container which might have requests might not you can assign a fraction of a GPU to a WebAssembly component just in time to execute a request execute that inferencing operation that can take a few seconds shut down the WebAssembly component and then assign that GPU that fraction of a GPU to a new component that you execute again just in time by fetching it from the network just in time to execute your request. This means that your incoming requests don't have to wait for a container to start and then for a GPU to be attached to it which means you can achieve a significantly higher efficiency and utilization for GPU infrastructure. Even more importantly you can run the same application the same inferencing workload regardless of the architecture that you're targeting you can run the same application on an M1 chip you can run it on a single CPU or on a CUDA GPU or even on an entire cluster and this is particularly interesting for the developer experience which suddenly doesn't have to change between running an application locally and deploying it somewhere and it's also interesting if you're running infrastructure at scale where you can move the components to a hardware that is available really late in the process of executing your request. How does it look like what do you actually write when you're trying to build an inferencing component that you can run from WebAssembly? There's a standard for this it's called Waziann it's a model loader API that lets you load a machine learning model from WebAssembly to the host and then have it executed on hardware acceleration on the host or you can have a higher level interface that then can be defined in terms of Waziann. This is a complete WebAssembly component written in JavaScript that you can run today in Fermion Spin. And this takes the approach of a slightly higher level interface that you can basically implement on the host to talk to a CUDA, a GPU or an M1 computer or any other hardware acceleration device. And this is the approach we're taking in Fermion Cloud as well. This week we've introduced the Fermion serverless AI beta that provides the building blocks for integrating this kind of AI capabilities in serverless applications. Using this service you can execute inferencing for things like Lama2 and CodeLama, large language models generate embedding requests and store that in databases, retrieve them, store them in key value storages. You all can also run all of these examples in the Fermion Spin project and the open source project and you can come by our booth and sign up for the preview. We're really excited to see what you wanna build with this. Enjoy the rest of your conference.