 Hello, very excited to be here. Last week was the Wasm.io conference, so we're, you know, fresh out of the Wasm deep end, and now we're excited to get to talk about WebAssembly and AI. Wasm, by the way, is a shortened version of WebAssembly. It is not an acronym, it's the file extension that was originally used for WebAssembly files. I'm Matt Butcher, this is Radu, CEO and CTO. About 10 of us left Microsoft a few years ago to start Fermion because we believe very strongly that WebAssembly sort of held the key to building the next wave of cloud computing. We've been very excited because one of the themes we noticed as we did this was that WebAssembly is excellent on compute workloads, and that means not just CPU workloads, but also GPU workloads. So the talk is titled something about serverless AI, so we should start with a quick definition of a couple of terms. Serverless is the first one. This term is much maligned for the correct reason that it has been used to mean all kinds of things. So I'm gonna give you a very succinct definition of how we at Fermion use the term serverless. We are talking about a software development pattern in which the developer doesn't write a server, hence it's serverless, instead they write just event handlers or request handlers. So the code merely needs to, it will receive a request when it starts up, it's responsible for handling that request and returning a response and then it shuts down. So when you're scaling up or down, you're creating multiple instances of the same application. There's no software server, no socket server, anything like that, thus serverless. Next term to define would be WebAssembly. What is WebAssembly? Well, the really boring answer is WebAssembly is a bytecode format, along with a specification that explains how to interpret the bytecodes. When you hear those terms, your mind should probably go to something like Java or .NET, both of which are languages where they compile source code to an intermediate bytecode format and then that bytecode format is executed on a variety of native hosts. But WebAssembly has, first of all, had about 20 years of research from these other communities to build upon and second of all has been designed for a very different use case. What was that use case? Well, it was running code in the web browser. You might have heard that Figma makes frequent use of WebAssembly. They write code in C++, compile it to WebAssembly, and then their JavaScript can call in and out of that WebAssembly. Why? Because they could do better vector calculations and things like that in C++ code than they could achieve in JavaScript. Our view of it, as we kind of ran across WebAssembly this first time, was this has a lot of potential not on the browser side, but on the cloud side of things, on the server side of things. Why? Well, there were four characteristics that we were looking for when we were trying to figure out sort of what a next generation cloud runtime would look like. First one is, has to have a really good security sandbox model. You don't want your code to be able to escape a security sandbox and do dastardly things to other code running on a system or to the host operating system. So that's table stakes. Turns out the web browser has a similar security sandboxing model and the WebAssembly model actually has a stronger version of the security sandbox than even JavaScript does. So we knew right away we were finding a technology that was gonna be amenable to sort of plucking out of the browser and plugging in on the server side. But the other three characteristics are what really got us excited. Cross-platform, cross-architecture. You compile once for WebAssembly. It doesn't matter if you built it on Windows, on Intel, you can run it on Linux with ARM. You can run it natively on Mac or a variety of other operating systems and a variety of other architectures. And so that meant that, particularly when you're writing serverless style workloads, you can really drop the workload on whatever the most economical environment is for you to execute and you don't have to have your developer trained up in how to write code specifically for that target environment. It really is sort of like the next generation version of the compile once run anywhere sort of mantra. But again, with this security sandbox. Other thing that we were really looking for is we were trying to figure out what the next cloud computing format would be was fast startup time and execution. So if you think about what the history of cloud computing has looked like, it's kind of gone like this, right? We started with the pre-cloud world where we managed everything from the metal up, right? And we had the entire single instance of the entire operating system running on one piece of metal. Virtual machines changed that one-to-one relationship and allowed us to run many operating systems on top of the same piece of hardware. But in order to do this, we had to package up the entire operating system from kernel and drivers all the way up through all the libraries and the file system and finally our code sort of dropped on the top. Virtual machines have proven rightly to be sort of the powerhouse class of cloud computing. They do everything. They're sitting under all of these cloud services from Kubernetes to hosted database, even Lambda functions. But they are slow to start up. They are a little bit difficult to manage because they're so big and so along came containers to kind of solve this problem by offering an environment in which long running individual server processes could be packaged up into sort of like what you might think of as a slice of the operating system, just the supporting files it needs, no kernel, no drivers. And it's been great for any kind of long running server. But what we were looking for was one that would handle that kind of serverless workload we were talking about. Can we find a runtime that makes it easy to fire an event at something and have it start up, cold start, execute the completion and shut down, not in minutes like a virtual machine or seconds like a container, but in milliseconds. And that's what led us to WebAssembly. Turns out, by the way, WebAssembly executes at sub one millisecond. So spin, which we'll talk about here on this next slide, it's average cold start time is about what, 0.51 milliseconds, so half a millisecond, little over half a millisecond. So that's the kind of speed you want if you're gonna get a request and spin something up and then serve a response back. The cold start time should be absolutely as minimal as you can possibly get. And that's still running in a secure sandbox. So we built a tool called spin. Spin is a developer framework and tool for building serverless applications that are then compiled into WebAssembly that can run in a wide variety of environments. Spin's all open source. You can go to, well, I think we got a link right there. You can head to that link and you can download it. We'll have links again at the end of the presentation so you don't have to capture it now. But spin is designed to just make it really easy for a developer to get going. Our core user story for all of the first year we worked on spin in 2022 was, as a developer, I can go from a blinking cursor to a deployed application in two minutes or less. I think Roddy was fond of pointing out that by the end of that year, we had it down to 66 seconds to type the commands and you needed to build and deploy your first app. Now, once you've got an app, you need places to run it. I'm just gonna point to a couple and I'm really only gonna talk about the Kubernetes one in much detail. But we built a hosted platform called Fermion Cloud that would allow people to kind of go and deploy and run their applications very easily. It was the platform that we first introduced our AI, serverless AI workloads on that. So Roddy will talk a little bit about that technology later. But it's the easiest place to go and try that out if you wanna see what high performance AI inferencing in this environment looks like. The big one though, last week we announced SpinCube. So SpinCube is a tool set that allows you to run these kinds of serverless workloads inside of any old Kubernetes cluster. So whether it's something large like AKS or something tiny like K3S, you can run these kinds of web assembly workloads and you can achieve really impressive density and scalability. I think Roddy you're gonna demo some of this a little bit later too. And then we announced again a kind of enterprise version of this called Fermion Platform for Kubernetes last week. The only real difference between this and that is that this one is designed for highly scalable workloads and comes with the GPU scheduling software that we use inside of Fermion Cloud which allows you to schedule the GPU time at very, very fine grained increments. So I think you'll talk about that a little bit too, right? So I'm gonna hand it over to Roddy. Roddy's gonna talk specifically about the AI cases. Thank you. So Matt talked a little bit about Spin and just recapping a little part of that is Spin is designed to help you build those kinds of stateful APIs that are event driven that you can basically compile to a function and then deploy wherever you might want. Open source, permissively license, you can find it on GitHub. But the idea of Spin is to help you integrate with the language that you're using, whether you're using Rust or JavaScript or Go, Python basically help you compile that application, that event handler to WebAssembly and then integrate with as much as the cloud native ecosystem as possible. So if you're used to distributing your artifact using OCI registries or signing your artifacts or you can continue to do all of that with Spin and then once you've built your application as WebAssembly, you can deploy it, you can run it on your own server with a system deep service, you can deploy it to Kubernetes, you can run it to Nomad, you can run it on all of the places where you can run WebAssembly, which by now are a lot of them. We introduced Spin back in 2022, but over the last couple of months, we've been working on adding a whole bunch of integrations that help you build applications. You can essentially, going back to WebAssembly, WebAssembly is a sandbox. And so anything that you would like to execute from inside the sandbox has to go through specifically to a capability-based system where you can explicitly allow or deny your guest application, your WebAssembly application, to make any sort of outbound HTTP connections or open a file from a file system or open a TCP connection. And so the ability to execute connections has to be tightly managed by the host runtime. And so Spin makes it very, very easy to make connections to databases and key value stores comes built in with a key value store based on SQLite, and you can connect to key value stores based on top of other technologies, integrates with variables and secrets that you might, variables and secrets providers you might use. One of those, the first on the slide, is serverless AI. And so while we were working on Spin, the execution model of Spin and the serverless way of building application was very appealing to a lot of people, a lot of whom were trying to build applications that do inferencing, so AI inferencing, as well as embeddings. And so over the last couple of months, we've been working on making sure that we can allow those developers to use inferencing capabilities from their serverless applications. But if you think about how we do inferencing today, there's one problem that relates to cold starts. And Matt touched a little bit when it comes to function as a service platforms that have cold start problems when it comes to packaging your HTTP handlers. The problem is significantly worse if you're talking about cold starting AVM or a container that needs to do inferencing with a CUDA GPU, you're effectively packaging a multi-gigabyte container image that contains CUDA and PyTorch and Python and your application into an artifact that is anywhere between two and 10-ish gigabytes in size, which means that if you try to scale that workload on a new node, you have to fetch 10 gigabytes from the network before we can run. And so that makes it practically impossible to scale from zero in some scenarios. And so cold start comes first from having to fetch that artifact. And then secondly, if you're running in a sort of cluster environment, you have to fetch it and then you have to find an available GPU and then attach that GPU to a container and then start your software, all of which make it in a lot of cases impossible to do real cold starts. And so going back to how things are done today, it makes sense if you have enough workloads and if you have the flexibility and if you have the need to run an actual container with CUDA and PyTorch and everything else all the time. If you need that flexibility, you'll probably have to do it. But most of the times what developers want is to execute an inferencing operation with a specific model. And so we wanted to address that and we wanted to address that with WebAssembly, which is how SPIN works. And so using WebAssembly, we can assign a fraction of a GPU, a fraction of a GPU to a WebAssembly application just in time as it needs to execute an operation. And then when it completes, essentially we can reassign that fraction of a GPU to another WebAssembly workload from the queue. What that means is whenever there's an incoming request, users don't have to wait for a VM to be available. They don't have to wait for a container, for a 10 gigabyte container to be fetched and just be started and to be attached to it, which means that that GPU is already there. And two, it means we can achieve significantly higher efficiency utilization. In particular, if you look at cluster utilization, I think an average that I've read a couple of weeks ago, places average utilization in Kubernetes at around 30%. Which means there's a lot of idle capacity that we could reuse across different applications, potentially different tenants as well. So how can we achieve that? First of all, we have to put together how a guest application that was built with WebAssembly can access that capability from a host, given that how we talked about WebAssembly being a sandbox and how it doesn't really have access to direct SysCalls. And two, how does the host eventually execute that inferencing operation? And so if you're familiar with WebAssembly, this is a simple WebAssembly interface that defines two operations. For most scenarios that we talked to developers about what they really wanted was to execute an inferencing operation. In particular, a lot of them with language models. So I need an infer function that takes my inferencing model, a prompt, and then some parameters. And then a second function about generating sentence embedding. So if you're trying to generate and do embedding searchings and vector spaces, there's a function built in to spin that lets you do that with a couple of popular models. And so what this gives us is a way to build portable guests. Once you've built that application, that WebAssembly application that uses any of these two APIs, you can run that WebAssembly application pretty much on every single system that supports running WebAssembly. It's portable across operating systems, it's portable across CPU architectures, and it's portable across if you have or don't have available a hardware acceleration device. And what it gets us to is you need a specific host to be able to execute it, and that host is the one that is platform-specific, that is compiled specifically for Cura, or specifically for Metal, if it's Mac OS, or specifically for a specific CPU architecture if you're trying to do CPU multithread as your acceleration device. And so once you've built that WebAssembly component, you can run it on your own machine. We'll see that in a second using Spin. You can deploy it to a hosted platform, or you can deploy it in your own Kubernetes cluster using SpinQ, which is the project that we just announced. And so let's have a quick look at actually running one of these applications using Spin. So you can find, you can find a demo at github.com for me on Spin. And if this, this might be too bright to people. Okay, let me know if this is slightly better. Okay, cool. So a Spin application is, in this case, we're gonna look at an application that's built with TypeScript that will compile to a WebAssembly module that we can then run. It has a tiny configuration file. In this case, it's a Tommel file, and it has one thing that tells it whenever there's an incoming request, what is the source of this WebAssembly thing that you need to execute? And in this case, it's target spinhtp.js.wasm. And then I mentioned the capability-based system for being able to execute and to allow anything from your Spin application. And in this case, this component is not allowed to make any outbound connections. It is using Spin's built-in key value store, which when I'm gonna run on my machine is gonna be backed by SQLite. And then I want this component to use a Lama2Chat model, which is one of the popular models people wanted to use from serverless style applications. And so that's all the configuration that I need. Next, I have a little bit of configuration that tells Spin how to build this application using NPM run build. And then that's about it. If you'll have a look at the Spin Quick Starts, you'll see in particular for HTTP requests and responses, which is one of the most popular workloads, a Spin application is entirely comprised of that configuration file and a function that takes in a request and returns a response. Spin supports other kinds of events. You can do like Q triggers and NQTT triggers. They're all open source and available on GitHub. In this case, we'll look at an HTTP application. But this function takes in the request body, parses it as JSON and then tries to look in the local key value store for the contents. But then the important thing to look at here is a function call that eventually ends up calling the infer function that we looked at earlier. There's a WebAssembly interface that defines that functionality and we're calling this from a JavaScript application. If this was a Rust or Go or Python application, it would look very similar. It would call an inferencing function and then it will be compiled eventually to something that ends up calling that interface. So to run this, you can do a Spin build, which as I mentioned calls into npm build. And then I'm gonna run this application locally. And before I do that, I have a .spin local directory that has a Lama 2 chat model. This is a quantized model for Lama 2 7 billion chat that I'm gonna use when running this application locally. I have built, I just built the WebAssembly application. It's 2.4 megabytes is fully built in into that 2.4 megabytes. And so whenever I'm gonna deploy this application, that's all that I have to fetch, that in an OCI manifest. And so keep that in mind when we're gonna talk about how this will use a CUDA engine as well. So have this application, I have the model local, all I need to do is do a spin up. And this is gonna start a tiny UI for my, the application, what I didn't mention is the application does a very quick sentiment analysis. It formats the Lama 2 chat prompt and it tells it to, okay, you're a sentiment analysis bot analyze this question and tell me whether it's neutral, positive or negative. And so what I'm going to do here is gonna start activity monitor and this is gonna take a while. So I'm not even gonna, gonna try and, and what's gonna happen is that 2.4 WebAssembly component is going to end up calling the infer endpoint and that will use metal because I built spin, the host using metal, specifically compiled for this M1 machine. And as you can see on the right side, it's basically loading the machine learning models, loading the language model, trying to do the inferencing and the main point here being that it runs locally, eventually it will finish and it will tell me that, yep, this sentiment is most likely positive. Once I build it, I'm not gonna rebuild the application, I'm not gonna rebuild the binary, the exact same binary I'm gonna start, but this time I'm gonna pass a runtime configuration file that tells spin that instead of using the host implementation, but the metal-based host implementation for the inferencing engine, that it should use an A100 GPU that we're hosting. And so without restarting the application, without rebuilding anything, this is not gonna use any of the resources on my machine but it's gonna come back significantly faster because it's an entirely portable WebAssembly component that does inferencing. I just swapped the actual implementation from metal on macOS to an A100 running with CUDA somewhere in the cloud without rebuilding my app, without changing anything about it. And then the last part is the same application I can deploy to Kubernetes using SpinCube which is a newly open source project that we announced last week that lets you take applications and run them in your exact environments and run Kubernetes. You can reuse all of the cloud native tools, that you're used to without changing anything about them. You can continue to do that sort of portable guest and specific hosts when targeting AI inferencing workloads and essentially going back to moving around a two megabyte binary as opposed to having to run a multi-gigabyte file. Of course, if you need the flexibility of running CUDA, that's gonna be continuing to be very helpful. If all you need to do is run simple inferencing operations or run embedding operations, then something like this that generates tiny, tiny WebAssembly modules that start in less than a millisecond that are then portable across those hardware acceleration devices. Please have a look at Spin and have a look at SpinCube. I think we do have a couple of minutes for questions. Three minutes? Okay. All right, so if you've got some questions, otherwise I'll tell you about the story of how Radu created serverless AI. All right, well, you're looking for questions. So the two of us were on a flight from Frankfurt to Hyderabad, Bangalore, Bangalore. Long flight. I had just gotten to Frankfurt from the US. We sat next to each other on the plane. Radu says, hey, I downloaded some LLM models. I'm gonna work on building serverless AI while you're asleep. This is the first time I'd ever heard him talk about this. And I said, okay, and I went to sleep about three quarters of the way through the flight. I opened my eyes and Radu was right here, like right in my face going, oh good, you're awake. I've got a demo to show you. And that was kind of the first pass of serverless AI written on a plane somewhere over the ocean. All right, with no questions, I think kind of the main things here that we hope you come away with is we really wanted to nail this kind of serverless AI story where you could write just a function and you could write it in JavaScript, Rust, Go, TypeScript, Python, whatever, and be able to take advantage of whatever the best hardware option was for you in your system today. Whether you're running this inside of Kubernetes or on Fermion Cloud or locally or whatever. So the developer should be able to easily develop on their local workstation and either use CPU inferencing, which I think is the default, or compile the local environment to have support for the compile spin to have support for whatever their underlying GPU architecture is. And then they can work locally, regardless of what the production environment looks like. And then when you deploy out there to production, you're running on whatever the best available is. So in that case, Radu showed, use Metal locally when he deployed, he was using A100s running in Sevo clouds in London, I believe is where those were. And you can see kind of the time difference there. Anything else you wanna add on closing thoughts? Nope, have a look at spin and spin cube as the way to run all of this in Kubernetes. Thank you.