 Hi, everyone. Thanks very much for being here. I'm Anastasius Nanos. We're going to talk about interoperable hardware acceleration for serverless computing. So a quick overview of the presentation. We're going to talk about serverless computing, about our framework which we call VACCEL and we enable interoperable application hardware acceleration. We're going to talk about the integration with serverless frameworks and if there's time we can do a short demo. So we are a small research company. We are doing virtualization, library OSes, Unikernels, containers, container runtimes. We have mixed academic and industry background and we're based in the UK and Greece. We also have another talk tomorrow about using Unikernels for serverless. So essentially serverless computing is about managed infrastructure orchestration by service providers. It offers effortless scaling, allows the user to focus on business logic and to deploy their code without provisioning the infrastructure. So the code in serverless computing example is deployed as a function with all its dependencies. There's event-driven execution. The building model is what it should be. Actual resource usage versus getting built for idler resources. And it offers stateless execution oriented to microservices and actions that are spawned upon a trigger. Serverless computing frameworks are often deployed in cloud infrastructures. However, this mode of execution seems useful for edge workloads as well. So for instance, you can do ML inference, image inference for at the edge at small devices with accelerators and stuff like that. It's serverless computing frameworks are currently backed by containers. Containers present some multi-tenancy issues, mainly with regards to security and leaks of data, either input or useful data in the container-like models or stuff like that. The solution from service providers at the moment is either managed isolated infrastructure or using sandboxed containers inside VMs. So they try to isolate the execution of a container using hardware extensions. The issue with VMs is that it is hard to get hardware acceleration or device access in general to the VM. At the moment, what is being done with regards to this issue is that you either use the whole device on a VM, so it's a single, only one user or only one VM can use this device. You can do the mediated pass-through, which is some kind of software emulation or software hardware emulation to pass through functions of the device to the VM. You could do the para-virtualization approach, which is kind of easy, but it requires vendor support both on the host and in the guest. And lastly, you could do API remoting. One good example is our CUDA that you can intercept and forward these calls, the CUDA calls for NVIDIA, for instance, to the device, to the host. So what we think, actually, what we figured is that what is the bare minimum for workloads to access accelerators? So in a serverless function that does image classification, the only thing that the workload needs to specify is what kind of classification they need, the kind of model, and the parameters for this model, and the input. So why do we need the whole device on the guest? Or why do we need to specify CUDA calls or OpenCL calls? So what we built is actually a framework that decouples the function call from the hardware-specific implementation. So VXL has a static API and the user-defined API. It has a couple of plugins or several plugins, which is glue code of the API calls to the hardware implementation, to the actual code that's running on the hardware accelerator. A mapper, a multiplexer that maps these API calls to the plugins. And a couple of custom plugins that we have designed to enable VM and remote execution. To see a bit more into how VXL works, we can look at the stack, at the software stack. So this is the mapper, the core part of the VXL library. On the top, we have the static user API. So for this example, we've chosen the image inference user API, where we have several of the functions that we would like to implement, such as image classification, image segmentation, object detection, blah, blah, blah. On the bottom side of this figure, we can see the plugins. These plugins are the glue code. So the Jetson inference plugin, for instance, implements some or all of these API functions, which means that for a specific function call for image classify, it provides the actual function called to the hardware implementation, which is shown here. So these red boxes are actually outside of VXL. So these are actual hardware, actual software implementations of the functions that we have defined. So the Dusty NVIDIA Jetson inference is a GitHub repo from NVIDIA implementing several API calls like the ones on top. So the Jetson inference plugin, the only thing that it does, it bridges the API call on top to the hardware implementation, to the actual code implementation on the bottom. So for an application that calls this image classify API call of VXL, the first thing is that it reaches the user API. It goes into the multiplexer. It's depending on what the user has chosen or at what can be chosen at runtime. It selects the current plugin, which is the Jetson inference plugin, and this plugin executes calls the actual hardware implementation, the Jetson inference implementation for the image classify function. Another application could call the same API call, but at runtime the user or the systems admin could have selected another plugin, which is an NPU image inference implementation. And it would call the other hardware implementation on an AMLogic chip, let's say, on an AMLogic NPU. And the same stands for another API call. The pose estimate, which is again on the Jetson inference plugin. So we've seen how the user API works. It's just a simple mapper. The user application calls a function. This function gets mapped to the relevant plugin, and the plugin calls the actual implementation. Now, having existing applications, though, without having to change their code, ported into VXL, could be more interesting. So what we do is that we libyify the application. An example, again, with image classification is shown here. So we have an application that does image classification through a TensorFlow image inference program. What we do is that we split that to the function call and the actual implementation of the image classification operation. So we have a simple function call on top. And all the hardware-specific stuff, the TensorFlow graphing, model loading and all that stuff on the bottom. And we port that to VXL using generic operation. We call it in the framework. It's an API-remoting thing. So we pack the arguments and the function call of the symbol that we want to call. We transfer it through VXL, and we unpack the arguments and execute it and call the relevant symbol with its arguments on the shared library. For example, we can see a vector add operation. We have three arrays, two inputs and one output. We define a structure in order to get the arguments moved. So we serialize the arguments, essentially. So we define the shared library, the actual hardware implementation, the symbol that's going to be called, the vector add thing. And then we add the relevant parameters on the structure. We just call VXL exec. On the other part, on the shared object that has the actual hardware implementation, we have to unpack these arguments. So the original vector add that has three arguments, four arguments, two inputs, one output and the dimension of the arrays. Now it becomes compatible with the prototype of the unpacking. And we dereference these arguments from the packed process that we mentioned. So what is left is to actually call this function, which is done through the exec plugin. And the process is this symbol, we just dereference the symbol and call the function pointer with the relevant arguments. Now this could be useful on a VM environment, on a remote execution environment. Because on the same host, it doesn't make much sense. It could make sense on interoperable hardware, so we have the same binary and several shared objects that could be called on different hosts. But what's really interesting is how you can run it through a VM. So this figure shows the process for the VM execution. We have the application, we pack the arguments, we use the VXL general operation. And we have the vertio plugin, which implements every API operation. The exec, the image classify, I mean the user API and the static API and the user extend API. So this plugin forwards the execution, all the call and the arguments to another VXL instance running on the host, which in turn calls the actual hardware implementation. So the original application that we have libified is able to be executed on a VM without direct access to the hardware. Now that's a vertio plugin implemented as a kernel module and a plugin in VXL. We have the same implementation using a Vsoc. So it's over sockets, so it's over virtual VAF Vsoc part and the generic TCP socket. In order to have, so in the Vsoc case we have a user space program that just intercepts, gets the server, let's say, that gets the requests and calls the VXL API, the VXLRT, the core of our framework. Now to be able to use that on a serverless environment, we need some kind of orchestration framework and we need to make sure that the agent in case of the Vsoc or the hypervisor is the correct hypervisor with the back end and the root affairs of the VM has got the vertio module and stuff like that. So we've integrated VXL in Cata containers and we've done both implementations for the patch to hypervisor, both chemo and firecracker and with the Vsoc approach which should be hypervisor agnostic but we've only tried firecracker and chemo on that. And we've added this downstream Cata implementation to a Kubernetes cluster running open FAS which is a serverless framework, open source, pretty popular. So the open FAS control plane is running on a container through Run C but the functions are being executed as with a new runtime class which is the custom runtime class with our downstream Cata containers implementation and all these functions on the right are generic containers that can use hardware accelerators through VXL, either with Vsoc or directly with interfacing with the vertio case. I don't think I have much time. In terms of the performance, we have several measurements, both micro benchmarks and end-to-end applications. We've tried the image classification. We've tried three implementations from the NVIDIA Jetson inference framework. We have implemented a simple inference example in the TensorFlow framework, both in Rust and C and we've also tried the OpenVINO framework from Intel using the Intel neural compute sticks. The setup that we used came when Firecracker VMs, vertio and Vsoc plugins on a simple home use GPU, the Intel neural compute stick on an Intel i5 and on Jetson Nano and Xavier AGX. As we can see in this graph, this is the execution time normalized to native, so one is native, over one is the overhead of the execution. On the left side you can see the x86 numbers, on the right side the ARM numbers. The light bars are the vertio case, the dark bars are the Vsoc case. With an exception of the ARM case and Vsoc, the overhead is less than 5%. These numbers are for the Firecracker hypervisor. We don't have the chemo plotted in this presentation. What we can see is that the overhead is pretty low, which is expected, as the only overhead associated with the whole execution flow is the transfer of the data that we need to copy from the guest to the host in order to reach the accelerator. The programming framework and language support for VACCEL, the core framework is written in C. We have plugins in C, C++ and REST. Essentially we can write any plugin, we can write on any language as long as we can compile it as a shared object. We have bindings for the user API in C, in REST and Python. The VACCEL agent, the server that runs and supports the Vsoc operation is written in REST. Regarding the user extensible API, we have wrapper functions that can facilitate the porting of existing applications. We have a couple of stubs, both for the user part and the hardware specific part. In terms of system support, as we mentioned, we have tried Firecracker, we have tried chemo, we have backends for both these hypervisors for the virtual case, for the Vsoc case, it's hypervisor agnostic as long as the hypervisor supports Vsoc. In terms of orchestration and runtimes, we support Kubernetes through Cata containers. We have tried OpenFAS, works fine. We have also support for Unikernels. We have tried Unicraft or Amplan when the process of trying OSV. In terms of hardware framework support, we can use anything as long as it is able to be compiled as a shared object and that it makes sense to be executed in this kind of use case. We have the detection inference and the TensorFlow stuff, the TensorFlow RT, OpenVINU. We have used some examples in OpenCL for the vector add example, for instance. We haven't run anything more complicated than that. We can also support Kudas, it is pretty straightforward. We have used the datacenter GPUs like NVIDIA T4. We have used Edge GPUs, the Jetson, Nanox, Avies, this kind of stuff. We have used the Google Coral, Miriadex, the Intel neural computer stick, a custom Kudas, AM logic and PU. It seems that as long as you can compile a plugin as a shared object, you can run anything you want. What we are aiming with this framework is to be able to have a single application binary that can use hardware acceleration on several hardware implementations like Edge, like an application that could scale out from the Edge to the Cloud and back. It is currently working progress. We are planning to extend the API and the hardware supported. We would like to add a lot of intelligence in the core library and we would like to add some kind of marketplace for the models, for inference or for binary blobs that the user could just pick and use on the fly. It is open source, you can check out the codes and the plugins. There is also a tutorial-like walkthrough that you can run your own application on non-VXL through Firecracker. I would also like to mention that this work is partly funded by EU research funds. Thanks very much. I think we have some time to show the demo, right? Do I have time? So it's a pre-recorded demo. We have a Kubernetes cluster with a node that has a GPU installed. The GPU is a home-use GPU, it's an NVIDIA RTX. We have an open-fast installation. This is the control plane running on several nodes in this cluster. On the left side there is a simple UI that we created to showcase what we are doing. We have an open-fast function which essentially calls the image classification operation. Without direct access to the hardware, it is a Cata container booted on a Firecracker VM. So there is no access to a GPU. So we can select an image for your class. We can see that there is GPU access. We have NVTOP on the bottom. So we can check out the logs of these functions. So another example is a dog. We can see that there is movement on the logs for the Kubernetes functions. We can see that the result is this. There is a classification tag and the confidence of the model. These are ImageNet models, pretty much default for the Jetson inference implementation. More images and logs on the open-fast functions. Trying to resize the NVTOP thing. More images. So what we have also built and we are also working progress is trying to have a unicernel execution as a function. So in the open-fast case, the way that this thing works is that there is a watchdog process that gets a request, spawns a process, does whatever the function needs and returns the result. So what we have done for now is that through the watchdog process we spawn a unicernel, we do what we do and we return the result. So this is kind of like a hack to get a unicernel implement that serverless function. What we are working on is getting runtime such as Cata containers to spawn a unicernel from a container image directly. But this is for another talk. So what we have done is this hack here. So we have open-fast watchdog running a unicernel that supports VXL, a hacked unicernel. And we just change the endpoint. We see the logs. We change the endpoint that we post the image there. We change the picture. We have the beer glass again. So we can see that the process is being forked. It gets executed and it actually talks to the hardware. So we can do the image classification thing from a unicernel without having to port all the NVIDIA libraries in the unicernel. And it works exactly the same as the other case. We would like to see how fast it is. It should be a lot faster than the generic case with the VM and all the bloat stuck there. I believe that's from our side. Thanks very much for listening. I would be more than happy to answer any questions. Yes, yes. No, no, no, no. The hardware implementation should be there already. So, yes, yes. So, and what you call is not what the user writes, what the application calls is the upper function call, not the actual hardware implementation. So we abstract the hardware implementation semantically. So the image classification is already there, the hardware implementation. It's already on the Jetson Nano, on the Kadas, on the AM Logic chip. It's already there. The user just calls image classify with some parameters. They don't implement the function so you can't, we don't do just in time compilation for the hardware. Yes, so it should be installed already on the system by the admin. It's not the user's responsibility. So think of it like on Amazon Lambda, you have a couple of API calls that you know that can be accelerated. So you would like to do image classification with Google net, image net and stuff like that. So it's being offered by the Amazon Lambda service. It's not that you have to write the model or you have to write the TensorFlow graph and stuff like that. It's the bottom part. So this image classification algorithm that you have, you have to libsify it. So you have to compile it as a shared object and then you have to pack and then pack the arguments when you call that image classification function. Does it answer your question? Do you already have any integration for some of the future? No, no, no. Yeah, it's actually pretty early in the project to do something like that. We rely on the fact that the binaries are already there because there's NNVIDIA card or some other TPU or NPU. So it's an offering from the infrastructure. And the user just picks whatever they would like to run and run that as a function. Correct. Yes, yes. So there is TCP socket support in the VSOC thing. The overhead is of course related to the network. So we have tried that in a use case where we had a Google Coral mini dev board that we couldn't run KVM on. So we had that in the VXLRT agent and we called that from x86 and from ARM boards to check the compatibility, the architecture compatibility and it worked just fine. But the core mini board only has Wi-Fi so it was pretty crap. If there are no more questions, thanks very much.