 Hi everyone, my name is Alexander Jung and I'm the CPO at Unicraft. I'm also a PhD student at Lancaster University, focusing primarily on lightweight virtualization. I'm very honored today to be able to give you a presentation today at KubeCon on some of the work we've been cooking up at Unicraft. This has been an ongoing effort with the University of Lancaster, University of Polytechnica BookArrest and NEC Lab Jura. In today's talk, it's all about how we can introduce Unicernals into the ecosystem. But let's start with our premise. Achieving higher cluster utilization whilst maintaining performance and security is fundamental to service operators such as infrastructure as a service providers or DevOps engineers who wish to decrease operational expenditure. We're actually seeing a lot of great work in this space recently on how to reduce the cost of services running in the cloud. However, virtualization strategies such as with containers have gained immense popularity thanks to orchestration frameworks such as Kubernetes. Thanks to their ease of use, containers and Kubernetes have been adopted as the de facto industry standard for deployments of diverse workloads on heterogeneous infrastructure. Deployments vary from big data, web services and edge computing. Some really awesome stuff and the mix has enabled advancements in scheduling strategies, deployment architectures and workload monitoring. But let's take a look at what a typical deployment looks like with Kubernetes and dive into the full stack to see what's going on. You can see here in this figure that what may resemble a typical worker node in a cluster. The node is provisioned in a cluster typically as a virtual machine itself running on top of a hypervisor. The worker node has a fully functional general purpose operating system and the container runtime engine such as container D which manages a number of namespace separated processes for both the system Kubernetes services, CubeSystem as well as provision services and their actual pods. But what's the problem? Well, there are actually four degrees of virtualization and indirection. There's of course the hypervisor which is hosting the virtual machine. It's keeping the VM safe from other VMs in what is probably the most secure way possible. The OS itself is managing a number of internal system processes along with the container runtime. The container runtime is managing the lifecycle and managing the namespaces and isolation of all the services and all the pods. And in some cases, the container pods introduce their own level of virtualization separating internal services to achieve a functionality of the application. To quote Anil Madhavadipatti, Are we really doomed to adding new layers of indirection and abstraction every few years leaving future generations of programmers to become virtual archaeologists as they dig through hundreds of layers of software annulation to debug even the simplest of applications? Not only is the amount of indirection between the different levels of virtualizations difficult to debug especially permission critical applications or performance sensitive applications but it adds performance penalties every step of the way. It's making things slower. It's making things slower and it's adding blow to the system meaning that we have to scale horizontally to new workers when the resources of the machine are no longer adequate for the service. Another thing to mention is that the system pods are mingling with actual service pods. They're only separated by namespacing. There have been some work to make the separation stronger but and I'll get onto this in a second but it's very soft meaning that we're relying on the operating system to ensure this for us. We rely on the operating system to be secure and hardened. Okay, what can we do to try and address this bloat? Over the last few years we've been seeing some great work pop up to creating a full stack operating system tailored for the Kubernetes experience. They have a single goal tailor the OS for Kubernetes and include its libraries, services and daemons and what have yous to make Kubernetes run and they're pretty good at what they do. We've got K3 OS and Rancher OS which create separate daemon processes for system containers and service containers. We also have some tailoring of the Linux kernel from the folks at Docker who build only the necessary kernel components. The folks at VMware take a similar approach but also include necessary drivers to work on their virtualization platform. But they still use a monolithic kernel, Linux and they still run on top of a hypervisor. So the question I'm really getting at is whether we can reduce the degrees of unnecessary components even further by looking directly at Linux. LinuxKit does this to some extent. I actually really like the way they approach the problem. It's a nice tool and you should check it out if you can. It allows for very bespoke builds of Linux kernel. It's really cool too, but it suffers from the same premise and that's the Linux kernel is inherently monolithic and it's totally understandable. It was built organically over the last 30 years. It's been growing and growing and suiting every need that's been thrown at it to increase performance or adjust to new use cases or code reuse. It has very strong interdependencies with other subcomponents. But if you build the Linux kernel from scratch, you'll just know how many configuration options I'm talking about, thousands, many thousands. But turning one of these off doesn't necessarily mean the whole Linux is excluded from one another or all these components. The Linux APIs for each component are very rich and have grown very organically and the component separation is often blurred to achieve this performance. For instance, send file, short circuits, the networking and storage stacks. We tried to quantify this API complexity in the Linux kernel and analyze the dependencies between the main components. We use a program called Cscope to extract all the functions from their sources of the kernel components and then for each call, check to see if there's a function defined in the same component or in a different one. In the latter case, we recorded the dependency. In this graph, you can see all the dependencies in between different kernel components. So back to our premise. The challenge operators face with containers is found with the increased desire to utilize the public cloud through ease of use despite its negative effect on cost, performance and security. Actually, there's a very interesting article by Sarah Wang and Martin Casado on the paradox of the cloud which goes into more of this problem. In this article, they essentially discuss the ease of use to get developers, operators, companies, everyone to use the public cloud and attempt to centralize and achieve economies of scale but it isn't quite scaling as expected. It's very easy to get started with AWS, Google Compute, Microsoft Azure is actually quite cheap for a small project but as soon as your services start to scale so does the cost of the bill naturally but it's not linear. It's not linear because it's not enough physical infrastructure specialized already for your exact use case. This is where I get to introduce you to the concept of a Unikernel. I'm very excited about this technology. It's been on the radar for a few years now but largely proof of concepts, experimentation, research projects, etc. But Unikernels in contrast to containers in the full stack, full OS stack are increasingly relevant model for deploying services as they serve for an uncompromising approach for gains of performance by tailoring the OS specializing the application's needs and stripping away unnecessary functionality right to the core. Their focus has been largely targeted or memory safety through single address space, single application performance through a specialized language runtime or customization of OS primitives such as the memory allocator and scheduler reducing latency and increasing bandwidth by specializing the network with stack. However, until now, these efforts have fallen short when considering large-scale heterogeneous deployments with overheads experienced in real-world context such as those experienced with services deployed by Kubernetes. Okay, first let's cover some basics of how a Unikernel is built. We start with the full stack application and you can see here that we have your application and its runtime configuration right at the top of the stack. It runs as, for instance, a Linux user space program. A typical go-to example is say Nginx. Nginx itself requires a number of third-party libraries to work, for example, OpenSSL. Nginx is also usually dynamically compiled and then so shares libraries that are installed on the operating system. The operating system and the kernel provides two Nginx a number of pieces of intrinsic functionality like scheduling and threading. The kernel also facilitates system calls like reading and writing to a file or opening a socket. If we look at the operating system and its requirements, it needs a lot of functionality itself to run. It runs daemons, services like SSH and also relies on the kernel for multitasking. The kernel is also compiled and has diverse drivers which are tailored to the platform it runs on like KBM, Zen or VMware. And lastly, it has hardware-specific code, machine code, which interfaces with the processor such as with Intel X86. The idea of the Unikernel and with Unicraft in particular is that we library size all of these components and create a library operating system. All traditional components of the monolithic operating system, the kernel and all third-party libraries are made available in a pick-and-choose fashion. You select them based on your application's needs. You don't need to scheduler because your application is run to completion, no problem. But not only that, we can pick different libraries with the same interface which interacts in the same way but offer different functionality. I'll cover more of this later. Unicraft is built on this system and this library operating system. So you start with your application, your configuration, and then through a number of steps such as retrieving relevant libraries, preparation, compilation, and linking, you get a final binary, a full virtual machine image. It only has the necessary components for the application to run and nothing more. It's tailored exactly as your application needs it. So, Unikernels. They are compile-time specialization strategy. You can actually very easily integrate them into a CI CD pipeline and generate the artifact ready for deployment to the relevant platform and hardware architecture. They are a form of lightweight virtual machine. They just boot like any other OS, but unlike traditional operating systems, they boot much faster. I'll cover how much faster a little bit later. The application and the kernel share the same address space, so there is no separation between the two. There's no user space and there's no kernel space. This means that the cost of CIS calls are much cheaper. They're actually just function calls now. This means we can also do optimizations on their call, making them inline when possible through compile-time optimization strategies. The Unikernel only has the necessary functionalities for the application to run, nothing more, which means that there are no daemons, no system libraries, background tasks, and really, you know, no SSH, for example. Oops. Unikernels target a specific platform and hardware combination. They're tailored for not only the application, but for their runtime environment. With Unicraft, Nginx has only the necessary dependencies for it to run, networking, IPC, file system, memory management, and scheduling. We don't need, for instance, USB drivers. We don't need iSCSI if we don't need it. And if we know the file system that we're working with, we only need to compile in those particular drivers, such as EXT4, 9PFS, et cetera. So, okay, let's check out some numbers. We compared the ports of Nginx and compared the performance of Nginx against different solutions, including a full VM running Linux, Nginx running, of course, inside, other Unikernel projects, and, of course, Docker. All versions of Nginx were the same, and we ran a simple throughput test using 14 threads and 30 connections with the same HTML payload size. As you can see, compared to Docker, we had an 82% increase in throughput. When it comes to performance, it's also worth looking at memory usage, and we're experiencing much lower memory footprints, too. We also measured the boot time of Unicraft. Here, we tried different virtual machine monitors, more on how we can use these later with Kubernetes. But we can also achieve boot times into the running main thread program in as little as 3.1 milliseconds. Just for fun, I ran some experiments on my Raspberry Pi and just wanted to quickly show you that it also runs directly on hardware. There was an order of magnitude difference in boot time between Raspberry and OS, the default standard OS that ships with Raspberry Pis, and the Unicraft Unicernel image. Unicraft's build system can tailor into different platforms and architectures, and here I just had to rebuild the Nginx application, targeting a different platform and hardware architecture. When it comes to the separation between kernel space and user space, Cisco's represent a huge bottleneck. It's understandable. There are a lot of checks which occur to ensure that the user space application does not make incorrect or potentially malicious calls to the kernel, which can disrupt the runtime of other services. If our VM is running in the same address space, which it is, because we know exactly what we want, we don't need this mitigation. Here, we demonstrated that a function call is two orders of magnitude faster in terms of CPU cycles at Unicraft. I'm going to dive into the specifics of how we get to these images in a few slides yet, but I want to show you that when it comes to packing images in an OCI registry, ready to be retrieved by the relevant worker node, we can often try to minimize the file system as much as possible, right? There are some classic techniques, like with the Go program. You often create the final static binary image and you place it in a scratch. It's the last step in the build step and you just hook it as the main entry program. In a way, this is almost exactly what we're doing with Unicernel. It's just with the whole OS and kernel. To give you an idea, however, of the difference between grabbing the official Docker image for Nginx. It's 4.2 megabytes in size, but our complete Unicernel image for Nginx of the same version is 1.3 megabytes in size. Both OCI images are compressed in the same way and have the same configuration inside. Having a reduced image size naturally comes with security benefits, including a much smaller attack surface. There's no shell to drop into. Actually, this is a problem later and I'll explain, but there's no concept of a user. There's nothing running in the background which could disrupt the runtime of the application either. Each image has a reduced memory size and we can introduce address space layout randomization at compile time to make each image more unique. And because we're dealing with virtual machines, we're making use of hardware-based isolation using the lowest level of virtualization. The VM is really just the standard unit which large infrastructure as a service providers offer. So it's way more hardened and it's more hardened than the OS that they actually provide to you and the provision to you. Finally, I'd like to mention that at Unicraft we have some ongoing work which is introducing in Unicernel memory protection using hardware acceleration. To separate, for instance, different library components such as the network stack and the scheduler are application from each other, for instance, by using Intel MPK. Okay, we have Unicernels and they seem to provide a battery of different benefits. A stepping stone, however, towards the ultimate use of lightweight virtualization via Unicernels requires integration into an orchestration framework which maintains this ability to dynamically and quickly provision new services to schedule and reschedule services based on workload known as autoscaling and co-locating relevant jobs and required resources such as static volumes. Kubernetes provides all of these requirements and as such acts as an ideal tool to be able to test large-scale deployments for Unicernels the same or varying comparable workloads. Our main goal is to allow for the seamless use of Unicernels through the orchestration platform, Kubernetes. At a high level, we envision dynamic workloads scheduled in the same way as their traditional container-based counterparts. The use of Kubernetes itself is fueled by its widespread adoption as an industry-leading platform for the scheduling and orchestration of services. And two, make it easy as possible to achieve the first goal. We want to make everything as pluginizable as possible, make the experience seamless, attempting to make little to no modifications to existing ecosystem internals which would require operators to make non-trivial modifications to the host. But the ecosystem already has some great tools for running virtual machines by Kubernetes. We have Kubevert and we have, well, we did have Run-V which is now Cata containers. But when we took a closer look at what these tools did, we noticed one major problem and that they were launching new VMs with the same assumptions, a monolithic OS running multiple services, including your application. Actually, one of the ways, some of the ways that we interface with configuring these VM-based containers through Kubernetes was via SSH. And as you saw before, we can't do this with Unicurnals because it boots and then it starts the application and does nothing more. Okay, in the next few slides, I'd like to explore with you the necessary steps to make our integration with Unicurnals into Kubernetes possible. We start by studying how Kubernetes as a whole operates and how Kubernetes talks to a container runtime engine via something called a container runtime interface, the CRI, which is implemented using a tool for instance like container D. And how the container runtime engine speaks with a OCI, Open Container Initiative Runtime Image or a specification compliant program, such as Run-C. Here is a general overview of a multi-node Kubernetes cluster with a controller node. The API server speaks with all worker nodes via PortoBuff. But on the host, the main daemon for managing the lifecycle of deployed services speaks to standard interfaces, OCI and CRI. If we're wanting to make the host of the worker node instantiate Unicurnal virtual machines, we're going to need to be able to work inside of this framework. Let's take a look at how a request is served from the Kubernetes control plane towards the instantiation of a service. First, a request is made which is passed to Kubelet. We want to create a new pod for our particular service. Kubelet speaks with the container runtime engine, in this case, container D, and makes a request to create a sandbox environment. The sandbox is passed to a binary shim which is daemonized, is a daemonized process on the host. This process serves as an interface, for example, with Run-C. I put a star here. The interface of the container runtime interface. The daemon is in charge of creating new processes on Kubernetes behalf. The sub-process of the Run-C shim layer and the Run-C binary. So, the star here, we can replace it with our own tools, Run-U, where we can instantiate and manage the lifecycle of Unicurnal virtual machines. So, introducing this shim layer and runtime manager into container D is actually quite simple and works towards this higher level goal being pretty pluginizable. We introduce a new plugin to the container D for a program which matches the runtime type. We'll see this here. And sets the path of the binary of the Unicurnal runtime manager. So, our first contribution is a new tool, Run-U, to run Unicurnals. To be OCI compliant, this program must essentially have the following sub-commands and arguments available to it. That's create, delete, kill, spec, which I'll cover in the next few slides. Start and then method to query the state. Okay, so that's all well and good. We're still dealing with OCI formats and images. This means we must disguise Unicurnals in OCI-compatible images. The ecosystem, aka container D, relies on the use of remotely accessible registries which understand the contents of container imagers standardized again through the OCI image specification. The distribution of such images makes the instantiation of services easier as their content are remotely accessible and can be transported where necessary. Unicurnals are typically binary format, a single virtual machine image which can be launched via virtual machine monitors such as Camu. Typical standardization for virtual machine images are ISO, QCOW 2, and RAW format and are not compatible with existing container-based infrastructure. To package a Unicurnal into OCI clothing, we must follow the specification dictated by OCI. To do so, in this example of an NGINX Unicurnal, we have to create a tarbell that contains a manifest or index.json file with information about its contents and to contain a series of, again, tarball images inside the blobs SHA-256 directory. These are known as layers, right? These layers are simply file systems which are extracted over one another to create a complete file system. So we go back to the lifecycle of our OCI image. Once it's retrieved, this is part of the spec command before, we unpack it and prepare it on the host environment. In all scenarios, each layer is extracted atop one another to create a final file system of the container. In our case, we simply need to extract the binary image of the Unicurnal. RunU will know where these are in a well-defined location and from there, RunU can interface with libvert and any other virtual machine monitor that we choose to instantiate the Unicurnal. So to put it all together, taking a look at our previous systems diagram where our service pods mingled with kubesystem pods which are managed by container D which is managed by the host operating system which is managed by the hypervisor. Our Unicurnal is instantiated from the kubesystem and it's now its own virtual machine sitting on top of the hypervisor only. When it's booted, it will do one thing and one thing only and that's perform the actions of the application. In our case, libvert manages interbasing with the hypervisor. So, enough talk. Let's check out a demo. So, here we have a manifest.yaml file. It contains a specification that will deploy to Kubernetes and you can see here that it requires the RunU class and pulls the relevant image from our registry containing the packaged Unicurnal. So when we run it, it launches by Kubernetes. You just do apply. We can see that we created a new pod and that it is running. Now, if we inspect this pod, we can just follow along the logs. You'll see it boots and it's now running Unicraft. This is running the nginx instance and it's been allocated an IP address. And when we curl it, we of course get our Unicraft image with the payload saying powered by Unicraft. We just opened up the Kubernetes dashboard and you can see here everything is sort of running and it's appearing as usual in the same place that you would expect. Cool. So, for future work, there's still lots to do and we're looking forward to getting this released very soon. For now, we want to create a better integration with the OCI image and with RunU to be able to instantiate Unicurnal virtual machines in different ways depending on their needs and contexts. There's a lot of metadata that is attributed with the build of a Unicurnal, how it's built, the things that it contains. If you've ever dived into building the Linux kernel Kconfig is its main configuration mechanic and Unicraft uses the same system. So, there's a lot of information about how the Unicurnal operates before it's even instantiated that we can make use of as we boot them in different places. So, if we have a particular file system baked in, for instance, we can use the different metadata there to locate it in a specific place that is able to facilitate the runtime. We're also starting to experiment with new scheduling techniques for Kubernetes to be able to co-locate Unicurnals. Unicurnals, of course, can adopt the same sidecar model seen in typical deployments, but we can do some really interesting things like network offloading, DMA and inter virtual machine communication. Another thing that we are working on is expanding our OCI registry with matrices of builds with the same application but for different applications and hardware targets. You already see this in major registries for OCI containers. Docker Hub, for instance, allows you to download a different architecture depending on the host platform that's requesting it. So, you can get an ARM image or you can get a x86 image, for instance. And so, you want to tailor that experience to a lot more of a diverse workload. I think that's it from me. Thank you so much for having me at KubeCon. I know this is a recorded presentation, so if you have any questions, please feel free to reach out by email or Twitter.