 All right, people. So here is Ali Raza and Thomas Unger talking about Unikernel Linux. So over to you. Can you see my screen? Oh, yes. Yes. So yeah, so let's start. Hi, I'm Ali Raza. I'll be presenting our work on Unikernel Linux with Tommy Unger. And as you can see, there are a lot of people who are part of this team. Some of them are part of Boston University, others from Red Hat. So I'll quickly go over the initial parts of this talk, the motivation, what we've done, and I'll try to get to the later parts where we show the results, what we've done, and then Tommy, who likes to live dangerously, will show a demo of the project. So the structure of today's general purpose operating system is not suitable for a number of today's key use cases, and the cloud has taken this problem and brought it to the forefront. Cloud workloads are typically run inside dedicated virtual machines, and the other workloads, which are all run bare metal, are also single purpose workloads. So a general purpose operating system designed to multiplex resources among many users and processes is instead being replicated across many machines, often single users, single process. So that's a general purpose operating system running a single application workload. Also applications that require high performance IO use frameworks like DPDK and SPDK to bypass the kernel and gain unimpeded access to hardware devices. So clearly for some cases, general purpose operating systems are not the answer, and researchers, people from industry and academia have been looking at special purpose operating systems. In response to these questions, there has been a resurgence of research exploring the idea of library operating systems or unikernels. So very quickly, what is a unikernel? In a unikernel, an application is statically linked with a specialized kernel and deployed directly on hardware where it might be virtual or physical. The application runs in a single address space with the kernel. There's no separation between privilege levels, unikernels in research, and other projects have shown to be small and lightweight when compared to normal kernels like Linux, and they allow application-specific optimizations. There are many examples of different unikernels, some of them you can see on the screen, which have demonstrated significant advantages in boot time security, resource utilization, IO performance, things like that. So the question is that despite these advantages, why are unikernels not very widely adopted? And one might find the reason for this in how the unikernels are developed. For example, some of them are from scratch where the entire pool base is new or some of them are developed by forking an existing operating system like Linux or NetBSD and then they are stripped and modified to such a degree that you create a lightweight operating system, lightweight unikernel. In both of these approaches, the code is essentially new from scratch approach, the clean slate approach, or if you modify the code and add glue code on top of it and add code underneath it to do the orchestration, things like that. So a lot of the code is new and this brings us to the problem that people who are running workloads in production do not trust new code bases. They want to go to code base which are well tested like Linux. And these unikernels mostly run on virtualized environments and do not support accelerators or do not have support for device drivers or different architectures. And also if you compare to Linux where there's a large developer community around Linux which keeps updating the project and keeps fixing different bugs, these new operating systems, these new unikernels do not have that kind of support around them so that's why they are not really used in production. So the question we asked ourselves is that can we take a unikernel model and apply it to Linux? And what I mean by that is can we create a unikernel which can live as a set of if-defs in Linux code? Can we reuse the Linux code to create this model whereby we can inherit all the good properties of Linux? It's battle-tested code base, it's huge developer community, it's support for different architectures and devices and also provide application-specific optimizations. So very quickly, what are the goals for this project? We want to run unmodified applications. We want to target upstream acceptance because there's no point in having an out-of-tree fort which we have to maintain over and over again. The idea is to create, to have as minimal changes as possible so that they can be potentially by discussions with the community and how the community wants to take it forward. We can have some upstream acceptance and for that, one, as I said, minimal code changes and two, we need to show some performance benefit to start with. And once this is part of the upstream kernel, anyone who wants to deploy unikernels can have their own specific optimizations which will give further benefits. Also, we want to have ease of build and use, ease of debugging, ease of profiling so that you do not feel to be working in a restricted environment when deploying unikernel workloads and we want to not only target virtual deployments but also bare metallic models. So, very quickly, jumping to the very end, what is the current status of the project? Right now, we've been working on it for a few years now and now finally we have a fully functional unikernel which runs unmodified applications. We feel that the code changes are minimal and I'll talk about them in the end as well. We show some performance improvements for unmodified applications and we'll have results in this talk and Tommy will talk about examples of deeper optimizations which show further benefit. So, just jumping over very quickly what the basic architecture is, how the unikernel basically works and how it's different from normal unikernel and normal Linux. On your screen, as you can see, this is how a normal system normally works. Applications make function calls to G-Lipsy or other such libraries that C-libraries then make C-assist calls into the kernel. Normally, this is what happens. All the applications can make C-assist calls as well, but let's take the easier case. So, what happens in UKL is that the application, G-Lipsy, all other application-level libraries, whatever is part of the application package, gets statically linked with the kernel binary, so the final VM Linux actually contains everything. Statically linked together. The C-assist calls are replaced by function calls. So, there are no C-assist calls in the Unica, in UKL. What's the memory layout? Normally, what happens is that the kernel has space in the higher end of memory, in the address space, and the application binary lives in the lower end of address space. Then comes the heap which grows upward. There's a stack which grows downward and there's spaces for different VMAs which can be in the middle. We follow in UKL Linux, we follow almost a very similar memory model, except that because the application is statically linked with the kernel, the application text, data, other sections live with the kernel binary up top and everything else is exactly the same. If we talk about how do we deal with C-assist calls, the kernel entry-exit part. So normally, as you can see, when you do a C-assist call, you come to the C-assist call entry point. In Linux, this is the entry-assist call 64. After that, you run some entry code. Before running the entry code, you first switch the stack from user stack to kernel stack. Then you run some entry code which includes RCU and related things. Then you go to the underlying kernel functionality. On the way back, you run exit code which includes different things like scheduling and signal handling and things like that. Finally, you switch back to user stack and you go back to user space. In UKL, because we live in kernel space, there's no CIS calls. We actually use function calls, but we make function calls again to the CIS call entry point. What this gives us is instead of reinventing the wheel, instead of figuring out how to handle signals, how to do scheduling, where to do the stack switches, we simply re-use all of the code that Linux has. This also gives us the benefit that the application gets an application code, gets the environment which it expects to get. The kernel is also dealing with application code which is scheduled in and out and deals with the RCU at clearly well-defined points. We reuse the kernel entry and exit code. I want to go over and talk about the results that we have, but very briefly, I want to talk about all the major points that we had to deal with over the years and how different problems we had to fix. These are just a small subset of them. First of all, the exact mechanism. It was Tommy who realized that if we follow the normal exact path that the Linux kernel follows to execute any application, it would help us in basically creating an environment where everything is properly set up. For example, the task struct is properly set up, the element struct is properly set up. Only two differences that we do here are, and this is hidden behind if-depths and if conditions, that we stay in kernel space. We do not return it back to user space. We stay in kernel space, and also we follow the entire exact path without actually having an elf binary. We do not run those parts of the code. Otherwise, everything else is properly set up. This also allows G-Lib C to do all the initializations when G-Lib C code is called because it gets an environment which is properly set up. I'll talk about page faults very quickly. Since we're always in kernel mode, hardware-based stack switch to the kernel stack does not happen, and when faults or interrupts occur, we do not switch to the kernel stack automatically. When we run out of user stack, the resulting page fault has no stack itself so we service that original fault and we get escalated to a double fault. There are two mechanisms of dealing with that in UKL, and you can choose one of them based through a config option. First is to deal with it as a double fault. When it gets escalated to a double fault, only deal with it then on the double faults dedicated stack, or we can change all page faults to use a dedicated stack through the interrupt stack table mechanism. I'm happy to talk about these things offline. Not going into any detail here, kernel entry-exit assembly code changes. When you enter or exit the kernel, be it via syscalls or what used to be syscalls, interrupts faults, whatever, there's assembly code which bases its logic on the fact that the CS value on stack will be correct and it will tell you if you came from user space or you came from kernel space or you're always in kernel space, that CS value on stack is not, is always the kernel's CS value. So we have, we've made mechanisms of keeping track of which code we came from the application code or the kernel code and we basically re-use all of that entire code and we give an environment where any application which is running as a unikernel experiences the same environment it would running in user space, only we have performance benefits because of running as a unikernel and we had to make changes to the linker strips to add sections like TLS which are normally not part of the kernel binary but are part of the application binary. So these are few of the changes and problems that we had to fix. There's plenty of detail there and if anyone wants to talk about these. So I'll just talk about the basic optimizations we talked about this diagram earlier how we use the kernel, this is called entry points we found that we can do an optimization where in the entry code where we switch to kernel stack if we do not switch to kernel stack we stay on user stack and on the way back also no switch stack, no stack switch we stay on user stack, we get slight performance benefit and we'll show the results in a bit and also because everything is statically linked together you do not have to go through the system this is called entry points to get the kernel functionality you can bypass the kernel entry and exit code and go directly invoke the underlying kernel functionality this has huge performance advantages because this means that whenever going from application code to kernel code you do not have to do you do not have to be scheduled out this is kind of run to completion kind of thing there is no RCU or other bookkeeping happening but this should not be done indefinitely because the kernel has to do these bookkeeping and the signal handling and other things at some point and you also realize that if you bypass a large number of these entry and exit code paths the increase in performance diminishes so there is no point in actually doing it for a large number of entry and exit code paths this also if you do it for a large number the tail latency goes up because the kernel then does whenever it gets the chance it does all its backlog work at that point so we choose a number between 10 to 20 based on what application we need to do a proper analysis of this that what number is well suited so it's basically a per thread functionality all the application developer has to do is just call UQL set bypass and wherever the performance critical code starts and call it again when the code ends or even if you do not want to do that call it once at the start of your application and automatically n number of maybe 10 entry exit code paths will be bypassed and one of them will go through the normal path so you do all that work so you get all these benefits all of this is totally automatic automatically done because Linux generates function stubs and this is called define macros and glibc calls those stubs through its macros so the code changes are extremely minimal the application developer does not have to do anything there also as you can see here we do not actually change or take out any of the Linux kernel code the changes are set off if there's a Linux kernel code last I checked we have 1200 lines of insertions and 400 deletions these deletions are not really deletions this is just basically reorganizing the code inside different if conditions so we do not take code out of the Linux kernel which means all the Linux kernel functionality is still there so you can actually run a full sidecar user space while your unikernel runs in kernel space so this is totally optional you can use it to ssh into your unikernel and perform and manage the system we use it to do for analysis which Tommy will get to and production unikernels do not need to do this they can be very slim and not have a user space finally very quickly in order to we do require applications to be rebuilt with a memory model of kernel because it has to be linked with the kernel code and no red zone because when going from in interrupts and faults there's no stacks which happening so the red zone can be trampled on so that's why these two flags we need for recompilation and just compile all the application by libraries and everything together do a partial link and then the kernel final linking step will create a final Linux binary very quickly the results you'll see linux there which takes for a get people call the latency is 299 nanoseconds ukl process is basically the same workload running as a sidecar in user space again 299 nanoseconds so all of the ukl changes actually do not introduce any regression then ukl kernel stack ks means kernel stack that when you go to kernel code it switches to kernel stack uses us means it stays on the user stack so you can see the numbers there and bypass are exactly the same case as before only we bypass the entry and exit code so you can see a huge benefit there as compared to normal Linux we run benchmarks this is a benchmark which does which reads a buffer from memory as you can see I'll just talk about the blue line which is linux and the red line at the very bottom which is the highly optimized unicolonial case where you stay on user stack and also bypass n number of entry exit code paths as you can see the smaller graph shows us the spread of different measurements that we took to show us the variance in these graphs so you can see we there's an offset that we do better than normal Linux right also shows us similar results apologies for quickly going over these unmapped similar results page fault also this is where we fault in every page of the buffer and as we go up the difference increases because it is accumulation of all the page faults so for a buffer which consists of five pages you have to page fault the first four as well so that's why it stacks up these are Mkashd results running bare metal using all the p-thread and lib event libraries that it wants totally unmodified code the tail latency as you can see at 500,000 qps you see 11% improvement and these are redis results where you see a 9% benefit in tail latency and 21% benefit in throughput so very quickly these were the results this is what we got and over to Tommy Thanks Suley maybe you could give me a nod if you can hear me Ali sweet alright let's share my screen here for a second if it lets me Ali just as you help me debug the kernel you want to help me debug this screen sharing setting sure why don't you why don't you come here if we can do a quick I'm sorry we're in some kind of matrix now alright well in the time remaining I assume someone will start playing some music and kick me off on this thing and so I thought it would be fun to kind of do a bit of a demo the idea being to try to make it a little more concrete what some of these shortcuts really are and how they work so yeah I'm Tommy Unger I've been working with Ali for a bit now so where did we leave off I think with Ali we were talking about some of these Redis latency results you're seeing those right yeah so let's talk through three of these cases right so once we've taken our Redis application right this thing's completely unmodified and we link it together with the Linux kernel running on UK we have the opportunity to do some pretty cool optimization so the whole UK the scope of the whole UK project is huge you could imagine doing profile guided optimizations link time optimizations zero copy I on networking paths but I just want to talk through one option which is making shortcuts into so directly from your application into kernel paths and I think the exciting thing there is it kind of opens up the API that that that applications actually can can use right that's normally just the system call interface but we have the opportunity to really open up the entire kernel and turn that thing into just a box of building blocks that you can make use of so looking at a couple of these graphs right start at the top there's a normal Linux system and what you're looking at is round trip latencies from two bare metal servers right there's a benchmark sending requests to the server servers sending responses back and what you're looking at is just a histogram the latencies on the X and the counts on those curves right and and a couple picked out right 99 tail latency up here so one thing you're seeing that shortcut that le talked about right so intersecting the application at the glibc layer and vectoring directly into system calls right skipping a lot of that intermediate code which is checking things like you know we have a signals to check right we have we have an rcu quiescent point that we can that we can there's a lot of work going on there so when we skip over that this is the this is the graph that we recover and some of the interesting points there right is we got we cut off 10% on that 99 tail which which corresponded to a 20% throughput win on that path right and something I played around with a bit was some deeper shortcuts right like now that we have the opportunity to call into arbitrary points on these kernel functions what if we what if we went in even further and we actually got that down to you know 23% improvement on 99 tail and 33% throughput increase but that's probably pretty vague and not not not something that maybe most people have like an intuitive picture how that works so I thought I would you know in the remaining time here see if if I could show you some pictures that might make that a little more compelling so here we go let's see how much trouble we can get into guys we have five minutes left so just we do have some questions as well so great great I'll give you the I'll give you the one minute version if I can here so let's let's start off a redis server and let's fire up a benchmark to give it some work to do you I'm running now I'm going to do over here yeah I guess I'd never run a high throughput low latency server while presenting before and apparently my machine can't pull that off but so I kicked off that server and I generated this profile which is just periodically sampling where that computation is and performing a back trace right so what function are you currently running and why are you running it what I'll show you here is when you process that thing you can actually generate these flame graphs which I find really useful for understanding what's going on in those processes right so here's the result of profiling that server down here some of the redis code and then you can see there's a jump into glibc redis calls glibc and then there's a transfer into the kernel mode right and so when we talk about these shortcuts right the ability to jump around the system call entry and exit pass what we're really talking about is going directly from this right call right directly into cases right the kernel system call handler for the right system call right when we talk about these deeper shortcuts there's so much work on every single one of these calls where the kernel is just taking that opaque file descriptor right to file descriptor 7 and it's saying I have to go through the virtual file system layer right that thing's a socket what protocol is it using is it udp no it's tcp okay so now we're finally at the point where we can start the network protocol that actually sends this thing off so I'll cut it off there but the point here is just by building our application together with the kernel into a single binary we give ourselves a lot of opportunities for optimization and one of them is exploiting this much wider API right instead of just a system call interactions you can actually jump into arbitrary kernel paths so I'll stop there thank you Thomas and Ali let me read out the questions for the recordings so that you can address them so we have questions from wonder costa so the first there are three sets of questions the first question is to real time embedded kernels like TN kernel and free loss fall into category of unikernels so I if you may just yeah I don't really know TN kernels but real time operating systems do not fall under the category of unikernels because for a unikernel you have to have a single address space everything linked together and kind of a static linking where you can call any function anywhere but what we do have in the pipeline here at BU and Red Hat is that we want to try to add whatever different mechanisms real time operating systems have and the real time patches of links have and try them on UKL now that you have much more flexibility can you change the scheduler can you make other decisions without being restricted by the sys call interface so I think our real time operating systems are orthogonal to the unikernel effort but yes all of them can combine these efforts alright so another question is what processors, family or models do you implement the kernel app on yeah so we used right now UKL is running on x8664 Intel processors we have again in pipeline we have students who want to do an ARM abort of this thing but right now only Intel x8664 the third and the last one is how does the performance compare with the approach of splitting the application user and kernel parts either through the kernel module or an EBPF application I'll talk about the kernel module part where in UKL you do not have to be restricted by the kernel programming model you can take your application as the entire library everything not ported to kernel code and you can run it in the kernel as a unikernel I think Tommy will talk about EBPF in more detail but EBPF has an assumption of security that there are multiple different processes running on the machine and then you have EBPF is the secure way of doing the computation which UKL you do not have that restriction because the idea is that these are single user machines these are not general purpose machines so now you do not have to be restricted by what BPF allows you what BPF compiler allows you if the code will go through the static checking or not you have your entire application you have the entire kernel you can access different drivers if you know what you're doing Sky is the limit I think it is a in-depth conversation about how these things how these models compare to module programming or BPF but BPF if the code you care about you can inject it into the prologue or epilogue of a kernel function if that is enough to implement what you need that's great I think that UKL allows a lot of cool spaces that you just can't do when you aren't jamming those two programming models together an example there would be running profile-guided optimization across your application the kernel paths at the same time so they are in some ways complementary in some ways they have slightly different research spaces but we love BPF use it all the time for profiling and prototyping stuff I think one interesting thing is you can write code in the application model and then run it in ring zero on UKL that's something that in the restricted model of module programming you don't really have full access to for example a glip see underneath with you if you wanted to put a machine learning algorithm on what used to be a kernel path you can use these high level you can run Python code in the middle of what you would normally consider a kernel path I think they are very different projects but I'll stop there the Harry just give us a comment about the person who gave the BPF talk also said it was in depth questions but he would be interested in talking to you about it Noah poorly hears his name thank you very much Ali and Thomas for this wonderful talk and we will be moving over to our next event that is the closing ceremony and trivia