 Hi everyone I'm Ali and I'll be presenting our work which we are still it's underway here in a Linux and as you can see many of the team members you might already know some are from Red Hat some from Boston University there's Uli we have Richard Jones from Red Hat Larry Woodman and also Daniel joined us very recently so I'll start off with why we're doing this what's the motivation and I'll share some updates what we've gotten so far and what we're trying to do now so as you already know operating systems have been designed in a way that they multiplex resources between different applications between different users and they're doing a different really good job of of doing that but things are changing now for example application design goals you know many of the applications are being deployed as single application virtual machines and even distributed systems where we have one application deployed across many single purpose virtual machines things like that so deployment deployment models are changing we don't really need the in some cases not in all cases in some cases we don't really need the operating system to multiplex resources between different application different users all that logic that it has to multiplex these resources we don't need that and we see that some of the applications are now being re-implemented to bypass the kernel with app libraries such as DPDK and SPDK because kernel is now just in the way for some of the workloads so researchers have been trying to figure out a solution to this issue and one of the solutions that they've come up with is a unicorn so let's start what is a unicorn so this is how things normally run you have different privilege levels the kernel runs in ring zero you have shared libraries and different processes in a unicorn there's just one single application and all the functionality that it needs so there's flat address space there's no separation between the kernel space and address space because there's just one application and all the functionality that you need from the shared libraries and operating systems is then statically combined into one single binary which is which is small and lightweight as compared to the normal kernels and you can do lots of optimizations on it because the important thing is that this is statically linked together you can do link time optimizations you can do profile driven optimizations things like that once you have the final binary so researchers have shown in many different unicorns that a lot of different optimizations are possible for example abarty which Jim and Tommy here work on it has shown more than two times improved memory performance on 99% tail late 99% tail latency as compared to Linux and there are many other examples of any kind of outperforming the traditional operating systems so let's discuss why have these different research unique I was not being adopted in in production and the reason for that lies in how these are developed so some researchers take the clean slate approach they develop all the code from scratch and the other approaches that you take an already existing kernel for example that BST and strip it down to create a unique and which is the romp run and both of these cases so the unique and that are created as we saw earlier have great optimizations but as with any operating system approach it will take years to build a community around these you know which can support and which can keep updating these unique and in the meantime there's no real adoption of these in in in production so we asked ourselves a question and mostly it was only and my advisor run asking each of the questions and they said that okay why can't we do this with Linux we didn't know the answer can we make a unique and will be a stone Linux so if it was possible and if it could live an up streamable target which if in the this is extremely important because if it is an up streamable target which means that the entire community can maintain it can support it things like that we don't just want another huge code base sitting on the side which we have to maintain if this is possible it will have the better tested code that Linux has and the large community that I already mentioned and it will also have different performance advantages that Unicernals have now we didn't know how much some Unicernals as I said have shown great performance advantages will be able to be with this Linux Unicernal able to get those performance advantages will be able to get close to them things like that and what different optimizations are possible so with these questions in mind we decided to give it a shot our requirements from the very start were simple we had to run unmodified applications and libraries minimal changes in Linux and glipsies because if changes were huge there's no chance these can get accepted upstream so minimal changes and from the get go we had to demonstrate some performance benefits to have any interest from the community and the most obvious ways of doing that were eliminating the ring transition overheads because we don't have different rings different address spaces so we can just have functions calls in place of system calls and since everything's statically linked together we can have cross layer optimizations there are different approaches we try to intermix kernel and application code for example user mode Linux but it does not give you that single address space there's still that separation between the kernel and the application you cannot do those optimizations Linux kernel library which is a very stripped down version of the kernel which runs in user space and Libos which is right now just the network stack as a library in user space they don't run unmodified applications and libraries and as I said extremely stripped down versions of the Linux so that's those still don't meet our goals other approaches include why don't we just implement applications as Linux kernel modules again huge changes in applications everything will have to be ported to the kernel code and kernel module next which allows the application to run in ring zero but still that separation between the kernel and the application you cannot do the cross layer optimizations so we decided to build the unikernel Linux and I'll tell you the high level architecture overview so this is how things normally run the kernel comes up it does the first exact we call and brings up the user space and the user space has applications and the shared libraries things like that applications make functions call into the glibc normally and then glibc makes system calls into the kernel this is how things normally run but in our case as I said there's no separation between the different address spaces so we don't need these system calls instead when the kernel comes comes up it calls assemble came in which is defined in the application so a function call application makes function call to glibc now for glibc to make function call into the kernel we just had a very small stub library which we call the uk library all it does is is calls the required functionality from within the kernel so instead of glibc making the right sys call it will make a uk right function call so you can write function call into the uk library which has stubs for the functionality which then calls the functionality into the kernel and all of this is statically linked together this this is the high-level artistic architecture overview I'll talk about our current build process so normally when the kernel builds it has different archives built in dot is in different folders different the different directories and some archives of libraries which are then linked together into VM linux and as many of you already know this linking stage is basically a few scripts which link everything together now we have more things which need to be linked together for example the application the glibc uk library and if you're using some other libraries as well we'll all have different archives and object files so what we did right now we change the kernel linking stage to not just include these archives to include all of this and now what we do is we just do a kernel make and the entire thing becomes a unicorn what we want to do in the future is that you can use your existing make files for example if you're trying to build some application as a unicorn you just have you just do make make and select the compiler target as uk lg cc or something like that and your application can be built into a unicorn that's where we want to go this is what we're currently doing so I'll talk briefly about different challenges that we faced in building this in a kernel and after that I'll share what are what is the status right now and what are the results that we got so as you know conflicting trend models kernel has k threads and applications run p threads so how do we make sure that we have one consistent trend model without changing a lot yeah so why can't we just you know simply use these as is because p threads as you know use a large set of registers k threads don't they all have their own idiosyncrasies for example p threads use a register for thread local storage that the kernel uses some register for its own per CPU context storage things like that so the question is how do we merge these two conflicting thread models should be implement everything as p threads but that would mean a huge rewrite of the entire Linux kernel we don't want that should be implement all traders k threads but that would mean and that all the performance optimized optimizations that glibc does because of the large register set we don't have that so we don't want that can we just keep switching between these two training models we don't want extra stack switches so in the end after spending a lot of time on it in the end we found out that actually merging these two thread models is is simple and most of that is based on luck for example kernel uses the gs register for per CPU storage and glibc uses fs register for thread local storage things like that if that were not the case we would have to do something else but right now everything works fine I'll let you know how so the kernel comes up it's running as a k thread the k thread basically actually becomes a p thread it runs the application code but for this first primordial thread to run application code we have to set it up in a way that in a way glibc expects the first thread to be for example the thread local storage is properly set up right next to it is the thread control block that's properly set up things like that all these initialization functions that glibc has so what we do is after the kernel comes up instead of doing the first exact we call we go into glibc initialization functions we modified them a bit just so that they can work together and when I say modified modified the number of arguments a function takes things like that nothing major so the first primordial thread sets up we set up everything it starts running the code and after that all threads just run because p thread create takes p thread create takes care of all the other threads and these threads when we go into the kernel we can execute kernel thread we can execute applications and glibc code so things like that so this all worked and in the end it turned out to be easy but took a lot of investigation and the interesting thing here is that we didn't change k threads in any way so all the background final tasks they still run as k threads no changes there yes there wasn't the question how to get all different p thread functions to work for example p thread create things like that so I'll give you an example of the clone call clone is a special call as you know because there are it returns twice so we need separate stacks which are handled by entry syscall normally the thing is what we did was instead of going into the clone call and when we normally go into the clone call the entry syscall function that is which is it takes care of everything it goes on to work as you know on a temporary stack it creates a new thread goes on to another temporary stack and then it returns and when you return you have everything set up properly but now since we're not doing that we're not actually doing the syscall thing we had to slightly modify these things for example now we make the function call it goes into the into of into our own function in the same same file which all it does is mimics what the kernel does instead of doing it on temporary stacks we do it on the same stacks and then the new thread returns its stack is properly set up the original thread that works just fine so clone also just work we have to introduce a new clone flag called ukl because of of the changes that we made we didn't want to it to affect the normal functionality of do for functions to think like that okay so yes this is what we were looking at recently how do we switch between different p threads now thing is scheduling normally occurs on user the kernel boundary once you do a syscall you give up the control to the kernel the kernel does all these housekeeping tasks it might switch different tasks around since we don't have that boundary the p threads were not being scheduled properly if the thread runs it runs because that's what happens in a non-preemptible kernel so what we did was we just use the preemptible kernel everything works also Daniel Olivera who recently joined our team he works at Red Hat he was interested in the project so he started contributing to it he's written a patch which does which mimics this kernel user boundary because we have the ukl library so that is a very clean point where we have we know we're going into the kernel code so he mimics that and now it works for a non-preemptible kernel as well memory management as you all know that application normally lives in the lower range of the memory and of the address space and kernel lives in the upper range applications text data live here the kernel and then there's the m-map area the kernels text and data live in the in the higher range and all of these different range has ranges have their own memory management primitives now since everything is linked together statically applications text and data also live along with the current now what do we do about a map how do we do what do we do about malloc actually so the question is should we just map malloc to the malloc but that would mean that as Larry always mentions it would break a lot of applications which rely on that on on bit arithmetic so what the last bit is things like that and we want to use the general purpose performance optimized functionality of glibc's malloc we just don't want to use simple vmalloc what about m-map where do we have where do we do the m-map which range do we select for m-map things like that now the problem that that we face was that the first k thread memory management structs are not set up properly for it to use the lower end of memory the user end of memory so what we did was that first k thread which becomes the p-thread along with doing the glibc initializations before that we do a memory struct initializations as well so we initialize everything perfectly now the thread knows that okay we have access to this lower end of memory as well so as soon as as soon as it goes into glibc glibc does malloc mallocs does m-map we start getting memory from the lower range so everything works like it should and it also helps us with future optimizations for example if you have in the future if we allocate a buffer for example network buffer in the kernel address space we can just instead of doing copy to user copy from user things like that we can just move that pointer to that buffer all long and we can do zero copy networking and finally when it's time to free that buffer we can just have a check over where does it work where does this buffer lie in the user range in the kernel range and just free it accordingly so we'll talk about that another issue we faced was namespace issues glibc internal have routines which have the same names for example mem move memset how do we fix these do we rename them do we keep the kernel versions only again as I said we don't want to get rid of the performance optimized glibc versions of these things should we just keep glibc versions that would mean some things a lot of things in the kernel would break and you would have to fix that things like that so earlier what we did was we just suppressed glibc's version and use the kernel version but now what we're doing is we're partially linking glibc the application of all the libraries separately and linking the kernel separately and then we do a final linking step so everything resolves correctly a little bit about the implementation we added a Linux config option so that all the changes that we made to the Linux kernel are nicely separated in if diffs so if you just turn off that flag you know you unique on a Linux flag off what you build is a normal Linux kernel as I said we changed the Linux kernel linker strips so now because now we have segments which are not usually present in the kernel for example threat TLS thread local storage segments things like that so we had to change that a bit we're using the latest kernel I forgot to update it which is now it is 5.3 we're using that now and now we have around 500 lines which we have modified in the kernel this is a highly bloated number because a lot of this is just the clone call which is a lot of assembly language things like that but mostly it's just a couple of hundred lines which we've changed and most of it is just simple if diffs and changing the syscall name to a function call name for example write do you kill right things like that very minimal non-basic changes glibc wherever in the glibc code base where we have a syscall we just instead of that we are doing we're rewriting it as a function call and all those changes are being done in a separate directory separate sub folder so the rest of it is nicely separated from the rest of glibc code things like that so this basically fulfills our earlier goal of minimal changes in Linux and glibc so that when we are ready when this things ready we have some higher chance of getting accepted upstream people who work with the Linux community a lot for example Larry they think that these changes are minimal enough non-invasive I actually have don't have no experience so I'm just going with what they say so what we did for the initial evaluation we did a simple tcp echo server we deployed a simple c code as a normal application in user space on Linux kernel running on in QMU and then we took the same c code built it as in a kernel and deployed that on top of inside QMU so same code it was single threaded and this is the result for the normal case where the application is running as another tcp echo server is running in the application on a normal kernel and this is what we got with the single with the same code running as a unit kernel where you can see less than half the average latency and 41% lower 99% tail but we don't know where these numbers are coming from what is the thing driving these this performance gains what is it just the system call over that we're getting rid of is there something else that we're doing so now we have interns who are working on getting perf to run inside the unit kernel as a simple k thread and doing perf krem from outside to just see what is actually going on and so as Quran says all the time that we should not look at these numbers the main interesting result of what we did is that we have an existence proof that it works we don't know where the numbers are coming from and we'll probably do more experiments later on do performance evaluation and we then we know okay where where are these numbers actually coming from so I'll tell you a bit about current status now as I said we have p-threads working so we have multi-threaded support we have multi-threaded dcp server running as well and all things multi-threaded altering all things simple multi-threaded now what we're doing is we're trying to run memcache the unit kernel which is I feel very close to completion we have different when we run it we have different threads spawning we have things like that it is right now it is failing with somewhere in the code there's a sis call which we have not changed yet so things like that so it's very close to completion the very cool thing that I felt was that memcache d compared to a simple tcp server is a bigger code base so we had memcache d lib event glibc and the linux kernel all linked together perfectly the kernel boots up and goes into the memcache decode and everything runs that was I felt very cool soon when we have memcache d working we'll we'll update our git repo and how you can test it and build it things like that so experiences we feel that we spent a lot of time investigating different things and in the end the changes that we had to make to get that thing to work was just a couple of line changes things like that and that happened all the time so in a way it feels like this is something which is meant to be that you know it's there's a huge problem a couple of fine changes done limitations as you all know fork doesn't work which means that you can have as many threads as you want but you can't have different processes running because as you know it's a unique kernel we have a few development tasks as I told you we're working on memcache d we're working on getting to a point where you can just use your own make files and just type make then when we have everything mature enough then we'll think about how to package it in a way that we can send it upstream or just begin a discussion upstream and after that all the fun part begins we can do all sorts of optimizations what can we do we can can we do link time optimizations can we do profile given optimizations how much performance benefit does that give us and can we show performance close enough to what different research unicorns have shown things like that so just to conclude research unicorns have shown that there are advantages to be had and Linux has always integrated new ideas into its codebase we feel that unicorns can be the next natural step for Linux and our prototype and the work we've done so far has given us some confidence that there are performance benefits and that we can get with modest changes thank you questions is this x86 64 only yes right now is are there any plans for anyone else to work on other architectures if people want to join us changes are minimal I'm wondering like what what sort of changes would be required for something like power okay thank you yeah and if some architecture uses the same register for trade local storage and perceived storage then we might have to do something fancy right have you looked at all yet actually deploying these in a platform like cube you tried running any of these like in kubernetes well the one of the ideas obviously behind this one of the motivations is to make a function as a service faster so one of the targets which we're going to look at after you on something to get scripting languages like no tightness on up and running into that then we'll do the measurements of Tommy talk yesterday about caching remember any start-up time so on we can do this with this as well so we can bring up partly bring up the interpret of something that have the nanosecond start-up time so function as a service we have so this is really great work I followed it ever since you guys published the paper and I've liked it or you kind of preempted something that I was going to ask about though you talked about interpreters so I was wondering what happens when something wants basically like a file system or to load things from file systems things like that is that going to be difficult with the interpreter model or what what are your thinking thoughts there so one of the things and let me answer this now because this is something which I've been thinking about for a long time I just have time because by the way I mean the rule are doing the work so the rest of us are just cheering inside so but one of the start-up start-up mechanisms which we have been thinking about which I want them to implement is basically that you need the equivalent of the init scripts in there as well and we will have this kind of part of the build system so that instead of having scripts we will generate basically have a kind of a compiler for this kind of things which translates the description of what your init is supposed to look like into code which is then linked in or executed before we jump into painting so with that you can mount any arbitrary class system then the same functionality is there it's an anomaly kind of function like we had the network all set up from the network you have it now hard-coded in some way or form so this is the very good version of this kind of thing this actually happens before we get the painting this just has to be generalized so by structuring this as a target does this make it incompatible with other targets such as like user mode linux well I guess my question is would it be possible to essentially build a user mode linux unicolonial I haven't thought about this I very much doubt that it would be me if you were and given the lack of interest in about the new mail to some extent it might linger around for a while before someone does the work so in theory it could work yes but we're we're literally today's ARC U.M. as a target and so we are introducing ARC into the 60k so it will be some restructuring I guess thank you any more questions thank you