 Hi everyone, welcome. My name is Martin and this is my collaborator, Riccardo, and we're going to be talking about Solo 5, which is something we're calling a retargetable sandbox execution environment for Unikernels. That's a lot of words, so we'll go and try and explain some of the concepts if I can switch to the next slide. So, actually Riccardo told me that in the Unicraft talk, which was just before a bunch of people were asking what is a Unikernel, so it's good that we have some background to this. There are two separate terms that people use here. There are two separate terms that people use that are kind of similar related to each other, but I think it's important to show some distinction between them. So, a library operating system, these are things that have been around since the early 90s. They actually come from even older technologies, Butler-Lampson had some papers I think in the 70s that were kind of using similar concepts. A library operating system is essentially just a collection of libraries, except those libraries do traditional OSC kind of things that you'd normally expect to have in your monolithic kernel. It's in the sacred kernel space and you have file systems, TCP stacks and so on. With micro kernels that's a bit different, but it's still part of the micro kernel component, whereas here it really is just a collection of libraries. Generally, there is no concept of process isolation with the Libos, that's not really its job. And they also use cooperative scheduling. For the non OSC people in the room, that basically means applications are more like event loops, rather than getting preempted internally. So, you take this collection of libraries and then you combine it at compile time and out pops a unikernel. That's the distinction I'd like to make between these terms. The unikernel is the art effect, which is made up of your library operating system and your application code. Unikernels, then due to this process of being built out of this collection of libraries, we can minimize the code that goes into them. So, you have minimal code size, which implies a minimal kind of horizontal attack surface. There's not very much stuff that goes into this thing. You can think of them as single purpose application operating systems, though then people tend to think, oh, it's actually a kernel, it has to run in kernel space. That's not true. Conceptually, I would, my favorite way of looking at it is it's just a process that does more. So, it does more operating system-type things. So, what is Solo 5? This is the shape of Solo 5. And the top bit, so is where your unikernel and library OS go. And the bottom triangle pointing up, which is much bigger in this case, is some host thing. And Solo 5 occupies the space that's the blue circle in the middle. Okay, what is this then? So, it's, there are three important parts to it. One is a minimalist legacy-free interface. So, really, really small. And by interface, as I'll go to show later, it's basically a bunch of C-function calls. Then we have some bindings, which are the implementation of that interface. And in this picture here, the bindings would be roughly the top bits, the top part of the intersection of the circle with the triangle. And we have bindings for micro kernels, for separation kernels, and so for a vertio-based hypervisors, so cloud hypervisors basically, and existing monolithic kernels. Additionally, on a monolithic kernel, we have a third concept, which is a component we call a tender. And that's used to strongly sandbox the unicernel further away from this big bad monolithic kernel with its huge interfaces, as I'll go to show in a bit. And we have two tenders. There's a hardware-virtualized tender and a sandbox-processed tender. From the library operating system and all the application and library operating system developers point of view, Solo 5 is just middleware. So, ideally, it's integrated directly into the Libos build system, and you don't actually interact with it directly. An example for Mirage OS is it is just a different parameter to Mirage Configure, where you can say which of these targets I want and out pops a unicernel. That's the way it should be. Other Libo OSes are not quite up to that state yet, but hopefully we'll get there. So, comparing the Solo 5 shape to other common isolation interfaces and some units of execution. Again, the bottom side is always some host, and the top part is the units of execution that we're comparing Solo 5 against. Now, what I want to show here is both horizontally in the middle where they meet at this line, there is something that represents relative width of these interfaces. Also, vertically, I'm very roughly not entirely to scale, comparing that unicernels are small, containers are generally larger, and the VM, well, a traditional VM can just be as big as your host, or bigger. So, the width of this interface in the container case, I'm actually showing that the interface is the entire host interface. Now, people may say, well, okay, but it's names-based and we have C groups and whatever. Actually, it's just adding more logic to this interface, which is just as wide as it was before. You can switch bits of it off, which some people do with SetConf. We'll talk about that in a bit. But most of the time in your general container case, you still have access to 350 calls. In the case with the traditional VM talking to some kind of host, host or hypervisor, the entire interface between the VM and the host is actually quite big. So, you've got tons of stuff in here. You've got ACPI, device discovery, vertio, PCI, whatever. All of these things are in themselves quite wide and complicated and stateful. So, the guiding philosophy of Solo5 is that the interface must be minimal. That, I think, is a fairly straightforward term. States list, which is something I've appropriated for this purpose and I'll go on to explain what that means. And it should also be portable. And then the implementation should be clearly layered. It should do one thing and do it well. It should be an engine for running unicernals, other things like orchestration, configuration management, monitoring, whatever. Those are done by layers higher up the stack. So, we're at the lowest layer. We try to integrate or we don't get in the way of other people integrating us into higher layers, but that's not our job. So, in terms of minimalism, we want to provide the simplest useful abstraction. Importantly, that means we're not trying to be Linux. So, this is not an exercise in re-implementing Linux or POSIX. Let's try the smallest thing which is actually useful and still lets us get some work done. One thing that pops out of minimalism is also that we don't do any kind of device discovery. This leads to a very small implementation size. A typical configuration, by that I mean Solr5 compiled into one target that has a tender and some bindings, some of which go in your unicernal, some of which live on the host, is about 3,000 lines of C. 12,000 lines of C is in total the whole code base with all the different combinations of tenders and bindings that we have. Minimalism also gives us clarity of implementation. So, we don't try to do very much and we prefer to do what we do correctly rather than add more features. And it gives us very fast startup time, comparable to what a normal user process would have. So, with Solr5 HVT or SPT we get something of less than 50 milliseconds. QMU, a traditional QMU will typically take when starting a Linux VM up to a second before you get, say, to a first network packet. A lot of that time is spent precisely doing things like device discovery and then actually bootstrapping the kernel. But, yeah. And then for cloud managed VMs with traditional hypervisors, so things like Google computer and general, it takes seconds. So, stateless. Minimalism by itself or a minimal interface by itself, that's just saying, okay, we don't do very much. But, that is not necessarily all we want. By stateless, I mean that there's very little state inside this interface. So, that means both, there's very little state transferred in any of the calls. You could think of them as as idempotent as possible, I suppose. And the guest, the unicernal cannot change host state. So, there is no dynamic resource allocation. Conversely, the host cannot change guest state. So, there are no interrupts going across this interface. This gives us a system that's very deterministic and easy to reason about. And it's also essentially a static system. So, you get some resources, some things from the host up front, memory, access to devices, so on. And that's it. You can never get more. And I think that the stateless property is also what very much helps us to actually strongly isolate this interface and do that both on existing monolithic kernels and on component-based and high assurance microkernels. So, it's this way of thinking about what the interface should do that helps us to get to strong isolation. So, and the third part of the philosophy is that we want to be portable. That includes easily porting existing library OSs to cell F5. And we currently have MirageOS, which was the original library OS that, where cell F5 got started as a standard way of getting MirageOS to run on other things. And Riccardo here has also done support for IncludeOS, which is a C++ based library operating system. And Rump Run, which is a POSIXE-like thing using NetBSD Rump kernels. So, conversely, it should be easy to port cell F5 to new targets. And interestingly, since the project started, we've had OpenBSD VMM support contributed, which I'll show by a guy who's basically an OpenBSD user, definitely not a cell F5 expert. Adrian from the Moon Separation Kernel contributed a port to targets Moon native subjects as a cell F5 target. And Emery, who's here in this room, just last year contributed a port to run or implement the cell F5 interface, so bindings to the cell F5 interface as a genode component. So, let's be honest, there are some limitations. Or you might want to call this slide non-goals. I mean, these are kind of the inherent limitations. As I already said, it's not Linux. We don't intend to become Linux. If you want Linux, you know where to find it. But there are library OSs that let you get quite a lot of the way there. So, Rump Run LKL is something called, I think, the Linux kernel library, which is running a similar project trying to do what Rump kernels did for NetBSD, but with Linux. So, if you want, you can build your POSIX application against a libOS. Semantically, it might be a bit different. You've got to be a bit careful, but generally it'll work. So, keeping the interface stateless, well, no interrupts implies it's a single core, no SMP, no library. The answer to that is run more than one instance of your thing and parallelize your work and pass messages around. It's not intended for interfacing to hardware. That is possibly a bigger limitation, but then we see cell F5 as something which is components as part of a wider disaggregated system and then drivers become some other components problem. So, you can do that both in the microkernel case where you have other microkernel components providing access to hardware, or in the monolithic kernel case, you have monolithic kernel itself in a type 2 hypervisor approach providing access to the hardware, then it's not isolated. Or you have a disaggregated system like cubes or something based on passing hardware directly to isolated components. Portability implies that we have some copying semantics in the interface, and also we have things like one core per packet. That's not very nice, but, well, actually we found it's pretty good. But the main thing I'd point out there is it's not intended for scenarios where you're trying to do high performance computing or pass millions of packets per second. If you're doing that, generally you will want hardware pass through anyway, and trying to implement an interface alongside of these goals for something like hardware pass through is, I think, just a fool's errand. You will end up with bits of the interface, take a typical hardware interface, rip a line down the middle and you'll have bits kind of sticking out. So, having shown the philosophy, I'd like to show the actual interface, and I'll leave this up on here for you for a bit to read through it. Originally I had one slide with all of the interface because I thought it would be cool to just fit the whole interface in one slide, but then maybe that's too much text to read in one go. So, I've split it up into two groups. Top call you can see is basically an entry point into the Unikernel Guest. This also shows that the fact that there is a static system. So, you get given some heap memory with some size and that's it. And you get told here it is, go for your life. There is an exit and an abort call. Those are separate calls. The abort call is something that's intended to say, oh, I'm giving up. Could the host thing, whatever it is, if it's configured to do things like call dumps or page someone or whatever. Something abnormal happened. If you care about that, then treat it differently. I don't like overloading exit and status codes for something which I think is conceptually slightly different. So, that's why there are two. There's a console. The console is intended mainly for debugging-y type stuff or some simple logging, but generally you'd want to do that over protocols and networks or somewhere else. So, it's a console. Nothing particularly interesting about that. We have monotonic time. We have wall clock time. Those are 64-bit types. They are nanoseconds. They'll run out in 500 something years from now. Too bad. Then there is a yield function. So, what this does is it essentially says please yield to the host until some deadline comes. And then, I'll show the second part where I'll repeat the yield. So, we actually want to talk to some devices. So, what we have here is we have a network API which is a very simple layer two network with layer two semantics. So, you push it back it. Packets can get dropped. Read is always non-blocking. So, it will return again if there isn't anything to read. And this ties into cello five yield which will give you back a bool that says do I have work to do or is there nothing to do and it's just the fact that the deadline was reached. Then we have a block interface which is basically block IO again. Calls are idempotent in themselves. So, there is always an offset and a buffer together. The interface is intended to read and write only at some minimum block size that you get out of block info. And it's yeah. So, you will either get back an okay and what you asked to write was written or not. So, there are a couple of things missing here. Which I saw observant people will be asking themselves right now which is where we do you say which of these devices you want to talk to. And also more obviously in the block API there is no flush that is just missing that is something that we need to add. So, there will likely be a flush and a discard. So, the multiple devices thing at the moment the implementation is rather naive and you can only have one or zero or one of these. This is something that we want to change. I'll go into a bit more detail later about why it hasn't happened yet. But fundamentally this interface will not change hugely even in the presence of multiple devices. The difference will probably be that block info and net info will give you back some kind of handle. So, you will pass in an identifier, some name of the device you'll get a handle out and then read and write except a handle. Yield interestingly will not change much. It might give you back a mask of bits saying which of the handles have worked for them. But I don't expect to have again with the not changing host state. You will always ask for is there any work at all? So, you don't go saying oh I'm no longer interested in net zero. Well, no you asked for it in the first place in the beginning. So, if there's any work then I'll tell you there's work on it. So, going to hijack the presentation for a couple of minutes. Concrete view. I need to give you a mic. Well, but no one will hear you on the stream. So, I can just kind of hold it like this. I can do it. Before that I just wanted to add something to what Maito was saying which is that even though the interface looks very, very minimalistic it's surprisingly enough to run a lot of stuff. For example, with RampRam we can run Python node, a lot of things which at least to me it's pretty amazing that such a simple interface it's able to do. So, now again. Can I make the font size smaller? Sure. There you go. That's probably just about right there. Okay. So, let's say you have a, actually let me show you. Let's say you want to implement a small web server as a unique kernel that just returns something like a pretty camel and it stores, of course it comes, and it stores a counter in persistent storage just to count the number of requests. So, you need storage, you need a file system with a key value store, and you need an HTTP server basically. So, one way of implementing this would be to use something like Mirage and because you want to, you want to print my camel so you would use something like a camel, right? Makes sense. So, you would go into Mirage and you would write a bit of a camel and you would use an HTTP library and also a networks like library and also something like a key value store as a library and then you would tell the Mirage build system that you want to construct a unique kernel using some target. Let's say a Sol of 5 target and more specifically, let's say a Sol of 5 micro BMM target. So, in that case, you would do that, you will make and you will get something like this. You would build and you will get the unique kernel which will be this one dot HVT which is, actually it's not that small as I thought, and also the tender which will be the equivalent of QMU for example if you were running a typical VM. And now the way you would run this is just like with QMU you would say HVT. Now because we need a disk to store the key value store information we need to specify the disk and we would say disk image net tap and HTTP HVT. Yeah, well it's huge. Okay. And then just to show you that actually works. So we go. Cool. Another possible target is the process one which is called SPT and so you would again tell the Mirage build system to build another unique kernel you will tell it now I want SPT and it will output something like this SPT SPT which is also pretty large. Yeah, makes a lot of sense. Really size will give a different number. Yeah, still pretty large. Anyway. Yeah, so you would use the SPT one something like this. Exactly the same. Just to show you as well that it works. Ta-da. 195 camel. So the counter is working. And now to give you a more in-depth view of what this thing is doing let's take a look at the as the S-trace. So for example for the HVT one the HVT tender is kind of implementing the solo5 interface by using Cisco. So we would map a solo5 API call for example I yield into a poll into a specific file descriptor into the host. True. Okay, sorry. Sorry, sorry. Okay, so let's do an S-trace. Yeah, well just boot it. And now it's just polling is just an event loop and it will take a look at what happens when we actually curl. We see a bunch of KVM stuff read into the this is the tap and polls. I mean it's really very very minimal. It's basically reads and writes into the this file descriptor and reads and writes into the into the the tap file descriptor. And now to compare against the the SPT one let's do the same on SPT. Yeah, that's booting don't worry. So that's pretty much exactly the same except that it doesn't have the KVM iactals. Yeah. And again that's just to give you a rough idea of the minimalism and also that we have a even more minimal interface which is called SPT that is pretty much like the KVM one but without doing all the KVM stuff. And now back to Maito to get to tell you more about how these works. So Riccardo showed a couple of the cellophane targets that use tenders to run on monolithic kernels. What do these actually do? So there's the HVT which is the hardware virtualized tender which uses hardware virtualization. Obviously and from the name it runs on KVM on Linux on FreeBSD on Beehive and on OpenBSD on the VMM APIs. So this is this is not a traditional VMM not even close. It has about 10 hypercalls. It self the code base is not modular. Again a typical configuration so a solo 5 HVT on Linux on x8664 would be about one and a half thousand lines of code. So again right we're talking about the tender which is this part it's the bottom intersection of the bottom triangle with the blue circle. Comparing to something like QMU well QMU I tried to be nice to them so I ignored the one and a half million lines of code that was just random blobs of firmware and stuff and that still leaves another roughly million lines of code. Something like cross VMM this may be a little bit unfair but I didn't S-Lock count doesn't understand rust so but I came up with some figure which looks like it's it's an order of magnitude smaller. HVT is also something that we've was the first tender that we implemented so it's been around since 2015 it works then got FreeBSD added into it and then OpenBSD contributed later as well. It's was formerly known as UKVM which some people here may be familiar with that name. So what is the tender responsible for? Well it loads the unicolonial artifact into the VCPU's memory it sets up access to the host resources that the unicolonial is asking for so the the network tap device or the disk block device. Does some VCPU set up sets up page tables for the guest which are very simple just one-to-one page tables so there's no no virtual memory is being used at the at the guest level and then it needs to hang around to handle guest hypercalls which or handle VM exits to actually perform some IO. So hypercalls what do they actually look like? What we do is we transfer a 32-bit pointer to a struct which contains the hypercalls parameters. The reason there is this hybrid approach is we actually want to run on existing deployed KVM systems so I think Martin Dedzky I don't know if he's here mentioned VM funk there is also an instruction called VM call which you could use on x86 for this except that KVM steals that for its own purposes so too bad. So we figured out this approach which on x86 basically does an out to some port which causes the VM exit and on arm64 we do an MMIO to some region which also causes a VM exit. At that point the tender just fishes the data out of the guest's memory and operates on it directly. So the HVT bindings then implement the cell F5 interface on using these hypercalls to talk to the tender. Apart from that they don't do very much so there's some handling of vcpu trap vectors which basically just say oh we got a trap abort sorry and the only other thing they do is provide monotonic time using a timestamp counter or an equivalent mechanism. So you can see an entire implementation of cell F5 net write for HVT which is basically just putting stuff into a structure and then invoking the hypercall. The mucking about with return codes here is because these two interfaces actually need to be aligned but we haven't done that yet. So Ricardo showed you HVT but we also have something called SPT which we're calling a sandbox process tender. The interesting thing about this notice that the blue circle suddenly got much much smaller. So what SPT does is instead of using hyper virtualization which is great if you have it but you don't always have access to that but just for example you might be running in a clouds where you're already virtualized while nested virtualization exists it's not something I would actually use to isolate things with. So instead we had this realization that oh we could just we already know exactly the calls we need to implement this interface. So why not just use a system call filter use a strict system call filter and turn off all the APIs that the host kernel is providing and it turns out that this is exactly what SPT does. Currently we only have this implemented for Linux using Seccom BPF but it should be possible to port to free BSD and open BSD. There's also a link to our paper from last year which talks a bit more about quantifying the the difference in the TCB where we actually make some claims where we say they're using something like HVT which touches hyper virtualization and KVM and actually ends up using more of the host kernel code than the approach where we switch off all but these seven system calls. So similarly to HVT SPT loads the unicolonal into its process address space sets up access to the host resources applies the sandbox and then basically gets out of the way. So the blue circle should actually disappear at this point but I don't have animations in there. So what this does is you've now taken the your unicolonal recompiled the same code against a different target and you're now running as essentially a user processor that's coming back to the original slide where I say just unicolonals are just like processes that do more and here we're basically treating the monolithic kernel as kind of as a hypervisor. So switch off everything. So some more about the the bindings side. Those are interested so the bindings basically end up directly invoking system calls but they are now inside the sandbox which they cannot change. There's no libc involves in there. Yep we have it for x86 64 and now 64 it's for SPT it's pretty trivial to add basically anything that the host supports. For HVT I'm not sure if I mentioned that we also have these two architectures it's a little bit more work but yeah so possible. What just happened hmm? Oh I clicked on something and so as I mentioned we have targets for micro kernels as well. So moon is a rather interesting it's actually technically called a separation kernel so for people that are not familiar with the distinction a separation kernel gives you much simplified way gives you much stronger guarantees about how your components can communicate so about the data flow between the things in the system. It's also I mean the original term is used as a way of saying that the componentized system should look exactly like as if it were a distributed system. You can look up the term and there's some more discussion about it there. So moon provides isolation using hardware virtualization it's implemented in Ada Spark and it's formally proven to contain no runtime errors at the source code level. The shape is now a bit different so the blue circle has moved more or less into the unicernel and moon is much much smaller. There aren't any hypercalls communication is done via shared memory. Verti was the first solo 5 implementation shape doesn't really match up we still have it there because well people find it useful I mean it runs on Google compute engine that's great but yeah it's doing like two and a half thousand lines of code in there just to do virtio stuff. Does anyone want to implement virtio scuzzy? Not really. Debugging this is the special slides. There is this myth that unicernels are undebuggable just use GDB and you can do it with a unicernel hidden inside an HVTVM or with SPT as well. Some lessons learned okay this is a bit complicated I have only five minutes but originally we with UKVM we wanted to specialize the tender to the unicernel it turns out that and we did that at compile time that's not practical the supply chain doesn't make sense you're not going to be supplying the tender and the unicernel at the same time it's now gone but we still want to it's gone for SPT we still haven't removed it for HVTV but it needs to go we still want to enforce a contract between the tender and the unicernel so the unicernel wants some resources tender should match those and not accidentally provide something else possible solution is to embed a manifest into the unicernel art effects but can we still specialize the tender some plans and challenges security we gradually implementing more best practices like ASLR, stack smashing protection, write or execute but again stack smashing protection the compiler ABI's are completely undocumented you have to go and read the compiler source to find out what it's doing that oh on x86 it always wants percent FS colon 0x28 yeah why isn't that specified somewhere for so there's some hypervisor support lacking on free BSD and open BSD where if you do an M protects on memory that's owned by the guest by the hypervisor that does not update the EPT tables it's just a knob KVM actually does this right and some defense defense in depth portability can we do dynamic linking safely and get a tender independent unicernel binary so none of this ahead of time choosing of do I want this or that isolation mechanism that's a very good question can we define an interface that allows asynchronicity but is consistent with our philosophy if someone can define one for me on a back of a bare mat and this later this evening then you are definitely God please contribute we would like more languages and libo s is ported also more targets HVT should run on macOS you should be able to build an SGX enclave out of this using keeping the same interface web assembly okay related work how much time do I have two minutes this should be on a spectrum solo five is somewhere here like the most radical thing QMU is somewhere here drawbridge is somewhere here cross VM firecracker the air there the sort of going towards the right G visor is just weird that's up there that's trying to re-implement the linux is called to face it go I mean yeah it's strange so we can do a lot with minimalism we have a tiny API it's legacy free it's IC licensed you can run 1500 HVT things on your three-year-old laptop we want to apply you the kernels everywhere both on monolithic systems because there are a lot of them around but you should trivially move to recompile your code and run it and genoed run it on new end and at the same time no compromises so you still get strong isolation across the spectrum of targets any questions okay so what you're saying is should we sandbox HVT even more I mean so one of the things that we we will do is but extend the set comp sandbox to the HVT tender itself right so I think that's that's kind of what you what you're touching on no we haven't we haven't thought about that yeah it's not recursive sorry any other questions yes yes that was that was the that was the real code you will get so there's some nope there's no reason you should get intra E inter because there aren't any signals being handled in especially not in the in the second case because you we don't even have sig return in the second filter so if you get a signal then you can't handle it sorry your process just dies I haven't actually seen so I haven't really seen E inter happen in any other case in practice I mean if you're not handling signals then why would why should you get that but good point yeah any other questions you're not you're not you're not the first person that's asked about so can can this be used for the sandboxing desktop applications with actual UI the question is what would that interface be because then then you run into the problem of well if you want performance you want at least some kind of at least a frame buffer but then that's very low level so it's like where how do you decide that abstraction right I mean how would this run across this interface I mean protect you could of course certainly run like an x11 client inside this thing right and have that talking to an X server somewhere else right so so that'd be one way of accomplishing that but yeah I mean possibly not the not the cleanest approach I'm not sure I know that there are people that did things with I mean even the even the genoed folks have been doing things with virtualizing graphics hardware but it's really quite a different area which I don't know that much about but yeah you could do the x11 thing or yeah