 Yeah, it's my pleasure to give the last talk in the micro corner deaf room this year And we will talk about Meltra and speck down. So this is actually a deliberate misspelling Because in early 2018 when we had a conference call with a customer The sentence came up. Okay, and we need to talk about Meltra and speck down and at that point Nobody really noticed this funny misspelling or pronunciation here So that's sticked with us when we talk about these yeah, vulnerabilities and The idea a better title for this talk then maybe should be the impact of meltdown and spectra vulnerabilities on the L4E microkernel system because everything I will show you here today is Relate to L4E and fiasco obviously so it's more anecdotic evidence instead of having a broad view on every or multiple different microkernel systems and how they deal with these kind of vulnerabilities So when I prepared this talk, I basically started with a couple of questions I wanted to answer and the first one is Where we somehow prepared for these kind of vulnerabilities so how hard did they hit us at the end and maybe some of the microkernel design principles that they help us or protect us from some of the vulnerabilities and At the end, of course, we had to implement mitigations. So what's the impact? Basically the performance impact of these mitigations To give you a quick spoiler No, we weren't prepared in any way Yeah, the time was ripe for these kinds of vulnerabilities to be discovered as we have seen throughout the last year but yeah, we Were not prepared Yeah, the design principles principles helped us a little bit and Yeah, the mitigations or the implemented mitigations. We will see later So let me first introduce you quickly back what meltdown and spectra is. So basically it's a set of Vulnerabilities and modern CPUs related to out of order and speculative execution and Yeah Yeah, these mechanisms are implemented in modern CPUs to Increase utilization of the computation units inside the CPU and to basically the gain performance advantages and Nothing on the Architectural side was wrong in these CPUs, but unfortunately these and out of order speculative execution caused observable side effects and these were the actual vulnerabilities So we had to implement mitigations and I want to start with meltdown welcome and If we look at the Classic virtual address space layout and I picked the 32-bit layout here as just an example So the actual layout may be different But that's sort of the classical layout where you have a virtual address space for the user Ranging from zero to three gigabytes and in the upper gigabyte. Usually the kernel is Visible in the address space and Of course The kernel memory is protected by setting a supervisor bit or some extra bit in the page table entries So that the user cannot directly access the kernel memory But the advantage is if your context switching into the kernel for a system call or something else You basically don't need to switch address spaces or context If you look a little bit further we see that usually a lot of kernels also completely map in the physical memory of the system for example for DMA or something like that so that's for example the case in Linux and the vulnerability of meltdown basically was that a user could start a short execution sequence which the CPU Without checking the correct access privileges caused to read in memory that actually was Yeah, protected by the kernel And then this content of this memory ended up being in the level one or second level cache And then using a side channel analysis The user could infer the data which was under the protection of the kernel and The unfortunate thing was that this was possible on a or is possible on a very high bandwidth in Yeah, where we speak about megabytes per second. So up to 500 megabytes per second So if we look at Alphoria, especially fiasco Fiasco only reserves a fixed amount of the physical memory for itself so for storing kernel data structures like page tables and Something the mapping database So it reserves certain percentage of the whole amount of memory and Maps this of course into the kernel So the first observation here is that not all physical memory is mapped into the kernel but in order to save on page tables and also TLB entries The kernel tries to use as huge pages as it can get on the particular architecture So for example on x86 where it is available tries to use one gigabyte pages so there may be an overlap between the memory the kernel is actually using and Some physical memory that is usually used by the user So this mapping may include some physical memory that is used by a user process and Then we ran. Yeah, this example meltdown example and we were also able to observe some of the data that was Mapped by the kernel or by this big mapping So it was clear. So yeah, basically this picture. So there's also a memory mapped into the kernel But not the complete physical memory So it was clear. We had to implement a mitigation and the obvious mitigation is here to implement a separate address space for the kernel In the Linux community, it's the kernel page table isolation so we call it also page table isolation PDI and basically we move the kernel into its own address space and we also decided to put The kernel into its local address space on each CPU. So each CPU now has its own kernel local address space and Of course, we need to map some parts of the kernel into the user so that The user process has the chance to enter and exit the kernel So we need some basic data structures like the GDT TSS and entry and exit stacks and of course the UTCBs must be visible there Okay, so what's the impact of these measures? So because now for every kernel entry and exit we have basically two context switches where previously there was none So before I show you the number numbers. I want to share some of the meter information So as the baseline for all these benchmarks I did I took the fiasco from January 1st 2018 so from last year before we started implementing all of these countermeasures and I compared it to the version of January 7th from this year You can go check out these commits on our GitHub if you're interested and yeah Verify or confirm these measurements. I used Clang 6 to compile the kernel the user land was compiled with the GCC 7.3 and For a platform I used in core i7 machine. It's a quad core machine with hyperthreading Running usually at 2.6 gigahertz and if you're interested I can send you the raw data I gathered so there are more numbers than I just present here So if you're interested I can share that with you So I then run a couple of micro benchmarks to Quantify the overhead incurred there But yeah, micro benchmarks are not really modeling real-world workloads very well So I also measured two scenarios and the first one is that I ran two alpha linux instances on this machine Each one gets two CPUs assigned exclusively so they're running on different CPUs and I have two fiber network cards wire PCI in this machine which where I one is each assigned to one of the Alpha linux instances when I connected them directly using fiber optics and I measured the throughput between those two VMs To create a sort of more worst-case scenario. I set up These two alpha linux VMs now with one CPUs one virtual CPU each Put them on the same CPU and connected them using a virtual Network link using what I own net so this component was already presented by Jakub earlier this day So they are connected with a virtual network connection also running on the same CPU So that when we are running this benchmarks, we have a lot of context Which is virtual interrupts going on there on this one CPU. So we should see a real. Yeah stress Or this overhead when there is real stress on on one CPU Okay, so to the micro benchmarks first So I used this ping-pong utility which has been there on L4 systems for ages I would say and I picked three particular micro benchmarks there So first sending an IPC message between two different threats running in different in different tasks So that there is a context switch involved we measured also the context or the context switch itself and The switch of a threat within the same context or address space And here you can see or we can see that IPC performance Yeah, dropped quite dramatically So it's only or dropped by 54 percent Basically, so that's more than double the amount Before these page table isolation also the context switch Suffered 30 percent and threat switching is 57 percent. So that's quite a huge overhead so how does that end up in some more Real-world workloads and there we can see so I measured using measured the Network throughput using IPERF And the baseline performance here was not exactly 10 gigabytes per second, but nine points 37 and With using kernel page table isolation, it's slightly dropped, but it's within one percent So I would consider this more of a measuring error or within the usual Yeah limits of these cards so it's not quite dramatic right now The picture looks a little bit different when we have this worst-case scenario running everything on one CPU So the baseline performance is around five gigabytes gigabits per second And now dropping down to three point one seven gigabyte gigabits per second Which is one point six increase in yeah time it needs to transfer these data So that's quite quite heavy all right moving on to to spectra the next guy in the house here and Yeah, just a brief and very very Yeah, high-level Description of what spectra is so spectra vulnerabilities are usually Or are Defined by having an indirect branch prediction which then speculatively accesses data It shouldn't access and these data accesses cause side effects that again can be Observed using site channel analysis and one of these indirect branch predictions is for example this piece of code where you have an index which is checked against an upper limit of an array and if that Index is within the this limits it is used to gather an Value from an array which again then the value is used to access data and within an another array If this is mispredicted or if this is yeah mispredicted so actually the actual index is larger than this Array size, but the CPU Executes this part code path speculatively the value from array 2 may end up in the caches which then can be observed All right for dramatic reason I now drop out of the yeah more causal order of things and first want to talk about spectra and G which was Discovered by Julian as our together with a colleague So they made the discovery that the CPU also speculatively accesses FPU state while the current threat or the current context running on the CPU is not the owner of the FPU so That's bad because fiasco uses lazy FPU switching which means that only Threats that need access to the FPU will get or become the owner of the FPU by the first excess Causing a trap which is then used to restore the original FPU state for that threat and then the threat can continue and this is a sort of your performance optimization because At least in earlier times not all threads needed the FPU so the expensive FPU switching Was avoided Yeah, the obvious mitigation is of course implement eager FPU switching on x86 And the question is does that incur any performance loss? so we benchmark that and now we can see that We add another overhead in the micro benchmarks here, so you can see that it is basically Yeah, 300 to 350 cycles extra which is the amount of Cycles you need to save the whole FPU state on this particular CPU Switching out going back going into more. Yeah real-world workloads. We can see that now We are getting some measurable performance loss here. So let me have a look into my Sheet. Yeah, it's another roughly. It's between 10 to 15 percent For these real-world workloads here and oh, no, sorry. I was In the wrong line here So it's actually four percent between the PTI and the PTI plus eager FPU mitigations So and now we're getting where it is. Yeah, some high performance Applications that might already start to hurt And the same picture is for this worst-case scenario It is also dropping a little bit As you may know or remember there is a lot of more lot more to yeah spectra vulnerability So there's a very long list already and we analyzed all of them and Found that most variants do not work across process boundaries. So they not directly pose a threat to our system. So these Applications then have to implement countermeasures by themselves and we also found that usually you require code execution to somehow train the predictor in the CPU so that actually starts mispredicting Then your code so that you can actually start your attack on the on the CPUs so But anyways, we had to or we implemented the mitigations provided by an Microcode update from Intel. So they introduced new instructions to prevent indirect bunch prediction So basically putting in barriers So we implemented that Indirect branch prediction barrier when you enter the kernel and at a context which we implemented the full prediction barrier For that we also had to implement microkernel function loading microcode loading functionality in the kernel Because at that time not all bios is provided these microcode microcode updates. Sorry for confusing this again and Some of these bios is also are not updated until today. So it was a good decision to actually Put that into the kernel as well So we benchmarked that as well and Yeah, the results are not good So you can see that now the bar is even higher So instead of initially having an IPC roundtrip of Roughly 1,500 cycles We are now at 16,600 cycles which is more than a tenfold increase in IPC roundtrip time and So what you can expect from these numbers is that it now also Dramatically hurts real-world Workloads. Yeah, and you can see that here So for the Yeah Scenario one we have a 20% loss in performance. So Just raw throughput of packets between two virtual machines not doing something else and The picture is even darker looking at this worst-case scenario There we have a 75% loss So we are ending up with a quarter of the performance we had initially and that's quite Yeah, bad Before coming to my conclusion, I wanted to share some information on the foreshadow attack with you as well This one was also known as the L1 terminal fault so basically the CPU used Speculatively entry invalid entries from the page tables to yeah speculatively access data and This is bad because for example linux uses invalid page table entries to store for example information about page out memory pages By invalidating the entry and then putting in some matter information into this entry And the data in this entry then was used to access some memory and Indole acknowledged that this Problem or this bug affected the operating system and also as system management mode Also the virtualization extension and also SGX So SGX is not supported in L4 or E. So we didn't care So it was good for us and we also decided that the system management mode has to protect itself So by for example flushing the L1 data cache when returning back into the normal mode But we had to do something about the operating system and or analyze the Impacts on the operating system and the virtualization. So on the operating system. We were lucky We are not or we were not vulnerable because when we invalidate and page table entry We also zero out the page table entry. So there is no Address or any data in it which the CPU can use to do some speculation But VTX especially nested paging is nasty Because the problem here is that the CPU So the guest can now provide an invalid page table entry in its own page table and the CPU Then start translating that in hardware But it's not Translating it down to host physical addresses, but instead uses the translated guest physical address to speculatively Continue execution instead of giving a fault Which then later of course occurs, but at that point we have lost because there are already observable side effects So what we have to do here is we have to implement or load another microcode update into the CPU Which provides new MSRs where we can Read out. It's called CPU capabilities And there is now a new instruction for flushing the L1 data cache Which we then have to do on every VM resume. So every time the kernel Resumes a virtual machine. We have to flush the L1 data cache Yeah, sorry no benchmarks for that one. I Didn't make it for for stem to provide or to measure some some numbers here So time for the conclusion, but there's one more thing before I want to do that and As you may have guessed it already We implemented all of these features as configurable options and fiasco so you can still Go back to sort of the original state. We had an beginning of 2018 So we can turn off kernel page table isolation We can turn off eager fpu switching going back to the lazy scheme and we can also remove the support for these branch prediction barriers and So the interesting question for me at that point was so how do we compare one year later? To the 2018 baseline performance So that's the full picture. So let me put that or move that a little bit around So now the blue and the green bar are basically the interesting ones And there you can see that we slightly improved performance on the microbench marks here So I will tell you in the conclusions why or what I suspect the reason is this So as you may now also guess if You compare the numbers We don't expect much of an influence in the real world benchmarks or in the where we put on real load onto the machines And here you can see that basically the baseline performance is unchanged. So already again within this 1% Yeah window and also for the worst-case scenario. We basically get the same performance so it's if at some point in time we get fixed CPUs then we can go back and basically have the same performance as before Okay, conclusion to finish the talk Well fiasco is still not the fastest microkernel in the world and probably it will never be but We saw that we lost quite a lot of performance due to these implemented mitigations I Would say some of the bugs did not hit as hard as They were hitting other operating systems. So for example in foreshadow. We were quite Safe so especially our customers not using the hardware virtualization extensions. They were safe in the first place and Also for meltdown and we had to do something but it was not as bad as for other operating systems Part of the reason is that missing features helped us in a way that we haven't had them So we don't need to implement countermeasures Like for SGX for example Yeah, we saw a dramatic performance impact So especially for this branch prediction barriers We should consider alternatives compared to the implemented implemented measures via microcode While doing that we had to reconsider some of the legacy implementations within fiasco So it helped us clean up some of the code and we could see it in the microbenchmarks that we were also able to gain some minor performance improvements and the microbenchmarks So for example IO page fault support just We ditched that and moved that over to just general exception handling, which is fine We still have to think about what we expect from the future We were we are constantly asked also by our customers how we can proactively protect against such vulnerabilities because Yeah, there will be more because now this is a new field of to use the FPUs and The related question is what kind of optimization level the compiler uses Because higher optimization levels usually try to employ more at least FPU registers