 Yeah, so hello and welcome to our talk, Spectra and Motown versus Real-Time How Much Do Mitigations Costs. So why are we doing all this? In the beginning of this year, people started to speculate on speculative execution of CPUs and that literally hits the core of our systems or the cores of our systems. There were many CPU bugs that were disclosed since then. And there are already many benchmarks out there that investigate on the effect of those mitigations for desktop systems, for data center appliances. There are many benchmarks that focus on web servers like EngineX, Apache, on database servers. On compilation, compilation is a popular benchmark. So ND coding, apparently it appears that Git is also used as a benchmark, it does not only produce CPU load, it also produces disk IOLoad and many other benchmarks. But to the best of our knowledge, none of them focused on the effect of those mitigations on real-time systems where determinism and low-latencies matter. Yeah, so what are we focusing on in this talk? We will only focus on variant 1 to 4. There are many other variants but they will not be part of this talk. And the level 1 terminal fault, L1TF, which was disclosed in, I think it was August this year. Yeah, so the commonality of those bugs is that they somehow, in a certain way, they exploit speculative execution, they exploit branch prediction units together with the behavior of caches and time-constraited side-channel attacks. With this, attackers are able to leak sensitive information and this breaks guarantees that are given by memory management units such as the memory protection. So this is violated by those bugs and attackers may be able to read not only memory of other processes but also memory from kernel space and under certain conditions even arbitrary physical memory. Nevertheless, we need to keep in mind that these are local vectors only, so an attacker needs to have arbitrary or needs to be able to execute arbitrary code on an affected system. If an affected system runs a browser which executes JavaScript, this is already the case but we need to keep in mind that this is a local vector only. There are some other attacks like the net spectra that also forms a remote vector, so a surface for a remote attack but even under laboratory conditions they were only able to leak bytes per hour. So this will not be focused on this talk. And we have to keep in mind that all of these attacks have a rather high complexity. So on the mitigations, I will especially focus or particularly focus on x86 and on ARM64 platforms, those are the platforms that we measured during our analysis. I do not want to go in greater detail on how those attacks work because they are very complex. We want to focus on the mitigations only. And for spectra variant one, the bounce check bypass, we have the so-called user pointer sanitization, where when we enter kernel space we need to protect some input that might be forwarded by the user, such as, for example, the system call number. The system call number is a value that is used for an indirect array access in the kernel space and we need to protect that speculation is done on this value because the user has control over this value. Therefore, on both platforms we use the so-called no-speculative accessors that are used to protect against this attack. Luckily, the cost per mitigation call on those systems is rather low. So it comes at a low cost. For spectra variant two, it looks a bit different. On x86 as well as on ARM64, we have the config red poline option on Linux, the return trampoline which converts indirect calls, indirect jumps to returns. And in this way, again, it prevents speculation. In addition to that, on x86 as well as on ARM64, we now need the possibility to control a speculative execution. And on x86, this is done by the speculative control extensions which come with microcode updates for Intel and AMD CPUs. And on ARM64, we apply those mitigations in the firmware, in the monitor. To mitigate against variant two, we need at least ARM trusted firmware version 1.6 which implements the system management calling convention version 1.1. And we also need to fill return stack buffers on context switch. And in sum, this leads to a very high cost per mitigation call on those systems, especially on ARM64 where we need to switch the privilege level to apply the mitigation. To other variants, variant 3 and 4, meltdown and speculative store bypass. For meltdown, we now have the so-called page table isolation in the kernel. Initially the page table isolation was implemented to have further protection for the address-based layout randomization. Now this protection is also used to mitigate the meltdown attack. On x86, it's simply called page table isolation. On ARM64, it's called the kernel page table isolation. Kaiser, as it was initially named by the authors. They introduced different page directories for kernel space and for user space. So before page table isolation, kernel space and user space shared the same pages. Now we have to un-map the kernel space, pages from user space which requires us on each context switch or on each switch to the kernel space to apply the mitigation to switch the page table entries. For variant 4, again we need some microcode updates. On ARM64, again we need an update for the arm trusted firmware. And authors say that variant 4 comes with a notable performance impact. So mitigations are applied on a per-processor basis using PR control and Sec Comp. The cost per mitigation, again, is of course very high. Variant 4 will not be a topic of this talk because we didn't activate it in our measurements. Then there is a slightly different attack. The level 1 terminal fault, also called foreshadow. On affected systems, we either want to disable symmetric multisreading or use other mitigations to deal with this issue. This is not a problem for real-time systems in particular because we deactivate symmetric multisreading ever since because running real-time payloads across on SMT siblings causes undesired side effects due to contention or any other effects that come with symmetric multisreading. So this is nothing new for the real-time folks. We disable SMT ever since. The second mitigation for the page table is the page table entry inversion where we simply flip the upper bits of page table entries. This comes at a neglectable cost because we only apply a simple bit mask to the affected entries. So the cost per mitigation call for native workloads, for real-time workloads, for usual real-time workloads, is rather low. Whereas for data center appliances or for virtualized environments, level 1 terminal fault is very, very high. Okay, so that's on the mitigations and which CPUs are actually affected. Basically, all CPUs are affected to a certain degree, leading Intel. They are vulnerable to Spectre of Orion 1, 2, 3, 4 and level 1 terminal fault, whereas AMD is not affected to version 3 meltdown and it is not affected to the level 1 terminal fault. There are also some non-speculative Intel variants like older Atom processors or Quad processors. Those are also not affected by those attacks. On the ARM side of life, it looks a bit different. There are some CPUs that are affected to version 1. There are some that are affected to version 2. There are some that are affected on version 4. No ARM CPUs are affected for the level 1 terminal fault and only the Cortex-A75 is vulnerable to meltdown. Please look it up on those pages that are provided by the vendors. Here you will get exact detail which CPUs are affected to what degree. Okay, mitigations versus real-time. What do now those mitigations cost? And before I want to explain the measurements that we did, I would like to refresh in the knowledge what real-time actually means. On real-time systems in contrast to desktop or server applications, the determinism matters. Deterministic response to certain stimuli, to maybe external stimuli matters. We might have time sensitive reoccurring repetitive cyclic execution where we must give a response not too early and not too late. So we might have bounded latencies, a time window where we must give an answer. We must be sure that on every event we are able to give this answer within a certain time window. And in contrast to desktop systems, data centers, systems we optimize for the worst case and not for the average case. So for those targets, determinism matters more than the actual throughput. And then there are the mitigations. There are the cheap ones that I explained, the page table entry inversion, the cheap one, the user pointer sanitization, also a cheap one, but then there are the expensive ones. That at some points they might want to flush the caches, the translation look-aside buffers. We have the page table isolation, the return stack buffer fills, the return trample lines, the SPMVAC calls on context switches, we have to replace the microcode to introduce new instructions to control speculative executions. So we have a lot of potential sources that introduce additional overhead to our systems and those mitigations might, of course, increase the latency and the behavior of those systems. And this is what we want to measure with our tests. So the basic idea of our test is to run real-time payloads on multiple cores at once. We do this because we also want to measure side effects across cores like shared caches, like in the process communication cores, so we run our real-time payload across multiple cores. Then we want to measure the responsivity of the systems and we repeat those measurements with and without the mitigations enabled and with and without additional workload, payload on top of those CPUs to also measure additional side effects that might occur. Yeah, this is a standard problem of software product lines. We now have the combinatorial explosion of variants that we might want to measure, so we only confine on some certain measurements like one important measurement is to measure our system without any mitigation applied, so like the state at the end of 2017. And then we activate those mitigations that are necessary to protect the vulnerabilities of those systems. For instance, on an AMD system, we will apply the variant 2 mitigations whereas on an Intel x86 system, we will also apply the page table isolation. And then for a bit fine-grained analysis, we only analyze the page table isolation to get the impact of the page table isolation only as well as the variant 2 only. And then we apply proper statistic analysis to strengthen the outcome, the truth of our results. The tools that we therefore use are standard real-time, is a standard real-time test, so our real-time payload is the well-known Ciclic test. Ciclic test fires, Ciclic Kelly raises a timer interrupt and it measures then the jitter of the interrupt in the user space. If we have additional non-real-time payload as a stressor on those CPUs, we use the stress and cheat tool to produce additional load. We repeat this measurement in the hypervisor that we develop in our virtualized environment, a jailhouse, because we want to know how those mitigations impact the overall behavior of the systems when we have a virtualized layer underneath our operating system. Please keep in mind that complex RT payloads may vary from the tests that we made because we have no inter-process communications within the real-time context and these measurements shall just give you an overview of how those mitigations, or a tendency of how those mitigations might affect your system. So what does our measurement do? We have in user space the Ciclic test in the beginning runs and it wants to simply sleep for one millisecond. It calls the kernel to set up the timer. The kernel says, okay, I'll do that and go to sleep because it has nothing else to do. Later, after a millisecond or so, it gets an interrupt from the timer and it dispatches the timer. It sees that Ciclic tests requested the timer and it will forward it to the user space again and the user space will then measure the chitter. This is the time or the time delta between the timer when it actually occurred and when it was expected to occur. So the chitter is the time window where the interrupt occurred and when we measured the actual time when it, when it, yeah, arrived in user space. So how do those mitigations affect our system? We have certain points where the mitigations now take place. For Intel x86 systems or for the Cortex A57, for instance, we have to apply the page table isolation mitigation when we switch the context from user space to kernel space. As well as when we leave the kernel space again, we have to apply the meltdown mitigation. For the system call, for instance, we have to apply the mitigation for variant one, the no speculative array access source because the system call is dispatched in an indirect array access. All the way, when it comes to the point where we have an indirect jump, like a function pointer or something like that, we have to apply the return, trample line mitigations and at some other points in time, we also might want to use the microcode extensions, the new instructions of our systems for speculative, yeah, protection, version two protections. So we have certain points that might affect the behavior of the latencies. The second test where we put additional load on those systems, we have a second user space process running with a lower priority than cyclic test. Again, cyclic test wants to sleep for one millisecond, cause the kernel to go to sleep. The kernel sees, oh, okay, I have another process that could now, that I could now activate. It schedules to the stress and G, stress and G runs. We do not know what stress and G actually wants to do on our system, so there might be further system calls in between. We do not exactly know what it does, but when the interrupt occurs, stress and G will be preempted again and yeah, the kernel will schedule back to our cyclic test. So now we have more mitigation points, especially on every switch of stress and G down to the kernel space, we have an additional page table isolation call and let's say we are in the middle of the page table isolation when the interrupt occurs, then we have to finish the page table isolation before we can handle the interrupt, so this introduces additional overhead that is relevant for the titter of the system. So this is the explanation of the measurements that we were doing and our target systems are a Intel X8664 system, a server system, a Xeon E5 processor, fourth generation, an AMD X86 system, a mid-line system, rising 2700X and for ARM64, we did those measurements on an NVIDIA Jetson TX1 which is equipped with four Cortex A57 cores. So first of all, we somehow had to get our systems under control and this took more time than expected because we had undesired high latencies on our systems. There were many tweaks and tunes that we had to do to have a system where we can actually run our mitigations on. So this was a lot of work but it was nothing special. You can look at this wiki page. If you want to know how you can strengthen your system to become more or to become real-time capable. In addition to that, we also wanted to reduce the noise of the system. So we want to get rid of interrupts on the cores that we are measuring because we want to measure the effects of the mitigation and we do not want to measure the system as such. So we isolated the CPUs where we run our measurements on. We rebound the SMP IRQ affinity of some IRQs to the non-real-time CPUs. We rebound net device Q affinities to non-real-time CPUs. We disabled machine check pause and some other tweaks. We realized that if we are running on separate numeric cores that we also have some undesired high latencies. So we decided not to leave numeric cores if we have more than one socket on our system. Turns out that the task set parsing in cyclic test currently is broken. This is already reported upstream. So the first few days we measured on random CPUs in contrast to the CPUs that we actually wanted to run our measurement on. So we had some bugs there. Then on stress and G we had some problem on the behavior of stress and G together with the five host scheduler. Alarms and real time this is somehow broken because alarm timers will never be propagated to the user space if the CPU is 100% on the load because they run on the soft IRQ context. These are much details but we are currently working on these issues and we have workarounds to run our measurements though. Yeah and the fun didn't end so we had more issues to deal with for the different mitigations we had to exchange the microcode on x86. So if you want to run the no protection variant then we had to use old microcode whereas if you want to apply the variant to mitigations then we had to switch the microcode to newer versions. So a lot of reboots, a lot of research which microcodes variants actually do what thing. On the Chetvin TX1 on the ARM64 platform we were missing official support of the ARM trusted firmware which implements the SMCCC version 1.1 interface so we had to build it on our own. Compiling was easy but getting it running on the target was not that easy. We have different kernel variants for the different measurements for instance if you have no protection then we want to deactivate the return trample line and so on. We also had some issues with our virtualized environment with the jailhouse hypervisor we had numerous traps that we didn't see before that were because of highly parallelized loads on many CPUs. On ARM64 we were actually missing proper spectrum variant 2 mitigations on our hypervisor. We have a proof of concept implementation already running, proper patches to be main light are in work currently so we had obstacles in the whole system stack from bugs in the user space to misbehavior in kernel space issues in the hypervisor and also in the firmware obstacles in the whole system stack. To the results of our measurements let's look at the Xeon e5 system. So on the left side of this plot you see the behavior of the system when it is not under load so the no stress variant on the right side you can see the result of a cyclic test when the system is under additional load. On the X axis you can see the latency in microsecond that occurred and on the Y axis you can see the number of occurrences where this specific or where this particular latency actually occurred. The vertical line denotes the maximum latency that occurred in this measurement. So without load we are somewhere under 5 microseconds and if the system is under stress then we are somewhere under 30 microseconds which is pretty okay for such a system. So now what happens if we turn on all mitigations for the system. All mitigations for all mitigations that are required for the system means we especially activate spectra variant mitigation against spectra variant 2 and against meltdown. And then this graph looks like this. So it's almost or it looks like this would be a constant shift to the right. We will later see that this is not the whole truth. But it looks consistent so we do have additional overhead. It doesn't matter if you have the no stress case or the stress case we introduce additional latency on the system. So now what does it look like if we only activate the page table isolation mitigation without spectra variant 2. The results are pretty consistent. We are somewhere between no protection and default protection so if we activate PTI only we are somewhere in between. It turns out that at least for our workload the variant 2 only, the variant 2 mitigation of spectra costs more than the page table isolation. Yeah but can we be sure? Can we really trust these results? And this is where we need proper statistic analysis and where I would like to hand over to Wolfgang. Yeah that's true so we were wondering I mean I can assure you that rather after doing all these measurements after doing all these complicated setups really started looking like Doc Brown. In the third picture so we decided to take some of the load of his shoulders that's where the two of us come in. And the question is so what do the data actually tell us? Can we learn more about the system behavior than just the maximum latency and the distribution? And the second question of course is can you trust our measurements? Can we trust our measurements? That's where the statistics come in but these bare knots I'm only going to take a few minutes to convince you that this is right without going into the details too far. So in a traditional real time system, in a traditional hard real time system you'd be mostly concerned about the maximum latencies because if you see these maximum latencies people may die if you're in a safety critical system. For preempt RT the kernel that we've been looking at the situation is usually not that bad so people use it for audio processing, for video processing and you may be annoyed if you maybe skip a frame or if you hear a glitch in your audio recording but you're most likely not going to die so it's not just the maximum the extreme values that are important but also the behavior of the distribution. And by precisely checking the behavior of the distributions we can actually also handle the second question namely can you trust our measurements? And to do that that's the answer to if you can trust us or not is in these graphs. What we did is use the so-called shift function method. We basically would like to learn do the distributions differ by a constant shift is the spread difference. Do they differ in more properties? And to learn about that we split up the values into different portions. That's also we take the first 10% of the values and compare these values with each other. That's one point in these diagrams. We take the next 10%, the next 10% and so on and compare these portions. From A to C we have an increasing amount of measured data points. What you see from these graphs is that actually the differences between the case with all mitigations on and no mitigations is not just shifted by a constant amount by a constant increase in latency but the latency increase since we have an upward slope gets worse and worse the larger the latencies become. So if you're already in trouble then you're even more in trouble with the spectrum mitigations. Now since we compute the comparison between the portions of the distributions we can use proper statistics to do a so-called bootstrap resampling to estimate confidence intervals. I'm not going to discuss in detail what a confidence interval says but that's indicated by the bars by the vertical bars and very roughly speaking the longer the length of the vertical bars tells you something about the quality of the estimation procedure so how well we learned the true difference between the distributions. You see the quality is not so good in case A when we don't measure sufficient data points but goes becomes better and better and better until D. If we looked at the final data then these bars would disappear because we have very precise results that you really can rely upon that's that you can statistically rely upon. I'm not going to bore you with further details so we can do that for the other measurements as well same result and if you want the key message to take away with you is a Bayesian people would maybe talk about credibility about and repeating Ralf's words what we're telling you here is the truth the truth and nothing but the truth and statistics shows we're right. Thank You Wolfgang. Okay let's continue with the AMD system with the Ryzen 2700X. The distribution of the latencies is pretty similar to the to the Intel processor. In this graph you again see the distribution of the latencies with no protection activated no protection between means there are no protection and default protection in the AMD case means we have a new microcode version for the AMD processor and the Spectre variant 2 mitigation activated. We do not need to protect against meltdown because the system is not affected by meltdown. So if we again activate the default mitigation we had we have the same consistent picture as for the Intel processor but it doesn't affect the overall behavior of the system or of the workload that we are measuring that much. Okay let's have a look at the Nvidia Chesson TX1 the Cortex A57 system again these are the results without any protection enabled. Let's activate the default protection. Default protection means we have an update of the arm trusted firmware together with yeah update of the arm trusted firmware together with the mitigation enabled in kernel space so that it uses those those mitigation part. And this is what it looks like on the ARM64. We have a drop of latencies somewhere here we do not exactly know what happens or what causes this this decrease of the number of occurrences in this window but we assume that in this window we we take the path to the system monitor which eliminates a certain range of latencies. Okay these were all the measurements without virtualized environments and now we will also have a look on how those systems behave when virtualization enters the game and this is where I want to hand over to you. Yeah thank you. So virtualization is considered for for two reasons first of all one as Ralf already mentioned if we are running in the virtual environment what does the impact what does virtualization does with it. A second question we may wonder if we can use virtualization also as a mitigation possibly. So first of all we picked one hypervisor which we knew best for this case which is not representing the whole ecosystem of course. The hypervisor is specifically tiny and small one. There are separate talks about this just to remember this one is targeting static partitioning only it does no scheduling it just does one one resource assignment and it's specifically designed for hard real time which is a good candidate in this case as well. It has some specific properties here in general static partitioning allows to reduce a certain set of attacks or avoid them simply by by separating them via the virtualization via the partitioning approach. That is a positive aspect. Furthermore Jailhouse has some non-static mitigation built in against the known issues simply by avoiding what these issues what what the attacker usually exploits exploits the confused deputy scenario so that a kernel or a higher privileged part of the system is tricked to leak information about things that shouldn't be leaked that has been avoided in Jailhouse by now it's a similar approach just simpler like HyperVee is doing as well so you cannot leak what you cannot see simply talking and this of course also raises the question and can we use static partitioning instead of using these well not cheap mitigation methods also a question so first of all collect ground truth so you see in the buff the bar you already seen that is basically the number on the on the Xeon system on the multicore system running our benchmark and below you see well virtualization surprise surprise isn't for free so you see an increase of the maximum latency of the same yeah real-time workload just running in a virtualized environment so you primarily see what you one aspect you see is the hardware induced virtualization overhead but this is just tiny it can be a few nanosecond nanoseconds up to a few microseconds but the bigger part you see here basically is the unavoidable interception of virtualization of the multicore workload so numbers may look better if you are not on a multicore real-time workload but a single core workload but anyway this is what we got on this 8 CPU setup now let's add the mitigation so the buff bar you know already so there's an overhead and below you see now under the virtualized environment the mitigations on Intel specifically well they scale so it become worse actually so the overhead virtualization is introducing is magnifying so to say the slowdown that the mitigations yeah cause here now a system so just to confirm this we also looked at the arm system how it looked like here well first of all even worse the primary reason here for this ground rules comparison here why it's so much higher under virtualized environment well on an Intel on x86 Intel we can exploit direct interrupt injection into the gas system with these static partitioning approach we can't do this on arm system on this arm system specifically so the overhead of any kind of interrupts happening here is even higher on arm than it is on x86 so we have a significant increase without any of the mitigations applied between bare metal and virtualized with virtualized with mitigations enabled yeah well we have a further increase of course so now the numbers are getting worse and you see here just like we saw before on x86 actually well there is a small or there is a noticeable increase of the overhead of virtualization brings to the mitigations and the overall overhead we see here also of virtualization compared to a full mitigated bare metal system well this is worse than well so it's not really the solution in this case is not really taking virtualization simply it comes with a cost still virtualization partitioning also offers some kind of mitigation that we don't get from individual mitigation because there's a potential that by splitting the workload statically you can mitigate what we do not even know today so it may be a little bit of future proofness as well so this is a would be a benefit to consider also for a real-time system furthermore real-time systems are also often systems which you do not like to touch in the field anymore and if you can use partitioning as a mechanism to split the real-time part from the potentially more affected non real-time part from the exposed real-time part you there's a potential that you can keep your real-time part unpatched unmodified and if you recall the the backpots of all these mitigations to older Colonel they weren't without yeah some glitches and some efforts as well and you may not want to apply them on your running systems so this is a potential also an alternative paths this way what you have to keep in mind regarding virtualization as I said virtual jailhouse is a very light-white approach here it's possibly the best case you get architecturally regarding the impact of the mitigations to be fair we haven't measured other hypervisors this is just derived from design aspects but you can easily imagine there is no LT1 L1TF dynamic mitigation applied in jailhouse by architectures not needed at the point where you have a hypervisor design which needs it because it does for example scheduling the impact will be much higher yeah thank you Jan so let's come to the conclusion of this talk I want to I would like to repeat that the histograms that we showed you only investigated one specific workload and must not be generalized to any real-time workload but they should give you a tendencies a tendency of how your system might behave when those mitigations are applied so go out and measure your own workload your own real-time workload when you want to apply those mitigate apply those mitigations and as I said at the beginning of this talk those all those mitigations only have a local attack surface so usually you do not execute arbitrary code from untrusted parties on a real-time system on a hard real-time system so you have to yeah to raise the question if you are actually affected by by these local vectors so unless no one is able to execute real to work the arbitrary code on your system then you can also turn off those mitigations especially for instance if you have no network connection on your real-time system we still have some open issues as I said we have problems with the pre-empt RT alarm timers we have problems with cyclic test broken task sets and we are still working on proper spectrum version two of our two mitigations for arm 64 so thank you very much for your attention and I think we still have a couple of minutes left to answer some questions thank you did you measure the impact of the microcode only when you have the old microcode and the new microcode so the question is if I did measure the impact of the microcode only without applying the mitigation so the microcode bring new instructions we did not measure the microcode only but I would and now it's the question what does the microcode do that would be that would be an interesting measurement so actually I would say that it only makes a difference if you really execute those instructions but you would have to measure it to exclude that it doesn't do any other voodoo magic of course no we didn't do this measurement but with our measurement set up we could if you're interested in this we can repeat that further questions okay so again thank you very much