 This week we've been talking about a lot of different technologies artificial intelligence big data cloud computing and you know everybody said when they introduce a new technology use this same phrase Hadoop is the Linux of big data or tensorflow is the Linux for machine learning or Kubernetes is the Linux of the cloud But you know what all those things have in common They run on Linux It turns out Linux is very very important We have great crow Hartman here today He is a fellow at the Linux Foundation. He is a key kernel maintainer maintains the long-term support kernel Is a prolific developer and he's going to discuss an important issue today Around cyber security and give a overview of the specter and meltdown vulnerabilities And how they have been dealt with in the open-source kernel community. Please welcome to the stage Greg crow Hartman I'm Greg. I got to use my title in icons Turns out every major security Problem these days gets a little logo. So of course spectra. How does logo meltdown has a logo a Linux as a logo? I'm going to go over how this all came to be what it is at a very high level And I want to say at a very high level because this is a very very technical issue I'm going to be very general very vague There are full notes for all the resources behind this There was a really good talk yesterday In Chinese from a canonical develop kernel developer who also gave some good information about this See my notes see his presentation. It's a good job so Let's try and do this on a very high level so spectra was announced in January of this year and meltdown and It was a big deal because of a number of different things. It was um This is really a hardware bug This is a bug or an issue in hardware and the goal of the operating system of a kernel is to make the bugs And hardware go away. So you can't see them anymore. We deal that all with that all the time on Devices of the peripherals. That's a job of a hardware of a kernel engineer We work around hardware problems. This was unique in that the bug was in the CPU CPUs normally do not have bugs The bug is that valid code code that is written correctly proper and Everything is good about it Can trick the CPU into doing bad things? So valid code that we write gets executed by a CPU to do bad things. It's something It's a new type of security research Yon horn from Google found this last July Took six months for it to come out publicly Two other researchers also found this at the same time because this is an active area of research And it exploits how CPUs work modern CPUs have to look into the future In order to go really fast. They have to guess what is going on in order to make your workload quicker So they guess and if they guess wrong they can unwind time and go back and fix things up But it's in that guessing wrong thing that the CPUs actually are affecting parts of the system that we never realized There's many many different ways this is coming out many different variants they call them I'll go through a number of them They all tricks the CPU to do different things and you can do it in different ways. That's why there's different variants And they're going to keep coming. We know it's been publicly said that there's like going to be at least eight or nine of these It's going to be a lot more coming over time This is just an area of research that's going to happen for a while It's going to take the CPU vendors a long time to fix this because their pipelines are very large for five years So it's going to be around for a while So let's talk about the variants. So the first three These are the abbreviations. I'm going to go into them very roughly the first three were announced back in January The first two I'll go through them later three a and four were announced back in March Number five was leaked a couple weeks ago and the full details will come out today actually on this one In today in the US time zone I thought I could talk more about it, but I realized I was off So these are the numbers that we call them if you look on the Wikipedia page it goes into the details The Wikipedia table has a lot of blank entries. This will continue to get more So let's talk about the first one I'm going to go into how this works a little bit to try and give you an idea of what is going on here and all these issues They just trick the CPU into reading memory and they trick The issue is that you can read memory that you don't normally have access to so you can read memory of other processes So you can go from one tab in a browser to read another tab in the browser's memory Or you can jump to another programs memory you can read on some of these some of them You can read another virtual machines memory and that's where the cloud people got really really nervous You can go and read like SSH keys from another virtual machine You can go read bit coin addresses from another our wallet that numbers from another machine You can do bad things that way if you have CP across our machines where you run trusted code that you know what it is and you don't have other virtual machines You don't have to worry about these issues, but if you don't you run the cloud you have to worry about them We had to fix the core kernel for a number of these ones so variant one Fix the core kernel and then we had to go through and fix all the drivers and I'll tell you about this for a while There's a lot more work to be done here. We're lucky in Linux We have access to all our driver code other operating systems are not so lucky So they've had to impose a much bigger hammer in the core to slow things down and make it go all of these will slow Down your processor. So people are seeing workload Dive depending on your workload some are not at all some very heavy do your own testing. I'll talk about that more in a minute So let's show this so I get away with showing code on a keynote This is a normal valid function in the kernel what we do is we take a user value from user space We check and see if it's the proper range and we read some values from it normal provably correct code So when the CPU sees this it checks the value Reads this it learns over time So if you give this a perfectly valid range for a couple thousand times through this through this function It'll guess it says hey, I know this is going to be good Let's go read this value and we'll have it ready for when the code actually makes it to this point in time So it'll speculate it'll speculate that you're going to get it right But what happens if you send this make it right for a lot number of times and then you give it a really bad value What happens is this bad value you will the CPU will actually go and read some really weird Portion of memory that shouldn't and have it ready for you But you check it and it says I shouldn't be doing this it unwinds itself in a way as it goes but that addition Additional work of going and reading some part of memory that I shouldn't have is observable and Other probable other programs can see that happen and by seeing that happen You can determine what that memory was So you can do this in a very slow way you can read about 2,000 bytes a second Which isn't a lot, but it's enough You can read the whole of a virtual machine of another virtual machines memory location dump it somewhere and then scan it for SSH keys So normally this is purply valid code and it still is purply valid code if you just do it like this But if you happen to do two of these in a row maybe Okay, you two of these in a row So you load an array and then you do another load of array based on that value The CPU can go way off in the weeds and then it'll come back and unwind and when it does way off in the Weaves and comes back that is when bad things happen and these two Accesses of an off of a speculation in another speculation don't have to be right next to each other They can be large distances apart in your code CPUs look ahead are huge these days. We thought we could get away. Oh like a hundred and some bytes difference No, it's no known range of when these can be so that can be any point in time You do something based on one user space value you jump again. That is the magic happen So that's what yon horn discovered he verified it was Intel and it became a big big problem So how do we stop this? How do we solve this because this is valid code? This is good code that the CPU should be using So what we have to do is there's some core kernel changes But then we have to go back and touch every single driver every single time we make an access based on a User space controllable pointer or value to raise the memory We have to say stop. We have to stop the CPU by saying we're going to index this array And don't speculate beyond this point in time So we have to manually find these locations in the code and that's a hard thing to do We have 20 million lines of code in the kernel people started running some static analysis tools on this We have some good attempts to try and find these It's actually really hard to determine where to put these We had one tool that did it we found a hundred different spots in the kernel Turned out only five are valid people are working on making these tools better We have somebody the Linux Foundation who's funded by the core infrastructure initiative to go through the kernel and fix these bugs right now He's doing a great job, but it's going to take a while So every point we have to say stop to the CPU. So what does this look like? Let's see some more codes There we go. Now I get more C codes. So this function array that's the speculation It's a really we're just determining a mask Returning what is the range of this value should be in we do some magic GCC macros and we do some masking at the end But then the interesting thing is this array index mask no speculation And that function is something that's specific to each different type of processor So this issue is actually there for all types of processors Intel AMD arm power PC S3 90 MIPS Everything that's the other unique thing about this bug. It's all modern processors have this problem So this is how it's fixed for x86 64 all we have to do is do these magic to Assembly language instructions and the processor gets confused and stop speculating. I Took a long time trying to figure out why these two magic ones are there and what they do and I could not figure it out There's a really really good email thread on the links kernel mailing list between Lina's and a number of other kernel developers Figuring out how to do this when we fix this we happen to do it all in public So you can see how we did this you can see how we tune these functions and you can see how we made things go faster Which is really interesting from a development point of view. We ended up with something very nice very simple and very fast Compared to what other people have determined in the past. So suddenly language in a keynote. Yay So let's go back So every single function that we do this we have to add this array index no spec And when we do that the CPU stops it won't speculate beyond that point in time We read the value and go great, but what this means is now the CPU is slower And we know this so every time you hit one of these functions the CPU cannot look ahead in the future So we will affect your workloads Interesting thing is newer kernels go faster than older kernels. It's been proven. We do better. We figure things out better older some workloads this fix can affect you five to six percent Maybe ten percent, but newer kernels are faster So Facebook published the numbers saying moving from a four dot nine kernel to a four fourteen kernel gave us 6% or 5% increase in speed just by doing that, but when we had to add these mitigations to it We lost 4% of speed so they came out ahead So they said we're newer kernels were actually 1% faster with the new with these fixes Older kernels got slow older kernels slow down some operating systems some enterprise kernels had to stay at older versions They got even slower And then they implemented things differently. I'll go into that in a minute So when do we fix this it was publicly announced January 2nd or 3rd? We've been worried about it. And some of us have known about this for a long time. I kind of ruined our Christmas vacations Actually the 6th we fixed in January arm. We found a fix in February for just some kernel versions And I say fix so all these dates I will be talking about that's the first time this was fixed It was not the final fix. We got better over time. We found out better ways to do things We sped things up those original fixes were slow We've made them faster to see what do we have today and it's proof of this the fixes are going to keep coming These fixes are not only needed for this process for the operating system. You have to fix your processor So processors have microcode or bias updates. You have to take those bias updates That's the only way these are going to get fixed some of these variants can only be fixed by the bias Some of these variants can only be fixed in the operating system You have to keep updating your microcode Intel and arm have constantly kept updating microcodes newer versions Different arm vendors are implementing things in different ways Qualcomm is infamously doing fixing this in a different way than other SSC vendors So they have different microcode updates different kernel changes, but keep updating because we get them better As proof of this we fixed it again in May And again an arm we fixed it in the 416 and 4 9 I realized when doing this talk we never went back and fixed those patches in the 414 kernel I'm going to go talk to the arm developers and we didn't fix them in anything older than that You'll notice a lot of 4-4 kernels are not fixed in here I don't know if they ever will be fixed So if you're using those kernels move to a newer one or use an enterprise kernel Suza is actually really good Suza developers helped out a lot with this They have a 4-4 base kernel They fix these problems in there But they fix them in a way that we couldn't take them upstream which is fine But it just is a little little fork in the way things go Red Hat fixed them also in a different way as well And their patches didn't apply to upstream either But I will call out Suza developers are doing a really really good job in helping us They went above and beyond So let's talk about variant number two Variant number two is actually really weird in that CPUs can figure out which way the code goes So you guess which way you're going to jump and if it gets it right you're great If it gets it wrong you have to unwind things Well if you jump one way all the time and then you want to say do it again You can jump off into the weeds you can figure that out and again you can read the data from the kernel So you can do that or you can read it from another virtual machine you can abuse this This one we had to fix in the compiler we had to fix in the kernel and we had to fix in the BIOS This one fixed all three places And Mike Linden from or Matt Linden from Google came up with something called Reptileen It's like a trampoline And a way it does is it protects functions pointers from being abused As they did that by a way of modifying the compiler So we can rebuild the whole kernel all the drivers with this you do that and everything gets fixed A little tiny performance decrease, but it gets fixed. He published the paper. There's a link to it there It's a really interesting read I'm really happy that he did this work research published it at the same time that all this came out And gave it away to the public It's a really nice work a piece of work Fortunately other operating systems could not implement this because they can't rebuild the world Microsoft and well apple didn't fix any of this Microsoft fixed this in a different way because they couldn't rebuild all the drivers So they had to do some other fun things, but they also modified their compiler They wrote a really interesting blog post about how they modified vigil c to affect fix these problems But again three things had to be fixed here again a very unique security issue that we had to cross Different teams in order to talk together and work Let's talk about meltdown maybe Oh, we fixed it. Here's the dates we fixed this Back in march We finally got around to it It took us a while because we had to coordinate with the compiler people And then again we fixed it more and more in newer kernels. We got better things set up So if you have an older kernel based on those dates pick a newer one your computer here will go faster So meltdown Meltdown happened at the same time. It's kind of related. It was found by some different people Um, and it was we called spectra variant number three And this one just lets you read what's in the kernel from user space. You can't really cross the virtual machine But it's still a bad thing um many years ago, um some researchers in austria Published a paper called the kaiser paper and implemented it was to try and solve some problems where you could guess what the kernel addresses were And they proposed some patches to it. It's called the kaiser patches That idea turned out you could solve it the same way you could solve meltdown the same way And what you have to do is every time you enter the kernel or every time you exit the kernel It'll slow things down by a chunk. It's it moves the memory around in different ways. So you physically can't see things Um, this was a huge huge way to change the kernel this these changes are what made people realize something was coming We were doing this work in public one um One news organization famously said all these changes are happening really late in the development cycle and lenis isn't yelling at anybody Something's going on So they knew a hint that something big was happening. It took us about 200 different patches to implement this properly for upstream Um for older kernels. We couldn't take those 250 patches We had to take the kaiser patches and add it to the kernel the different old distributions did this in a way um This one really really will slow down your machine. This one. It really affects certain workloads um Different distributions and different kernel versions implement this in different ways so much so that it's obvious in benchmark marks Kurt garloff from team mobile published a paper showing same kernel version Different distributions different um and the kernel.org version How the speed affects different things? Luckily the kernel we released in the community was the fastest The one of the major enterprise distros was the slowest by far because they were trying to be very very safe Some other enterprise distros were in the middle So it depends on what you're running is how fast or how bad this is going to be Benchmark your workloads your workloads are going to be very very specific to this one This one really hurt a lot of people in the virtual machine space. It's going to make things slow We're sorry. We don't know how to do it any better Um, we fixed this this one. We had ready this one was ready to go by the time it came it was announced So january 2nd when it leaked it leaked about a week early. We had fixes ready. Boom. They went out Uh, they were fixing that kernel version the other stable kernels got them a few days later because I had to have some review Um arm fixed them much later um, the backcourt patches also They don't fix things the same way So, um, andy when the core kernel x86 people and I have said publicly That there are some holes in the old backports We don't know if those holes are exploitable But there are holes. We know where they are Um, we and we'll tell you where they are, but we don't know if that's exploitable enough to be absolutely Sure, you're safe from this use the latest 2.0 or 4.14 kernel or newer So if you're really worried about this use a newer kernel and you'll be fine so, um Variant 3a came out out in march So now we start getting the research from other people who see this area and say what other parts of the kernel What other parts of the processor can be abused and this one you abuse the way the system registers are read in a processor You can read again data from the kernel or another virtual machine But it's solved by the same way we solved number three So if you have meltdown secure, um implemented for your system, you're safe So it's a nice part of research Kind of interesting It was already solved So if you weren't patched for that one reason, you didn't worry about it. Here. There's another reason to worry about So let's talk about four um Processors not only can jump to different places read memory to different places They can execute and read in other places And we're seeing the ccpus are very very odd black boxes. They can do things that we don't really realize they can do So this one again, you can read the kernel from the data in the kernel or from another virtual machine This one was a little bit a lot easier to fix There's some minor code changes in the kernel and some major code changes in the microcode The kernel is fixed The microcode has not been released I will lean on intel again. Probably say we're ready. We're ready for you. Um The intel has some data microcode out there. If you're really worried about this talk to them They should get it on it. Um They knew about this. They had plenty of time We fixed this in x86 these days. I do not think this is a problem in arm But i'm not sure it might be um talk to your arm vendor if you're really worried arm hasn't really said anything about this Again, we only fixed it back to 4 9 4 14 and 4 16 number five floating points now this one gets interesting in that um We fixed this problem in 2016 accidentally If you take newer kernels again, I said you go things you get things done faster newer kernels run faster We also fixed problems that might Be caused we didn't know about this being a security issue. We knew it was a bad idea CPUs actually ran slower. So we we ripped this code out way back in 4.6 2016 um it uses the way the floating point registers can be restored The bsd um developers had this code still in their kernel. I think um Microsoft had this code still in their kernel. Um, they've ripped it out and replaced it since then Um, the leaked a little bit early because the bsd developers kind of were poking around and found this problem It talked about it publicly. Um last week or two weeks ago Um, it kind of leaked um the full details about this are going to be published today Uh, the zen developers also published a really good report on it The researchers in germany are going to publish report. Um, there's a redacted report It'll come out today later today on how this works. But again, we fixed this a long time ago so Back in may of 2016 in the 4.6 kernel It this type of fix was a major architecture change We never backported it because it didn't look like it was a big issue But I did for the 4.4 kernel. So, um, if you're using these kernels, you're wonderful. You're safe Um, here's another reason if you're using a newer kernel, you didn't even have to worry about this one All was good. If you were using an old kernel, there's another really good reason why you should never use an old kernel We do fix things So why this is a big deal Again, these are cpu bugs. These are taking code that we Thought was correct that was formally approved and it abuses the cpu in ways to make them incorrect That's a really radical change in people's mindset Um, all operating systems were affected. Everybody was hit by this Wasn't just linux. Wasn't just windows. It was osx embedded people. Everybody got hit hypervisors The good thing about this is now all the kernel developers from the major operating systems are talking The linux kernel developers and the microsoft windows kernel developers have a back channel We talk to each other now. We found problems in each other's kernel at times. We point things out So it's a nice side effect. It's brought the the community of kernel developers together in a way Because we all are worried about how cpu's work Performance is going to decrease That's a big deal for a lot of people a lot of cloud a lot of virtual machine benchmarks suddenly taking those dive You can claw your way back and get a better newer versus the kernel organ things better, but it will affect performance There's nothing we can do about this Um, this is a totally new class of vulnerabilities these come around every couple of years Maybe every 10 years. This is a whole new class of research. It's going to be worked on for a long time from now We're going to be living with this for a very very long time and we're going to be fixing them for a very long time Keep updating your kernels. We are going to be keep finding these problems and we're going to have to keep updating them and update your bias So when this first came out A lot of you saw the ramps from linux in public I have complained about this in public The way this was released was very very unique and it was treated as far as i am concerned very badly We were notified very very late in the game yon found this in july Kernel developers some of us were notified about this in october It came out in december ended december early january The way they notified us is they notified the companies So a traditional company when they find a problem that they're going to talk to other companies because they're used to doing with that Um, it turns out the majority of the world does not run company-based kernels The majority of the world runs kernel.org kernels or community-based kernels debion The one major major cloud provider in the top three told me less than 10 percent of their workload is enterprise kernels Over 90 percent is debion or kernel.org kernels The world has changed in the past five years companies have not realized this when intel found this was notified of this problem They started dealing with the company. So they dealt with susa individually. They dealt with red hat individually They dealt with other companies individually and they didn't let the different groups work together Because all the developers of those companies are actually part of the kernel community And so they work together and they they work together. They find out the best way to solve things They solve them together. We were not allowed to do that for this Because of that, you'll notice the enterprise distributions Implementation are radically different red hat solution for this is radically different from oracle solution from this Radically different from canonical radically different from susa susa is actually close to the upstream kernel. Again, those developers did a good job So that was a big big deal that made us really mad Because the community and the kernel.org kernels were vulnerable Really really late in the game the major cloud providers who based their systems on the kernel.org kernels Were upset. They were they were caught flat-footed and they were needed to change really fast Intel realizes this they've worked with us on this and for the future ones that are coming out. They've changed They now notify us. They're allowing us to talk to each other about this. It's getting better. It's not perfect We still have some minor complaints, but it is getting better. They have learned This is why you saw a lot of us really really grumpy really really upset in the beginning because we weren't allowed to work on this together So how do you keep up to date? How do you keep a secure system? Do this take all the releases I make Take all the security patches all the stable updates because and do not cherry pick your kernels. Don't say oh look this Fix I'll take this fix from the stable this fix from the stable this fix is stable because everybody gets it wrong I've audited almost everybody's major kernel these days and the people that try and cherry pick fixes Miss things just take the whole thing the community supports the whole batch of patches together The community will not support if you cherry pick different pitch Fixes we know this whole thing works together take it. There's no reason not to using git We can merge things easily just take the whole thing Enable the hardening features in your kernel every new kernel release adds new security features new ways to protect yourself Some of these sector issues were not Did not Affected by these hardening features, but other security bugs are I've seen loads of very very famous brand new phones With new new versions of kernels not enabling Our different security issues that cause no performance issues They just and so they're open to a whole class of vulnerabilities enable those options. They're there for a reason Do that keep updating your major kernel version if you can move from a 2.4 4.4 to 4.6 4.14 do it if your soc keeps you in an older kernel version work with them and complain But move to newer kernel versions because they're faster ends up. We speed up things. We make things work better We're not doing these releases just for the heck of it. We make things better So use a newer kernel version again facebook publicly said newer kernel version no performance loss because they went to a newer kernel version And update your microcode in your bias They are being released for a good reason They again a cost them a lot of money to do this take those updates. They are fixing problems again variant 4 Needs a microcode updates. You're gonna have to update your microcode. You're gonna have to update your bias do it So I don't mean to be all doom and gloom It's really easy. Just keep updating everything will be fine. Thank you very much