 Please welcome Colin Percival, he'll be talking about 23 years of software side channel attacks. Okay, this isn't my first time speaking at EuroBSDCon, but it's been a while since I was last year. So for those of you who don't know me, I've been a FBSD developer since 2004. My first major contributions to FBSD were FBSD update and port snap. These days, my main contribution is maintaining the FBSD EC2 platform. So when Amazon changes things, which they do on a fairly frequent basis, I make sure that newer versions of FBSD will support new devices, new features, and so on. I spent seven years as FBSD security officer, and relevant to this talk, I am an occasional cryptographer. You'll see later on some of the work I did, but also a script in 2009 is a key derivation function. My day job is as the author of tar snap. This actually came out of my work on FBSD. I was concerned as FBSD security officer that I had a lot of very sensitive information on my laptop, and if somebody got access to my backups, they might find out about a large number of remote root vulnerabilities that we were in the process of fixing. I couldn't find any good solution for backups, so I decided to make my own. Tar snap is my day job, and it's what's paying for me to be here. So I'm going to talk about software side channel attacks. So just so you have some idea what I'm talking about here, black boxes. We think of code as being a black box. The outputs go in, something happens, and then output comes out. But in fact, these black boxes are not very black. They leak information via electromagnetic radiation. They leak information via power consumption. Sound is actually more of an issue than you might think. Sound, depending on how much power a system is using, it will heat up. And as things heat up, they expand. You can actually hear the power consumption of devices as well as measuring it over the wire. Also, of course, information comes in, something happens and information comes out, but how long it takes before that output comes out reveals information. And then in some cases, these black boxes have internal state, and you might not leak information during the process of performing an operation, but you might be able to extract it later on through another operation. If you're leaking information deliberately, we call this a covert channel. This is relevant in situations like mandatory access control. So, I think in the 1970s, CIA has shared computers, which some people with top-secret clearance have, and some people with lower clearance have. You want to make sure that the people with top-secret clearance cannot deliberately leak information to the people with lower clearance because the people with lower clearance aren't going to be searched when they walk out the door with it. But relevant to this talk, if you're leaking information accidentally, then that's a side channel. Software side channels, what I mean here is simply those that you don't need any special hardware or physical access to exploit. So, if you're looking at something over the internet, you won't be able to measure electromagnetic radiation, power consumption, at least not with the granularity that you need. But you can measure how long an operation takes, and in some cases, internal state may leak at a later point. And if you can get anything that we care about through a side channel, then we say you have a side channel attack. Typically, when we're looking at leaking secrets through side channels, we're talking about cryptographic secrets. There's two reasons for this. First, historically, information that was being dealt with cryptographically, whether it's being encrypted or decrypted or the keys themselves, it's important information because why would you bother using cryptography on anything that doesn't matter? Of course, these days, you're probably using TLS when you watch cat videos. So, it's not so much of a correlation anymore, but historically, it was. And second, side channels are inherently low bandwidth. And so, if you're going to be using an attack like this, you're going to want to apply it to a situation where there's a small amount of data which is very valuable. So, cryptographic keys, they might be 128, maybe 256 bits. It's a lot easier to leak that much data than it is to leak a file of gigabytes, for instance. Now, I said, I'm going to talk about 23 years of side channel attacks, but to set the stage for this, I need to go back a little bit further. 42 years. In 1977, Revesti Shamir and Edelman published the RSA cryptosystem. This was the first published asymmetric cryptosystem. I say published because, in fact, asymmetric cryptography was discovered by a mathematician at GCHQ about four years earlier, but it wasn't published. At this point, it was essentially mathematical curiosity. For those of you who went to Warner's talk about the early days of UNIX yesterday, just a back of the envelope calculation. You were on a PDP7, and you wanted to run RSA with typical key sizes. The box was big enough, but it would take you about five minutes to do a single cryptographic operation, a private key operation, on a PDP7, and it would take you about two weeks to generate your keys in the first place. But, of course, computers get faster over time, and they get bigger, and people write more code. So, June 1991, Phil Zimmerman released PGP. Originally, he intended to release it only to the US. He didn't realize that the US-only tag on newsnet messages wasn't actually enforced, so what he thought he was publishing to the states ended up being published around the world. The US government was not very happy with him. There was a five-year saga where they tried to throw him in jail for exporting munitions illegally, because as far as the US government is concerned, cryptography is a weapon of war. This isn't where the story of side-genital attack starts, though, because it's really difficult to target PGP with a side-genital attack. Somebody downloads an encrypted e-mail message. They type in their passphrase. It decrypts it. They read the e-mail. At some point later, they write a response. You can't really measure what's going on on their system, unless you're actually on the system, of course. But for common uses of PGP, there wasn't much opportunity for attackers to get close enough to the cryptographic operations. February 1995, however, SSL 2.0 was released, and this completely changes things as far as side-genital attacks are concerned, because now RSA is being used interactively. You've got a web server which is connected to the internet. A message comes in which is encrypted with RSA and that web server is immediately doing this cryptographic operation and something like 400 milliseconds later, it's finished doing the arithmetic, and it sends a response back. It's not going to sit around and wait five minutes for you to read an e-mail because there's somebody trying to load a web page. And so, at this point, now there's suddenly an opening for timing attacks. And as you might expect, it didn't take very long. So 1996, Kotcher published this paper, Timing Attacks on Implementation of Diffie Hellman, RSA, TSS, and other systems. There's a lot of cryptosystems mentioned there, but all that have in common that they use large-inch arithmetic and the standard way of doing arithmetic at the time used non-constant time-modular multiplication routines. It was fairly straightforward to look at the operation being performed and figure out is this multiplication going to take slightly more time than normal or slightly less time than normal. And if you can do that, then you can feed particular inputs to this web server, the RSA Oracle that's doing these calculations for you. And by measuring how long it takes to get a response back, you can get one or two bits for each message that you send it. So around a thousand RSA operations, around a thousand connections to the server, and you can steal its private key. This was a theoretical attack. It really wasn't very practical. It took around 400 milliseconds to perform the entire RSA operation that you would be measuring. And the difference between a fast modular multiplication and a slow one was around 20 microseconds. So you would have to measure a difference of one part per 20,000 in order to distinguish whether a particular bit of the cryptography is a zero or one. And to make it worse, the networking at the time, FastEthernet had just been released. 1500-byte packet took 120 microseconds to transmit. And so your signal could be absolutely dwarf with a packet getting in the way over the network. But over time, attacks all get better. So 2003, Bonet and Bromley published this paper, Remote Timing Attacks Are Practical. They attacked RSA in a slightly different way. They were looking at a timing channel on Montgomery reduction. RSA implementations were done slightly differently, now slightly more efficiently, using a Chinese remainder term. But the key observation they made was if you can't measure how long it takes to perform one operation with enough accuracy, you can measure how long it takes to perform lots of operations and calculate the average. And if you average enough, your noise level goes down. It's like public opinion polling. If you poll four times as many people, your margin of error on your opinion poll is half the size. So rather than timing around a thousand RSA operations, they're now timing about two million operations. So now they're making two million connections to a secure web server. They say in their paper, it takes about two hours to do this. So the joke at the time was if you suddenly notice that your mostly idle web server has its CPU being pegged to 100% for two hours, somebody's trying to steal your RSA key. Of course, that's if somebody's being really obvious about it. You can do it in two hours, but you could spread the connections over a longer period of time and then not have as obvious a signal in terms of the amount of traffic. But this attack, which was considered to be purely theoretical against RSA, at this point now suddenly it is real. You can steal people's cryptographic keys over the network by measuring how long it takes for them to respond to you. So you need to fix your web servers, your open SSL stack or whichever RSA stack you're using. Fortunately, there's an easy way to fix it for these particular attacks. These attacks may use of chosen inputs. So you're connecting to the server and you're deciding what value to send to it for it to perform an RSA operation on. This is a very common theme with cryptography. It is always easier to attack cryptosystems if you can make use of chosen inputs, whether it's chosen plaintext or chosen ciphertext. And in fact, it's a common practice in designing cryptographic protocols that if you want to design a protocol to be secure, even if some of your components end up being not as strong as you hoped, you design your protocol to avoid allowing attackers to choose what values are going to be operating on. So the defense that they provided here the blinding approach is rather than operating on the value x that you've given, raising it to power d, you can pick a random value r, raise it to power e, which is a small number, the RSA encryption exponent, multiply that into x. Now you have a different value, which you raise to power d, your RSA decryption exponent, and then you can divide out the value r at the end. Because the way RSA works, r to the e to the d is just r. It canceled out, you get the same answer at the end. But you're not performing the exponentiation to power d with a value that was provided by the attacker. Since the public exponent e is much smaller than the private exponent, calculating r to power d and r inverse are both very fast. In the Boney-Bumley paper, they say typically 2-5% performance cost of applying this defense. And as long as a new random value r is chosen for every operation, there's no way for the attacker to predict what value are going to be raising to power d, and so there's no way that a chosen input attack can reveal anything to them. They will see that some operations are faster than others, but it will be completely random. As long as it's a different random value r for each operation. The next year, though, Bernstein published the first cache timing attack. Now we're heading away from purely timing attacks where you can look at the C code and say, well, in this case, we're doing more operations than this other case. And now we start looking at how the details of particular CPUs matter. AES is the advanced encryption standard. It is the standard symmetric cipher which is used in pretty much every system out there. Well, now some people are switching to Chacha, but for about two decades, it was the standard way that we perform symmetric cryptography. And a straightforward implementation uses these things called S-Box table lookups. So, at many points in the AES computation, you take a single byte value, you look it up in a table, 256 entries, depending on the implementation, you're looking at one byte or looking at four bytes. But the point is, you are looking up a value in a table. And the first set of values that you look up are always the key XORG with the blocks, the corresponding bytes of the block of input to your cipher. If some table entries take longer to access than others, you'll find that certain values of input will take longer to compute than others. And which values of input take longer will tell you something about what the key is, which is being XORG with it before you look up in this table. And it turns out, yes, in fact, there are many reasons that certain offsets will be longer to look up than others. You have whether the value is in the cache at the time. You have load-store conflicts on the cache, where you're trying to load and store to the same part of the cache at the same time with different instructions. You have cache-bank conflicts if you're trying to load two different values from different locations but different locations that are part of the same bank of the cache. And it turns out that with about one billion random inputs to AES, Bernstein was able to steal AES keys. Now, the defense against this is pretty straightforward. Don't do AES in software. These days, pretty much every CPU out there has AES hardware circuits. They don't have any table lookups. They just have circuits. And they don't have any side channels, at least not side channels of notes, of note in them. If you do need to run AES in software, it is possible to do it without having any table lookups. Bernstein wrote at very great lengths about ways to do this, but all of the ways of doing AES in software are, if you want to avoid, side channels are very slow. So you really do want to use the hardware circuits whenever possible. Moving along, my contribution to this story in 2005 at BSDCAN 2005, I published this paper, Cash Missing for Fun and Profit. This was the first published attack on Intel hyperthreading. The attack here is taking advantage of the fact that there's two threads running on the CPU and they're sharing the same L1 cache. What you can do is, in the thread that you control, you fill the cache with your data. So you access a whole bunch of different locations in memory and the cache will obligingly load those values for you. It assumes you're going to be using them again. Then a moment later, you go back, try to access all those same locations again, but you measure how long it takes for you to access those. How long it took reveals to you whether those values are still in the cache or not. If the other thread hasn't touched any memory, everything will be fast. If the other thread has touched some locations in memory, then some of your data will have been displaced in the cache. And which values, which memory locations have been displaced, tell you something about which particular memory address is the other thread used. It's important to note here that in this attack, you're never measuring how long a cryptographic operation takes. The only thing you're measuring is how long it takes for you to access your own memory. So people have sometimes referred to my attack as being a timing attack, but it isn't. It is, in fact, a micro-architectural attack. So this is an entirely new family of attacks. It's, I mentioned before, side channels. Sometimes you have information which is staying inside that black box and you can extract at a later point. Here, the cryptographic operation is leaving some state in the form of whether it loaded values into the cache and then I am extracting that information by measuring my own operations to see whether data was loaded into particular cache lines or not. And this allows a much higher bandwidth. With a standard timing attack, an operation is being performed and you measure how long it takes and you might get one bit of information. Was it fast or was it slow? Maybe, you know, was it very fast or slightly fast or slightly slow or very slow, so you might have two bits of information. More often, from a timing attack, you get like 0.1 bits of information because you need to perform the operation many times and average things to get even a single bit. But here, because you can be measuring the state of the cache while this is going on, you can get, in fact, enough information to steal the entire RSA key just by watching a single operation. And this is what it looks like. This is a slide from my talk in 2005. The x-axis here is the cache concurrency class. On the CPU I was looking at, I think it was a Pentium MMX? No, I can't remember. On the CPU I was looking at, there were 32 cache concurrency classes. So 32 different places that data might get loaded into when you access it from memory and it comes into the cache. And the y-axis is the clock time, the time in cycles as we're going through measuring this. And all I did here was repeatedly measure how long it took to access the same locations in memory and the, I believe, the black cells there are the ones that took longer and the white cells took less time. And you can see certain patterns there. There's a repeating pattern on the right, three or about four or five blocks high, which corresponds to every time that open SSL is squaring a value. And then maybe seven blocks high is a different pattern, which corresponds to a modular multiplication. You can also, if you look very closely, see some values which are darker because those are the values which are being loaded out of memory for particular values that open SSL is multiplying by. You can also see three horizontal lines, which I think are clock interrupts happening where everything's getting displaced because the kernel is doing something with a clock interrupt. But as you can see, there's a lot of information here. And this is less than 1% of the RSA operation. About two months after I published my paper, the team of Osvick, Shamir and Tromer published their work. They had been investigating the same issue with hyperthreading concurrently with me. They were a few weeks behind me all the way through. But instead of looking at RSA, they were looking at AES. They used the same approach of looking at the, how long it takes to access your own data. You load it into the cache and then measure how long it takes to access the same values again in order to steal all AES keys. As before, because the L1 cache is shared, you can measure what's happening with somebody else using the same L1 cache. They also demonstrated stealing AES keys after the fact. So they looked at Linux's DM Crypt. This is fairly similar to FreeBSD's Jelly. It's an encrypted file system. And what they would do is they had a file system mounted and an unprivileged user would write something to disk. And after the kernel returned to userland, so the cryptography has happened, the disk access has happened maybe 10 milliseconds later after it's been written to disk, it returns to userland, and then they measure the state of the L1 cache. And even at that point, there's enough information left in the L1 cache that they could steal the AES key that DM Crypt was using inside the kernel. And what they showed was depending on the CPU and whether they're measuring the operations while they're taking place or after a fact, it was between 100 and a million AES operations they needed to measure in order to steal that AES key. So again, this is compared to Bernstein's attack. Bernstein was measuring one billion AES operations. Here, it's between 100 and a million. So you have a lot more bandwidth when you're performing a microarchitectural attack rather than simply looking at timing. Again, we know how to defend against these attacks. In my 2005 paper, I explained exactly what you need to do. You need to avoid having any way that your secrets are leaking through conditional branches. So which particular instructions you end up executing. So don't put your secrets into if statements or selection operators or for while loops. And don't put any of them into array indexing. So make sure that the exact sequence of memory locations you access does not depend on anything secret. In some cases, this will mean your code will be slower. Instead of having a conditional, a selection operator where depending on the condition, you either do foo or you do bar, what you would need to do is do foo and bar and then, based on the condition, combine them and get the right value. Execute both sides instead of just one. Typically in cryptography, this isn't a big deal. If you're writing a database, it would be very difficult to write code like this. But cryptography, you're dealing with a fairly small amount of data. There's a fairly small number of different things you might be doing. So it's not too difficult to write your code to avoid vulnerabilities like this. And there's also side benefit that in addition to preventing any micro architectural side channels, this also gives you a guarantee against timing side channels. It does not mean that your code is always going to take the same amount of time to run. Depending on the state of the cache, it may take longer because you need to load your data into the cache. But whether it takes longer or not will not reveal anything about your secrets because you're loading the same values into the cache regardless of the secret values you care about. Over the years, after 2005, we got more attacks. Intel in 2005 claimed they had an attack against RSA exploiting the shared L2 cache rather than the L1 cache. I say claimed. They never published this. This claim came up when they were trying to convince me that this attack on hyper threading really wasn't a big deal and really you don't need to give this conference talk. Later on when I said I really do think I need to give this conference talk, they then moved on to trying to tell Yahoo to fire me which was funny because I didn't work for Yahoo, but somehow they thought that I did. So I take Intel's claims here with a little bit of a grain of salt but the fact is whether they did it with the L2 cache or not. The next year, this team showed that you could use the state of the CPU branch predictors to leak information. The next year, the state of the L1 instruction cache. 2015, it was shown that you could leak information through the L3 cache. So now we've gone from a 32 kilobyte cache L1 cache which is attached to two hyper threads on a single core up to I think 24 megabyte cache which is attached to 16 cores. But there was enough information there for them to steal keys. 2018, the translation look aside buffer. Of course, at this point we needed to have flashy names for vulnerabilities so they refer to it as a translation leak aside buffer and call this the TL bleed attack. The same year last year we had an attack making use of CPU execution ports. This one going back to hyper threading where each core has a certain number of execution ports which are shared between the two threads and they call this the port smash attack. And I'm sure there are many other attacks that I've forgotten, haven't put onto the slide. But the point is over time attacks always get better and where you say in 2005, well, here's a way to exploit this using one particular bit of shared state and maybe you can do it with other bits of shared state. It turns out, yeah, pretty much any time you have a resource which is shared you'll be able to exploit it. But the good news here is if you write your code according to the guidelines that I gave you in 2005, you don't need to worry about all of these because all of these attacks are exploiting either different memory locations being accessed or different code paths being executed. Before I move on to further attacks, I just want to go back and talk a little bit about CPU architecture. Those of you who took a CPU architecture course in university can fall asleep for five minutes. In 1961, IBM released the IBM stretch, which was the first major system that used CPU pipelining. The idea here is if you can start processing one instruction before you finish processing the previous one, you can make things faster. So the classic risk pipeline, the first thing you do is you fetch instructions from memory. Then you decode them, then you execute them, then you do any memory accesses that you need and finally you commit the operations so you write to your register file on the CPU. So classic risk pipeline, you've got five stages, so five clock cycles typically for each instruction making its way through the processor. Modern X86 pipelines, typically around 15 stages. In fact, it's not just X86. Pretty much all modern CPUs will have pipelines around that depth. 1990, IBM brought us out-of-order execution with the Power One. The phrase out-of-order execution is important because the start of the pipeline, fetching instructions and decoding them, that is still an order. You can't reorder instructions until you know what they are. And the end of the pipeline, where you commit the instructions, that is also still an order. It's not immediately obvious why the commit stage needs to be in order, but in fact, if you want to deal with exceptions, it's important to have the commits happening in order because if you're going to say throw a division by zero error at some point, you want to make sure that whatever exception handler you have, maybe your exception handler is going to say, if you try to divide by zero, just insert a zero there instead. And then if you want to go back and resume from that point, you need to know how to resume and the only obvious way to resume is to say, well, everything before this instruction should have completed. And then we can continue from the next instruction. If you're committing instructions out of order, then you're sort of lost. There's no way to deal with exceptions. Out of order execution, in fact, turns out to be very important on x86 because it has very few registers. These days, we have 16 registers, but back in the third Hubert days, we just had eight general-purpose registers. And if you've looked at assembly code you get from unrolling loops, you find that you've got the same registers being used over and over again because you don't have enough to use different registers for each time you go through the loop. And so out of order execution allows the instructions to refer to the same register, but internally they end up being remapped to different registers. About the past 40 years of CPU design has really been built around the idea that the instructions must flow. It is very easy to add another ALU to your CPU. It's very easy to add another multiplier. Even trigonometric instructions these days, the amount of dye area you need in a floating point unit for those is very, very small compared to the amount of dye area that we have dedicated to decoding instructions and reordering them. And of course these days most of the dye area on the CPU is just the cache. So these give us a lot of improved performance, but of course they do bring risks with them. I'm sure you've been waiting to see the words speculative execution. So modern CPUs you start executing, you start handling instruction N plus 1, B4 instruction number N has completed. There's an exception here, if you insert something called a serializing instruction, then that tells the CPU, stop here, wait until everything is finished, then you can proceed. But normally you don't insert those because, well, that slows everything down. But when you're dealing with outward execution or even the speculative pipelines, sometimes things don't go the way that the CPU expects. So we have these things called pipeline flashes. If you miss predictor branch, you're running a loop and the CPU says, oh, you've been going around this loop a thousand times. Let's just assume that you keep on going around the loop. And the CPU says, oh, didn't think you were going to do that. Okay, let's stop here, throw away everything that we've done for the next time around the loop that you ended up deciding not to do, and go back and continue where we're supposed to go. Indirect branch target miss prediction. If you have function pointers, typically you're usually calling the same function many times. Occasionally you call a different function in order to speed things up. The CPU tries to predict where the code is going to go next. Sometimes it gets it wrong, so it needs to do a pipeline flash. Exception handling for us. I mentioned divide by zero. If the CPU waited every time it saw a divide instruction and said, let's wait here until we make sure that we're not dividing by zero, it would slow things down. So the CPU will just hope that you're not dividing by zero, keep on going. It turns out there's an exception, flash the pipeline, keep on going after that point. Data hazards. And my personal favorite, self-modifying code. If you write to memory location, which is the location that you're fetching instructions from, the CPU will say, oh, okay, that instruction I fetched isn't the one that you want me to execute. Flush the pipeline, reload the new values that you stored into memory with your self-modifying code. Every time you see a pipeline flash happening, the speculatively executed state is not committed, because it gets flushed away. And so the architectural state, the value of registers is not changed, but the micro architectural state may have changed. And this is a problem because all these speculatively executed instructions may have affected things like caches. So the meltdown attack. This was referred to by some people as Spectre V2, but I think it's actually the sensible place to start. Try to read from a location in memory that you can't actually access. Do something with the value you just read. Hope that you do that before that trap that you know is going to happen gets to the end of the pipeline and is committed. On Intel CPUs, they are handling traps at the time of instruction commit, not when it first deals with the instruction. As a result, even though the pipeline is flushed, you can see the state in your cache that was affected by the speculated instructions. Similar approach called the rogue system register read, the RD MSR instruction, if you are not privileged, it will trap, but it traps when it gets to the end of the pipeline. You can do something before you get to that point and have the pipeline flush. The lazy floating point state switching attack. If the floating point unit is not present, which is what we used to do when we didn't want to copy state in and out, because some processes do not use the floating point unit, you could access something on the FPU. It would trap, but meanwhile, before the trap actually happens, you can do something with the value that you just accessed. Swap, yes, same thing. In all these cases, because the exception is being handled at the end of the pipeline, you can speculate through the faulting instructions. As far as I know, this is just an Intel issue. AMD and other non-intel CPUs are at least mostly not affected, because when faults happen, they are dealt with sooner, rather than waiting until they get to the end of the pipeline. They do something earlier. I don't know the details because I haven't seen the internals of their CPUs, but for everything I understand, they identify these as being faulting instructions, and then do not allow the results of those instructions to be used by other instructions. There's more CPU design issues. Speculative store bypass, this affects pretty much every modern CPU. You have an instruction that writes to a location memory, you have another instruction that reads from the same location. If the CPU realizes that it's the same location memory, it will say, let's not run this read until that write is finished. If the CPU doesn't realize it's the same location, which can happen without order execution, then it will read the value from memory. You may go ahead and do something with that value, and then only later, when it realizes, oh, you are writing to that location, then it will say, whoops, that read shouldn't have happened, flushed the pipeline, let the write happen, and then go back and proceed with the instructions that the code actually set out. But you may do something with a value that was read from the old value in memory and leak information, even though if you look at the code and look at it one instruction at a time, you would realize that no, there's no way that you should be able to read the value because you've just written to that same location in memory. Micro-architectural buffer sampling. There's several of these on Intel CPUs. None have been published for other CPUs, but I suspect that other CPUs do have these same issues. In many cases, it makes sense to buffer data. You've got store buffers where you have a whole bunch of values being written to memory, and they will go into the L1 cache eventually, but you don't want to hold things up waiting for the L1 cache to be ready for you. You just put them into a queue and let them happen later. Same thing on the read side. Data may be being read into the L1 cache. You want to be able to access it before you finish dealing with things on the L1 cache. In some cases, in order to speed things up, the processor is fording data from one place to another, but it's doing this optimistically, and it realizes too late that it shouldn't have forwarded that data. So what does it do? It flashes a pipeline, but if you've already used that data in a way that leaks state, then it's too late. In some of these cases, the leak only happens between hyper threads, but I believe there's other cases where it can happen even between cores. As I say, these have only been demonstrated on Intel CPUs, but the nature of the design elements in CPUs that make this possible makes me think probably other CPUs have these same issues because these design elements are very useful for making CPUs faster and people care about performance. Now, of course, Spectre is the broad category of speculative execution attacks. The first one that was demonstrated, a bound check attack. The CPU mispredicts a branch. Usually, in this code, your bound check says, oh, yes, everything's fine. You're accessing a value in the buffer. So the CPU learns, predict that this will happen. Well, if you then go in and access something outside of the buffer, the CPU still predicts it's inside the buffer, and it's too late when it realizes that it wasn't because you've now used that value from outside of the buffer and leaked the information. Branch token injection, the CPU, again, to make things faster, it is predicting where your function pointer is going, but you may be able to trick it into calling the wrong code. It realizes eventually, but by that point it's too late. You've already leaked information. And this is a general issue with speculative execution. If the processor mis-speculates, you may run code that you were not planning on running, and every margin CPU is vulnerable to this. Branch mis-predictions happen even in good times. I mean, if you want to exploit this, of course, you will deliberately make the CPU mis-predict. It's very, very easy to do that, but even without doing it deliberately, good branch predictors on CPUs, you're looking at maybe 98% success, but 2% of the time, the CPU is not predicting where you're going to be running code next. But the good news here is that this does not bypass operating system-level privilege boundaries. So sandboxes are your friend. If you can put the information that you care about, your cryptographic information, for instance, into a different process, then straightforward specter attacks will not be able to leak that information because it is speculatively executing something that that process could have executed. It just wasn't what you were trying to execute. And there's many possible exploit paths here. One that I haven't seen anybody demonstrate yet, but I'm sure is out there, is pcode machines. We've got a switch statement, and depending on the pcode operation, it's doing a different thing with its operand. This is exactly the sort of place where you would expect to see branch mis-predictions. And so, your pcode machines are probably speculatively running the wrong code with your opcode. And if one of your opcodes is jumped to this location and run the raw machine code, then it is very likely that you can be tricked into speculatively running something dangerous. I'm just about out of time, but just before I go, I just want to say, point out that almost all attacks demonstrated so far have been leaking information through the L1 data cache. This is the way I demonstrated in 2005 to leak information through a micro-textual side channel, and it's the easiest way to do it. But it is not the only way to do it. As I mentioned on that slide, showing all the different attacks people had come up with since 2005, you can leak information through the instruction cache, through the branch predictors, through the translation look-aside buffer, through the CPU execution port contention. So, even though all of the attacks demonstrated so far have been in one particular category, that does not mean that that's the only way to attack systems. It's just, that was the easiest way. And, in fact, you don't even need to execute instructions in order to leak information. Instruction pre-decode leaks loads data from the L1 data cache, and when that happens, of course it leaves a footprint behind in the L1 data cache, but the speed of pre-decode depends on the bytes that are being decoded. Modern Intel CPUs, if there's a 0x66 or 6-7 byte, these are prefixes that modify the length of registers being used, these don't come up very often, but when they do, the pre-decoder slows down dramatically. It takes an extra three cycles to handle each of these. Normally, the pre-decoder is parsing for instructions every clock cycle. But, if one of those bytes comes up, then the pre-decoder will slow down. If you can trick the processor into specularly loading code and then measure how far it got in the L1 cache, how much data it loaded into the L1 cache, that tells you something about what those bytes are, which is trying to pre-decode. That tells you something about the information that it's parsing there. So, even without actually executing instructions, the simple act of looking at them and decoding them into instructions reveals information. So, there's a lot of ways there to leak information from CPUs. Any questions? So, yeah. As far as I understand, the CPUs perform all this magic, speculative and out-of-order instruction execution. So, we as programmers being lazy, can keep the programming model simple in-order instruction pretending our modern machines are still PDP11 or something and still gain this vast performance boost that we now have. Has historically there ever been an active discussion about the programming model? How we could scale CPUs with strict in-order execution while changing the programming model? Or has this stuff just happened because faster CPUs are successful in the marketplace? I think this has just happened because people cared about performance and weren't thinking about the implications of speculative execution. I do not think that the solution here is to get rid of out-of-order execution and pipelining because I mean, you might do all right with staying with in-order execution, but if you get rid of pipelining, then you're looking at a huge performance cut. I think you can design CPUs which do not have these vulnerabilities, but you need to redesign all the components. So if you can design your caches so that data gets loaded, but it goes into some sort of temporary location and doesn't actually go into the cache until you tell it, yes, that load actually happened. But I think you're looking at a decade before you have CPUs that fix this. Okay, so besides shadow registers, you also need shadow caches and shadow everything for everything you do speculatively. Something like that, yes. Thanks. Is there anything in RISC 5 or any other sort of on-the-horizon technology that addresses any of this? I'm not an expert on RISC 5. As far as I'm aware, there is nothing published that addresses this. I strongly suspect that Intel has teams that are looking at doing what I just described in terms of redesigning caches and branch predictors and translation look-aside buffers so that things can happen speculatively but do not affect the visible states until the instructions are committed. But I'm not aware of anything published so far. These days from the outside it looks as if Intel is actually sitting on a lot more leaks and we're actually not disclosing them. What do you feel about that? Sorry, can you repeat that? I mean, each few months we get a new leak coming from Intel and the timing attack in my opinion that Intel is actually knowing lots more of these attacks and they are just not telling them to the public yet. What do you think about? I'm sure there are many more attacks out there and I given the conversations I've had with Intel they're trying to do their best to get things fixed. We can maybe argue about the exact timing of do you wait until everybody's got a fix or do you publish something and hope that most people get things fixed in time. I'm no longer FPSC security officer so I'm not involved in those conversations. I'm inclined to say that Intel is trying to do the right thing here. They have come a long way since 2005 when they just wanted to sweep things under the rug. Whether they have people who understand how to deal with the open source community they have some whether they're in the right position I don't know but my general feeling is that they are trying to do the right thing and nobody gets everything right all the time. We need to stop things here we need to get ready for the next talk but thank you very much for your talk.