 So this talk is about something which is not about cryptography and the idea is that I wrote a complete TLS library which does constant time cryptography and does not leak secrets and that's only part of the picture and this is the second half that is when all the crypto has been done what do you do to process your secret data safely. So here are the key ideas of what I am going to explain. One is a world definition of what constant time coding means and it's actually not about things going in constant time so it's a misnomer. It's about resistance to a class of attack known as timing attacks and I want to stress out that it actually matters it's not just an academic toy and it is becoming more and more relevant especially when considering the new trend for security enclaves such as SGX or new Intel processors or arm trust zone and it is not specific to cryptography it is a larger matter. So timing attacks. Timing attacks are a specific kind of side channel attack. The side channel attack is something that lives between the abstract description of computing in a mathematical sense and the physical world. This means that you are running your code and things hopefully happened as they should do with regards to the description of the language and the compiler and so on but they really run on a physical machine which has an impact in many ways. It does the computing you ask to do it but it does other things so in particular it will consume power more or less power depending on what it's doing. It will emit various electromagnetic emissions on the wires and in the air around the machine it will even emit acoustic things that is it makes noise which may depend on what it's doing and there have been successful timing attacks that work on the noise that a processor does simply when it powers up and on and one specific side channel which we are going to talk a lot about is the difference in time for various operations that is you ask the computer to do something and it will do it but the time it makes to do that thing may depend on data which you would prefer to be kept secret. So in all the side channel attacks the timing attacks are a bit special because they don't need physical proximity. If you want to listen to the noise made by your computer you have to be somehow in the same room. Demonstration have worked up to about 12 meters but if you want to simply measure the time taken by some operation it can be done from the other side of the planet that is while most side channels can be mitigated by having a physical environment which is protected so basically you don't let the attacker enter the room where your server is timing attacks can be enacted from anywhere in the world which makes them a lot more concerned in many contexts. So timing attacks can be split into direct timing attacks in which you simply measure the time that the computer makes those specific operations that manipulate secret data. It was first published in 1996 by Paul Kosher and he demonstrated how to recover an RSC key from simply talking to an SSL server and measuring the time it took to respond to connection requests. So the difference was very minute but do it sufficiently many times and you get for a lot of statistics the actual key information. And then there are indirect timing attacks in which you don't really measure the time taken by an operation but you obtain some information about the modification of the states of the machine and in particular cache contents and that information you obtain it afterwards using a timing attack. So it's been about 13 years since cache attacks were first demonstrating in the lab and now we have a lot of variants and basic demonstration shows that the attacker in some uneasy context can recover usually a cryptographic key from an ongoing operation. It's been demonstrated from across processes across virtual machines that is an attacker running a virtual machine on the same hardware on the same core or neighbor core of the victim could recover key. Recently there has been an attack which has demonstrated an attacker who is really JavaScript code just normal plain JavaScript from some website which could recover the secret key for encryption of the disk of the local machine. So always the attacker is near the target machine but not physically near. It can be just some code which runs not too far from the target system. A lot of this is about caches. So this diagram is some sort of illustration of how cache memory works and the usual description of caches is that the cache remembers the last memory you accessed and it's a wrong description. It does not work that way. It does not work that way because it would be awfully difficult to make an efficient cache in hardware that would work that way and the whole point of having a cache is that it's faster than memory. So instead when you have one cache you actually have a lot of small caches which are called cache sets and whenever you access a memory, a byte in RAM, it may be cached in exactly one of these sets. So the cache is organized as lines usually 32 or 64 bytes which are grouped in sets. On the picture I put four lines in each set but actually on this computer it's eight lines and whenever the code just reads some data which is not already in the cache, it reads a whole line which will end up somewhere in one specific set which depends on the address of the line in memory. And of course when it comes into cache it pushes out whatever was there before. So important parameters on caches are one is the set selection that is from the address that you access in which set it will end up and then there is the eviction policy which is the way the cache decides which line in which in the set will be used to store the data. And these parameters are absolutely not documented. So it's each processor vendor is doing these things but this has been reversed in general. In the level one cache which is the cache that must be the fastest, the mapping from address to cache set and the eviction policy must be relatively simple because the cache must respond in a few clock cycles. And so on this machine which is a fairly standard MacBook Pro, it's a Skylake Intel CPU, the eviction policy for level one cache is the least recently used so that when code is accessing some piece of data it's fairly straightforward which line will be evicted to make room for the new one. And the selection of the set is just using a specific number of bits from the address which in this machine will be the bits 6 to 11. On other cache levels which are further from the CPU and which have a lot more time to act, things are more complicated. In particular in level three caches there are some sort of undocumented hash functions which decide which set is go and there are several eviction policies and the machine will dynamically switch from one to another. So from the point of view of the attacker who wants to leverage things from cache it's easier to work with level one cache if you can because then things work in a relatively straightforward way. So there have been some terminologies of cache attacks which split them into three categories and each of them is about using the addresses of memory bytes accessed by the target code to infer information about secret data. So it's not about what data is read from memory but from where it is read and where it is written. And this is the important point we have to remember because the defense against that kind of attacks is to never read or write data at a secret address. You can read secret data but which address should be considered as public. So one, the first one is called evict plus time. It means that you measure the execution time of the target code which suppose that the attacker can trigger it in some way. For instance, if he's trying to attack a server, he's just connecting to the server and just making it do its stuff. And then the attacker must have an ability to evict some of the elements of the cache, some lines, and then try again just to see if it's as fast as previously or slower because if it's slower then this means that the lines which were evicted were actually useful for the computation. So then the attacker learns that these specific lines have been read so he gets some information of the addresses of memory slots which were accessed. Prime plus probe is another variant in which the target code is executed once but then the attacker tries to work out what parts of the data caches have been evicted because they would be evicted to make room for whatever the target code is going to load. And we saw that when the code loads things, it will evict different lines from different sets depending on the addresses that has been accessed. And the flush plus reload is another variant on that one. The theory one is about filling the cache with no data then see what was evicted. The other one is removing data from the cache and see what was loaded. Here is an example. That one is cryptographic because most of the science on it comes from the cryptography area. So this is an excerpt from a very classical AES implementation. In this specific case it's in C-sharp but it could be about any language. A classical implementation of AES will just consist in a number of wounds and at each wound there will be about 16 array accesses at addresses which are computed from bytes of the intermediate results. And the point of the cache attack is to work out which of these bytes have been accessed because each time it yields a minute information about two bits on the secret information which can be mathematically linked to the key. So after a few hundreds of observations the key is revealed. And the beauty of the thing is that the attacker does not even have to look at the plain text data or the cipher text. It just has to know that at some specific point encryption is running and it has been demonstrated across distinct virtual machines. So if you think in a coward you've got your own virtual machine and the attacker is just renting another virtual machine which happens to run on the same CPU a distinct core but that's not an issue for the attacker because the level 3 cache is shared between cores in the classic Intel CPU. So these kind of attacks have been demonstrated repeatedly in lab conditions. They were never observed in the wild. Attackers, malware apparently don't do that. They don't do that because it's complicated. It's tough to run it. In the lab condition you know you can repeat the experiment and you can try it for several hours and nobody will mind except the KG student who does it but it will. It will be done. In a real attack setup usually there are simpler ways to get through an entire machine and the state of software security is such that you usually don't need to leverage that kind of attack to get things done. However it's predictable that these attacks will become relevant in the near future because of SGX. The setup of SGX is that there is some code that must handle secret data and it runs on the attacker's hardware and it may communicate with the external world only through the attacker CPU. The attacker has full control of the CPU, kernel, hypervisor, whatever. He can monitor memory accesses. He can often look at individual cache accesses and he can get a lot of information and still he must not be able to access the secret data which is processed within the enclave. So that's exactly the dream setup for training timing attacks and cache attacks. Right now it seems that there are people who are really intent on doing useful security stuff on enclave. I've been professionally involved in auditing already two distinct projects which were using that kind of enclave. So it begins to matter. Enclave was not official but the point of SGX when you read the spec is that it should allow for DRM for video and music. Everything in it was meant so that the enclave runs on the client system so that the content distributor knows that he can send uncupted data because it will be decupted only by the enclave which controls what becomes of the data and prevents siphoning it out into a file to give to other people. It seems that the industry is not doing that. Instead they're trying to use SGX and servers for various reasons. There's one which is public which is the whisper systems. There are people who make the signal application. They wanted a system which could match your contact phone numbers with the set of phone numbers known to be signal users so that you could be put in a easier path to using signal with your friends. But they wanted to do that without revealing your contact. That is they operate the server and they define an attack model where there are the attackers. And it's all on the server. And they did, they solved that with an SGX enclave so that the client knows that it is talking and sending uncupted things to an SGX enclave and the code on the SGX enclave will do funky things to match the phone numbers without leaking the information. So they really worked hard to make it with the sort of hash table which is constant time which does not leak information for cache timing attacks and it can be demonstrated that it does not leak information by basically analyzing the code. So all such things work. Constant time code is about code which will not leak secret information for timing attacks. So if its execution time is not constant, then it should not be correlated to secret data both directly and in its indirect effects on the machine in particular cache contents. So the execution time may vary but it should not vary in a way which can be linked to the secret data. So you have to, if you want to write code that works that way, then you have to be very wary of some operations. First, you know that you're writing in some more or less high-level language, but C was go C sharp java script based on a scale, whatever. But at some point it will become assembly of codes and the CPU will run the assembly of codes. So you have to know which of codes will result from your code and then some of them have data dependent execution time. It's well known for the division of code, division and remainder that I'm talking about the integral divisions. For floating points, the situation is a bit more complicated and it appears that recent intelligence have constant time floating point operation including divisions. But it would not have been true about 10 years ago. So just be wary of that. Less known is that multiplication is not always constant time. It depends on the kind of hardware. On that one, it's constant time. On some older or smaller system and microcontrollers, it's not, especially for PC cores tend to have a multiplication with execution time will vary depending on the data and on the big tanks and other parameters which are usually not documented. Shift and rotation may have a data dependent execution time, not on the, depending on the data which is shifted, but on the shift count of the rotation count. So the shift count of the rotation count should not be secret. Okay, floating point depends on a lot of things and it's always hard to analyze. Then you must absolutely avoid data accesses at secret dependent addresses which was the issue with classic AES table based implementations. A variant of that is that you must not do a conditional jump which depends on secret data because when code is executing, these are accesses to code which is memory. So if you're doing a conditional jump, you're going to access some code bytes or some others which will depend on the jump condition. So don't do that. Similarly, indirect jumps should not depend on a secret dependent, should not go to a secret dependent target address. And of course, any library function which uses either of these operations should be out. You must be very worried of what you call what you'll be asked. So let's take an example. These lines of code are supposed to be generating a random number at the 32-bit render which is uniform in a given branch which is not the power of two branch. So what it's doing is that it's generating 32-bit integers. Then it's doing a reminder operation just to see what it is modulo n and it's excluding the top of the branch. That is, the n is not necessarily a divisor of 2 to the 32. So you have to exclude some of the branch because if you kept it, the generation would not be uniform. So that's what this code does. And this code is not constant time. But it's interesting to see exactly why it is not constant time. First thing you see is that there is a conditional jump on data, including the data which is returned. And it seems it turns out that that one is actually okay. Because from the outside, the attacker knows, may know that at some point it will exit. But when the jump is not taken, all the attacker knows is that the value was rejected. And as such, it was not secret because it was not used. So all the attacker learns from observing the exact pattern of these conditional jumps is that the work, the code worked as it is supposed to do. That is, it rejected values which were not usable to generate the random. That one is okay. But it's not a matter, it's not a simple matter. Because if you use a generic tool which will look at your code and which will simply tag values as that value is secret so that it should not be involved in any conditional jump, it will report that conditional jump and say, you're doing a conditional jump on secret data. And the programmer has to do the actual intellectual effort to demonstrate that in fact it's correct in that space. Now there is another one which is not correct and that reminder operation because it may have, depending on the platform, non-constant time execution, non-constant execution time, which may depend on the modulus which is n, which is not considered to be great, but also on the value. And as such, for instance, if the w value is very small, the division may be faster. And the attacker will know then that the return value is in the small value. It's important to look at the produced assembly. So you compile your code and you look at the assembly. And indeed, if you look at the assembly, there is a div l of code which is integral division and that's the bad one. So when you want to do the constant time version, it must look like this. And what it is doing is that it's reducing the, it's masking the height bits so that the modulus operation is now merged with just the comparison. So for instance, if you use a modulus n which is 920, for instance, this will compute a mask which will just keep 10 bits, so the comparison will be, is the value greater than 1024 or not? And if it's, in fact, the value will be between 0 and 1023. And if it's between 0 and 911, it will keep it. Otherwise, it will simply reject it. So there's no division anymore. And we can see it in the code which is here on two columns because it's more complicated. There's no trace of a division. And you can follow this assembly code to see that it really is the translation of the C code. Ideally, if you want to write constant time code, you should be able to do that in your head. That is to compile C code into assembly as if there was no compiler. So it's not easy. Another example here, which is a comparison of two binary values. If you use the standard C function for that, which is main compare, it's not constant time because it will read bytes until it's, which is the first different byte. And this has been used to crack passwords, actually, to recover hash passwords from authentication servers by simply submitting passwords with some sort of hash value and trying to see how far it could go in the comparison before it decided that the submitted password when hash did not match the stored hash password. So that one has been demonstrated on a real server. It's lab conditioned by academics, but less academics than the other versions. The constant time version works by reading all bytes always in a strict sequence. So it reads all the bytes, the XOR operation will return a nonzero value if they are different. A bit wise, all will accumulate everything. So at the end, they are equal if and only if the X value is still zero because it means that all the XORs deleted zero. And you can see that in the assembly code. And you can, if you read assembly, you can see that it's obviously correct. So now that I've explained that constant time coding is important, it matters, and it's hard, now comes the part where I'm helping. And I've called it CTTCA. It's a, it stands for constant time tool kit. It's a C library, which I've been pushing on open source, it's on GitHub, it's a MIT license, you can reuse it, you are encouraged to reuse it, and it offers a number of primitives to do constant time coding. And among primitives are of course basic Boolean types, and also comparison of small integers or larger integers. There is a complete big integer implementation, which guarantees constant time operations. And also it does, there's no indefinite behavior. If you right shift a negative value, you know what you get. And it also checks that if you go on overflow, it will replace the value with not a number which will propagate throughout the computation. So it's meant to be, to remove all the other nice traps that you have in C code. So there's also an hexadecimal encoder decoder and a base 64 encoder decoder, which does not work on tables because that would leak information. And the conditional copy and so on. So it's growing. I'm occasionally adding code to it, and I have other things to add. So just this is an example of how you use it to do comparison with 32-bit integers. So you've got a specific Boolean type, which is not actually a Boolean, and you get the normal operations such as odd and so on. And at the end, you can convert it back to an integer to do a test once you have done all your constant time stuff. Internally, it looks like this. The Boolean type is actually a structure with a 32-bit integer, and it's meant to, the 32-bit integer will contain zero or one. And I am very careful to prevent the compiler from actually understanding that. I don't want the compiler to come up with the conclusion that value v is always zero or one because then it could be tempting to use conditional jumps, and I don't want that. So instead, I'm using 32-bit integers, which just happen mathematically to be in the right range. And you see the evaluation of greater than comparison, which is a sort of game with exhaust and odd and shift, and you can try to analyze that and see that it actually works, and in the real libraries, there are commands, so you can read that and understand why it works. But the whole point of using that library is that you don't have to worry about that. The library will do the comparison for you, and it's still an inline function, so it's rather efficient. It will be inline in your code. Last slide. So I won't be too much up. It's an example of how it's doing operation, for instance, on big integers, which I defined to be exactly 132 bits, just like that. It's good up to a million bits or so, so you have room for that. And there are setting value from normal integers. There is little and young and big and young coding, and the coding, sign and unsigned, and there is multiplication, addition, division, modulo, and re-encoding. So it's meant to be, it's a bit barbaric because there's no operator overloading in C, so you have explicit function name, but it's meant to be usable in a lot of context, including embedding systems, because none of it actually uses dynamic memory allocation, and you don't go to some big integer libraries such as GMP. There's no free operation because these are allocated on the stack, so you can just return for a function and forget about it. It's meant to be usable. So that about it, and I kept to my half hour.