 Welcome everyone the first session of the last day. We will continue our discussion there was one lecture that I gave on some research directions and some of the work that we done namely elliptic curve cryptography optimizations minimizing the number of additions and minimizing the number of curve additions and we call that work optimizing elliptic curve scalar multiplication with near factorization. We now look at another thing that we have done. So, there were many questions about what are the possible attacks on elliptic curve cryptography on AES and so on and so forth. So, one of the important classes were attacks that has become particularly important and explored well in the recent past are what are called side channel attacks. Side channel attacks can be due to various reasons they could be timing based side channel attacks, they could be power based side channel attacks, they could be cash based side channel attacks and so on and so forth. And we have looked at actually many of these things over here my students have been working on it now for about 2 years or so and one of the things that we looked at quite extensively was cash based side channel attacks and in particular those that target AES. So, with that little background let us now move on to how exactly we can attack AES, but before we attack AES let us just recall all the different steps over here. So, the 2 very important things before we begin the attack the first is to recall the different steps in AES and then how is it actually implemented in software like open SSL for example. So, the kinds of software that you use maybe in your browser or wherever else to do encryption and decryption will actually implement it in a very interesting and very efficient way. So, of course, you can have both implementations software based implementations and hardware based implementations. So, there are many modern processors they are actually implement some of these crypto algorithms and actually in hardware. So, you will see the modern processors will implement AES for example in hardware, but also we implemented in software. So, open SSL is one example of a software implementation. So, we need to look at the details of this implementation and some basics of cash memories before we see a cash based side channel attack. So, I will be going through this a little slowly and repeating certain things just to make it clear this is one of the more advanced lectures a research oriented lecture because I thought this is an audience of faculty around the country and as faculty some of you have done a PHD some of you are planning to do a PHD perhaps. You are supervising BE students and some of you have M tech programs also you are supervising M tech students and a few colleges may even have PHD programs where students are looking for areas to research on. So, the reason for this lecture and the goal of this lecture is actually this lecture and one or two of the previous lectures. Professor Virendra Singh also had a lecture on actually hardware implementations for detecting malware in packet payloads and I had an earlier lecture on elliptic curve optimization. This is another lecture now on side channel attacks targeting AES. So, I will talk about the basics of AES first some recall some of it then from the basics we will go to how it is implemented and then the attack proper. So, the implementation as you will see is extremely interesting. So, this is a special attack there are mathematical attacks as mentioned before there are attacks on the mathematical foundations of certain crypto algorithms for example, in the case of RSA RSA depends on the hardness of factorizing a large number that is the product of two large primes. So, that is the hard mathematical problem in the case of RSA in the case of Diffie Helman and elliptic curve cryptography the hard problem is the discrete log problem. So, the discrete log problem which I have already talked about and the easy discrete log problem. Now, in addition to these hard mathematical problems there are these attacks based on side channels that is what information can you get from a side channel basically the algorithm is being implemented on a server or a laptop or even a smart card and you are trying to get timing information and you are using that timing information or the power information how much power is consumed as a function of time you are using that information to actually reveal certain bits of the key. So, this is the whole idea over here now a summary of AES I do not know whether it is very clear to the audience what is going on on the slide because it is a very busy slide. We said there were 10 different rounds or stages and each round had 4 steps. So, this is sort of like round 0 the initial round which only involves a round key operation. So, recall that a round key is something that is obtained from the original key there is something called as key schedule you look at the text for details of this how to obtain each of these round keys from the original key. So, the first sort of before the actual round begins. So, this is an actual round with the 4 steps before that round there is a simple add key add round key operation over here and then the first round and the subsequent rounds. So, there are 9 such rounds and then one final round that has one missing step. So, these 4 rounds are actually byte substitution, shift rows, mix columns and add round key these are the 4 steps in each round and these will go on for 9 rounds and then in the final round there is only some byte substitution, shift rows and add round key there is no mix column operation in the last round. A more formal definition of each of these round now follows byte substitution that is the first step. We use a 16 by 16 S box where the i jth entry S i j is mathematically this thing you take the value of i. So, as we had seen before to recall. So, let me just draw this picture. So, you represent the plain text which is 128 bits by a matrix a 4 by 4 matrix and each of these may each element in this matrix is actually a byte and a byte is of course, 2 hex characters. So, why is it a byte because each of these is an element of this field g f 2 raise to 8 each element is represented as 8 bits for example, this thing. So, in hexadecimal this is the number 10 and this is the number 11. So, 11 a b and so on and so forth. So, to clarify the input to every round is a matrix like this. In fact, the input to every step in a round every operation in a round is a matrix just like this. The matrix is a matrix of bytes each byte is represented as 2 hexadecimal characters. Now, what you do in the substitution step is that you take these each of these things and you substitute for some other value and you use something called the S box. The S box is a 16 by 16 table and again each table is a byte 2 hex characters. Now, how do I use this S box? I take this a b and I use the first element as the index as the row index. So, I look at the 8th row. So, a is 10 for example. So, I look at the 10th row here. So, 0 1 2 etcetera right up to f I look at the 8th row and then I look at the b th column. So, 0 1 2 etcetera right up to f whereas, where I find the b th column and this is the element I would look at. So, I substitute a b for this element. So, this is the byte substitution step. Now, how exactly did I get this matrix? So, just a little bit of detail on that it is not a terribly important detail, but if you are getting into research and so on then you might want to know how do those entries come about. The i jth entry S i j is you take i and then you take j. So, i is a hex character sort of 4 bits 4 bits. So, you get 8 bits over here which is a field element and you take the inverse of that field element the multiplicative inverse of that field element and then you exclusive or it with 63 h. So, 63 0 1 1 0 0 0 1 1 exclusive or with that and that is the element that sits over there. So, i j is a concatenation of i and j represented as binary as a binary string and i j inverse is the multiplicative inverse of i j in this particular field. So, operations are defined in this field by the irreducible polynomial x raise to 8 plus x raise to 4 plus x cube plus x plus 1 and because 0 0 does not have an inverse the entry in row 0 column 0 is actually not using does not use this thing, but actually is just simply the element 6 3. So, let the i jth entry in the state matrix be x y just like we had a b in that picture that I just showed you there will be x y where x and y are hex digits then this step involves substitution of the i jth entry by the element in row x and column y of this substitution table. Then the next step is a row shift. So, each element in the i th row i lying between 0 and 3 of the state array undergoes a left circular shift of i positions. So, in row 0 nothing is shifted in row 1 everything is shifted left by one position. So, recall these things it will be important when we try to write down the implementation. The third row is shifted by two positions and the last row is shifted by three positions to the left and it is a circular shift. So, the row shift causes bytes in a column to be diffused amongst other column. So, there are these two principles of confusion and diffusion. So, the permutations actually help in diffusing things. Now, the next step is column mixing. So, the original state is pre multiplied by this matrix 0 2. So, this column is 0 2, 0 1, 0 1, 0 3 and so on. So, they are all actually shifts of one another as you can see this thing is shifted right by one position and then again it is shifted right and so on. So, this is the matrix the important thing to notice that these are not simple arithmetic numbers, but these are actually field elements. So, this field element is actually 0 0 0 0 0 0 0 1 0 that is this field element and so on and so forth. So, you actually multiply that so called state matrix that 4 by 4 matrix that I told you that 4 by 4 matrix goes around initially it is populated by the 128 bit plain text and it keeps getting transformed as it goes from stage to stage and also from step from one step to the other step within a particular stage. So, initially it starts out with something and then it that thing gets substituted each element they get substituted then there is a row shift and now we come to the third step where that matrix is multiplied by this thing and this and the product becomes the output of this step called column mixing. Now, the interesting thing and you can just verify it is something that you should observe very carefully it will take you some time to notice this and that is after the few rounds what is the meaning of few maybe two rounds after the few rounds and output will depend on all the inputs. So, there will be a tremendous amount of diffusion making the hackers job difficult and then finally, the fourth step round key addition each round has a separate key obtained from the original key using a key expansion algorithm. So, that key expansion or key schedule is all defined in the text we would not actually deal with it too much over here. So, each round key is obtained from the original key is exclusive odd with the current state to obtain the next state. So, this is a simple transformation where you just represent the round key as a 4 by 4 matrix and you exclusive or with the input to this particular step 4. So, now to understand our attack. So, we have implemented actually and implemented fairly successfully it has taken us a long time to do it actually at least about a year and a half because these attacks are extremely complicated and the more the newer the machine the worst the complication is. We started off with Intel dual core, but we have also tried to look at i 3 and i 4. So, attempts at designing and implementing side channel attacks. So, we have implemented this cache base side channel attack and these are some of the machines that we have targeted. We will also be targeting the atom machine and the and some more machines from spark and so on. So, right now we have got results for some of these Intel dual core Intel i 3 also i 5 and i 7 and these machines have 4 cores. Now, the actual implementation of e s. So, it is important to really understand how this implementation how this thing has been implemented in software and in a very efficient fashion to obviate the need for expensive field operations. So, although what I just told you the a s operations used field arithmetic and if we know how to do field multiplication etcetera we can easily see how to implement this I mean we can actually implemented using field operations and software, but that is not very efficient. So, although a s operations used field arithmetic the actual software implementation uses extensively makes extensive use of table lookups. So, turns out that there are 4 tables we will call them t 0 t 1 t 2 t 3 and each of these is 1 kilobytes. So, actually that corresponds to 256 entries of 4 bytes each and there are some information over here that is relevant to the attack because these are the targeted machines. So, we use the Intel dual core a unified L 1 cache. So, what does this have the Intel dual core has only 2 levels of cache L 1 cache and L 2 cache and the first one is 32 kilobytes. So, it is a unified that is to say data and instruction cache are in the same thing is not separate data and instruction caches and then there is an L 2 cache which is quite big 2 megabytes and it is 8 ways set associative. So, I will explain some of these things because they are important to really appreciate the attack what is exactly meant by saying it is 8 ways set associative and so on and then the Intel i 3 which is a 4 core machine and now this has actually 3 levels of cache a separate i cache and a separate d cache each of which is 32 kilobytes then you have an L 2 cache which is 256 kilobytes shared between p 0 and p 2 and another shared one between p 1 and p 3 and they are each 8 way set associative and then there is an L 3 cache which is shared by all 4 cores and that is 3 megabytes and the other important parameter of these caches is all of them have a block size of 64 bytes. So, let us try to understand exactly how this cache is organized. So, we will get back to one picture over here this is main memory and this is your cache much smaller than main memory main memory might be say 4 GB and this thing is if we take L 2 cache let us say 256 kilobytes for example, in the case of Intel i 3 now visualize this thing has been made up of lines like this. So, this thing is referred to as a cache block or a cache line. So, we start with the first block of main memory let us say this is the first block then this gets mapped to this location the second one gets mapped to this location and so on and so forth and then you go on this way and then this one. So, this one over here gets mapped to this location and you go on like this and round robin fashion this one again gets mapped to the first line of cache and so on and so forth. So, there will be many blocks over here of main memory that mapped to the same line in cache ok. So, this is a example of a actually this is like a direct mapped cache. Now, one of the things is the block size it is 64 bytes. Now, in an 8 way set associative cache there will actually be instead of this picture there will actually be several lines inside a set there will be 8 lines inside a set. So, let me draw a picture of an 8 way. So, this would be something called a direct mapped cache which is one way set associative. Now, the one that we are targeting is actually an 8 ways set associative cache. How does it differ from this one? Now, I have got sets over here and each set has got 8 lines. So, the first line maps to it does not map to a line this line of main memory does not map to a line it maps to the end to the set it can be anywhere inside the set. So, you go on this way now the second line will map to the second set and so on. This line will map to this set notice it maps to a set not to a particular line as in the previous diagram. So, this line if it is in cache it will be somewhere inside this set it cannot be in any other set it has to be in this set, but it can be in any one of those 8 positions. The next line has to be in one of those 8 positions to the next set and so on and so forth. So, this is 8 way set associative once again the block size over here is 64 bytes and the question is in this case how many different sets exist. So, the number of sets is the cache size divided out by the block size times the associativity. So, C is the cache size B is the block size and A is the associativity and until I 3 for example, I just substitute the numbers to get the number of sets I had 256 kilobytes. So, that is 2 raise to 18 divided out by the block size 64 that is 2 raise to 6 and the associativity 8 which is 2 raise to 3. So, this turns out to be the number of sets is 2 raise to 9 which is 512. So, let us just keep some of these things in mind and how we got these numbers. So, we will look at the maximum time to access a line in each set as we go on in this thing. So, we now see that the number of sets in the Intel I 3 machine the number of sets in the L 2 cache is 512 this is an L 2 ok. So, with this this background let us continue let us just recap the our targets are we try to attack 2 different machines Intel dual core with this configuration and Intel I 3 with which has 4 cores a separate I cache and D cache at level 1 32 kilobytes the one I am most interested in is this one L 2 cache with 256 kilobytes. So, this is one number that we should keep at the back of our minds the next number is the block size the block is basically the granularity of transfer between cache and main memory or between L 1 cache and L 2 cache you never transfer just half a line you will always transfer a full line or you may transfer 2 lines, but you never transfer one fourth of a line or half a line and so on. So, this is the granularity of transfer between cache and main memory ok. Anything else this is 256 kilobyte and the other important thing is that the at least these 2 caches are both 8 way set associative. So, that gives us the number of sets as I just showed you on that picture ok. Now, we need to understand something about these AES tables they are going to be put in cache right. So, when the we are going to assume over here that you have got an attacker and a victim the victim is implementing AES. He keeps implementing AES again and again and again what is he doing with AES? Think of the victim as being a database service provider or an archival storage provider. He takes things from his client his customers things like certificates say your school leaving certificate or your degree certificate of your B your B Tech degree and so on and you want to save it and store it some place. So, he takes all those certificates transcripts whatever you want to be stored securely he takes it and he stores it and there is no reason why he should store your certificates with a different key from somebody else only he can access that storage. Whoever gives him his certificates and other documents to store he stores in that same storage with one special key and that is the AES key. Now, it turns out that to implement his AES. So, the victim once again is implementing AES and the attacker is essentially he is got a very large array that he initializes and he reads and he writes from that array. So, the victim is the guy who is implementing AES. For AES you need a total of 4 kilobytes T 0 T 1 T 2 T 3 we will see exactly what those tables are and what do they contain these tables are placed back to back in virtual memory. Now, if they are 4 kilobytes immediately from the information that we have got we will see that they will occupy a total of 64 cache blocks. So, there are 4 tables we will see how those tables are used T 0 T 1 T 2 T 3, but they occupy a total of 4 kilobytes which coincidentally also tallies or matches with the paid size in these machines. So, we will see of the significance of that in a while. So, 4 kilobytes for these tables and why do they occupy 64 blocks because each block is itself 64 bytes. So, that is 2 raise to 6 multiplied by 2 raise to 6, 64 blocks each block itself is 64 bytes. So, the total is 2 raise to 6 multiplied by 2 raise to 6, 2 raise to 12 which is 4 kilobytes. There are also 16 blocks per table there are 4 tables 64 cache blocks each table has got 16 blocks and then within each block. So, I use the word block and line interchangeably. So, each block or each line of cache has got 16 elements why because each element is 4 bytes long the size of the element of that field. So, each element is actually it is got 4 elements. So, each of these elements of this table these 4 tables T 0 T 1 T 2 T 3 each of those elements is actually 4 bytes. So, it is actually 4 field elements and because the cache block size is 64 bytes there are 16 elements on each line or each block of cache. Now, for each block of plain text to be encrypted all 4 tables need to be accessed, but it turns out that not each block of a particular table is accessed. Now, what exactly are those tables T 0 T 1 T 2 T 3 we will demonstrate next to attack the software implementation of EES we have to first know how it is actually implemented. So, what is going on? Now, recall there is one step where we use the multiplication the field multiplication the 2 matrices and I just write down those values this is from a previous slide. So, this matrix has to be multiplied by the matrix which is the input to the column mixing step. Now, what is the input? So, briefly we recall there were 4 steps we started out with byte substitution then row shifting column mixing and round key addition 4 steps we can interchange those first 2 steps. So, let us say we started with row shifting first there is no problem you can just convince yourself that I can interchange the order of the 2 steps I can start with row shifting and then move to byte substitution. So, if I have shifted then the this matrix will look like. So, instead of putting x 0 x 1 x 2 x 3 it is more convenient to write down x 0 x 1 x 2 x 3. Now, if I have shifted it by to the left by one position then this thing will become 5 everybody sees that right. So, I shifted by left shifted by one position to the left. So, the x 5 from here goes there. So, I am going to do this this is the output of the row shift. So, I get x 5 over here I am just shifting all these indices to the left. So, I get 5 here then 9 then d and then 1. So, I will just read this this is x 0 x 4 x 8 and x c because I am writing this down not in row major fashion, but in column major fashion then the next thing is shifted by 2 positions. So, what am I doing now this is a particular round I have finished the first operation I just cheated a little bit I started with shifting rather than substitution and there is a reason for this before I told you it is first substitution then shift then column mixing round key addition. What I am doing over here is I am doing the shift first once I do the shift then this matrix which had x 0 the original matrix had x 0 x 4 x 8 x c x 1 x 5 and so on. Now, become looks like this it is x 0 x 4 x 8 x c x 5 x 9 x d x 1 and so on. So, this is what the matrix looks like after the shift and now I want to get the new matrix what I have to do for that and this is the interesting thing instead of actually multiplying it like this. So, actually I would have to take this element x 0 which is a field element 8 bits or 2 x characters I have to take it look into that s s matrix and then replace that value over here. Similarly, replace this thing and so on and so forth and once I have replaced all those elements then I will do the field multiplication of these two matrices. So, let me repeat the first step that I did over here was row shifting after I did row shifting I got this matrix. Now, I have to do byte substitution what that actually involves is taking this element and substituting it from another element from that 16 by 16 s box, but I am going to play another game I am going to be very very efficient as efficient as I can and what I do is I use this as an index. So, I take x 0 and I use it as an index into a table let us call it table t 0 that gives me the value. So, in table t 0. So, there is this table t 0 which has a total of 256 elements and each element is actually 4 bytes and guess what those 4 bytes are. So, for example, if I take x 0 over here then I use x 0 as an index into this table and what I get over here these 4 bytes are actually x 0 multiplied by 0 2 again field multiplication of the normal multiplication x 0 multiplied by 0 2 x 0 multiplied by 0 1 x 0 multiplied by 0 1 and x 0 multiplied by 0 3. So, that 0 1 occurs twice. So, once again in this table t 0 the entry in the x 0th row is the concatenation of 4 bytes the first byte is x 0 multiplied by 0 2 x 0 multiplied by 0 1 x 0 multiplied by 0 1 and x 0 multiplied by 0 3 which sits down over here this is the row indexed by x 0. Then to look at. So, this is what I get. So, I am going to do a table look up looking at table t 0 instead of doing that substitution and multiplication I am going to make it very efficient for each entry say x 0 I will look at table t 0 and from that table t 0 I will get x 0 multiplied by 0 2 x 0 multiplied by 0 1 x 0 multiplied by 0 1 and x 0 multiplied by 0 3 by looking at the x 0th row of this table. This table has got 256 rows and each row has got 32 bits. So, I can write a very nice assembly language instruction to actually pick out those 32 bits and send it through my data bus into the CPU and those 4 bytes 32 bits is 4 bytes those 4 bytes are actually those things that I just mentioned x 0 multiplied by this x 0 multiplied by 1 x 0 multiplied by 1 and x 0 multiplied by 3. So, that is the first thing I get I am doing one table look up and I am picking up in one shot I am picking up all these 4 entries. Then the next thing I do is I look at x 5 and guess what I do now I look at x 5 and I look at table t 1 a very similar looking table of 256 elements, but in this table so I looked I use x 5 as an index into this table and in this table guess what I have I will have x 5 multiplied by 0 3 look at the second column now x 5 multiplied by 0 3 x 5 multiplied by 0 2 x 5 multiplied by 0 1 x 5 multiplied by 0 1. So, once again one assembly language instruction where I use an x 5 to as an index into table t 1 and I pick out these 4 bytes these 4 bytes over here then I use x a x a as an index into another table t 3 and what is that table contain now you can guess what it contains if I take x a that table contains x a multiplied by 0 1 x a multiplied by 0 3 x a multiplied by 0 2 and x a multiplied by 0 1 I am looking at now the third column inside this. So, the first row of t 3 will be the first row. So, the 0th row will be 0 multiplied by 0 1 0 multiplied by 0 3 0 multiplied by 0 2 0 multiplied by 0 1 those will be the first element of t 3. The x a th element of t 3 x a th element of t 3 will be x a multiplied by 0 1 x a multiplied by 0 3 x a multiplied by 0 2 and x a multiplied by 0 1 and the final thing is I use x f into a fourth table and that table is t 4 and that table contains each element starting with the 0 th element first element and so on 0 th element multiplied by so 0 0 multiplied by 0 1 0 0 multiplied by 0 1 0 0 multiplied by 0 3 and 0 0 multiplied by 0 2. So, in that table now which is table T 0 T 1 this is T 2 I am sorry T 2 over here and table T 3 I will use x f into as an index into table T 3 to pull out in one assembly language instruction I will put a pull out x f multiplied by 0 1 x f multiplied by 0 1 these are all concatenated in one word of the processor. So, I pull out all of that thing and lo and behold I have got x f multiplied by all of these slight change it is not actually x f multiplied by all of these it is x f after you do the substitution do not forget that all of this stuff that we are talking about right now subsumes two operations you are killing two birds in one stone in one stone you have hit substitution as well as the multiplication. So, I get actually in the first access to T 0 the x 0 th row of T 0 is actually the substitute perform a substitution in x 0 using the S box all that is taken care of the table will actually reflect all of these things that is x 0 perform a substitution is on x 0 using the S box and then multiply and then multiplied by 0 2 0 1 0 1 0 3 all of that contain in one single row the x 0 th row of T 0. So, by these table accesses and it is important now to see how many table accesses I have done for each element I will do one table access. So, as you might guess now x 0 is used as an index into table T 0 x 5 into table T 1 x a into table T 2 and x f into table T 3 and what do I do when I get all those words from memory from cache memory now I will actually add them and when I say add I mean the field add which is an exclusive or. So, this thing is equal to. So, this is the output matrix of which round of the third round. So, the first round was substitution then the row shift we have changed it actually interchange it we have made it row shift without any problem. So, the row shift was first. So, that is reflected in these indices then the substitution and then the multiplication both substitution and multiplication this matrix multiplication which is a field element of fields this has both been subsumed in just table look ups. So, in just a total of for each element I must do one table look up. So, for element this thing x 0 I must look into table T 0 for this thing I must look into table T 1 for this thing into table T 2 for this thing into table T 3 and I simply add those entries exclusive or that is to say those concatenation of 4 4 bytes whatever came out of the table say this thing came out of the table I take this and I added to this thing simply exclusive or and that I add to this one whatever was that x 8th thing added to this and then the x 5th this guy x 5th entry in table T 3 and to complete that round I need to exclusive or further with the round key. So, in just 4 table look ups and doing these exclusive or operations guess what I have obtained this entire this entire column I do exactly the same thing now with x 4 x 4 again I use that as an index into T 0 I use x 9 as an index into T 1 this thing as an index into T 2 and this thing as an index into T 3 and I once I pull out all of these things in 4 assembly language instructions the next thing I do is exclusive or all of them and then exclusive or with the round key and then I get the next column and the next column and the next column. So, basically what I am doing is I am doing 16 memory accesses and almost certainly those accesses are going to be in 2 cache. So, 16 accesses in cache. So, I have got now tables T 0, T 1, T 2, T 3 let us see how they would be organized in cache, but before that we will just very quickly to understand the notation we will just show you one slide where all of this thing appears on one slide ok. I hope this thing is visible to the people to all the participants or this I just read out from it. So, it is a nice summary slide. So, this is from my students MTP presentation. So, there are 4 tables each table each taking 8 bit input and giving 4 byte output. So, there are 4 tables what are those tables called T 0, T 1, T 2, T 3 what do you provide as input to the table you provide one of those x 0s or x 1 or x a or whatever. So, those are basically elements of that state array. So, one element at time. So, you provide one of those elements to table T 0 as I have shown and then. So, that is an 8 bit input that means there are 2 raise to 8 possible indices from 0 through 255. So, that x 0 serves as an index into this table T 0, the other one below it would serve as an index into table T 1. So, that is why we say 4 tables each one taking 8 bit input and do not forget the output is an interesting 4 byte quantity in one single shot you are getting 4 bytes that you needed 32 bit word. So, this can very nicely come through 1 assembly language instruction into your CPU. The size of each table is 1 kilobyte why because there are 2 raise to 8 entries in the table 2 raise to 8 is 256 multiplied by 4 bytes that gives you 1 kilobyte and those 4 tables almost certainly when you write your program and you define those 4 tables T 0, T 1, T 2, T 3 they will be adjacent to each other in virtual memory and the 4 tables together 1 kilobyte, 1 kilobyte, 1 kilobyte, 4 kilobyte almost certainly would occupy one page 4 kilobyte page. The default page size on most of these machines is 4 kilobyte. The constants of the table is known to all. So, we can derive those constants very easily basically that table is derived from the substitution box and from after you do those mathematical operations, but that is already done for you. So, those 4 tables are populated how do you get values in those 4 tables by looking at the substitution box and by doing the field operation. So, that is already done for you do not have to do it again and again you just it is done once and for all and every time you run AES you simply use those tables you do not need to look at the substitution box you do not have to do the field operation that is the multiplication by 0 2 and 0 3 and so on. Now, this very nicely summarizes a particular column in a round. So, this is the first column of the output of that particular round. So, what is that column? The first column it is T 0 if you remember the notation I am looking at x 0 and I use that as an index into table T 0. Just imagine the beauty of this implementation x 0 is used as an index into table T 0 in just one assembly language instruction I pulled out this 32 bit quantity each element in these tables is 32 bits. Then I look at x 5 why x 5 and not x 4 because it has been shifted. So, I look at x 5 and take and use that as an index into table T 1 and I get a 32 bit quantity again. So, one 32 bit quantity another 32 bit quantity then I look at table T 2 and again shifted twice was the value x 10 I use that as an index into table T 2 and then this thing shifted 3 times x 15 and that is an index into table T 3 all of these are 32 bit words are exclusive or all of them in a couple of assembly language instructions and the corresponding 32 bits of the round key. So, the round key itself is. So, this K 0 stands for the 0th column of the round key again the round key can be thought of as a 128 bit entity and the first column would be 32 bits. So, each of these things is 32 bits think of it as 32 bit words they are all exclusive word and lo and behold I have got the entire first column of the output matrix in this round. So, E 0 that is 32 bits or 4 bytes or 4 field elements and then if you can read this thing it actually gets into a little bit more detail. So, the R plus 1 superscript stands for round R plus 1. So, round R plus 1 x 0th element x 1 x 2 x 3. So, this is the first column basically expansion of this thing in terms of the actual elements. So, x 0 R plus 1 is the first element of the output matrix that is to say the output of round R or the input to round R plus 1 then this is the second element of the matrix which is input to round R plus 1. So, these are the inputs to round R plus 1 and these are the inputs to round R. So, from the inputs to round R you get the inputs to round R plus 1 and exactly using this thing. So, this is for the first column of the matrix this is for the second column the third column and the fourth column. So, this is a standard implementation used in almost any way where you implement AES in software for example, the open SSL library. So, to recap there are 4 AES tables each 1 kilobyte. So, total of 4 kilobytes they are placed back to back in virtual memory. They occupy a total of 64 cache blocks 16 blocks per table and 16 elements per block. So, let us just clarify all this each element on the table is 4 bytes each block is 64 bytes on most machines. So, therefore, there must be if the each element is 4 bytes there must be 16 elements per block. Now, 16 elements per block there are total of 256 elements per table. So, there must be 16 cache blocks corresponding to table T 0 16 cache blocks corresponding to table T 1 16 to T 2 and 16 to T 3 for a total of 64 cache blocks. For each block of plain text to be encrypted all 4 tables need to be accessed. So, let us focus on doing one particular round of AES then we require 16 accesses. If you do the entire encryption you require 10 of these things. So, 16 multiplied by 10 160 accesses to these tables. So, let us try to see now each table will be accessed, but the interesting thing is that there is no guarantee that every single line in a particular table is accessed. So, for example, take table T 0 table T 0 will be accessed how many times in each round just recall in each round table T 0 will be accessed 4 times guaranteed table T 1 will be accessed 4 times T 2 4 times T 3 4 times, but there is no guarantee that if you take all of the rounds which is a total of 4 multiplied by 10 that is 40 accesses to T 0 you cannot guarantee that every block of T 0 will be accessed there are 16 blocks in T 0 because there are total of 64 cache blocks. So, 16 blocks for T 0 and how many times you are accessing T 0 you are accessing it 4 multiplied by 10 each round 4 and there are 10 rounds of 40 times, but that does not guarantee that each of those 16 blocks are going to be accessed and that is the heart of the attack to figure out which blocks are not accessed. So, most of those blocks those 16 blocks in T 0 will be accessed, but based on what is not accessed you can try to get valuable hints as to what is the key. So, not each block of a table is accessed figuring out and how do I figure out by cache accesses I will actually access the cache as the attacker to see which block takes me more time to access and which is less time. So, the fact is not each block of the table is accessed. So, if I can figure out which block of the table is not accessed or which blocks are not accessed. So, out of those 64 blocks statistically speaking about 3 or 4 will not be accessed and that will give me valuable clues as to what is the encryption key. So, this idea has come up from a paper by Shamir and others. So, this is the main idea again by identifying which blocks of the tables are not accessed valuable information about certain bits of the AES key can be deduced. An attacker will be able to bring down the complexity of an attack on AES from this if you use a brute force attack then because the AES key is 128 bits the complexity of that the of that attack is around 2 raise to 128. You can bring it down from 2 raise to 128 to around 2 raise to 48 by using a cache based side channel attack. So, now what is the actual attack? Finding the location of the AES tables approach 1. So, access SW blocks using the attackers list of pointers to get the cache completely occupied. So, what the attackers going to try to do is he is going to target L 2 cache he is going to completely write into every block not every byte, but every block he is going to try to access each of the blocks inside L 2 cache. Over here the notation is S is the number of sets and W is the associativity. So, that is it he will do this then he will give control he will impart control to the victim. So, as I said before the victim is just doing encryption after the encryption after the encryption. So, the victim performs AES encryption to bring all of those blocks those 64 blocks he is going to use almost all of those blocks. So, those blocks are going to be brought into cache and the next thing is for each set access W blocks which map to it. So, if there are S sets inside this cache then you iterate for each set you access the W blocks which map to it. So, basically the attacker now in this step he is also accessing the W blocks. He records what is the maximum time to access. So, amongst those W blocks that belong to a given set he looks at the access time for each block and he computes the maximum time and that is what we are going to plot. So, now this is for the Intel i 3 4 core I just went through the calculations this has got 256 kilobytes each block is 64 bytes and the associativity is 8. So, that is the total number of sets is going to be 2 raise to 18 divided by 2 raise to 6 divided by 2 raise to 3 that is 2 raise to 9 2 raise to 9 is 512 sets. So, for each set I looked at the maximum time to access a particular block in that set and I plotted this and I was hoping to find as per the attack I was hoping to find those 64 blocks nicely on top over here. So, I can at least identify where the AES tables are in cache. Unfortunately, I was unsuccessful I cannot see any plateau of 64 64 sets. So, 64 is about one eighth of 512. So, one eighth of this graph should have seen a plateau of this sort more or less contiguous because almost all of those blocks are accessed, but I do not see any of this and the reason for that is because modern processors typically use something called prefetching when there is a cache missed on a particular block in anticipation of the use of the next block they also fetch that. So, that has basically dashed my hopes of finding out. So, the first step was to find out where the AES tables are in cache as the attack up and unfortunately I cannot at least from this picture I cannot. So, let us get back a little bit to what I am trying to do over here. So, when the victim starts to execute AES from main memory there are these different elements of the T 0, T 1, T 2 and T 3 which I have now brought into cache. So, the moment I start to access them and run these different rounds they will be brought into cache and I am talking about doing a complete encryption which is 10 rounds. So, now visualize in this picture each of these things is actually a set each of these is a set and the set has got 64 let us see 8 different elements inside the set. So, so now visualize this not as a line, but as a set there are 8 blocks per set this is L 2 cache. Now, what happens is I bring the first line let us say in those tables again recall there are 4 tables T 0, T 1, T 2, T 3 each of these tables is 16 blocks. So, total of 64 blocks. So, I bring the first element of T 0. So, let us suppose it is inside this set here now it can go anywhere inside that set then I bring the next element of T 0. So, this this element actually this is a set this whole thing is a set and this is a block. So, the block is of size 64 bytes which carries 16 elements of T 0. So, 16 elements over here now need to access another block. So, let us suppose that will come here and so on and so forth not exactly in this order. So, like this. So, I have a total of 64 such things that I bring in 64 sets. So, this is one set another set another set etcetera for T 0. So, 16 such sets for T 0 16 41 16 40 2 16 43. So, in my first. So, looking at the attack in pictures what I did was the first step was the attacker who has initialized an array which is about the size of this L 2 cache actually turns out to be larger than that for reasons I would not explain here. So, he populates every single he accesses every single block inside this L 2 cache. So, this block this block and so on and so forth each block of this a cache is going to be accessed that is step 1 of the attack. Step 2 the victim begins to execute AES and in so doing he needs to bring all these 4 things inside cache. There are total of 64 blocks and he brings them. So, the first block could be here for example, the second block and the next set if you see if you understand how cache works it will be in this set first and in this set it need not be exactly below this it can be anywhere inside that set. So, the next one here the next one here and so on and so forth. So, he brings 16 of those things, but not necessarily in this order of this then this then this whatever he needs to access based on those x 0 values x 1 values and so on. So, he brings a total of these 64 and he puts them inside over here and then guess what happens the third step. So, now most of those 64 blocks as I said before statistically about 3 blocks will not be accessed. So, most of those blocks now sit down inside here 64 contiguous things and then the third step is the attacker again starts to access each block of this. So, he accesses each block and then he looks at the maximum time to access each block within a set and he takes that maximum and he plots that. So, the maximum time to access each of these blocks. So, access this block access this block access this block take the maximum of them and plot it for this particular set then go to the next set take the maximum for that set and plot it and so on and so forth. Now, when you come to this set he find something very interesting because the victim has executed AES a full execution of AES that is all 10 rounds he must have brought this block. He must have brought this block and as a result when the attacker goes to access his array the attacker goes to access his array he will find that to access this block it takes more time. So, the maximum of the access times is now elevated and we will see that in the graph. So, for all of these contiguous 64 blocks you should see some elevation. So, the first step for the attacker is to figure out where in main memory and in cache his t 0 t 1 t 2 t 3 exist that is the first step. So, we did this experiment and we plotted it and the results of the plot are in this slide. So, when we plotted we actually got something like this which does not reveal anything and the reason for that is as I said before modern processors have this feature of prefetching it is called hardware prefetching. If a block B is requested from main memory B plus 1 is also prefetched in anticipation of its use. So, even without requesting it I prefetched. So, suppose there was a miss on block B I bring this block then I also bring B plus 1 in anticipation that I might require B plus 1 in the very near future. This is hardware prefetching done by the processor to ensure better bus utilization. There is also something called software prefetching that is done by the compiler, but we are not going to talk about it here. So, unfortunately this greatly complicates the task of the attack. So, when we tried to attack because of this as you saw in the previous slide we could not find any elevated region over here corresponding to the 4 AES tables. So, that was bad news we had to continue working and working and then we modified our attack as follows. So, now we do this we iterate for each cache set how many cache sets are there we just calculated for the Intel i3 there are 512 cache sets for each cache set we do the following access W blocks using the attackers list of pointers that map to this cache set. So, I just do a picture in each horizontal row there were 8 of these things. So, I access each of these blocks these W blocks corresponding to the same cache set. So, that is the first step then I allow the victim to go on. So, he does the encryption perform AES encryption to bring all those blocks amongst the 64 blocks into the cache that were used by the encryption and then I go back to the attacker the attacker accesses the W blocks which map to this particular set S prime. So, he brought those elements into the cache those W elements W is the associativity it is value is equal to 8 in the case of the i3. So, he brought those W blocks into cache this guy starts the encryption then he goes back and accesses those W blocks which map to the same set S prime record the maximum time taken to access the W blocks. So, keep doing this experiment for all the S prime sets inside the L2 cache S prime is equal to as we calculated 512 sets. And after we do this we found that this was the graph this modified attack there was a slight plateau over here almost all of them in the 64 set range were actually elevated access time. So, the others were generally lower now there are some that are little bit high over here, but those false positives are because of the operating system the operating system and other processes are executing and because of those other processes they bring their own this is d cache do not forget. So, there is i cache and d cache separate caches on the Intel i3. So, in the case of d cache I am sorry this is a unified cache we are talking about L2 actually. So, in the case of L2 you will bring both data and instructions. So, there are different instructions and there is different data corresponding to operating system processes and also some other processes that are running. So, there is some amount of noise, but fortunately if you put the browser off and you put other applications off to the extent you can then you will be able to actually see this. This is a place where the four tables t 0 t 1 t 2 t 3 are nicely sitting down. So, once we know where the tables are then the next step as you would guess is to find out which of those sets is not accessed. Once we know which sets are not accessed then by some extremely complicated mathematics which is in Shamir's paper we can actually figure out which keys which could not be possible potential keys. So, in other words we have reduced the space from 2 raise to 128 to around 2 raise to 48. So, this was the first step on the Intel i3 finding out where exactly are those four tables in cache by just doing experiments we do not know anything about the you know we cannot see the victims program we cannot figure out where is tables are this is a very indirect way of figuring out just measuring access times to figure out where is tables sit down in cache. We also did this experiment on Intel dual core there if you do the calculations L 2 cache size and so on it turns out there are about 4096 different sets and then we found the location of the AES cache those 64 sets strangely enough most of the time they were contiguous, but sometimes those 64 sets were discontinuous. So, if you look at the total number of this sets over here if you just look at the x coordinate here and the x coordinate here the total number of sets here and the total number of sets here you will find that the total is 64. Now, we try to understand why this was the case in about 10 percent of the cases these t 0 t 1 t 2 t 3 were not contiguous in cache, but actually split in this fashion. So, we try to understand that and the reason for that is actually reasonably straight forward. So, if you look at main memory t 0 t 1 t 2 t 3 will almost always be contiguous no problem with that each one is 1 kilobyte 4 kilobytes total. So, they would occupy one page in general because each one is 256 bytes sorry 256 entries. So, 256 multiplied by 4 that is 10 24. So, 1 kilobyte. So, 1 kilobyte 1 kilobyte 1 kilobyte 1 kilobyte now very often these would be aligned on a page boundary. So, this would be in 1 page very often. However, this would not be the case not very often, but sometimes this would not be the case there would be split between 2 pages. So, this is the page boundary and then the tables might be say here between 2 pages. So, this is the page boundary and this is the page boundary. Now, we know how virtual memory is mapped to physical memory. So, this is virtual space the way this is mapped to physical memory is using a page table and this page might go to one particular page in main memory. And there is no guarantee that this page maps necessarily to the next thing the way page tables work is that this could not map anywhere else. So, this maps over here while this next page might map somewhere else altogether. So, this goes over here that is the reason you see this thing not contiguous, but discontinuous. So, some part of the t 0 t 1 t 2 t 3 tables are here and some part are there at the beginning of this page. So, this is the page boundary in physical memory main memory this is the page boundary this is the page boundary. This is something else and r t 0 begins over here for example, t 1 continues t 2 is partly here and partly there and t 3 is here. That is the reason why the 2 things appear discontinuous because of paging. So, this is the first part of the attack to figure out where the AES tables lie and then once we figure out where they lie out of those 64 lines or 64 blocks in t 0 t 1 t 2 t 3. We figure out which blocks were not accessed approximately 3 to 4 blocks will not be accessed figuring out which blocks are not accessed will enable me to get valuable hints to some very complicated mathematics which takes about 3 pages. I can figure out which by figuring out which blocks out of the 64 were not accessed I can get valuable clues as to what may not be a candidate key. So, I can eliminate some keys from that 2 raise to 128 space I can bring it down to around 2 raise to 48 and once I brought it down to 2 raise to 48 then I can laptop for example, I can spend a day or so in actually obtaining the rest of the key. So, this was a known plain text attack. So, that is the thing I have also talked about this last time. So, I would not repeat it these are some of the projects that we are doing and since some of the participants had specifically requested that we talk about some research directions and what is going on over here. In addition to doing the basic stuff on the basic course we spent about 2 or 3 sessions talking about some of the research directions that we are pursuing out here. So, with that I conclude this discussion.