 Hello everyone, today I am going to be presenting my paper, RASIL written at the stack-based side channel leakage. My name is Anirman Chakraborty and I am a PhD student at Indian Institute of Technology, Kharagpur. This work has been done in collaboration with Dr. Sharni Bhattacharya from KU Luven, Mahanaralam from Indian Institute of Technology, Kharagpur, Dr. Shikhar Patronabish from ETA Zurich and Dr. Devdhi Mukapadhyay from Indian Institute of Technology, Kharagpur. Before going into the details of our work, I would like to talk about the deadline scheduler and try to understand its equity implications. Schedulers, as we know, handle removal of processes from and selection of processes to the CPU. There are many types of scheduler, the common types being CFQ, Nu and the deadline scheduler. So, in particular, we focus on deadline scheduler, which actually automatically pre-empts a process from the CPU after its request expiration time. It creates quite a wide applicability, mainly on real-time operating system, embedded systems and also in general purpose servers. However, the security implication of deadline schedulers have never been studied before. So, in this paper, we'll talk about the deadline scheduler and try to exploit its security implications. To enable the deadline scheduler, the user must have CAP-SISNICE permission in order to adjust the scheduling parameters. Normal users can efficiently utilize system resources for better performance using deadline scheduler, which is common in case of real-time execution environments. To adjust the scheduling parameters, we can use the CHRT command along with the three parameters, shed runtime, shed deadline and shed period, and the name of the executable. So, in this presentation, we'll show how utilizing the deadline scheduler an attacker can effectively achieve synchronization with the victim process. So, here is the outline for our talk. First, we'll discuss about the written address stack or RAS. Then we'll see how to reverse engineer the RAS for undocumented processors. We propose a novel attack RASN by establishing a covert channel through RAS. Then we move on to the case study on OpenSSL ECG scalar multiplication, and we'll see how to use deadline scheduler to achieve synchronization. Finally, we see a case study on ECDSA signature generation algorithm. Return instructions are a special type of indirect branch instructions that might get called from different program locations, but the target address will remain the same. Now, for example, take the example of the printf statement, which is a common for GNUC library that can be called from different locations inside a source code. So, every time the same library subroutine will be invoked and the same set of instruction will be executed, but since they are being called from different program locations, the return address for each of the corresponding function call will be different. Now, to reverse engineer the size of the stack, we devise a simple experiment. We start with an arbitrary number of nested function calls, suppose 17, where the main function calls function f17, which in turn calls function f16, and so on till the deepest function, which is f1 here. Now, as per the working principle of RAS, the return address of the main function will be pushed onto the RAS first, followed by the return address of the function f17. Now, suppose the RAS is as high as 16. So, proceeding this way, the function f2 is called, its return address will also be put at the top of the stack, and at this point, the stack will be completely full of all valid entries. Now, when the function f1 is called, its return address will also be pushed into the stack, but as the stack was full at this point, it will lead to an overflow condition and thus the return address for function f17 will be pushed out of the stack. So, one must note that RAS helps to keep the return address user to the processor, and thus reduces the access latency. So, for any return address that is not present in the RAS, the processor will take more time to complete the execution. So, we will use this increase in execution time to reverse engineer the stack size. So, as already discussed, we start with n nested function calls, and check the difference in execution time for n calls and n-1 calls separately. In order to account for the system noise, we use operation multiple times and calculate the mean difference for their execution times. Then we repeat it for function depth n-1 and check the mean difference for n-1-th and n-2-th function calls. So, we reduce the function depth by 1 every time and keep a log of the difference in their execution time. This flow shows the difference of execution time for consecutive function call depths. So, we can observe that the difference in execution time for consecutive function depths increases significantly after 16 function calls. So, the reason behind that is when depth of nested function calls is less than 16, the processor gets all written addresses for the RAS and therefore, takes considerably less time to complete the execution. So, we can conclude that on our target system that we did this experiment on, the RAS can hold up to 16 entries. Now, this is a generalized experiment and can be generalized to any processor to find out the depth of the RAS for that system. We further validate our observation using hardware performance counters. We know that in a speculative execution environment, the target address for a written instruction is predicted by referring to RAS and matched with the actual value stored in the main memory much later in the pipeline. Therefore, any wrongly predicted address or an underflow overflow condition in the RAS will result in a branch miss event. Branch miss events are pretty accurately can be pretty accurately measured using Perf event tool and the event Perf count hw branch misses. So, we observe that for inner 16 functions the number of branch misses is 2 whereas for the 17th function the number of branch misses becomes 3 and it increments by 1 for every increment in the depth of the function call. So, this validates that the size of the RAS is 16 and because after that when we are increasing the depth of the function those are actually getting overflowed and as a result our branch misses gets incremented. We exploit the fact that an overflowing RAS can result in an increase in execution time and this difference in timing can be observed by a core-located process to establish a covert channel between two processes. Now, consider the scenario where two processes A and B are running simultaneously on the same logical code. Now, process A executes a series of n nested function calls where function f1 is calling f2 a virgin term calling f3 and so on. Therefore, for each function call an entry is inserted in the top of the stack. Now, we choose n such that the entire stack gets filled with the written address. Inside the innermost function for A it yields the CPU before executing the written instruction. Therefore, the entire RAS is filled with the written address of process A and at this point it is yielding the control of the CPU to another process B. Now, B executes m functions also in a nested fashion. Now, as both process A and B share the same RAS the written address of process B will push out some of the written addresses of A or more specifically n-m number of written addresses for process A will be pushed out of the stack. Now, when the control again goes back to process A, it can easily understand that some of its written addresses has been pushed out of the stack by measuring the execution time of its different function depths. Now, we will use the scenario that we just described to create a covert channel between a sender and the receiver. The receiver makes 16 nested function calls and yields the CPU in the deepest function without executing the written statement. So, as a result at this point of time the entire RAS is filled with written addresses of the receiver process and the control now goes to the sender process. Now, the sender processes a bit stream of 0s and 1s as shown in this adjoining sample program. On processing the value of the bit 1 it makes a function call and processing a 0, it does nothing. Well, suppose it processes bit 1 so it executes a call to this function function. So, as a result the return address of this function gets pushed onto the RAS. Now, since the RAS was filled with the receiver's written addresses one of the written address of the receiver will be pushed out of the stack. Now, the access again goes back to the receiver so after the receiver after getting the control of the CPU back it will start executing the unfinished return commands by referencing the addresses stored in the stack. So, to input the message transmitted across the covert channel the receiver measures the timing latency of its outermost function call. So, on receiving a 1 the receiver will encounter a stack under flow situation and thereby we will see an increase in the execution time whereas in one receiving a 0 no extra latency will be observed. The adjoining figure shows the timing values as observed by the receiver. The threshold is empirically selected and the timing values above the threshold denote a bit 1 and below the threshold denote a bit 0. We have conducted the experiment on multiple systems where we reverse engineer the size of the RAS and performed our covert channel experiment to observe the average bandwidth. So, now to demonstrate RASL on a real-world setting we target the scalar multiplication operation in P384 curve from open SL library. Electric of cryptography is one of the most widely used asymmetric key algorithms based on the algebraic properties of elliptic curves over finite field. Scalar multiplication is a fundamental security critical operation in ECC which computes q equals to kp where k is an n bit secret scalar and q and p are points on the elliptic curve. The security of ECC is defined by the hardness of determining the scalar k given both the points and the curve parameters. The scalar multiplication is open SL for this curve is implemented using Montgomery ladder with conditional swaps and non-adjacent form for scalar representation. The scalar k is transformed to its corresponding WNAP representation and based on this representation a series of double and add operations are executed to perform the multiplication. These operations are further implemented by a series of b and add and b and sub functions. We now proceed to perform template attack on ECC scalar multiplication. The victim and the attacker are sharing the same logical code and thereby sharing the same RAS. The attacker first fills up the RAS and yields the CPU to the victim. The victim again performs ECC multiplication through the Montgomery ladder operation and yields the CPU after each iteration of the Montgomery ladder. So the access again comes back to the adversary which now measures the timing for its function calls. So more specifically the spy first fills up the RAS entire RAS with the return address of its N functions. Here the value N can be easily determined by reverse engineering the size of the RAS and then it yields the control of the CPU without executing any return statement. So at this point of time the entire RAS is filled up with the return address of the spy. The control now goes back to the victim which is executing the multiplication operation. The victim yields the CPU after every iteration of the ladder. The control now comes back to the spy again which measures its own execution time to check whether any of its return addresses have been pushed out of the step. We perform the attack in an iterative manner. At a particular instance the adversary targets the IH bit of the secret scalar given the assumption that the adversary already knows the first I-1 bits. The template attack works in two phases template building and template matching. During the template building phase the attacker simulates the number of BN add and BN sub function calls for each bit. As the number of BN add and BN sub function calls depend on the particular bit being processed and the affine coordinate of the curve points, the attacker builds templates for each bit based on the total number of this function calls executed for a fixed set of input plaintext. For any particular bit say the IH bit the attacker performs point multiplication using a set of unique inputs fixing the IH bit to be both 0 and 1. Next on each input the attacker estimates the total number of addition and subtraction function calls made by the ACC program for the IH bit assuming is valid to be both 0 and 1 and simultaneously it uses the spy process to measure the execution time using RASL. Now that we have the overall strategy let's see the process in detail. We introduce an encoding scheme to represent the total number of BN add and BN sub function calls as unique classes. Suppose for a particular input and IH bit the monomer ladder executes X BN sub and Y BN add function calls we represent this as this class as XY. Now based on this classes we segregate the corresponding inputs and associated timing values by creating hypothetical bins corresponding to each class. So by fixing the IH bit to be 0 we get a set of bins and by fixing if the value to be 1 we get another set of bins. In our experiments we found out that the classes mostly belong in the range 81 to 85 then 90 to 96 and then again 100 to 106. So the attacker selects a pair of bins which contains relatively high number of inputs in this example the class 83 and 95. So the take away from this is that for each bit position there will be 4 template bins. 2 for when the value of the bit is 1 and 2 for when the value of the bit is 0. Now once we have selected our bins we proceed to the template matching phase. In the template matching phase the goal of the attacker is to predict the correct value of the IH bit. The attacker again observes the encryption process using inputs associated with the 4 selected bins and also observes the timing values using results. The idea is basically the timing values for the correct IH bit should match with either of the 2 set of templates but not with both. So here is the result from our experiments. So this is the template for the 350th bit as value 0 we can see the classes selected as 83 and 95. Now for the correct estimate we can see the distribution quite clearly matches with the template or with the correct template whereas for the wrong estimate it does not match with the template. The attack on ECC scalar multiplication demonstrates how RASL can be utilized to leak information about the control flow of another process but the challenge there was that it requires the victim to yield the control of the CPU for every iteration. So this yielding of the control of the CPU can be made possible using deadline schedulers which impose a deadline on operations to prevent starvation of processes. Now victim and the spy can then be executed with a single sketch runtime parameters. In this case we assume that one of the schedulers must be a deadline scheduler of the system and the user obviously needs to have capsis nice capability to launch the attack from user space however no synchronization mechanism is required inside the victim code. Before moving on to our next attack we will provide a brief background on ECDSA. ECDSA typically uses an elliptic curve with a base point P of prime order Q and consists of two parts. One is a signing operation where a non-SK is sampled uniformly in the range 1 to Q minus 1 and outputs the signature R comma S and the other part is a verifier given a signature R comma S and a message M. It computes the hash and finally outputs 1 if the X coordinate of Z modulus Q is equals to R or 0 otherwise. Now we perform the attack on ECDSA in two parts. One is the online phase where we perform a targeted recovery of a fraction of the MSB or most significant bit of the non-SK sampled by the ECDSA signing algorithm with the help of RASL and in the offline phase we combine partial non-SK information with lattice based cryptographic techniques to retrieve the final signing key. For the online part of the attack we again resort to template attack on randomly selected nonces but this time we consider building templates of a window of unknown bits instead of a bit by bit iterative approach. For template formation the adversary executes a SPI and a dummy victim process performing ECDSA scalar multiplication simultaneously using deadline scheduler. For the L bit of the MSB position of the nonces there can be 2 to the power L combinations of bit sequences possible. We build templates for bit sequences of each of this 2 to the power L combinations. Now for each of the 2 to the power L bit sequences the dummy victim process performs ECDSA scalar multiplications using 2 to the power L nonces by changing the L most significant bits while keeping the other bits same whereas the SPI running in parallel continuously fills up the RAS and probes it to observe the timing values through RASL. As the adversary requires to retrieve only LMSBs of the nonces the SPI considers only those L timing observations that correspond to the LMSBs. So now the adversary has timing samples for LMSBs of each of the 2 to the power L bit sequences of the nonces. Next the adversary selects medias from each of these timing distributions as a representative template for a particular bit position of a particular sequence. Now from the figure it is apparent that the distribution of timing samples for each bit position can be subdivided into 3 regions. The most intuitive explanation of this observation is that the SPI tries to achieve synchronization with the help of dead-lash curricular without explicit handles inside the victim code. So due to the absence of perfect synchronization mechanism there is a mutual overlap between the timing samples of any 2 adjacent trace points. So therefore we define 3 separate regions in the timing distribution, a lower region, a middle region and an upper region and we select medias from each of these regions. Therefore the adversary will have 3 into L into 2 to the power L templates for LMSB positions of the 2 to the power L nonce candidates. Now in the template matching phase we choose 500 non-signature pairs for a randomly chosen ECDSA signing key. The attacker tries to extract the 6 MSBs of each of these 500 nonces. So for a particular bit position we have 2 to the power 6 templates each having 3 regions. Next we select the median which has the least difference with the actual observed timing value and finally we perform a least square error method to determine the top 5 templates and represent the possible combinations for the 6 MSBs. So therefore for 500 nonces we have 500 cross 5 candidate combinations of the 6 MSBs. The table shows the ordering of candidate nonce combinations of the 6 MSBs achieved by least square error. Now given the noisy leakage samples on the partial nonces used by the ECDSA signing algorithm, we aim to recover the ECDSA signing key using a combination of lattice reduction algorithm and statistical mixing and matching of leakage samples. We adopt a trial and error approach where we randomly select 200 candidate partial nonces to create the hidden number problem instance. We convert them into lattice and the target vector for the CVV problem instance and finally solve the CVV problem instance using the FiLL Solver to arrive at a guess for the secret key. If the guess is correct, the attack outputs the recovered secret key otherwise it repeats the same process for a different randomly selected set of instances. We must note that the attack can trivially identify when the correct key has been recovered by checking if it yields the correct public key which is available for verification. This check involves a single deterministic error multiplication so when the attack terminates by outputting a secret key we can be sure that the correct secret key has been recovered as opposed to merely guessing the correct secret key. This table shows the time taken for ECDSA key retrieval for partial leaked nonces where we try with 500 signatures and we see that nonces retrieve in each of the cases where the online phase took somewhere around 5 seconds to 7 seconds and the offline phase took around little more than 1 hour. For our attack we use the following system for the online phase we just use the system with the processor having Intel Xeon CPU E5 2609 default it has a default deadline scheduler in it and the operating system was Red Hat Linux server 7.7 with kernel 0 the deadline scheduler parameter that we set was the schedule run time as 3600 schedule deadline as 3700 schedule period as 7200 for the offline phase more specifically for the lattice reduction part we use a cluster having 260 nodes where each node had 128 AMD epic 7742 processors with 2.25 gigahertz nominal and 3.4 gigahertz peak clock speed and a ramp of 512 GB of 3200 megahertz. So to conclude in this presentation we discussed about determinant stack a core component of the speculative execution subsystem we propose a generic methodology to reverse engineer the RAS for undocumented processors we propose a novel attack RASL where a covert channel could be created between 2 co-located processors using RASL we demonstrated an exploit on ECC scalar multiplication over the P384 curve we demonstrated asynchronous execution by utilizing the deadline scheduler and finally we showed an exploit on breaking the ECDSA signature generation algorithm over curve to P256 Thank you for your attention