 Hi everyone. I'm going to introduce you to work power analysis on your prime. I'm Wei Longhuang and the other two authors are Dr. Jun Mengchun and Dr. Bo Yin Yang. We come from Academia Cineca Taiwan. I'll first introduce post quantum cryptography and your prime. Then after a brief preview about experiment settings and power analysis methods in our paper, I'll talk about the features in all three single trace attacks. In the finale, I'll list some other implementations that potentially subject to our methods. Enter prime. Shores algorithm can efficiently solve inter-crystallization and discrete logarithms, and functional quantum computers are estimated as arriving in 10 to 20 years. Therefore, NIST initiated its post-quantum cryptography standardization project in 2016. The submissions include key encapsulation mechanisms and digital signatures. We can categorize the hundreds of assumptions behind into lattices, error-creating codes, multivariate quadratic equations, super-single asogyny, hash functions, and some others. Here, we target inter-prime and it's based KEM. It contains two KEM algorithms, streamline inter-prime and inter-L-prime. Each has three parameter sets, characterized by the polynomial's insight, 653, 761, and 857 coefficients respectively. In this slide, we use streamline inter-prime 761 decalculation as our target implementation. R here refers to the integer polynomial ring model X to P minus X minus 1. Over Q and over 3, the coefficients come from Galville Q and Galville 3 respectively. A polynomial is small if it is ternary, and is furthermore short if it has exactly W non-zero coefficients. Streamline inter-prime is characterized by P, Q, and W, these three parameters. Public edge and servitex C are two general polynomials of P coefficients from Galville Q. In contrast, session key R and public key F are short, so they have only W coefficients, 1 or minus 1. We are interested in multiplications between our public key element and the short polynomial. The first element can be servitext or public key, so we know it. The second input can be session key or public key, so we want it. The target implementation here uses cross-scanning to fulfill its multiplications, so the stored operations to the upper array are minimized as we compute one upper coefficient at a time. It is clear in this diagram that the most sectional operative part is the calculation of the middle upper coefficient. Such multiplications in over Q appear not only in de-calculation, but also in encalculation, and a key generation of entry L prime. A brief preview. We run our target implementations in C and ARM assembly on Cortex-M4. Trips wide two-par version helps generate random inputs and major targets for consumption. Then we run our statistical analysis programs in Python 3.6 and C++ on emulgable arrays to analyze power traces. Here is the photo of all the measurements out. Here are the four power analysis methods in our paper and their features. Three of them are single trace attacks. Correlation power analysis, vertical versus horizontal in depth. Vertical CPA observes only the most sectional operative part in the multiplication, and that's the calculation of the middle upper coefficient. The target implementation loads input coefficient pairs from the lowest degree to the highest degree of the coefficients we know, say 7x coefficients. So we reveal circuit coefficients, say perfectly coefficients, from the highest degree to the lowest degree. During the media state, the current value of the middle upper coefficient changes only if the current perfectly coefficient is now zero. So we look for such changes throughout calculation, reveal the positions and values of only non-zero perfectly coefficients, and thus recover the entire perfect key. It is worth noting that the non-zero perfectly coefficients of the two highest degrees should be together revealed. If we would like to reveal that of the highest degree solely, then we probably cannot discern the change in upper coefficient value from the load operation of the corresponding center text. As for the other non-zero perfectly coefficients, we just reveal them one at a time. In vertical CPA, we compute the ideal power sample sequence from the same intermediate variable across different inputs, and then we compare it with its real-world counterpart at the same timing across different executions. So can we squeeze more information from its short traits? The answer is yes. In depth CPA, we compute the ideal power sample sequence from different intermediate variables from one single input. Now its real-world counterpart is from different timings in one single institution. In both scenarios, one timing, one intermediate variable, and one intermediate variable similarly one perfectly coefficient to recover. So why would the in-depth CPA make a difference? Each intermediate state of the middle upper coefficient depends on both the current input coefficient pair and all the previously loaded input coefficient pairs. So one intermediate state helps more than reveal one new perfectly coefficient. It can mean why verify the correctness of all the previously revealed perfectly coefficients. To exploit this dependence, we use the extend and prune framework. Here is an example. We like to reveal 67 perfectly coefficients at a time. So we recursively generate each 67 perfectly coefficient hypothesis. A perfectly coefficient can only be 1, 0, or minus 1, so the number of candidates steadily triples. Whenever the current hypothesis has 6 new coefficients, we calculate the correlation between power sample sequences. If the correlation fails across the fixed threshold, then we discard the current hypothesis right away. That's how we get the vertical drops and the multiples of 6 and slow down the exponential increase of the candidates to test. Unfortunately, some perfectly coefficients at the end of the block may go wrong. In the current block recovery, they are related to very few intermediate states, so they can lead to better correlation due to noise. In the next block recovery, the initial intermediate state must be wrong. So no hypothesis, not even the correct one, can lead to a good correlation. No survivors and epic fail. Luckily, no survivors in the current block recovery also signals the tail errors in the last block recovery. And we can roll back by half a block to correctly tell errors. They are now right in the middle of the current new block. Here is a toy example. Each block contains 5 coefficients, candidate printing every 2 new coefficients. So we recursively generate the first element, the second, then candidate printing the third to the fourth, then candidate printing the fifth. Here, we choose the best 5 coefficient survivors as our optimal guess. However, if we make a mistake at the fifth, then the second block recovery yields no results. That would trigger the rollback mechanism. And now, the third block recovery is meant to correctly tell errors. What if we observe more than one upper coefficient calculation? In-depth CPA, candidate printing can be ineffective, because each M coefficient block corresponds to only M power samples. Many wrong hypothesis, their prefixes, can still fit the few power samples quite well. So overall, the in-depth CPA can be inefficient and even inaccurate. But we can learn from horizontal attacks and observe some near the middle upper coefficient calculations. These upper coefficients have almost as many intermediate states as the middle upper coefficient. Therefore, if we observe L-1 more upper coefficient calculations, we have almost L times as many data. Here's a real-world example, we observe 5 upper coefficient calculations rather than 1. So, the candidate printing is now effective. Only 42 hypothesis of 67 coefficients survives in a top block recovery. Unfortunately, there is one tail error, so the middle block recovery has no survivors. And the rollback mechanism now gets triggered. The third block recovery, the bottom block recovery is meant to correctly tell error and it succeeds. Online template attacks. If the power characteristic of the target device fails to fit simple power models well, we need some template traces to profile the characteristic. These template traces come from a fully controlled device, similar to a target device. Classical template attacks need many template traces and heavy computational power to compute the multivariate Gaussian power model. So, can we complete the profiling stage with just few template traces? The answer is yes. In online template attacks, we acquire a single target trace first. Within partition list target trace into pieces, each piece corresponds to one multiplier and accumulator. Then we start to generate three template traces for the first perfectly coefficient recovery. That indicates the perfectly coefficient to be 0, 1, and minus 1. Online template attacks would then compare the first piece of the target trace with each of the three template traces. The closest template trace in terms of euclidean distance gives us our optimal guess. We then update this guess to our knowledge and according to this knowledge, we start to generate the next three template traces for the second perfectly coefficient recovery and so on. So, can we build the templates we need with fewer template traces? The answer is yes, and here is the chosen variant of online template attacks. We set the immersive attacks to have p identical coefficients. Now, each intermediate state is a multiple of c sub 0 modulo q. In this case, all in w non-zero perfectly coefficients 1 and minus 1 are randomly distributed. We need much fewer template traces to mount an attack. Here is an example. It only needs 60 template traces. The blues are the cases where the perfectly coefficient is 0. The greens on the right, the perfectly coefficient is 1. The reds on the left, the perfectly coefficient is minus 1. The red, blue and green is first at a multiple of c sub 0 modulo q. If the template generator further accepts legitimate perfect keys, then we can set the f star to be 0 1 0 1 0 1 and so on, or 0 minus 1 0 minus 1 0 minus 1 and so on. We then set the self-attached c star as the input self-attached, except that c stars of 1 can be c sub 0 or c sub 0 times minus w modulo q. Now, we only need four template generator executions to build all the templates we need. Chosen input simple power analysis. There are two commonly used counter-majors for intra-electrical systems. The first is to randomly initialize every and each output coefficient. Then, at the end of each output coefficient calculation, we subtract this random offset from a result. The second is to randomly access the input coefficient pairs during each output coefficient calculation. Unfortunately, both counter-majors are ineffective when the adversaries can choose the input self-attached. A simple example, we start with counter-major 2. Here, we choose the input self-attached to be constant c sub 0 and observe the output coefficient calculation of degree p0 to p mass 1. First, we partition this long single power trace into p pieces. Then, if the perfect coefficient of degree i is non-zero, the corresponding i-th piece of a single long power trace will be discontinuous. There are only two kinds of discontinuities because there are only two possible values of non-zero perfect coefficients, 1 and minus 1. As for the chosen input as PA on counter-major 1, it is a bit more complicated. Its first stage resembles the introductory example, but here we only care if the piece is continuous or discontinuous. After the first stage, we know the degree gains of all the non-zero perfect coefficients. Now, we can start the second stage. The second stage is back to the calculation of the middle output coefficient. Suppose now we know that fj1 and fj2 are non-zero. We would like to know if fj1 and fj2 are equal. We can set limit self-attached c such that only c sub 0 on only c sub p minus 1 minus j1 and c sub p minus 1 minus j2 are non-zero. They are identical. So now we expect two discontinuities and three patterns in each short trace. If the first pattern and last pattern are the same, then fj1 is minus fj2, otherwise fj1 is equal to fj2. At the end of the chosen input as PA, either on counter-major 1 or on counter-major 2, we need and exploit the error detection mechanism of n2 prime to choose from the two final hypothesis. Finally, we've also experimented our power analysis methods on the optimized product scanning. In this optimized version, we do not call modular reductions until the very end of each output coefficient calculation. Also, we use an SMD instruction, sml80x, to replace smlBB. This instruction completes two multiplier accumulates at a time. Now online template attacks fail because it relies on the amount of sensational leakage available. In contrast, chosen input as PA and horizontal in-depth CPA remains effective because they only target certain leaky power samples. We recommend first-order masking with both inputs masked as the counter-major. If we do not mask the several texts, then the publication is directly subject to horizontal correlation power analysis. If we do not mask the perfect keys, then the modification is potentially subject to SPA or other profiling attacks. To summarize, we propose three single-trace power analysis methods against product scanning. We apply them to the reference optimized and protected implementations. Here, we use streamline to prime decapsulation as the concrete target but overall, entry prime decap and cap and a key generation of entry L prime contain the operation of interest. Our methods may work for other ideal-elected scriptals if their secret coefficients also come from a small set of possibilities. As for other ideal or advanced modifications, our methods may apply to the multi-level Karazuba, ending with product scanning. Thank you for your listening.