 Hello and welcome to the talk on our paper Single Trace Attacks on Ketchup. This is joint work by Matthias Kahnwischer of Rapport University of Nijmegen, Robert Primas of Katz University of Technology, by myself, Peter Pestl. At the time of writing I was also with Katz University of Technology. So with this title, Single Trace Attacks on Ketchup, you might wonder, as in Shaw Free Shake, is a hash function. Do we even care about such a security of hash functions? Well, there are search and applications and uses of hashes, which do involve some keys, you know, probably think of an HMAC. But HMAC is a classic DPA setting anyway, so there we have that we have a fixed secret key that is combined with lots of known and varying input data, and we know that section attacks work quite well in this setting and the front is already quite obvious. Now, Ketchup has found a lot of different uses in the last couple of years which involve some sort of secret, and many of these uses are in context of post-quantum crypto. In PQC it is frequently used to derive the final shared secret in a key and capture election mechanism. Ketchup, as in Shake, is also frequently used to expand a short secret seed into a longer random and still secret bitstream. It's used both in chems and signatures, and of course you can also use Ketchup in hash-based signatures. Now, what's interesting about these applications is not only that they involve some secret, but also that the secret is an ephemeral one, so it's used once and then tossed out. This also means that a side channel attack is limited to observing a single execution and Danny has to try to recover the secret key using just this single measurement, or more correctly speaking in some of these settings. Side channel attack can get more traces, but since of a single key, but since all the other data is then constant, it can do at most averaging, but still no TPA. So in this setting one has to wonder, well, are attacks even possible here anymore, or so are countermeasures still needed, or can we assume that an attack doesn't have any chance of recovering the secret of this limited information anyway. In this paper we answer these questions and show that yes, attacks are still very much possible and quite realistic, and thus yes, countermeasures are still needed. We show all that by presenting a practical single trace attack on Ketchup software implementations, our attack uses the framework of soft analytical side channel attacks for SASCA, which involves two steps. For this attempt at matching, which means we retrieve leakage probabilities of all its immediate throughout the execution of the cryptographic primitive, and the second step is belief propagation, which is an efficient way to combine all the probabilities of all these intermediates and allows to infer the most likely key used. Now the soft analytical side channel attacks are new, but thus far they were mainly used in context of the AES, which is structurally quite different from Ketchup, and was also used with 8-bit leakage, and we tried to extend this to 16 or 32 clickage. And actually this attack can recover the key in a large array of settings, and thus we show that countermeasures cannot be omitted. And we also show several factors that influence the success rate of the attack. Some are obvious, like the key size and the width of the device. Some less obvious like the structure of the input. I'll come back to that later on. Now before I can explain our attack, I have to briefly explain the inner workings of Ketchup. Ketchup uses a cryptographic sponge with a 1,600-bit state, meaning that this state is split into two parts, the first part of size R, the second part of size C. And then it absorbs the input, so this is by exhoring a rock of the message onto the first part, and then calling an F permutation. And after the whole message is absorbed, the hash is squeezed out. And if you need a longer hash, then again this is interrupted by calls to the F function. As an F function, Ketchup uses the Ketchup F permutation, which interprets the 1,600-bit state in a 3D array in a cuboid of size 5 by 5 times 64. And then it applies five transformations on this 1,600-bit state. The first step is theta, which computes the parity of each color and x-horses it onto some bits. The second step is rho, which rotates each lane, so this long section along the C-axis, which offs size 64 with some given offset. The third operation is pi, which reorders the lanes. Then we have chi, which is the non-linear operation in the S-box with simple logic gates. And finally, we have yota, which is the addition of a round constant. Now, what does it actually mean to attack Ketchup? Well, we assume that we work with an unprotected software implementation on a microcontroller, unprotected since we want to find out if unprotected implementation can still be a text with a single trace. And now we need, of course, to attack something, we need to have that sum part of the input as secret. And unlike AES or any other block cipher where we have a dedicated part input for a secret and a dedicated input for ciphertext or plaintext, Ketchup or a hash just has n input. So we need that sum part of the input as secret. And we also have that this part is used only once. And then, of course, we have an attacker side channel attacker, which does, for instance, a power measurement of the single execution. And since there's only a single execution, he can't really do a differential side channel attack like in a classic DPA. So we have to resort to some sort of template attack. Here, I'd like to go on a bit of attention. So template attacks are typically considered to be quite a strong attack among quite restrictive, because usually you need to build the templates, you need some templating device, which you have already broken, which you know the key, or you need a device which is fully open, where you can set it. And that's very similar to the text device. But then you also have the problem of portability of templates between devices when you build your profile on one device and wants to attack different device. It's quite tricky. But the situation is a little bit different for the catcher actually, because inside, for instance, in any public key scheme, we have often multiple calls to the hash function. And some of these calls work with entirely known data. For instance, the first thing you do in a in a signature operation is you hash the message, which is known. Also, many post quantum key encapsulation schemes require some re-encryption step. And if the attacker is the one who generated the ciphertext, then well, he knows all the input to the re-encryption, as well as the encryption. Now, what I want to say with all this is that in many scenarios, it is actually possible to build the profiles directly on the target device by exploiting these calls to catch up that use entirely known data. So you don't need a separate profiling device, and you don't have any portability problems. Now, how does the attack actually work? The first step in the attack is template matching. Typically, software implementations of catcher work along the lanes of the state, so along the 64-bit parts. So we think that the lanes are split up into bytes or 16-bit or 32-bit words. And then for templating, we target all loads and stores from head to SREM. And the template attack gives them a probability vector to each processor word that is loaded and stored from SREM. Now, we have all these distributions for each processor word. And to combine all this information to find the one most likely key. And one very efficient method that can achieve this are soft analytical side-chain attacks proposed by Weyler-Chan Yohait at ACPRIP 2014. Now, how such a soft analytical side-chain attack works is the following. First, we need to build a factor graph, which is a graphically model of the implementation of the graph primitive. This graph consists of two sets of nodes. The first are variable nodes, which model the intermediates that occur during the execution of the algorithm. And the second set are the factor nodes, which describe how the variables are connected to each other, how they interact with each other. For instance, on the factor graph on the right, where we have three variable nodes, X, Y, and C, and they are connected by an X source so that we have X, X or Y, X or Y equals C. And the next step, we incorporate the leakage information that we got from a template attack in graph. So, assuming that we have leakage information on X, Y, and C. And then we run belief propagation on this graph. And belief propagation is quite a versatile algorithm, has lots and lots of uses, and its generic goal is to find marginals of the variables inside this graphically model. It does so by using the message passing principle. So, for instance, it takes the information of X and Y, combines them according to the rules of the X or and sends this information over to C. This combination, the simplest form of combining X and Y by simply enumerating all possible inputs of X and Y, all possible combinations. And now we can do the same in a different direction. And here it's important to avoid circular reasoning. So, the information sent in the direction of X must not depend on the previous information that X sent to the other nodes, which means that for C, we take the original distribution, not the updated one. And finally, we can do the same in Y direction. And we have updated probabilities, which we can send out to further nodes. Now, how might this work for catch up? Well, catch up uses bitwise operation. So it's quite natural to use a bitwise factor graph. So for each bit after each five of the five steps of catch up, we instantiate the variable node. As it turns out, however, this doesn't really work all that well. And this is because leakage, we get leakage not on a single bits, but on full processor words by 16 bit words. And if we now split up this leakage information into bits, then we lose lots of the joint information that is in the leakage. So we solve this problem by introducing clustering. So we cluster multiple bits into a single variable node. And since we have leakage among the lanes, it's also natural to cluster the bits along the lanes. And ideally, we, we set the cluster such that no side channel information is spread over multiple clusters. Ideally, we would like to have the clusters as large as possible. But there we have runtime and memory consideration, because these are exponential in the cluster size. So which means we support eight and 16 bit clusters. Now with this cluster, we will also run into problem of misalignment of clusters. So previous works on SAS on the AS. Well, AS is completely bytewise oriented. So you'll never have any alignment problems. On catch, however, the operations are not necessarily aligned if you have clusters smaller than 64 bits. For instance, consider this operation where we have a word A that is XOR with a rotated version of a word B. So we have to XOR one cluster of A with two parts with parts of two clusters of B. And what we have to do here is we need to split the clusters of B. And we have to extract the marginals of the upper parts of one word of one cluster and the lower parts of another cluster and combine them together. We lose some information on the joints there, but we have to truly look for that. There are a couple other considerations for soft analytic side channel tags on catch. For instance, the very first computation that catch does is the computation of the column parity. So you take five words and XOR together. And you could model this with a series of two input XORs, but this doesn't really work. All that will with the propagation, this slow propagation. So we instead instantiate a five input XOR node. Here we have the problem, as I said earlier, the simplest way of combining all of these informations now is to enumerate all the inputs. If we have eight bit clusters and five inputs, two to the 40 possible input values, which is not really practical to enumerate. But here, however, there is a solution by using a fast convolution of the distributions using a Volsch header mark transfer. There are a couple of other considerations that we have to deal with. We have to deal with an efficient attack. For instance, for CHI, we have to break up clusters again into single bits to deal with invertibility problems. And for 32 bit leakage, we have to combine multiple 32 bit words, multiple eight bit clusters into, and we have to combine them with the 32 bit leakage information. And we also found an efficient method to do this. And this method uses convolution instead of enumeration. Okay, now we implemented all this in Python, and we put all those sources called on GitHub. All our attacks, we restrict ourselves to the first two rounds of catch up F. This is just to keep the runtime a bit lower. And we found that operation leakage from the, from the rounds later on in catch up don't really propagate to the input anymore where we have our secret. Catch up belief propagation is an iterative algorithm and each iteration updates the each distribution once. And if we have eight bit clusters, then this entire operation takes a couple of seconds on a single core with 16 red clusters, the runtime of one BP iterations rises to a minute. And it also uses all 44 cores on our cluster. So can still be so it can still be considered practical. Thankfully, though, eight bit clusters are sufficient in most of the cases. Anyway, and we piece an iterative algorithm. So we keep repeating these iterations until we reach point of convergence. And this is if the attack succeeds typically in less than 10 iterations. Okay, now we evaluate this attack. And just to recall, we want to recover some secret input of catch up F of the of the permutation. As an evaluation tool, we use leakage simulations, meaning that we generate noisy hamming with leakage of loads and stores at typically occasions where software implementations loads and store from classroom. We evaluate eight, 16 and 32 bit implementation, and we sweep across the noise parameter Sigma. And for each selection of Sigma, we retrieve compute a success rate. Also analyze the impact of the key size. So we evaluate 128 and 256 bit keys. Apart from these more obvious factors that influence the success rate, there's another factor that is less obvious. And that is the content of the of the public input. Remember, the catch up F input, some part is secret, some part is known. And as it turns out, the content of the public part quite drastically impacts the success rate of our attack. And here, we have a look at two to two most extreme scenarios. The first to call all zero public input. This can happen, for instance, if we just hash the C, the secret seed. So, so the C seed is in the first bits of M zero, which is X or down to the zero state. And then the other parts of the state is still zero before it goes into catch F. There are a couple of padding bits, but we can on the next dose. And the second scenario is called random public input. This is, for instance, if M zero contains of a message, and this message and entire state is then squeezed through the F permutation. So, we have then a random state and then we X or M one, which might contain key onto this. So, we have the first bits are our secret and all the other bits are something pseudo random but known. And as it turns out, attacks in the second scenario, the random public input work a lot better. Why is that one potential reason is in can be found in feature, where we have this so called feature effect T, which is X or onto five bits or five words of the state. And as an observation, we have that knowing T, the type feature effect allows key recovery in our setting. And as it turns out in our setting, we had to look at four of the input nodes of such an X or of theta effect T are known. So, these are public input and one of the inputs is secret. And in the all zero input, we add this, this theta effect T onto four times zero. That's the same operation four times. Best we can do is averaging. So, we don't get a lot of information from this X or now in the random public input, we add this T to four different values. So, this is kind of like a DPA which uses four trace and gives us a lot more information than just this averaging. And this is one reason why the this random public input scenario works a lot better. Now, we evaluated our tech for multiple concrete scenarios. And this is the result for simulated leakage on an 8 bit device. And as it turns out, yes, attacks on which want to recover 128 bit key work a lot better. So, with higher noise levels than those with 256 bit key, we can also see that attacks in the random public input scenario work better than all zero public input scenario. And just to give a bit of context, we evaluated the real noise level on an 8 bit device and we found it to be roughly 0.5, which is well below the which is well within the noise levels that we can handle with our tech with quite a lot of margin. So, the attack works perfectly on 8 bit device in all scenarios. Now, for 16 bit timing weight leakage, we didn't have a reference device. We don't have a real Sigma, but we can also see that it still works, especially in the random public input. On the all zero public input, it only works with 128 bit secrets. So, no, 256 bit secrets. And the success rate never reaches one in the year actually. And for 32 bit timing weight leakage, the real noise parameter quite varies a lot. So, there we would need to look into the more concrete applications. And in the 128 bit scenario, with random public input, the attack also works. So, with all zero input, we didn't get it to work and also we're not with 25. So, what does this all tell us? This tells us that single trace attacks are still a considerable threat, especially for 8 and 16 bit implementation. And while the situation might be less clear for 32 bit devices, here we can say that, yeah, we used a simple leakage model like simulations with the univariate timing weight templates. And it's likely that more sophisticated attacks will fare better. And here, rim-like attackers using localized EM measurements. And that is not all that unrealistic that such an attacker exists because, remember, we can do the profiling directly on the device, in many cases at least. So, this tells us that we always must include some basic countermeasures. We can't rely on the fact that the attack comes and gets a single trace. And countermeasures might include hiding, like shuffling and dummy operations. These tend to be very effective at mitigating all kinds of algebraic attacks at relatively low cost. Masking is, of course, also an option, but comes with some restriction. Like for smaller devices, masking alone might not be enough because you might be able to still get enough information if you combine the two shares using the factor graph and on larger devices, masking might be a bit of an overkill. So the first step to mitigate this attack is probably through hiding. So that's it from my side. Thank you for watching.