 That was off. Okay. Now it's on. Sorry. Everybody woke up Welcome everybody to the first Asia group session on site channel analysis and liquid resilience And the first talk of this session will be given by Amin Morati The co-author for his paper in Tobias Schneider The title of this work is site channel analysis protection and load latency in action case study of prints and Okay, thank you Yeah, let's see why I'm here actually to a bit motivate what the topic is and what is the motivation to give such a talk is that I need to Remind you about the low latency and low latency encryption and decryption and why it's needed and It actually It's actually initiated a couple of years ago by industry sector and say that we want to have low latency ciphers or low latency encryption and decryption at the same time to support memory encryption Just suppose that you have a system on chip you have a complete system and with a system It has a marker controller or marker processor and wants to save or read from the memory But the data that should be written in the memory should be encrypted And then if you just to stay with the let's say a standard ciphers or the non ciphers like for instance AES and Then you have a pretty Let's say normal implementation of the AES which is a round-based architecture It needs minimum 10 clock cycles and then every time that you want to read From the memory or you want to write into the memory. You need to wait 10 clock cycles. Then of course, it's not Optimum then you prefer to have a lower latency means that The software is able to react or the implementation react very fast in a way that it's somehow synchronized with the clock cycle of the marker controller and marker processor and then in every Clock cycle that you can write or read from the memory at the same time you can encrypt and encrypt the data That you write the encrypted data into the memory and read after what's the decrypted data from the memory But actually why memory encryption is required Of course in the scenarios that the chip is in the hand of the attaker because the attaker may be able to open the tip Let's say let's say read the memory content or dump the memory content And then for that reason to a to be against such an attaker or to to protect against such an attaker you you go for memory encryption Otherwise, there is no My point of view no reason to have the memory encryption, right? And then if we have such a scenario that we think that the device is in the hand of the attaker's that Should we not care about the sergeant leakage of the implementation or say just yeah We don't care just the device doesn't matter. It leaks information, but But the memory is encrypted to these things We have tried to see what are the options when you want to have the Low latency implementation of the cyphers. Let's say the best noun or the first Proposal was Prince what happens if you want to have the also Sergeant protection at the same time and such an implementation For instance, we look at the Prince Here is the design of the Prince. It has five four forward rounds and five back one backward rounds And I don't go too much details, but it has of course a Sparks You have the mixed column she froze and key addition and also round count round constant which are added to the state and Lowest latency can be achieved by fully unroll architecture means that if you have implemented this Cypher this design in fully combinatorial circuit means that all only gates There is no register at all once you put the input here either plain text or Cypher checks And then set of course the parameter like the play a lot the key and also that whether you want play Encryption or decryption then after a particular time the Cypher checks is ready And this time of course depends on the how large is the system is how large the how many gates you have and Actually, which is defined by the latency of whole of this combinatorial circuit This is a graph here. You can you can have an idea about how fast or how slow this implementation is if you want a very fast implementation, which is the minimum for Prince in the technology that we consider it is nine nanoseconds and then it you you end up with roughly a more than 17 Thousand gate equivalents or if you want to have a smaller implementation Let's say around nine thousand gate equivalents you reach in the areas or around 13 nanoseconds Of course, this is for one particular library and technology if you change the technology this this figure changes Then the idea was single cycle fashion or single cycle implement fully unroll architecture That that is completely between the processor and the memory Then you don't need to care about anything just you give the plain plain text the software takes ready after let's say maximum 30 nanoseconds or nine nanoseconds or other way around for the decryption Good now if you want to have the search and protection We have a couple of different schemes Of course, you have heuristic and ad hoc a schemes like noise addition shuffling temporal randomization and so on For instance, you can for sure add noise addition to such an implementation to a fully unroll architecture It doesn't matter. Of course, it increases the noise and then the attaker Of course has a bit more difficulties to perform the attack depending on the level of the noise that you add Shuffling most likely is not possible here because you don't have any register You don't have any order of the computations This is just fully combinatorial combinatorial circuit And then you don't have any chance to change the temporal order of the operations the same for the temporal randomization But then on the other side we can go for Teoretically sound schemes like masking that we call it in such an analysis or such an area Masking which is actually just a secret sharing and multi-party computation. This is the same concept but since we are using very simple scenarios of Secret sharing and multi-party computation, which is a boolean secret sharing We call it usually masking and boolean masking and we have threshold implementation in short ti Which is a correct way of implementing boolean masking in hardware Now let's say at the short shortly I will Review the ti and the concept of ti and then we come back to the prince and see what happens Minimum number of shares means that the number of shares that you need to implement the circuit sharing a scheme depends on the algebraic Degree of the sparks of the non-linear functions that you have for instance in the case of Prince or let's say for most of the forbit All of the ciphers that they are using for bid bijections The the sparks is a cubic function Which is algebraic degree of tree and then for first order security it directly map to minimum four shares you need to implement or You can decompose the function this cubic function to two quadratic functions I mean this is a quadratic function another one quadratic function and then instead of four shares You can implement it with three shares and then you need to put a register in between if you don't put a register between These two are stages then of course again You have a big combinatorial circuit which works with three shares and the algebraic degree of this The function that it that it realizes is again tree and then with three shares Of course, you cannot implement this and definitely this leaks Then you have to put the register between quadratic parts if you go for minimum number of shares but putting the register between these Decomposed functions or between the non-linear functions. It's already contradicting with a single cycle fashion that you require for for for fully unrolled architectures that you say Low latency implementations a bit more about this we come back to this Concept a bit later, but just say just imagine that this implements the sparks Let's say X and Z X is the input of this box and Z is the output of this box And then I represent X with three shares in a way that if I X or all these three I get the X and if I X or all these three I get Z which will be the X input of the sparks Z output of this box Yeah, this is a Boolean sharing here and another one Boolean sharing of the output There are a couple of properties for the Ti That you should you should fulfill if if the implementation of Ti is correct. We come back to this later Um But You can I mean I heard a couple of times from mainly from industry that yeah Let's have some heuristic ad hoc architecture means that instead of fully protected implementation Just in just protect the first round and the last round. I mean the things that that occur can Can predict if of course if the plaintext or ciphertext are not controlled if they are controlled You can extend that attack to the second run, but just imagine with this Situation that the first round of the cipher and the second round of cipher should be implemented and then still you want to stay with the unrolled architecture means that you Share the input in four shares as I said This is the cubic functions and minimum you need four shares and then you have a cubic represent mean shared representation of the cubic functions with four shares and then After first round is done then you explore again the shares and then these Rounds are processed in an unshared way and then again last round again you Make a new sharing of the middle value before the last one and again the last round is a shared implementation This is not a correct realization of Ti. Of course, this will leak. Yeah, I mean we know this I mean this is clear you don't have registered here first first of all and after that This part is already leaking the data which depends on the let's say the last that's the output of the first round and These are already predictable But this is a fully unrolled architecture means that you don't control this part how they are implemented Of course these parts which are gray should be controlled in a way that the shares are not mixed together but this is another correct implementation but Because I heard just a couple of times and let's let's have some some Kind of a crazy ideas like this, which is not the correct way that what is the achievement of this if you implement this and then measure The power traces. This is a PGA based evaluations Here it at the left side is the unprotected unrolled implementation and the right side is the Ti protected But of course is it's still II protected unrolled and then the public consumption is of course increased Roughly three times because the circuit is of course much larger and then the evaluations that we have done is based on first t-test I mean if you aren't familiar with the t-test you can have Kind of a leakage assessment whether whether the whether the design or the implementation has leakage depending on the depending on the Intermediate values with a complete Random versus Fixed t-test we have performed here with one million traces in the dollar side and the unrolled version with 100 million traces And then this is clear that should leak This is unprotected and the other one is still has leakage You know which as I said is expected because the first front and only the last front are Protected and then the X or that make the again unsure value leaks You can also have the SNR which SNR Calculate the signal to noise ratio based on the plaintext nebles because the Spock's is a forbit and then instead of Because the bijection is sorry this box is a bijection and the key addition is also linear you can have Computer SNR based on the plaintext nebel which will directly map to to the way if you calculate the SNR based on let's say Spock's input or a Spock's output in this case that the SNR is decreases here If you see it's pretty a small but 10 to the minus 3 and here 10 to the minus 5 The SNR is pretty a small means that the leakage is really used But you cannot say that say that this implementation is provably secure. Of course, this is Is Ealti, but it really uses the leakage And then the question is what are the other solutions if you want to have a provably secure design Then you have to stay with the round-based architecture. There is no way you have to put the registers and then This is the the upper part figure is the original design of the round-based architecture You have here the register again for the prints for encryption and decryption It has two pads Spock's and a Spock's inverse because this has forward round and the back for around and if you want to make the TI or Boolean mask version of this you have to implement S and S inverse which are pretty large circuits because nonlinear parts and We have observed that a Spock's and a Spock's inverse are actually affine equivalent in prints And then you can just have one instance of them doesn't matter a Spock's on a Spock's inverse We check a Spock's inverse was a bit more optimized as Spock's and then which two other affine functions If you go from this way you implement a Spock's if you go from this way you implement the Spock's inverse now, then you have one instance of this Spock's instance of two that To let's say reduce the complexity or the size of the circuit if you implement it with a TI means that this Design is not necessarily a smaller than this in an unprotected scenario But in the protected scenario if you have to you want to implement the mask version of this Spock's the below one Should be a smaller than the upper one First problem is the how do you correctly implement a TI of of a Spock's with three shares Unfortunately, it has to be decomposed in three stages because one property of the TI is the is uniformity means that If you implement a Spock's again, I come back to that slot that we discussed here oops It's pretty slow reacting to my accordance again these this implementation If you see this a g-function as which received three inputs and three outputs If you consider as a bijection again as a function again, which let's say three times input three times output It should be also a bijection Otherwise the output that you have here, of course It's a correct sharing of the output of this part or also the correct sharing of the Spock's output, but this is not a uniform sharing and then when it's a not not uniform sharing the next rounds when this input is going to Other nonlinear functions It means that you shared the input with a with a mask What are let's say with mask or random data's that they are biased? They are not uniform with distributed then you have to achieve the uniformity of the implementation as well in the TI scenario now to achieve the uniformity of the Presenter Spock's you have to decompose it in three stages, which is actually known This is not our funding and then there are different ways these these names actually are corresponding to do to some Definitions that have been published before which are the quadratic function and then some classification by researchers in Carl Lovren and Then when you decompose it in this way and then there are a fine Transformation between quadratic functions and then you implement it with three stages means that you have to put one register here One register after this after the first quadratic functions one register after the second quality functions and of course a third quality function see one register here as a State register means that you have to have three stages and three Registers which directly affects on the and the latency that you have if you if you implement this Then you have to wait Let's say number of clock cycle is ten and then ten times three you have to wait till the output is available Right if you have another Spock's which has which could be decomposed into a stages Then of course it reduces the latency for you, but there are other Spock's families in the prints that they are offered Suggested but all of them unfortunately need to be Shared in three stages means that there is no difference if you change the Spock's one of those Spock's which are suggested by Prince if you now measure this again FPGA implementation the power consumption is Reduced but of course you have more rounds instead of one big peak that we have seen here there are clock cycles and this design was running at pretty low frequency and three megahertz I believe and then With a t-test it shows you that with hundred million traces You don't have a first order leakage which was expected I mean which was our goal of course because this is the first order Secure implementation then second order leakage and third order leakage are detectable for sure Yeah, but to avoid this leakage then you can easily easily add noise addition Then the noise addition plus the first order secure implementation will make this practically hard to detect or to exploit now The idea was that yeah We have seen this but we don't want to have a round-based implementation because we we have to make a fast clock in the design And an industry was saying that no if you have a fast clock to achieve that high Low latency then we need to add another source of the clock Which let's say works at hundred megahertz, but it takes energy it consumes energy and When did the when the encryption function doesn't work? It just consumes energy and this is this is not possible and we don't want this or sometimes you need to go for a sense let's say for clock frequency of 400 megahertz to achieve the lowest latency for that round-based architecture, but in not an all platforms Let's say 400 megahertz is possible Then the idea was I was to yeah, let's use a synchronous logic Probably you have not heard about asynchronous logic. I don't know if you are electrical engineer You have probably heard that our synchronous logic is is pretty different to a synchronous logic means that every gate That's kind of acknowledgement and request means that when the gates are finished the evaluation They they send the data to the next gate and the next gate evaluates and some kind of a evaluation Way for which is cascaded to the whole of the circuit and then the circuit or that gates actually do not have any glitches and and then you don't need to In theory put put register in your design means that all the gates are completely glitch-free And then you can implement whole of a design in a loop way And then when it when the design the computation is done by the design the design will give you a signal that the Computation is done and the idea was to use let's say let's use the asynchronous logic to avoid the fast clock in the circuit and then see Which achievements we have if you are familiar with the sgtn logic a style It's very very close very similar to the concept of WDL, which is a wave dynamic differential logic proposed in 2004 as a sgtn and contumers. We're actually reducing the leakage of the sgtn Now this is a synchronous RON based TI, but we have to again put the registers You see the registers here again The problem is that if you remove the registers between the stages again This will be a fully combinatorial circuit these two parts that they are working again on Treasures, but the algebraic degree of this is still again more than two and then you have definitely the leakage Then you have to put the registers But the point is you can have as another part of the circuit which triggers this register means it's completely Independent of the clock, but when the computation is here done It says the register safe and again the next part when the computation is done It says okay safe and so on then this circuit needs only one signal to say a start And then when it's done based on the control signal control logic it tells you a done signal and I'm done now without any external clock But if you implement this again an FPGA it was pretty hard to implement an FPGA because this asynchronous circuits are not Made to be implemented in an FPGA or let's say FPGAs are not Designed or fabricated to implement or realize the asynchronous logic Then you see here the small clocks actually the clocks are internally generated by the system itself This is that they active for the registers, but unfortunately again You see the first or the leakage of course This is a very small with hundred million traces But if you want to stay again with the provable security This is not what you wanted if you want to see this and say other leakage is a small You could have stayed with the first design as I said just First and round last round are protected and then your design would be similar to this the reason for that Actually Is coming from the asynchronous concept? if you look more Let's say carefully here These registers Are triggered by one function here? The function will tell when the computation of all these three are done and say okay saving this or even one of them Done and second one is done third one is done then saving the register The problem is that this is already Sorry, this is already Linear it's a non-linear function over these shares means that when all of them are done then save Then then you are already leaking information about these three shares. Yeah, that's is the source of the leakage Yeah, and the good thing. I mean if you come back to the ti concept the Ti initially proposed to be secure in presence of glitches yeah, and then But actually it also has a synchronous design which avoids this timing dependence It means that unintentionally the design of ti was Synchronizing all these stages and then made that the second the stages starts completely independent of the first stage And then it avoids such a leakage that we see with our synchronous logic Then the one conclusion here that he I on a synchronous circuit is actually they can correct construction I mean you can you cannot easily combine ti and a synchronous circuit Then just very shortly about Midori what happens in Midori is again round-based architecture And you know probably Midori has been designed of course Midori 64 is broken But Midori has been designed to have the shortest latency in a round-based architecture Which actually fits very good to here and then there are sparks of Midori compared to prayer to Prince Can be decomposed into a stages to have a correct ti then in theory that the latency of Midori If implemented in this fashion is better It's in this which means that the latency is shorter But now the one question was that how do we deal with the fast clock? You can't generally generate the fast lock This is the engineering part probably completely out of a scope of the crypto But you can't generate the clock internally of the system and then have just some Control signals to say that this ring oscillator should have started and then should should should have stopped working And then without any external clock, which always working in consumer energy. You can you can implement and run the system As the last slide I run out of the time already This is for instance the synchronous run base ti with a fast internal clock and then again You see here small clocks which are generated by the internal Oscillator and these large pick is because of a lot of clock cycle and a lot of rounds are Running at the same time or after each other and then the problem that you have seen but a synchronous circuit is gone We have the first order security and the second order as I said should be expected I think this time just a matter of chance that is the third order leakage Has not been seen That was not the last slide sorry Here a very short comparison what happens The circuit without synchronous design is extremely larger compared to the round base ti for instance You compare these two. I mean the upper one is for the shortest for the smallest design and this number is for the shortest latency and then these are extremely larger, let's say around three times larger and Then the gain is actually very very low and in the circuit is very Larger compared to this one because it has two phases of the evaluation and pre-charge But if you look at the Midori again, the latency is shorter and the design is also a lower smaller because of their sparks Yeah, of course as I said the Midori is broken then probably it makes sense to design something similar to Midori But with not with the known leakage that or let's say the the weakness that the Midori has Okay, thank you so much and sorry for spending a lot of time Okay, thanks. I'm here for a very interesting presentation so If you have a fast internal clock, which was one of the solutions you're proposing near the end Does it not really defeat the purpose of low latency crypto to begin with? You know what I mean because you do not necessarily need to go towards symmetric e-crypto designs that can be Very fast inside one clock cycle because you're using more clock cycles. Anyway, did you understand my question? Probably did not understand your correction correctly So a design like Prince is specifically optimized to be implemented Very efficiently, I mean with a very low latency assuming That you need to be able to execute the entire block cipher call inside one clock cycle unrolled fully unrolled fully unrolled And if this restriction is not really there anymore because we have a fast internal clock anyway, then Maybe we're not really going towards designs like Prince, but other designs could be very suitable as well I wouldn't say that for instance meet if if you go for a faster fast clock Midori is also better than Prince in this case Yeah, I mean even if you completely remove all the registers and fully unrolled again Midori should be a still good compared to Prince It's I mean it's not much far away from from Prince Midori in this case, but But I cannot say that the other design probably work much better The key the problem is that in the fully unrolled design Many gates are combined to the library that that the cells that are available in the library and Then just by trial and check you say you can check whether how big is this design or what is the latency of this design, you know I would say it's not contradicting with Let's say main goal of having low latency cipher if you go for around base architecture Of course, what makes sense or what is important is is to have very low number of rounds which you can see it also in Prince Yeah, I mean the Prince has a low pretty low number of rounds compared to other softwares which have four four with a Spock's It doesn't matter if you have a higher clock frequency. No, it matters. No, no, no, it matters. Sorry. Sorry. Wait a second here The latency is the number of clock cycles, right? multiplied by the critical path delay means that this is 31 multiplied by four this will be the latency not latency not the frequency Not the frequency itself the latency is that Latency of when you give the plain text and when the software text is ready, which will be the number of Clock circles that you need multiplied by the critical path delay and look at here This is 40 by four and this is 31 by four because just you have less number of rounds or let's say because of the Stages of this box, then the num, let's say in total the latency is shorter Just just one more short question Your results are given for a six Yeah, but your evaluation is done. Yeah, correct Which is adjusted Evaluation the social evaluation and FPGAs not not any let's say how big or how small or how fast is the Implementation on the FPGA just because making actually the chip for this design, which definitely would take me one one more year Yeah, this is clear, but Security analysis I would say we can trust on that Yeah, I would FPG design because we actually we were misusing the FPGA Elements, let's say for the case of particular for the austrian chrono circuit We were using every loot as a single gate to realize everything correctly Yeah, and then I would say what we made in such a leakage should not be much different to the to the ASIC one You mean you mean doing the simulation, okay, no Yeah You mean if you use a simulation in seed of practice in FPGA, yeah The problem is here if you do the simulation This is the time right because you need to do that if you want to have a good simulation You need to have the simulation in the transistor domain and then taking every trace depending on what you have but we'll make with Which may take? Around let's say in the fastest way couple of seconds and then having one million traces or hundred million traces Would be not be feasible. Okay. Let's find the speaker again. Thank you