 Hello, I want to present low latency kechak at any arbitrary order which is joint work with Ain Reza Shah Mirzadi, Hadi Soleimani, Razia Salari Faird and Amir Muradi. Before getting into details, I would like to give an outlook of what low latency kechak at any arbitrary order is. In fact, this is a realization of kechak that is immune to side channel analysis and has three main qualities. It is low latency meaning that it has only one registered stage per round. It is compact meaning that compared to the state of the art, it attains low area consumption and it is extraordinarily extendable to higher order securities by the means of an algorithm that we have provided for it. First, we will have an introduction including a brief review of masking schemes and mask implementations of kechak. Then we pay to the other low latency implementation of kechak, which is the first order on roll threshold implementation. And finally, we go through the generic single register per round realization followed by its evaluation results. Masking schemes are probably secure countermeasures against power analysis. They are based on multiparty computation and secret sharing. Two of the most famous hardware masking schemes are threshold implementation and domain oriented masking. The threshold implementation scheme was firstly introduced for first order security and then it was extended to higher orders. However, later it was shown that the proposed extension only provides univariate security and fails to provide multivariate security due to the lack of first random list. However, the key property of this order threshold implementation is this order non-completeness which necessitates td plus 1 as the minimum number of input shares, where t is the algebraic degree of the shared function. Here we have an example which is first order threshold implementation of an AND gate. According to the above formula for the minimum boundary of the input shares, each input has been split into three shares. The main challenge of threshold implementation is achieving uniformity of a shared function. A simple example would be the direct sharing of the mentioned AND gate which successfully attains non-completeness property but fails to achieve uniform output bits. Domain oriented masking is the other well-known hardware masking scheme which is configurable to any arbitrary order. According to Dom scheme, input shares are divided into independent sets called domains and this independence is maintained throughout the whole design. It includes a refreshing layer that adds first random bits to intermediate variables before compressing them to reduce the number of output shares. The application of refreshing layer followed by compression layer necessitates hiring of registered layer at the middle to avoid glitch propagation. And this refreshing layer as well as the independence property leads into d plus 1 as the minimum number of input shares rather than td plus 1. An example would be the first order domain oriented masking implementation of AND gate in which all AIPJ multiplication terms are called component functions. This is important since later in our design we will refer to component functions frequently. The other topic that we would like to go over is a brief history of the mass implementations of Kechec that are proposed up to the present time. The majority of these implementations are first order ones. The first mass implementation of Kechec was introduced by its designers. It was a three-share threshold implementation. However, later it was discovered that it suffers from non-uniform output shares. The second, third and fourth proposed designs aim to fix this non-uniformity. The second one employed a plenty of randomness to achieve the goal. The third one increased the number of input shares to four shares and led into an area increase. The fourth one followed a different technique and proposed the so-called changing of the guard methodology which constructs uniform non-linear layers instead of uniform individual aspects. There is only a this-order term implementation of Kechec in high order group. Compared to the td plus 1 scheme, this design is compact. However, it requires a notable number of first random bits per chain stance. Let's say for second order implementation, this amount equals to 15 bits per chain stance. We have dedicated low latency implementations to the third group where there is only a two-round on rule threshold implementation. This design successfully achieves half number of clock cycles while requiring six input shares at its first order. This is the reticle extendable to higher orders. However, in practice in cut fronts, significant challenges. Let us say what happens when we want to extend it to higher order implementations. This is the design of a two-round on rule threshold implementation of Kechec. If the entire design is supposed to have a security order D according to designers, the first round should have security order of 2D. This application is a necessary rule which ensures non-completeness. However, it implies 2D plus 1 as the minimum number of input shares. Two implementations are proposed for first order realization by the designers, one with five input shares and the other with six input shares. The results show that the second one leads to a more compact implementation. Now suppose for second and third order securities. According to the mentioned application rule, the minimum number of input shares in each of these cases would be 9 and 13 accordingly. This is not a trivial task at all to find efficient and non-complete shareings for 5-bit cheap implementation with 9 and 13 input shares. And this is just in case that we do not consider uniformity challenge. If we consider uniform implementations, the problem even gets worse. Furthermore, even if we suppose that there are implementations which map 9 and 13 input shares to 9 and 13 output shares, such implementations would pose significant costs which is not of interest, especially when considering the real-world gadgets. So if we take another look to the map of mass implementations of Kechech, now we can feel the need for a compact low latency implementation of Kechech which is physically extendable to higher orders. To satisfy this need, we propose our single register per run DOM implementation of Kechech. Our design stays with the latency of an unprotected implementation at its higher at any arbitrary order. To this end, we revise the round function of the DOM implementation of Kechech. We relocate the compression layer in such a way that the round function needs only one register stage per run. In fact, we move the compression layer after the linear layer theta which results in d plus one to the power of two instances of theta rather than d plus one instances. Of course, we will take this note into account when evaluating our design. But more importantly, we are interested in investigating what happens in the lack of register layer after the linear layer G and before linear layer theta as its subsequent operation. Here we have a simple outlook of how G function operates on its five input bits. It replaces each bit with the result of its XOR with a multiplication term. The multiplication term is produced by the complement of the next bit and its adjacent bit. The outputs of theta are then delivered to theta and theta also replaces each bit with the result of its XOR with a parity bit. The parity is made up of 10 different bits where one of these 10 bits is exactly the one previous to the bit being processed. To keep it simple and easy to follow, A prime with A prime will be added to B prime, B prime to C prime and so on. Let's check what happens for first order implementation where we have two share input bits and four share output bits. The theta is applied to the output stuff G function and if you keep an eye on output share one for example, we can see that XORing A prime and B prime leads to a leakage on all shares of C input bits. The same leakage exists when we XOR the second, the third shares of C prime and D prime. And speaking more generally, XORing every two adjacent bits of output shares one and two results in a dangerous non-completeness failure. How can we fix this issue that is the result of register omission at the middle of G and theta? Our further investigations show that the leakage originates from the fact that the alignment of the component function in the DOM scheme is in such a way that every output share contains more than one share of each input. The same holds for output share too. So our solution would be introducing a new alignment that is compatible with the lack of in-between register. The realign component function or proposed realign component function for first order is shown here. It can be checked that at most one share of each input variable shows up in each of the output shares. This is true for output share one and the rest of the outputs. A good question would be what about extension to higher orders? We are glad to announce that our proposed realignment is not just a heuristic one working for first order. In fact, the sharing access of input variables are given to the component functions according to a transparent rule which is extendable to any arbitrary order. Once again we refer to the first order solution to describe the algorithm. The sharing access are given to component functions according to a table called index configuration table. This table can be generated for any arbitrary order according to an algorithm that we have provided in detail in our paper. The structure of this algorithm is based on three steps. First is to give each output share only one share of each input variable. One is to check for correctness of the masking, meaning that the alignment should ensure that every two adjacent columns cover all combinations of the share indexes. And finally is to placing linear monomials. There is a freedom to place them in output shares with the same share index of that variable. Here is our final design architecture for low latency term implementation of Kechak. We have instantiated another multiplexer identified as multiplexer 2, which bypasses data step and allows us to take the outputs of the last round directly from the compression layer, which turns the number of output shares back to D plus 1 instead of D plus 1 to the power of 2, which is produced by G function. There are some further notes that we would like to discuss about our design. First is about the user refreshing layer. Our design follows the same refreshing rule as the dummy scheme presents. According to this rule, all AIPJ component functions where i is not equal to j are blinded with fresh random bits. In fact, the realignment does not change the refreshing layer, and our design requires the same number of fresh random bits as the original dummy scheme. However, the compression layer differs in some detail. Let's suppose that we have the first order realization into which the random bits are applied. According to the rule of compression in dummy scheme, output shares 0 and 1 and output shares 2 and 3 are allowed to be summed up. This instruction works for all output bits except for D prime that results in a dangerous and invalid compression. So we have to rearrange the output shares of D prime right before the compression layer and of course after the application of Theta. And after that we can perform an ordinary dom compression safely. In order to assess the performance of our design and to compare it with the state of the art we have used synopsis design compiler and focused on the implementation of one of the small size Kechak permutations with the state size equal to 200. The results show that compared to the equivalent original dom implementation, our implementation achieves around 26 and 31% lesser delay in first and second order securities respectively. Our design has almost constant delay in its higher orders. Compared to the first order low latency TI implementation of Kechak, our design had a 74.5% lesser area consumption while they have almost equal delays. Here we have reflected the detailed results of implementing our design with security orders 1 to 5 and the results of the related works. We have also performed some experimental analysis including the verification of the security of our constructed mask G function using silver verification tool and FPGA-based evaluation using Sakura G platform. We have used 100 million traces to perform fixed VS random T test for first, second and third order implementations. For first order implementation, we observed no leakage at first order T test, however Univariate by variate and multivariate, second and third order T test show leakage, which is the approval of the validation of our setup. For second and third order securities, no detectable leakage is observed since there is a high noise caused by the high number of fresh random bits updated at every clock cycle in this design. In summary about low latency Kechak at any arbitrary order, first note is that this is a design with implementation costs that are within the feasibility bounds. From a latency point of view, it has a single register per round regardless of its security order and from area point of view, it has a lower area overhead compared to the state of the art. The second note is that the challenge raised by register mission in this design is addressed with the help of Kechak specifications. And there are some closing points about the design. First is that our solution is not only working for data, but also it works for any linear function that is applied right after the chief function. The second is that our solution is dedicated to chief function and this is not extendable to any other function. And the last but not least is that we make no claim that the proposed index configuration is unit. Maybe there are some solutions which can distribute the sharing access to component functions and achieve the same goal.