 Hello everybody, I'm Ayin and now I'm going to talk about our paper entitled as New First Order Secure AIS Performance Records and which is a joint work with Lucian and Amir. FGAs and FPGAs are two main horror platforms where in ASIC a circuit is realized by logic gates in transistor levels. However, FPGAs are completely different platforms and the design should be realized with its building blocks like lookup tables, boxes, lock memories and other building blocks. This work will focus on AIS as the most widely used block cipher in literature and industry. We know that the context of the network security we need high throughput AIS implantations. We need high throughput design to support a high data rate, efficient AIS implementation since AIS has a key role in the most of the protocols in network security. Answer this demand many companies use FPGAs to solve such a high amount of data rate. There is a considerable body of work on site chance secure AIS ASIC platforms which are not necessarily optimized solution for FPGAs. Some studies consider FPGAs but their purpose design have a very low throughput or not high secure design. So our goal is to introduce a high throughput site channel, secure AIS which is optimized for FPGA platforms. FPGAs are designed to be reprogrammable. We make them a perfect choice if system should be updated when necessary. They also meet critical timing and performance requirements with parallel process. In this talk I focus on Xilinx FPGAs and mainly as part of C-Family even so our constructions are general and can be implemented on every FPGA. As all FPGAs in other brands also have basic building logs like BRAMs or blockmores. As you can see the figure Xilinx FPGA contains a matrix of configurable light blocks called CLB whose number depends on the device size and model. ECLB consists of lookup tables, flip flops and moxes and synthesis tools utilize these logic blocks to implement the desired functions if not particularly configured to use certain internal hardware blocks. Such FPGAs additionally provide other sort of built-in system level blocks including 18 Clubit BRAM or blockram where each can be used as two independent 9 Clubit memory blocks. The input of the BRAM is always registered which means that the read and write operation are fully synchronous and optional output register can be configured to reduce the latency of the circuit. BRAM also feature TDG to dual port implying that there exists two completely independent address and data ports which can be simultaneously used to access the content of the BRAM. Interestingly the size of the address and data ports can be adjusted by user based on their needs. For example a 9 Clubit block can be configured to have a 13-bit address and 1-bit data port or a 10-bit address and 8-bit output port. So the address and the size of output is quite limited and if you want to have 8-bit output port only 10-bit address port is available. So bear in mind this number as you are going to use blockram in our constructions in this paper. Masking with D++ shares is particular due to the using minimum number of imp shares which potentially can lead to lower area overhead or other implementation costs. It's also independent of the algebraic degree of the target function. Here I brought a simple example of D++ sharing which is a tool input aggregate is masked which is the first order that you design. So D++ sharing mask variant is split into two parts and which usually is divided by a register layer. So you can see the register layer with the red dash launch. So we have component functions here. The result of the component function should be stored in the register to avoid propagates of the pages. Each component function should be non-complete meaning that each component function should receive only one index share of each input variables. Then the result of the component function are compressed from two output shares as you can see here. I refer to this as a compression layer or simply compression in this talk. So we have a problem. If an attacker has a physical access to the target device they can monitor the power consumption or collect the electromagnetic radiation target device to provide the search information. The solution that our community came up with is to mask our designs. In masking we're animalized key dependent variables during the execution of the site. To evaluate a mask design probing model was proposed. In this model the design is this order set you if any combination of circuits wire doesn't reveal anything about information. However, this model should be adjusted in hardware platform due to the phenomenon called glitches. Glitches are unwanted signal transition at output of the combinatorial circuit mostly because of unbalanced at the gates input. To realize a mask version of a function we should split sensitive variable or sensitive data into at least three plus one shares. Boolean masking is one of the most popular masking schemes in which the X4 of shares should reveal the original value. So we know that AS test clause consists of an inversion and an affine function which this inversion can be written as p to the power of 254 in the Galois field. So I would like to decompose this function to two QB functions which means that we would like to decompose the inversion issue to function F and G and each coordinate function of F and each coordinate function of G is at most QB. So there are many solutions for that. And if you look at the ANF algebraic normal form of each coordinate function we will see that for both F and G we have all possible QB monomers. So there is no difference between the solutions. So for no particular reason we chose N equal to 26 and N equal to 49. So now we decompose the S into G and F where you can see the declaration of F and G here. So as I said, each coordinate function of F and G is QB and now the question is I have a A to QB function for each component function of F and G and I would like to have a two share mask form of this. And now the question is what is the minimum number of component function to realize a two share mask form. So I would like to explain this problem with a simplified example. As you can see the function F with form input which is a coordinate function and I would like to have a two share mask realization of F. So of course as I said each component function would be non-complete and we know that for a simple function four component functions is enough. So if you take the A-B as the first monomial A0B0 goes to the first component function A0B1 the second one A1B0 to the third one and A1B1 to the last. And we can do the same for a monomial without any problem. So I can show or simplify these four component functions into a table. So as you can see F0 take A0B0C0 and E0 as its input list and the rest of the component function is also demonstrated here. But when the F function gets complicated and has more monomials then the question is four component functions enough which in this case there is no solution. So what we should do we should add another component function to make sure that a shared monomial can fit into the component function. So with five component functions you can realize a two-shell mask which F0 F1 should be for example compressed for the first output share and for the second one F2 F3 and F4 should be compressed. So this problem is called set covering problem which is a discrete optimization problem and we have some method to solve these kind of problems. And fortunately we have some tool, write a program and find a solution and I'm not going into the details but the optimal solution is 12 output shares from 8B to 1B cubic function which means that this solution is for worst case scenario so we have a very complex ANF with 8B input and the algebra is 3 and if you have for example a simple function of course we are sharing the list number of output shares and we found however in our case, namely for F and G we have a very complex ANF so we have to use the 12 output shares. So as I said we can decompose the ASS box and basically the inversion and for example we take F8 equals equals to X to the power of 226 and if Faj be the J coordinate function of F so basically we have a coordinate function for F I can realize a two share mass form of F with 12 component function which I represent here as Fi so Fi is the coordinate function and I write to the index of the coordinate function and the J is the number of coordinate function so each Fi and J receives an 8B input X prime I which is a mixture of input shares and of course it is non-complete and this means that regardless of the ANF target function the index share of each component function is fixed so it doesn't matter which coordinate function we take and the component function input list is always fixed so if I represent the output of the component function by Vi and J then I can build this matrix which each column represents a share two shares mass implementation or mass realization of each coordinate function so if I look at each row I can look at it as a function of 8B input and 8B output so I can do the same for the second row and as I said the input of all of them are the same so I can represent it as X prime 0 and call the function F prime 0 and I can do the same for the next one and of course the last one so to have each extended problem model I have to add some freshness so I take six freshness and add them to the second half of this matrix so to achieve also correctness and of course a glitch extended problem security I add the same to the lower half so as I said the output of these XOR should be stored in the register and then we have the compression layer which form to 8B output shares adding these freshness is enough for glitch extended problem security however the output is not uniform so to achieve uniformity we add two more freshness in the compression layer and of course the result of the compression layer should be stored in register before being given the next function so here you can see the general structure of the mass realisation of the inversion so as I said we decompose the inversion into two cubic functions F and G and then we have F prime 0, F prime 1 to F prime 11 which represents 8B to 8B functions and each of which is already demonstrated in the last slide which is the role of the matrix so as I mentioned earlier each row takes only one bit so if I go back here you can see each row is represented by F prime 0, 1 to F prime 11 and as you can see each row only takes one bit freshness so here on the R0 then only R1, only R2 to only R5 to this row and we do the same for the next half and then of course register and then compression layer and adding freshness to achieve uniformity so here we have the same, we have 8B to 8B functions we want freshness for each of them here and this is the same half which the freshness repeated and then we have two bits here in compression layer and the result should be registered here and given to the next function and because the 8B freshness are supposed to be updated every clock cycle we can connect the freshness of F and G to the same source as they are two clocks apart so every clock cycle the freshness are updated so when the data or the pipeline the data is refreshed here then it is other freshness here so we see no leakage in practice and actually has no leakage we can do the same for spot inverse here so we can decompose it into two qubit functions W here and do the same for them and the general structure is pretty similar to the inversion so now we realize the ASS was and its inverse and if we merge it into one construction we can see here so basically F and W prime are 8B to 8B functions we have one bit freshness for each of them which is basically shared between them and we have marks which has a control signal which means that we would like to apply the mask the ASS box or its inverse and of course here we have the same we have G prime and F prime which is 8B to 8B functions we have one bit freshness can also have a control signal for ASS box or its inverse and as we can see this construction can be seen as a 10 bit to 8 bit function so basically 8 bit equaled 1 bit freshness and 1 bit control signal and as I said earlier a VRAM can be configured as 10 bit address and 8 bit output so each of them can be perfectly fit into one VR so we can realize this with one VRAM and all of them in 12 VRAM and also 12 VRAM for them here however it has also TDP a true dual port so we basically not implemented only one spot but we have two spots with these 24 VRAMs so the general structure of our FPGA specific mask ASS here so basically we have a control signal here then we can decide if we want to perform ASS encryption or decryption and depends on the mixed column or mixed column inverse or shift or shift row inverse is applied so here you can see the result of the synthesis of our constructions as you can see our designs support both encryption and decryption and the throughput is similar to other designs the number of freshness is considerably lower of these designs apart from this one and the cost of the employing VRAMs the number of slices is lower than the state of the art and if you do the decryption then the circuits simplified and we need roughly 1000 slices in S-partons which is a model and cost-optimized FPGA and model in silence FPGA for smaller FPGA we also provided half-popline and quarter-popline which are basically two column based design and column based design and you can see the throughput is still good compared to the spike serial implementation and the number of VRAM is quite low and as you can see less than 300 slices needed to implement our column based encryption design as a matter of fact we also provided the spike serial implementation just bear in mind that the smallest FPGA in S-partons 6 family has only 600 slices and some designs like this one does not fit into this FPGA and our design used only 200 slices the cost of 12 VRAM which already included in these FPGAs and the throughput is highest compared to the state of the art of course if you don't need a high throughput or that much throughput you can use this design which is in tiny and small area overhead however the output throughput is pretty low so all nonlinear parts of our construction means that the S-box construction is already verified by Silver under the extended program model which is dedicated to hardware platforms Silver is a verification tool that does not simplify anything so it doesn't have false positive or false negative and already all our designs the S-box construction and S-box universe construction is already confirmed by Silver because it's still not possible to evaluate the entire design we evaluated our design in practice by implementing them in FPGA and collecting 100 million teres and performing a t-test as you can see it's always below the threshold so here is the last slide so at first we presented a methodology to find the optimum number of component functions for sharing and we already covered 8 to 1 bit Q function but this methodology generally can be applied on any beyond function any number of inputs we also minimized the amount of fresh randomness compared to the state of the art and constructing a wide range of design which basically around based on both encryption and decryption and encryption only one and also double column based and also voice area implementation to cover a wide range of models and devices we already outperformed the state of the art and none of them can efficiently use BRM and all of them should use sliders we realized first order secure encryption and decryption together and also be conducting verification based evaluation using silver and also experimental analysis thank you very much for your attention and watching this video please do not hesitate to contact me if you have any questions or suggestions