 Hi everyone, my name is Alexandre Adamnickai and in this video I will present our paper entitled Exlicing a New Gift Representation. This is a joint work with Zakhar Najm and Thomas Perram from NTU Singapore. Some context first, lightweight crypto has been a very hot topic in the past decade. More than a hundred ciphers claiming to be lightweight have been published in the and it appears that no single algorithm is more efficient than all others on every possible platforms. It comes from the fact that it is very hard to achieve outstanding performance in both hardware and software at the same time and usually you have to choose a side when designing a cipher and in our paper we tried to answer the question how efficient hardware oriented ciphers can be in software. We believe this is an unprotected question for the ongoing NIST LWC standardization project. We focused on the gift family of block ciphers which was introduced at chess 2017 and is composed of two members namely gift 64 and gift 128 where the number refers to the block size and gift block ciphers are substitution bit permutation network which means that the linear layer only consists of a bit permutation. It makes gift hardware oriented design since the linear layer is typically free in hardware it's just bit wiring no logic gate is required gift is a direct improvement of the 64 bit cipher prism which is an ISO standard it provides smaller area better resistance against linear cryptocurrencies better performance and extend to 128 bit block size actually gift 128 is used in several NIST LWC run-to candidates now let's have a look at the gift building blocks so both gift 64 and gift 128 share the same 4-bit sbox so as you can see it is quite lightweight it is composed of only 11 gates and it has nice properties regarding the integration of sectional temperatures it has an algebraic degree of 3 and only for non-linear gates so if both gift algorithms share the same sbox it means that they only differ by the bit permutation so here is the bit permutation used in gift 64 the figure illustrates two rounds of gift 64 so first we have the non-linear layer which consists in parallel application of 4-bit sboxes then the bit permutation and the xr refer to the addition of the round keys and the round constant and the bit permutation has the special property that each bit will always arrive at the same position within a 4-bit sbox so here for instance are the sbox first input bits during the first round and they are still sbox first input bits in the next round and still in the next one etc and this property holds for every bit position for the first one the second one and third one so in fact the pattern red yellow green blue red yellow green blue repeats at each round so what does it mean from a software implementation point of view in this presentation we only consider constant time implementations thanks to the thanks to the bit slice implementation strategy and in the case of gift the internal state is composed of four slices since it uses a 4-bit sbox and for gift 64 each slice is composed of 16 bits and so this bit permutation property means that each bit located in a slice remains in the same one through the bit permutation and so different permutations have to be applied to each slice independently so p0 have to be applied to the slice f0 p1 have to be applied to s1 etc but how do we do concretely well the implementation solution is not very elegant we have to deal with a lot of masks shifts and bitwise or and so this operation are just to apply p0 and f0 then we have to apply p1 and s1 and p2 and p3 so they can be computed in a similar way using mask shift and or and at the end the entire linear reader requires about 100 cycles per round on arm cortex and processors so this highlights why it is usually believed that ciphers using bit permutations are bad candidate for software implementations although they are extremely efficient in hardware so still in the case of gift one can take advantage of some large architectural like by processing two blocks and 32 bit platforms for instance so we did some 9 bit size implementation to have an idea on how gift performs in software so here how constant time implementation results on arm cortex m3 and m4 so as i just said for gift 64 we need to get advantage of the 32 bit architecture to process two blocks in parallel for a gift 128 it perfectly fits 32 bit registers so we just process one block at a time and the speed is expressed in cycles per block and so finally we have the gift block ciphers that run at 268 and 540 cycles per byte and in order to give some insight on how it compares when compared to other ciphers AES 128 runs at 100 cycles per byte on the same platform so clearly gift is not a perfect candidate for optimized software implementation microcontroller and this is mainly due to the bit permutation which is very costly so let's have a look at this building block see if we can arrange that so here is the bit slice representation of gift 64 so we have we have the four slices each bit within a slice well sorry each cell within a slice referring to a bit and so then during the first round we will apply the linear layer so each slice will be transformed independently using p1 p2 p3 and same during the second round third round and fourth one and one can remark that after a four round we have like a synchronization of the slices so after four round the bits are back at their original position within the slices so it means that the permutation order is four and it let us think that following an alternative representation for a few rounds namely four might help to improve the performance actually the decomposition of the present permutation over two rounds allow significant performance improvements for this algorithm and in a case of gift we asked ourselves what if we just completely omit the permutation for a given slice since it will be back to the original position after four rounds anyway so that's what we did we had a look at it so once again the slice representation of the internal state and during the first round instead of applying p0 to s0 so this slice will remain fixed during the entire algorithm and so we cannot just apply p1 p2 and p3 to the other slices because we need the bits to be properly aligned for the sbox to be computed so by properly aligned I mean if I go back to the classical representation so for instance in the second round we have the b0, 5, 10, and 15 and so those four bits will be involved in an sbox computation and so we want the same bits to be involved in the sbox computation in our alternative representation otherwise the result will be erroneous so we just instead of applying p1 p2 p3 we adjust the slices according to our fixed one so that the bits are properly aligned for the sbox and actually this can be done in a very simple way for the first round we just have to rotate columns of the slices so for the first slice we just rotate one column to the left two to the left for the second three to the third and for the second round we don't have to rotate columns but rows for the third round we have to rotate columns back but to the right this time and for the fourth round we have to rotate rows again but to the bottom and then after four rounds once again we have classical sorry we have recent colonization with the classical representation so to put it in a nutshell our new representation consists in fixing a slide to never move and to adjust the others accordingly so that the bits are correctly aligned for the sbox we call our technique fixed slicing for gift 64 the slices adjustment are very simple and consist of row-wise and colon-wise rotations depending on the round numbers by processing two blocks at a time on 32-bit architectures they can be computed by means of word-wise and byte-wise rotations respectively and since word-wise rotations can be computed for free on arm thanks to the inline barrage shifter it means that the linear layer is free every two rounds on those processors which is quite an improvement compared to the knife pre-sliced version and so what about gift 128 so it's more tricky in this case because we don't have the permutation order is not 4 anymore actually the permutation order is 31 for p0 and p2 10 for p1 and 5 for p3 and we are interested in the lowest order since it allows to define a more compact alternative representation so we suggest to fix at three to never move so we can define an alternative representation that will be synchronized with the classical one after five rounds and the slices adjustment are similar to gift 64 for the first two rounds so we just have to compute word-wise and byte-wise rotations but are slightly more closely for the last three rounds but still they are way more efficient than the knife version so we do not detail the slices adjustment in this presentation but if you're interested you can have a look at the paper this is well detailed and so here are our implementation results for the fixed slice implementation so all for our gift implementation were written in assembly to reach the best performance and these implementation results are for fully enrolled implementation still for speed optimization and so the gift b algorithm here refers to gift but expecting the data to arrive in the correct form I mean expect the data to be already in a bit sliced form so we don't spend cycles at the beginning of the ciphers to pack the input data and at the end to unpack the output data so this is just a matter of representation it does not affect the security of the algorithm and actually it was used by some LWC candidates and so thanks to fixed slicing we see that there is quite an improvement gift 64 is only outperformed by a spec which is known for its outstanding performance thanks to its art structure and now we have gift 128 that outperforms AES on microcontours so fixed slicing allows significant performance improvement when compared to knife-besizing for instance we see that gift runs five-time faster and gift 128 six-time faster on ARM context M and our fixed slice representation perfectly fit the ARM architecture thanks to the light bar shifter and we expect slightly lower but still impressive improvement factors on platforms without further instructions such as respite we also have the look into masking implementations so actually there there were no results for masked gift in software so we filled the gap by integrating first order masking indeed lightweight cryptography cryptographic algorithm should be able to integrate such on consumers at a reasonable cost since embedded devices are typical targets for post-sectional attacks and so for our masking scheme we used a non-linear case that do not require additional randomness generation we use the scheme published by Boricov and Cardis 2017 and we can see that the gap with the AES is even more important in the masked domain still one has to note that AES I mean this AES implementation uses non-linear gates that requires additional randomness generation so these results can probably be improved and the gap with gifts can be reduced but we still expect gifts to show outstanding performance over the AES in the masked domain and we also had a look into the benefits of fix slicing in the context of the LWC standardization project so by integrating our fixed light gift implementation into the gift CRB authenticated cipher and we compare it to ASCAN which is part of the CSER portfolio another NIST LWC algorithm and is actually one of the fastest in software and actually we see that gift CRB is competing with ASCAN not as fast but not really far from him and actually by looking at the NIST LWC benchmark gift CRB ranks among the top five on most of the microcontrollers so this is quite an improvement so to conclude we introduced a new alternative representation of gift that we call fix slicing it allows a constant time and software friendly implementation of the big permutation it makes gift extremely efficient in software placing gift CRB among the fastest NIST LWC run-to-candidates on microcontrollers and we highlighted that gift is well suited to session consumers by reporting first order masked results and all our implementations are publicly available online so you can have a look if you want to run benchmark also we did not challenge the security of our masked gift implementation so if you are interested in please feel free to contribute and some perspective regarding the fix slicing implementation strategy actually tends to be generic and it might be of interest for older designs and not only substitution bit permutation networks indeed we applied it to the AS and it led to new bit slice speed records on ARM Cortex-M processors and RIS 5 so if you're interested in the paper will soon appear in eprint and we will also soon make the source code available on github so you can have a check if you're interested okay thanks for your attention I hope you enjoyed this talk and if you have any question please feel free to reach us by email we will be happy to answer your question thank you bye