 Welcome to our presentation of a subterranean 2.0 Cypher Suite. My name is Johan Dame and this is joint work with Pedro Maet, Costa Massolino, Alresa Meerdad and Jan Rotella. I will be doing the first part of the presentation, the historical part. And this history starts in 92 with publication of the original subterranean. So I published it as a stream hash module and it is something that looks like this. And you can use it in two ways. You can use it as the compute hash function. So you can compute the hash over a variable length message and get a fixed length hash. And you can use it to compute key stream from a fixed length key and a fixed length diversifier. And the way this works is very close to what you would do with a sponge. We absorb here in blocks of 32 bits where each cycle we shift in 32 bits into the shift register and they shift out back here again. So a 32-bit block stays in the machine for eight cycles. While doing that, we also update a state that is initialized to zero and then we updated each iteration by applying this round function that I will explain on the next slide. So after everything has been absorbed, we do a number of blank rounds to get some non-linearity and mixing. And from then on we start squeezing. So every cycle we can squeeze 16 bits. So for instance if you here want the hash of 256 bits, we have to iterate 16 times. So that's the way it worked. Let's now take a closer look at this round function. So our shift register has 256 bits and our state is 257 bits. And the round function is depicted here. So you can see for instance here this bit 92 of the state, how it depends on the bits of the state one cycle earlier. So that's time t. So you see that it's actually a function of nine bits. And this is implemented in two layers. A mixing layer here where each output bit depends on three input bits. And a non-linear layer where we have something that closely resembles Xi from Kechar Kev. With the only difference that here we operate on a cycle of 257 bits rather than cycles of five bits. And also that we have a complement but this is a detail. So for the rest we see also that there is buffer injection here. So the shift register bits are injected here with the exception of here. So that's a count for the difference in state size between A and B. And we do a flip of the bit at position zero to get rid of the symmetry properties. So you can see that if you compute, the computation of this bit depends on these bits. And if you then go one cycle up, we move this here and we see that these nine bits, they will be put in positions 12 bits apart, so nine consecutive bits, and they will then again be depending on non-overlapping bits. So you get a quite good spreading of information. So I published this in 92 but there was never much reaction and I still think it was a nice design. So we were thinking why not try to get some attention to subterranean by submitting it to the NIST lightweight competition. So it was not really meant as lightweight at the time. For instance that the chaining value in the hash mode is a 250-bit chaining value while then current hash functions like MD4 and MD5, they had 128-bit chaining value. So it was not meant to be lightweight, it was meant to be more secure than existing solutions. And also this round function is really hardware oriented and not suited for software. But that would not be a problem because we would go for low energy and low energy implies ASIC anyway. So would it be really low energy? Well, a round function takes four XOR gates, one NAND gate and one NOT gate to execute probit and a shallow. So that's quite good for low energy. There are two operations and not too high gate delay. If you look at absorbing, we absorb 32 bits per round and taking into account the size of the state, this accounts for 32 XORs, 8 NANDs and 8 NOT operations per bit absorbed. Squeezing is twice as expensive but still reasonably cheap if you compare it to so-called lightweight designs that are now being proposed. So we thought let's give it a shot. So we decided to refurbish subterranean to subterranean 2.0. So first we got rid of the sub-heisen substream and we aimed for three different primitives that are more modern. A ZOF like the shake function, so variable length input, variable length output. A DEC function which is basically a key ZOF which is more efficient than a ZOF and it also allows you to build a lot of different very interesting modes. And then a dedicated SAE mode. There's nothing we can not do with a DEC mode that we can do with SAE but SAE is much more compact. So you can build a more compact solution and that was very interesting in the context of the NIST lightweight competition. So we built this primitive by refactoring subterranean in two levels. So subterranean 2.0 has two levels, a duplex level and a mode level. At duplex level we harmonized the rates to 32 in squeezing and keyed absorbing. We still had the different rates for the unkeyed absorbing where we had to reduce to eight bits per two rounds, so eight times slower. But that we had to do because the difference between the state size and the security and the required security was too small. So we wanted to achieve the 112 bits of security that NIST required. We got rid of the shift register B all together and we just absorbed in the state A or squeezed from the state A in an intelligent way but Jan will tell you about that in a moment. At mode level we always have eight blank rounds between absorbing and squeezing except for the encryption and decryption operation in SAE and there we rely on non-suniqueness. So Jan will now go on about the rational behind subterranean 2.0 and also the details of the design. As already highlighted by UN, subterranean 2.0 comes with XOF, a deck and a session authenticated encryption scheme. Let's now describe those three modes. The XOF can be used for hashing and works as follows. The absorption phase splits the possible multiple strings that we want to absorb into byte blocks between each absorbed block of message we apply two times the run function or once the strings are absorbed we apply eight blank rounds. The squeezing phase then produces a string of arbitrary length where every output block has a length of 32 bits as maximum. For clean mode we let the absorption rate to be the same as the squeezing phase four bytes at maximum as padding rule allows us to process any arbitrary length strings also we apply one round only one round between each absorption then the subterranean deck works as follows cut the key into blocks of maximum 32 bits absorb the key cut the public string represented here by M also into blocks of length four bytes or shorter apply eight blank rounds and then produce the output that can serve for instance as a Mac. Subterranean 2.0 comes also with a session authenticated encryption scheme that works as follows every input and output strings are separated into blocks of length 32 bits at maximum the scheme is duplex like we first absorb the key and the nonce for any selection we apply the run function eight times then we absorb the associated data before absorbing any plain text block 32 bits or less are squeezed and those serve as keystream to produce a safer text by absorbing plain text P to accord in keystream Z after applying again eight blank rounds the tag can be produced the same way as in the exhaust or deck mode hence a new message can be processed in the same session the run function or looks as follows it is composed of four mappings non key mapping for nonlinearity a bit complementation at position zero to break the symmetries a theta function that is a mix layer mapping that here to serve diffusion followed by a P mapping for dispersion which is a bit shuffle mapping more precisely in this P mapping output bit zero is input bit zero output bit one is input bit 12 output bit two is input bit 24 and so on by taking all NBC's modulo 257 not here that 12 is a generator of the multiplicative group Z over 257 Z star let's now see how bits are absorbed and squeezed in the Zoff, Deg and SAE modes the main point here is that we do not take consecutive state bits to define the outer part of the states the factor 12 used in the P mapping is a generator of the multiplicative group Z over 257 Z star and hence its powers covers all 256 non zero bit indices hence 12 to the power 4 covers 64 of these positions this defines a multiplicative subgroup G64 then the squeeze operation outputs 32 bits that are computed by the XOR of two state bits where the indices are inverse from each other in the subgroup G64 so every key stream bit is the sum of two particular state bits eventually the absorb operation is simpler and the input bits are absorbed into the state bits at all positions defined by the first element of G64 where the elements of G64 are ordered by the consecutive power of 166 that is 12 to the power 4 we choose non-consecutive bits in the absorbing and squeezing phase we squeeze from non-consecutive bits and we absorb into non-consecutive bits if we had chosen consecutive state bits we absorb and squeeze then an attacker could partially compute the key mapping with the knowledge of the key stream this effect has been observed by Tom Afu Maria Naya Placential-Malsef in 2018 on the first version of Caitre Jr and the choice has been made in this direction the fact that the key stream is defined as a sum of two state bits also frustrates cryptanalysis this particular choice is also consistent with the shuffle layer the key mapping the important of our design choices are the number of rounds which shows eight rounds for separation for the unkey mode a security analysis present in the paper that I will not detail here explains why we chose two rounds between each absorb block a padding rule allows arbitrary length bit strings this for all subterranean modes hence the absorbing rate is 8 bits plus 1 bit of padding hence we want security in the case where an attacker has exactly 9 bits of freedom per input block the same also holds for the key mode where the effective absorbing rate can be considered as 33 bits but where there is in this case only one round between each block up to now two third-party cryptanalysis have been published on subterranean 2.0 the first one comes from Foucault Rue Takanori Isobe and Willy Mayer where they looked into subterranean SAE into non-smithius senai even for subterranean is not designed to be non-smithius resistant moreover they were able to attack subterranean using a cube-based approach when the separator is reduced to four rounds the key recovery complexity is 2 to the power 122 calls to subterranean and the distinguisher has a complexity of 2 to the power 32 calls to subterranean the second one cryptanalysis has been done by Ling Song, Yi Tu, Don Ping Shi and Lei Hu and it is available on each print in their work the author propose size-reduced versions to help further investigations they also prove that there is no observable linear biases in the key stream when taking four consecutive output blocks and they also improve the complexity of the non-smithius cryptanalysis to conclude my part I would say that we believe that the security of subterranean is still strong but naturally we solely encourage all cryptanalysis to target subterranean so more cryptanalysis is welcome but now me announce Ali Reza that will talk about differential trail analysis of subterranean now I want to talk about the resistance of subterranean against differential attack the main idea of differential cryptanalysis is to find pairs of input like M0 and M1 with a specific difference delta 0 that leads to a specific difference at the output delta r with a high probability then we can say that the security of this algorithm against differential attack is at most equal to the maximum differential probability of delta 0 goes to delta r but it is hard to determine we believe that for subterranean the maximum dp is approximately equal to the maximum dp of qr when qr is a differential trail and you can show this trail like this when bi's are intermediate differences if you want to show this trail on the figure you can show it this way to make it easier we use the weight of trail instead of dp but these can be easily converted to each other the weight is equal to the negative log2 of dp this is what we have so far to calculate the weight of this trail we need to define differential trail core each round function of subterranean consists of two layers linear and nonlinear we call linear layer lambda and chi is the nonlinear layer so we can divide each round into these two layers we denote by ai the difference between each chi layer since lambdas are linear layers once we got the value of ai we can compute the value of bi with probability equal to 1 so the weight of this trail only depends on the weight of passing chi which is equal to the weight of delta0 goes to a1 here and bi's go to ai plus 1 or we can simply show it this way minimum reverse weight of a1 plus weight of bi's knowing the input difference is enough to compute the weight so we can show this part like this but why did we use minimum reverse weight for a1 well the output difference a1 is compatible with a lot of difference at the input delta0 and since we are looking for the lower bound on the weight we use the minimum reverse weight here now we want to know what is the lower bound on the weight of differential trails well it's hard to determine this lower bound for big trails like 8 or 7 rounds but the thing we can do is to start with smaller trails this is obvious that the weight is 2 for 1 round and 8 for 2 rounds but it's not trivial for 3 rounds or more to find the lower bound on the weight of 3 round trail course we generated all of this trail course up to weight 39 for this purpose the same method as introduced by Mella, Diamond and Vanish in 2016 this is the list of all 3 round trail course up to weight 39 the numbers are modular rotation this means that for this particular weight since the length of the state of subterranean is 257 we have 1 times 257 trails with weight 25 so the lower bound on the weight of 3 rounds trail course is 25 and this is how it looks like you have 1 active bit at A1 3 active bits at B1 and 9 active bits at B2 I also listed the positions of the active bits here to find the lower bound on the weight of 4 round differential trail course we searched the space of all trails up to weight 48 but there was no 4 round trails up to this weight the best 4 round trail course we found cost 58 so the minimum weight of 4 round trail course should be somewhere between 49 and 58 this is the 4 round trail course with weight 58 that we found we have 9 active bits at A1 5 active bits at B1 6 active bits at B2 and 15 active bits at B3 so far we found the lower bound on the weight of 1, 2 and 3 rounds trail course and we know that the lower bound on the weight of 4 round trail course is somewhere between 49 and 58 but what can we do to find the lower bound on the weight of 8 round trail course well we know that each 8 round trail course Q8 can be divided into 4 into 2 4 round trail course Q4 and Q4 if this trail wants to have weight smaller than or equal to 97 then one of these 4 round trail course should cost smaller than or equal to 48 since there is no 4 round trail course up to this weight then we can conclude that the weight of 8 round trail course is 98 which is 97 plus 1 we use different methods to find the lower bound on the weight of other trail course which I will not go into detail but you can see the list of these lower bounds in this table now let's talk about implementation and performance since subterranean 2.0 is a software optimized for hardware we only talk about the hardware results we won't talk about the software results nor the software implementation so here we have the full circuit that's compatible with the hardware LWC API our goal for this architecture was high throughput which is one of the outstanding features of the subterranean permutation to obtain high throughput we made a circuit called the subterranean stream which can perform AAD or hash as long as data arrives in a specific order and data is sent in full 32 bits of blocks except for the last block which has to be also flagged as being last and around this streaming circuit we built the circuit that's compliant with the LWC API it's the same strategy we used by the LWC API people who did the framework and who did the proposal however our solution is tuned optimized for subterranean itself and to control this entire circuit we made a big state machine all this LWC API related messages however subterranean stream itself has its own internal state machine so I could do my own benchmarking instead of doing my own benchmark I'm going to get the results for third part benchmarks so the people who proposed the framework the API they also did the benchmark in terms of FPGA so for FPGAs specifically the RTX 7 from Xilinx subterranean is the first one in terms of encryption so in terms of throughput and sixth place in terms of hashing throughput so this is not very surprising exactly because in doing the construction of subterranean the proposal of subterranean so the theoretical part of subterranean subterranean can process 32 bits of data per round during encryption while other ciphers like AS process 128 bits of data every 10 rounds zodiac process 16 bits of data per round you can see here that it's a lot of data per round compared to other ciphers of course to do a full comparison a better comparison I would need to compare how much it round costs however I can easily say that subterranean round is very cheap and therefore most likely that's the main reason why we get first place also really high throughput the same thing applies for hashing subterranean process 4 bits per data for hashing mode therefore you can easily see the difference between the 6 gigabits per second to 744 megabits per second however all of this solution both these solutions are in the same hardware so the same hardware supports hashing and encryption therefore we get very low amount of lookup table so only 115 in terms of ASIC we also get the first place and however and also for energy we get the first place for throughput and the first place for energy and while throughput are already explained before energy is a little bit more special exactly because we already propose that we have a very low amount of energy cipher and this can be explained by two things subterranean is a cipher that can process lots of data for a very small amount of rounds so you can think of for a big set of data I need less rounds so less time and as a consequence we also our rounds are very cheap so you need less resources for each round if you need less resources that mean less power less power less time therefore very less energy so that's the reason why we are very cheap energy solution so to conclude as I already said subterranean can target the security of 100 training bits for deck and side and 112 for exhaust we have a very comfortable safety margin as seen by other third-part cryptanalysis but we need you to do more cryptanalysis so please do cryptanalysis of subterranean it's a very good cipher and it's also a lightweight cipher as I already said by very small state size very low amount of of rounds that you need to absorb and all of this has been conformed by the benchmarking and also for in terms of masking the non-linear is only an endgate therefore it should be easy to mask just one endgate and thank you very much for your attention