 Today I'm going to tell you about how we optimize Spike for the Intel as well as ARM Cortex-N4. This is a joint work with Meeting Chen and Marquess Quarles. Spike is one of the third round urtonic end dates of NIST-PQC standardization process. It is a code-based cam based on so-called moderate density parity check code, which is essentially the same as low density parity check code. The polykeys of Spike are pretty small. The polykeys are 1.5 kilobytes, 3 kilobytes, and 5 kilobytes, where the level 1, level 3, and level 5 parity sets. The reason why Spike can have small keys is because the construction is entry-like. Some ring structures have been included in Spike's construction, and Spike has been supported in Amazon's AWS scheme management service. So this sounds good, but unfortunately Spike is a relatively slow scheme in the standardization process. For example, if you consider Spike L1, the level 1 parity set of Spike, to the corresponding parity set of classic MacLeese, you can see that both encapsulation and decasalation are much slower, and there hasn't been any optimized code for embedded systems written by the Spike team. So here's what we did. We wrote a Haswell implementation and also a N4 implementation, and both implementations are constant time. As you can see, our Haswell implementation is faster than the AVX2 implementation of the Spike team. In particular, our decasalation time is much faster. This is important because decasalation is the slowest operation. And our N4 implementation is also much faster than the portable implementation written by the Spike team. Here I have to mention that the portable implementation is not fully constant time. Now we can look at Spike's specification. Here the operations that I highlighted in red are operations in the ring R, which is F2X divided by X2R minus 1. So you can see there are three modifications here, and although it's not particularly important, each modification involves one low-weight operands. So E1, H0, and H1 are all low-weight elements in R. And when we implement H0 inverse using Etocity, there are going to be more modifications in R. So what I want to say is that there are many modifications in R that we have to perform. And you might wonder that why is Spike a code-based game? It doesn't look code-based. In fact, 0S0 here is a syndrome, which means that it's a parity check, the product of a parity check matrix and an error vector. And the parity check matrix is a low-weight matrix. And just by using the low-weight structure, we can already find the error positions. Well, at least with certain probability. So here is what you can do. So you can set an error vector E-prime to 0. And then for each position I, you count the number of unsatisfied parity checks. If the number of unsatisfied parity checks for a position I is greater than t, then you flip E-prime I. If that E-prime is equal to s, then you return E-prime. Otherwise, you go to step 2. This error pattern is not the same as what is specified in Spike's specification, but essentially the decoder used by the Spike team is a variant of this single error pattern. Actually, as shown in the chess 2016 paper Quick Bits, Constantine Smoky-Kobey-Skrtofi, the operation of counting the number of unsatisfied parity checks can be viewed as multiplications in another ring, RT, which is d y divided by y divided by r minus 1. And operands of each multiplications in RT must have coefficients in set of 0 or 1. And also one of the two operands must be low-weight. So essentially the multiplications in RT are similar to the multiplications in r that I highlighted in red in the previous slide. So now we know that there are multiplications in RT and also r and we can take a look at how we perform the multiplications in a high-level point of view. For multiplications in RT, if the multiplications is between g and f, where g is the low-weight operand, then we consider g as the sum of several y to the i's. Then we compute y to the i f for each i and then we compute the sum of different y to the i f's. We know that in our implementation these two operations are interleaved, but conceptually you can consider them as two different steps. And how do we compute y to the i f? Well, due to the structure of RT, this is essentially a circular shift on f by i bits. And you follow previous papers to perform logical shifts on f prime. We don't actually perform circular shifts. We convert f into f prime and then perform logical shifts. And here f prime is essentially a duplicated form of the vector f. Regarding multiplications in r, what we did is simpler. We basically do a polynomial multiplication and then modulo x to the r minus 1. So here are our optimizations for multiplications in RT. We use the SEL instruction on m4 to perform constant time logical shifts. And as well we make use of matrix transposition to perform constant time logical shifts. And on both s1 and m4, when we want to add different y to the i f's, we use an Erbison described by Boyer and Porotta. Regarding multiplications in r, as well we use an Erbison proposed by Bernstein. And on m4, we use so-called Frobenius additive 50. Now we can talk about how we use the SEL instruction to perform multiplications in RT. The task here is to shift f prime by s bits where s is smaller than r. And we consider s as the sum of s1 and s0 where s0 is s monster 2. To shift f prime by s1 bits, we use a better shifter, which means that we conditionally shift f prime to the k bits, k minus 1 bits and so on, until we conditionally shift f prime by 32 bits. So here, because each time when we compute conditional shifts, the shift amount is always a multiple of 32. So each condition of shift is simply a sequence of conditional moves of 32 bits worth. And this is easy to implement. For example, the portable immunization of the bug team use a piece of code like this. But at lion m4, you can do better than this because you can just use the SEL instruction to select from one of the two words so that you don't have to do multiple logical proportions. And if you just use SEL instruction directly, then you will see that there are different chains of conditional moves. And for each chain, if the shift amount is rather large, then you will be accessing 32 bits worth. They are pretty far away, which is not very good in terms of performance. So we actually combine multiple chains of conditional moves. We carry out several chains like this in parallel in order to minimize the cost of memory operations. And actually it's very easy to unroll the whole loop, but eventually we decided to slightly unroll the loop in order to balance speed and code size. And finally, when you want to shift the vector by S0 bits, this is very easy to do. You can just do it by using logical instructions. But a better way to do this is to use multiplication and comp-relate instructions such as UNLAO. We don't talk about this here, but this is written in our paper. Okay, now consider Haswells. Our approach for Haswells is a bit different. We still consider the shift amount at S as the sound of S1 and S0, and S0 is S mod modulo sum b. And now we consider f prime as a matrix like this. Now shifting f prime by S0 bits is approximately the same as shifting the rows. You still have to combine the results of the shifts, but basically what you do is to shift the rows. And in order to be able to shift the rows, it's better to make M0 row major. And our main observation is that when you want to shift f prime by S1 bits, instead of doing barrel shifter, you can actually consider the operation as shifting the columns. So in order to be able to shift by S1 bits, we store f prime in a row major way. And then for each S, we shift the columns by S1 bits, transpose the matrix, and then we shift f prime by S0 bits. And here a matrix transposition is pretty fast. You can partition the matrix into four pieces and recursively swap the upper right part and the bottom left part. And all this can be done conveniently using logical operations. For our Haswell implementation, we set b to 256. And we also tried the same approach on M4, but unfortunately it doesn't seem to work very well on M4. And we guess this is because it's not very easy to shift by a large number of bits on M4. Okay, now we know how to compute y2if for any i. And now the task is to compute the sum of y2if. And if you think about this, the task here is essentially the same as counting the number of ones in all vectors of lengths of the number of i's. And in the quick bits paper, the author proposes to use space sizing to do this. And then what you do is essentially making the hardware implementation that perform handway computation. For example, you can prepare several bits as a counter. And then you just keep adding one bit into the counter until you finish all the bits. And in this way you will be using only half-edges. But actually it's much more efficient to make use of full-edges. So Borja and Boratang have shown this in their paper, the exact multiplicative complexity of the handway function. The idea is quite simple. Basically, you just perform additions of operands with 2dk month one, 2dk month one, and one bit whenever possible. And just by doing this, we can already save lots of logical operations as shown in this table. And one thing I should mention is that the drawback of using this method is that it uses more memory. So it's more like a memory time trade-off. Here is a picture that illustrates the Orgison. For example, if you want to add 15 bits, you want to compute the handway in this length 15 vector. And what you would do is to add bits 1, 2, 3, 4, 5, 6, 8, 9, 10, 11, 12, 13, using full-edges. And then you add bits 1, 2, 7, using 2 full-edges. You add bits 8, 2, 14, using 2 full-edges. And then you add bits 1, 2, 15, using 3 full-edges. Now we can talk about how we perform multiplications in the ring R. And as well, one simple thing you can do is to use Cartouba. So in Cartouba, we write each polynomial f as f0 plus f1d. And we evaluate f at 0, 1, and infinity. And then we do this also for g. And we do a point-wise multiplication. And then we do some interpolation in order to get the product. And the byte team sets z to x to the k, where k is the power of 2. And for polynomials of length 64, they just use pc or monkey dk. And actually for each of the bytes parameter sets, R is always approximately 3 times the power of 2 for each parameter set. So as you can see, the problem here is that f0, f1, and f infinity are not balanced in terms of lengths. And we also wonder maybe a lower complexity Erwison can perform better because the lengths of the polynomial are not so short. So we tried another Erwison proposed by Bernstein in his paper Badge-Panel Realvers. The Erwison has a lower complexity. So for every polynomial, the Erwison writes f0, f1d, f2d squared. And then we evaluate the polynomial as 0, 1, x, x plus 1, and infinity. And then we do some point-wise multiplication and some interpolation. So the implementation is quite straightforward. We implement Bernstein's Erwison for the top-level regression. And then for the bottom-levels, we just use cartouba. And one thing that's worth mentioning is that we have to divide it by with divide polynomials by x squared plus x. And this operation, if you do it naively, you can copy each coefficient one by one. This is very small. A much faster approach is to just add the top half of the polynomial to the bottom half. You do this recursively. And finally, we can talk about how we do multiplications in R on n4. Here are the basic ideas that we want to use additive FTs to do polynomial multiplications. And the operands here are polynomials of binary coefficients. So in additive FTs, you write polynomials fx f of x as f0 x squared plus x plus x f1 of f squared plus x. And then by doing this, you will see that we have a very big overlap between the computation of f alpha and f of alpha plus 1. And this is how the recursion works. And if you do this naively, then the resulting recursion will be not so efficient. So there's a special recursion called Forbanus Additivity, which tells you that you can reduce the number of inversion points by making use of the Forbanus map. The concept is quite simple because we're dealing with binary polynomials. So we have f of alpha squared must be equal to the square of f of alpha. For any alpha in f2zm. And we are now the first one to implement Forbanus Additivity. The previous implementations just use PCOMOQDQ. While on n4, you don't have such a convenient instruction for careless multiplications. So we have to use specializing. And for the representation of the field, the biggest field we have to use is f2z32. And this is built upon a sequence of smaller fields. We build it from f2 and then f4, f16, and so on. And each multiplication in fA at t is of the form alpha times beta, where beta is v plus w. Here w is always a subfield element and v is always a constant. In order to optimize this multiplication, we actually compute alpha times w using cartouche bar, and then we compute alpha times v using some sort of circuit generation. Because multiplication by v is essentially an f2 linear map. And finally, we add the result together to obtain alpha times beta. And we found that doing the two smaller operations and then add the result together is much faster than doing the operation directly. And finally, we found that Bernstein's algorithm still performs better on Haswell. We actually tried the same algorithm on Haswell, but Bernstein's algorithm is still a bit better. Finally, we can find our source code online. Our Haswell implementation is already available in Supercop. Our M4 implementation is available in PCN4. And both implementations are available in the architect archive. So that's all of my talk. Thank you for listening.