 Hello and welcome to my talk for chess 2021. My name is Mingxing Chen. This talk is about paper, classic mechanism and the uncaught info. This is a joint work with my colleague Zotong. The first page is the introduction of the classical mechanism system. Classical mechanism is a key encapsulation mechanism. It is used for establishing a shared key for communication. It's based on a core basis public key encryption scheme. And it is a final list in NISTAS Post-Quantum Critography Standardization Project. It has three core operations. The key generation is used for generating a public key and security pair. The encapsulation first generates a random message and then encrypts the message. And the encapsulation can decrypt received cybertext and then derive a shared key. So I leave all parameters for a different security level in the table. And you can post the video if you want to read the details. A common criticism for classic mechanism is its big public key. This problem is more severe than embedded system because it usually don't have so many storage spaces. We use the uncortex info as our optimization target. This is because the NIST choose the process as its first platform for embedded system. So the info process has 14 32-bit registers. If the process is a 14-point unit, this is an option. Then we have 32 extra 14-point registers. We use the 14-point registers as an extra space. So it can help us to remain the register building. Memory SS usually cause two collect cycles. Reading or writing contiguous memory can become fast. For example, it takes N plus one cycles for reading or writing N contiguous data. Our device for programming is the ST discovery board from the ST microelectronic company. This is a commonly used board for benchmarking the crypto systems. And this board has a Android and 92 Edo byte RAM. More importantly, it has a one megabyte fresh memory. The fresh memory is usually used for storing the programs. But in our work, we use the fresh memory to store the public key. So this more or less solves the problem of large public keys. Our first optimization is the key generation. Basically, you have a large rectangle matrix H and you want to turn the first part, the m part, into an identity matrix. And then you multiply the rest part, the t part, by the n inverse matrix. And the product is the public key. So there were two implementation before. The big picture of the two implementation is the same. They all first use an in place LUP decomposition to decompose the m matrix. The decomposition must be in place because we usually don't have a large space in an embedded system. Then with the decomposed matrix of m, they generate the only part of a public key. Because the public key is larger than the memory, so they can only generate a step by step. So there are two differences in the two implementation. First, the result of the LUP composition are different. Because they use different algorithms for the decomposition. Second, after the decomposition, the RKK implementation computes an inverse matrix and generates the public key by an inverse matrix. In those implementation, it did not generate the inverse matrix. It generates the public key directly by the decomposition data. So this page shows our implementation. We kind of mixed up the two previous implementation. For LUP decomposition, we use the algorithm from the RKK implementation. And after the decomposition, we generate the public key directly from the decomposed data, just as source implementation. So there are two kinds of matrices after the decomposition. For the permutation matrix P, we use a sorting network to permute the laws of matrix TI. After the permutation is done, we have to multiply TI by the inverse of the upper triangle matrix U or the lower triangle matrix L. The figure in the middle shows the multiplication of the inverse matrix L or U. L or U can be done by the correct order of the low operation. So we don't have to compute the inverse matrices of L or U explicitly. In our implementation, we use the blocking matrix multiplication for all the matrix multiplication for beta performance. Then we proceed to the second operation of classic mechanism, the key encapsulation. And this page shows the details of the encapsulation operation. It first generates a uniform random message with a fixed weight T. Then it performs a matrix vector multiplication. This is the most time consuming operation in the encapsulation. It multiplies the public key matrix by the random message generated in the previous step. And we will talk about the two steps in the data size. So after the random message is generated and the matrix vector multiplication is done, we can then produce the ciphertext and the shared key with the hash functions. When generating the random message, the spec requires it to be a uniform random vector with the length and weight T. Because we don't have a PRNG that can generate the message directly. So the typical method for the generation is to generate indexes of ones. And reject the index with the same value over its length. Then we still have to check the repetition of the indexes. So we check the repetition by sorting the indexes and check if there are two adjacent elements with the same index. So we claim that sorting here need not to be a constant sorting algorithm. Because when a non-constant tank sorting algorithm is used, for example, we use a Q-Sort. Then the sorting algorithm may leak the information for the order of two indexes when compared and comparison. It will not leak its real value. So we use a faster algorithm for checking the repetitions. And this method can also be useful for other code-based cryptosystems. For example, the bike or HQC. And then it's the matrix vector multiplication. Here we want to reduce the number of reloading the vector E during the multiplication. The reloading occurs because the public key matrix is a raw major matrix. So the multiplication is performed as an inner product for every row and the vector E. In other words, we have to load the vector into the register whenever one row is processed. So our strategy for reducing the memory access is to process many rows of the matrix together. But we still want to write as many elements in one row as possible because the contiguous reading is faster. So we end up coming out to process the public matrix in a manner of block sub-metrics. We divide the public matrix into four by 96 blocks. So in each block, the vector E is used only once. Then we talk about the last pay-in decapitulation. Besides deriving a shared key, the main computation in decapitulation is to decode the error vector from a received ciphertext. The decoding agrees on text-to-inputs. One is the received ciphertext and the other is the secret copper code from the secret key. In the table row, we list the four most important components in decapitulation and their optimization methods. We optimize the business multiplication in the FFT component. And we use a new method to implement the finite field multiplication in the frequency algorithm. And last, we optimize the finished network by combining the computation of many layers together. Because the previous implementation uses a business multiplication, so we optimize the business multiplication for the encoded M4 here. The tricky part here is that we only have 14 registers, but we need to multiply polynomial of 12 turns. So one solution is that storing the intermediate turns in the 14-point registers. When the register spieling occurs. But moving data between normal registers and 14-point registers are still causing a lot. So we have to find a way of scanning input operand such that we not so many registers occur. So we end up scanning the input operand like the figure here. In each block of the figure, we have eight turns of input. Four turns of them are coming from the polynomial A and the other four turns from polynomial B. When moving to the next block, we only need volume operand either from polynomial A or from polynomial B. And some intermediate results here can be shared between the two adjacent blocks. The other point of the figure is that we compute the turns from high-degree to low-degree. This way, when we compute the low-degree blocks, we can reduce the computed high-degree turns. So we don't need an extra phase to reduce the computed high-degree turns through low-degree. And here comes the breakemacy algorithm. We list the algorithm on this page. It looks complicated, but the actual computation only occurs in 96 and in the product. And 98, a vector multiplies a scalar. There is one difference from the previous implementation from the list submission. In the submission, they use an inverse-free algorithm, but we compute the inverse of data in 98. We think the inverse can be computed faster, but this faster than the vector scalar multiplication here. Here we show a new implementation for the finite field multiplication. We call this red-existing multiplication. Here we have a polynomial of degree 7A, and we can store the polynomial as a 32-bit integer. The constant turn is stored at bit 0, and then the bit 4, bit 8, and so on. So if we store the polynomial in this way, then an integer multiplication can perform the bit polynomial multiplication as the equation showed on this page. Because the maximum value turns will appear at degree 8, and its value is 8 when all coefficients are 1. So we can see here that A is this thing, 16, so the multiplication won't overflow the red-existing format. And we implement the finite field multiplication with the integer multiplication here, if we store data in red-existing form. And we show our result of Frequence Machine algorithm with various settings in these tables. This is because we didn't know which combination will be faster, so we tried every setting to find out the fastest implementation. On the left table, you can see that the business multiplication is actually faster than the red-existing multiplication. But on the right-hand side, we have the opposite result. The Frequence Machine with red-existing multiplication is faster than the one with the business multiplication. And the other result is that Frequence Machine with inversion is indeed faster than the inversion-free version. So we analyze the reason why red-existing Frequence Machine is faster. When computing the inner product, the red-existing multiplication uses a regi reduction, which means it accumulated the product of bit polynomial multiplication and reduced only once when all the polynomial multiplication are done. The business multiplication, on the other hand, cannot do the regi reduction because the business data is larger than the size of all regis. When computing the vector time scale, the red-existing can do the multiplication with exact length of polynomials. But the vector length in the business data can only be 32, 64, and so on. So the business multiplication waste is a computing power for multiplying unnecessary terms. When raised the degree of polynomials, red-existing can do this by changing its point to the height of polynomials. But the business data has to do the real logic shift across regis. So we think this is the reason why red-existing Frequence Machine is faster than the business implementation. Our less optimization for encapsulation is about the finished network. This technique is actually quite common in this kind of multi-layer structure. For example, the FFT algorithm also has the multi-layer structure. When computing the finished network, we can combine the computation for many layers to set the memory assets. For example, in the figure on the slide, we can combine the computation in the middle three layers. Instead of loading and writing back all the data for every layer. Finally, we show our performance results to the conclusion. The number here in the table includes the reading and writing of the fresh memory because we store the public key in the fresh memory. Although there are still some numbers that we cannot show here because of the public key is larger than the fresh memory. We think all the optimization techniques can also apply to the situation that the public key is streaming through the network. So there are actually some boards with enough storage space for the classical mechanics available on the market. So comparing to the latest base KN scheme, our encapsulation is about the same speed to the FATIS final list. But our encapsulation is about four to seven times slower. And that's it for all my talk. Thank you for your listen.