 Hello everyone, I'm Vincent Hong. In this video, I'm going to talk about paper, and key modifications for English-ang-friendly rings. This is a drone work with Qingming-Ramington, Mattis-Kang-Visha, Yagya-Sider, Chen-Zhi-Shi, and Bo-Ying-Yang. Let's first recall some useful properties of number theory transforms. Suppose the integer n is a variable in the ring R. Given the invertible element data and the principal and rules of unity omega, the size n energy key is an isomorphism from the ring of x over x z n minus zeta z n to the product ring, where each ring is of x over x minus zeta times omega z i. If zeta z n is 1, then we call it secret energy. If zeta z n is negative 1, then we call it negative secret energy. Additionally, if zeta z n is the nth power over power omega, then we can use this definition of energy for explaining each step of unity of a key. This will be shown in the next slide. As a ring isomorphism, energy provides alternative approaches for multiplying and adding up to 0.4. For multiplying A and B, we can first apply energy on A and B, point multiply them, and apply the inverse of energy. This multiplicative property is also called the convolution theorem. Since energy can be computed efficiently, we can also compute the product efficiently. For adding up to 0.4, we can also compute energy in a similar fashion, but it is not very clear how we can read the benefit from energy keys. This property is actually crucial for the energy per se, and we'll see in the next few slides. Fast Fourier transforms our several algorithms for computing energy efficiently. In our paper, we implement three effective keys. The first one is clue to the effective key. After applying the secretive entity, for the second isomorphism, we'll apply the observation in the previous slide by setting data to omega to the i. For the second effective key, we implement Jensen's effective key. Jensen's effective key is testing the polynomial rings to the secret ones whenever possible. So, for the second isomorphism, we transform the polynomial ring of x over xnn1 minus omega to the n1i to the polynomial ring of x over xnn1 minus 1. We then continue with secretive keys. The third effective key is question. Question is transforming a one-dimensional convolution into a multidimensional convolution. After the transformation, we can then apply clue to the effective key for each of the dimensions for computing psychic convolutions. In our paper, we implement two-dimensional transformation and then we'll apply clue to the effective key in the dimension of y and compute psychic convolution for the dimension of z. The most important operations for implementing any keys are monogonal locations. On quotation four, we implement 32-bit monogonal notation by utilizing the long-monogonal instructions. For every x2, we implement system-built monogonal notation by utilizing monogonal instructions returning high product. The main reason for this is because there are no 32-bit monogonal instruction returning high product in every x2. Saber is a lattice-based key based on modular learning with round-grip. The polynomial ring chosen by Saber is z ak192 of x over x256 plus 1. The security of Saber is specified by the modular dimension L. And the most time-consuming operation is the measure-to-vector multiplication. The measures A is an L by L matrix where each of the components is a polynomial in RQ. For polynomials in the vectors, all of its coefficients are within plus minus mu over two. We take a closer look at how indy-friendly and how indy-unfriendly Saber is. First, the polynomial modulus is a negative combination. This implies that if we can define negative signal indy-t's, then we don't need to expand the polynomial degree. Next, we look at the coefficient ring, z ak192. In this ring, we can define indy-t's given isomorphisms. Therefore, the coefficient ring is regarded as indy-unfriendly. A straightforward solution for this is to choose a large modulus or several moduli bounding the maximum value of one polynomial location. But indy-t's actually tell us much more than multiplying two polynomials. In particular, for a well-defined indy-t, the summation of several products of polynomials can be computed with the aid of indy-t's first forms. First, we apply indy-t's to all the polynomials. Then, then, we accumulate several point locations, and finally, we apply the inverse of indy-t. This characterization suggests that the structure of measure-to-vector modification is actually threading, because we only need to transfer the vector once, and at the end we only need to compute the inverse of indy-t's for a vector. For the indy-t's for the measure-to-vector modification, we start by computing the result as z is a coefficient ring. Therefore, the maximum value is bounded by 20 times mu times L. On quotas M4, we choose a 32-bit prime Q prime bound in the maximum value. Then, we compute any keys in ZQ prime. For every x2, we choose two certain B-primes, P0 and P1, where the product is bound in the maximum value. Then, we compute any keys P0 and P1. After computing the entire product in these two primes, we then apply CRK to derive the result as if z is a coefficient ring. For computing the entire product with any keys in a certain prime ring, we need L plus L to the 2 any keys for transforming the vector and the metrics into two prime locations and L inverse of any keys. We made the following decisions for our Q4 implementation. First, we compute incomplete any keys, giving 4 coefficient point omials. This is because there are only 14 general propositions on Q4, and we find that computing 3 layers of any keys at a time is the most economical choice. Therefore, in total we compute 6 layers of any keys given 4 coefficient point omials. The second decision is to decide how to accumulate the products of 4 coefficient point omials. For example, suppose we are going to compute h as a summation of products p and q, and we focus on the constant term of the result. Then, we see that there are at least two approaches for computing the constant term. The first approach is to compute the 64-bit value of each cement and reducing them to 32-bit immediately. And finally, the 32-bit result. The second approach is to compute the 64-bit value of the constant term, so we only need to reduce to 32-bit value once. On Q4, due to the register pressure, it's not very clear which approach is faster. And our experiment shows that the second approach is slightly faster than the first approach. Furthermore, we expect that the second approach is going to be a lot faster than the first approach if we have a lot more registers. Now, we take a look at multiplying point omials in Yan Chu. Now, in total, six per liter sets. In our paper, we implement four of the per liter sets. The largest two per liter sets are introduced after we submit our paper. The point omial ring for point omial notation in Yan Chu is ZQ of X over X to the N minus 1, where Q is a power of 2 and N is a prime. We are targeting the point omial location where one of the modulcans is ternary. Therefore, all of its coefficients must be within plus minus 1. In this figure, I only focus on the differences of the blister invitation in Yan Chu and in Yan Chu prime. In Yan Chu prime, we are given two primes for specifying the point omial rings as the large value of field. And I'll compare the per liter set, using one in Yan Chu prime with the per liter set 677 and 701 in Yan Chu. In both implementation of Q stream, they are almost the same except for the reduction to the target point omial rings. And in both implementations, the reduction of point omial modulus is performed before the reduction of the coefficient ring. First of all, we can look at the point omial modulus in Yan Chu and find that for reducing to a coefficient, we need one addition. This also implies that the maximum value if we regard z as a coefficient ring is n times Q. On the other hand, the point omial modulus in Yan Chu prime is x to the p minus x minus 1. So to reduce me to a single entry, we need two additions. This also implies that the maximum value of the result in z is n times 2p minus 1. So already because of the point omial modulus, we find that we need one addition on average and we also need to choose a larger prime for compute s and z. Next, we look at the reductions of coefficient rings. For n Chu, the coefficient ring is a power of 2. Therefore, we can pack two half words before the reduction and after packing into a register, we can then reduce two coefficients at a time with an instruction logical n. So on average, for each coefficient, we only need 0.5 cycles for the reduction. But if we look at the coefficient ring in Chu prime, then the coefficient ring, because of the coefficient ring is a prime ring, we need 32-bit arithmetic for reducing reduction and reduce the coefficient ring with a two-cycle implementation for better reduction. After the reduction, we then pack two half words into a word. Therefore, on average, we need two cycles for reducing each entry. On average, we can save 2.5 cycles for reducing each coefficient. This gives us about 17-50 cycles of reduction on a cycle count. By effect, we have a slightly greater reduction and this is because that Poinomials in n Chu are short-term n Chu prime. I'll talk a little bit about some implementation considerations with AVX2. For AVX2 implementation, since we're implementing two 16-bit entities, we have to compute CRT for applying the result as in Z. For the CRT, we implement the divided difference form for the computation and we find that this approach is more favorable when there are only two products for the CRT. Another important consideration is the magnitude of the intermediate results. We compute the worst-case intervals for the given input intervals. For the computation of the intervals, we follow precisely the arithmetic instructions and grab precisely the real stability for computation. We look at the results of our implementations. The most time-consuming operation in SABER is the measure vector multiplication. On QDSM4, we obtain a reduction of 61% of cytotons and we obtain a reduction of 32% on Skylake. Another important polynomial arithmetic in SABER is the inner product. On QDSM4, we reduce the cytotons by 42%. And on Skylake, we reduce the cytotons by 55%. On QDSM4, the reduction of the ratio of inner product is less than the ratio of measure vector multiplication. This is suspected because the structure of measure vector multiplication is more friendly for any case than the structure of inner product. But I think a little bit weird is that on Skylake, we obtain a greater ratio of reduction on Skylake for the inner product. And the main reason is that for the inner product on Skylake, we compute it in a way by assuming one of the vectors is already transformed by NGVs. So the cytotons is excluding three NGV transformations. On QDSM4, we are doing so, maybe because we are trying to make the implementation compatible with the reference implementation. This is a result of the force of inner saber. On QDSM4, we obtain a reduction from 22% to 26%. On Skylake, we obtain a reduction from 5% to 10%. We also have the result of this result of the CCH transform. We also have the result for the CPA in our paper. This is a result for multiplying a polynomial with a ternary polynomial in entry. On QDSM4, we obtain a reduction for all of the perimeter sets, even for the smallest perimeter sets. And this set a new criterion for applying entities on QDSM4. On Skylake, we are not able to have a speed down for the smallest perimeter set, but we still require a reduction between 7% to 15% for all other perimeter sets. This implies that when the security level becomes larger and larger, we believe that on both QDSM4 and Skylake with FX-2, NGV is more favorable for entry. This is the result of the forcing of entry. We ignore the cyclical of CCH generation because the CCH generation is dominated by the inversion of polynomial and we are not targeting the inversion here. For the encapsulation and decapsulation of the perimeter sets 677 and 701, on QDSM4 we obtain a reduction between 3% to 6% and on Skylake we obtain a reduction between 1% to 2%. We also implement entities for the round 2 summation log. The polynomial range chosen by log is ZQ over X over X over N plus 1 where Q is 251 and N is a power of 2 which are 512 and 1024. In the log summation polynomial notation are also of the form where one of the monvocane is ternary. During the implantation for any keys we find that in the previous implantation the approach is somewhat a quadratic time in N and we believe this is the main reason why we are able to obtain a vast speed of multiplying two-point dominoes. Due to the vast speed of multiplying two-point dominoes we obtain a reduction of cyclical comfort of source in our log between 67% to 79% on QDSM4 and between 20% to 61% on Skylake. In conclusion we find that even though the coefficient range of Sabre, N2 and log are N2 friendly we can still benefit from N2 keys. In particular for Sabre the polynomial modulus is N2 friendly and the structure of metric vector multiplication is also N2 friendly. For N2 since the degrees of polynomials is large we can still benefit from N2 keys. And for log both the polynomial modulus and the degrees of the polynomials such as the N2 keys are very useful. For computing result as in Z on QDSM4 we choose a 32-bit prime bounding the maximum result and for ABX2 we choose two 16-bit primes bounding the maximum result. There are some works worth noting in this period. In our paper we optimize Sabre on QDSM4 only for the purpose of speed and we think mature from blind term for integrating stack optimization of Sabre. For even more stack-automized Sabre on QDSM4 and more about N2 keys there is a paper with more N2 keys for Sabre on QDSM3 for N2 keys with 64-bit on VAA there is a paper Neo-N2 faster dilation kyber and Sabre on QDSM2 and 8.1 In the paper we also introduce three instruction single width multi-modulation and explain how to multiply polynomials in x2 to the 204 minus zeta without requiring the existence of a square root or a fourth root of zeta. Thank you for your attention.