 Hello, everyone. I'm Vincent Huang. Currently, I'm a master student at National Thai University. In this video, I'll talk about paper for enomial multiplication in n-true prime. This is a drone work with several authors. n-true prime is an alternate candidate in the third round of this post-quantum cryptography standardization. There are in total six parameter sets. In our paper, we focus on the parameter set where p is equal to 761. In this video, I'll show our result for all other parameter sets on quotation 4. I'll show the numbers for multiplying polynomials and the force fields. Given two primes p and q, n-true prime selects the polynomial ring as a large scalar field. For a polynomial, if all of its coefficients are within plus minus one, then we call it small. If there are exactly w non-zero coefficients, we call it w-w. If it is small and w-w at the same time, then we call it a short-point nomial. We focus on the case when one of the modulcans is small and make no assumption on the other modulcans. If both of the modulcans are small, then it takes fast computation for this case. If you are interested in this case, you can check the reference, and I'm not going to tell about the details in this video. In our paper, we propose two different approaches for multiplying polynomials in n-true prime. Our first approach is about Good Street, mapping a one-dimensional convolution into a two-dimensional convolution. Our second approach is Mixed Rates approach. For the Mixed Rates approach, we propose two different implementations. The first implementation is utilizing several small matrices. The second implementation is handling the large radius with rather straight. Now, I'm going to talk about the convolution and how to apply a convolution for multiplying polynomials in n-true prime. For a polynomial ring, if the polynomial modulus f of x is equal to x of n minus 1 for some n, then we'll call multiplication in this polynomial ring a convolution. If we look at the polynomial modulus of n-true prime, then multiplication in this ring are clearly not convolutions. We observe that the degree of the product of A and B is bounded by 2p minus 2. Therefore, we can choose an n greater than 2p minus 2 and compute the product as a size n convolution. After computing the product, we then reduce the polynomial modulus to x to p minus x minus 1. Our first approach is about Good Street. Good Street is converting a one-dimensional convolution into a multi-dimensional convolution by permuting the coefficients. Suppose we are given 2-core prime integers q0 and q1. We can permute the coefficients according to the map sending x to yz under the agreement that y is a q0 loss of unity and z is a q1 loss of unity. After applying the permutation, we compute q0 and qt on the dimension of y. Number theta transforms. Number theta transforms is converting a convolution into polynomial locations. For polynomial A, the number theta transform of A is n-tuple derived by replacing x with a certain power of c, where c is a n-dupe of unity. The convolution theorem tells us that the product of AMB can be computed by first applying NG key on AMB, point multiplying them and applying the inverse of NG key. Efficient algorithms for computing NG keys are also called FF keys. Now we return to Good Street and we focus on the number of locations before and after the isomorphisms. On the left-hand side, a straightforward approach for computing convolution requires about q0 to 2 times q1 to 2 modifications. On the right-hand side, we first apply permutation to have a two-dimensional convolution. Then we apply NG key on the dimension of y. In total, we only need about an order of q0 times q1 to 2 plus q0 to 2 times q1 modifications. In the appendix, I give an example showing that Good Street is already fast for computing small convolutions such as x to the 6 minus 1. Q2TFFP is the most commonly seen fast Fourier transform. Given an invertible element data, we have the isomorphism from the ring on the left-hand side to the product ring on the right-hand side, where the polynomial modulus of each ring is x minus theta times c to the i. We apply this isomorphism by observing that roots of unity are just invertible elements. Suppose we are going to compute side N0 and line entity. For applying Q2TFFP, we first compute N0 and T to have a product ring. For the second isomorphism, we apply the isomorphism on the previous slide by setting theta to c to the i. So eventually, we have a product ring where each ring is a product ring. If N0 and N1 are power 2, then Q2TFFP is very fast. If N0 and N1 do not share the same radix, then we call such Q2TFFP a mixed-radius computation. For our first approach, if we want to apply 5 tau energy on the dimension of y, then 5 tau has to divide Q minus 1, which is not the case. So how can we overcome this? We compute SFZ is a coefficient ring, and after computing the product, we then reduce the coefficient ring to Z2. On quotas N4 with powerful 32-bit arithmetic, we can choose a large point bounding the maximum value of the result for entity. For the 5 tau entity, we apply Q2TFFP by observing that 5 tau is just 2 to the 9. So eventually, the result is a very reversal of the result from the straightforward application of 5 tau entity. Notice that the goal is simply computing the product. So as long as we can design the inverse of entity by assuming the input is already in the very reversal order, we can compute the product in normal order. After applying the inverse entity, we have to multiply each coefficient by the inverse of 5-12, reduce the coefficient ring to Z2, and finally reduce the polynomial modus to x761 minus x minus 1. We can instead first reduce in the polynomial modus to have the number of notation reduction to ZQ. When we reduce the polynomial modus first, then the maximum value of the result is bounded by Q times 2p minus 1. Additionally, for the small polynomial, if it is actually a short polynomial, then we can replace p with w in the above conditions. Our second approach is an interesting approach. It will propose two different implementations. For the first implementation, we look at the small-point factors 2, 3, and 5 for defining entities. After applying one layer of 2nd key, three layers of 3nd key, and one layer of 5nd key, the convolution is transformed into several modifications for 6th coefficient polynomials. For the second implementation, we look at the prime factor 17 and 3. First of all, we apply 17ngt, and then we apply 9ngt. For the 17ngt, we apply rather strict for efficient computation. Rather strict is converting part of the computation of size pngt into a size p-1 convolution by permuting the coefficients. For prime p, there is always a bijection of size p-1 sending i to g to the i. After permuting the coefficients, we see that in the second equality, the exponents of g sum to a fixed j. This is exactly the pattern of convolution. Here is a small example for size 5ngt. After permuting the coefficients indicated by sending i to g to the i, we see that this part of the computation is just a size 4 convolution. I'm going to talk about our implementations using some unique features of QuotientSync4. QuotientSync4 implements only 7m architecture with DSP and single precision floating point extensions. DSP extension is so crucial for arithmetic in ZQ. For the floating point extension, we are not using floating point arithmetic by using floating point registers at temporary storage. In the DSP extension, there are instructions multiplying specific haps of the operands. There are also instructions that you still have words at the time. The second result is the attitude of subtracted from the first result. One could also choose to accumulate the result to accumulate it later. Another useful instruction is SMMULLR, which is useful in implementing 32B barrier reduction. The instruction first multiplies A with the inverse Q and extracts the upper 32B result with rounding. We then apply MLS to reduce the value in A. Another important feature of QuotientSync4 is that the long modifications and the accumulation reverence are one-cycle. After acquiring the 64B product of A and B, we multiply the low register with Montgomery factor. Finally, we multiply T with the modules and accumulate the result with a 64B accumulator. The result of Montgomery malocation is then in a high register. In our implementations of mixed-redice approaches, we commonly have to compute butterflies that are not power-aspective. These are implementations for registry butterflies. If we look at the computation of T1, then we see that the T1 can be computed by the instruction SMLAD. After computing three double-sized products, we then apply three barrier reductions to reduce them in QQ. For our questions approach, we commonly compute three layers of registry butterflies. First of all, we remove the T2 factor from the floating-point register to the general proposed register R1. We then perform four Montgomery malocations followed by four SR pairs. We can also group the add instructions together to save the code size. Next, we compute another two layers of registry butterflies. Another important idea for implementing QQ is that we can design spatial butterflies utilizing the fact that half of the entries are zeroes initially and permuting the coefficients on the fly. These are these are the figures for the data flow. For the details, they are shown in our paper. These two tables are a result for multiplying polynomials in entry prime. The first table is in our paper. We find that NTTs whether including QStreet rather straights and utilizing several smaller digits are all far faster than QStreet. In the second table this table contains our result in our follow-up work. We find that for the smallest two parameters sets rather straight is fast. On the other hand when the polynomial degree is larger and larger we find that switching the coefficient ring to a larger coefficient ring is the most beneficial approach for which QStreet is useful. These are the cyclicals of the four schemes shown in our paper. And this table is our result in our follow-up works. All these implementations are rarely merged in the peculiar form. For the streamline entry prime the t-generation is now with NTT for computing jump steps. In summary for multiplying polynomials in entry prime we first have to decide do we want to compute the result as if the coefficient ring is Z or not. If we decide to compute as in Z then we can freely choose an N for fast computation with NTTs. We recommend choosing N as a product of a power of 2, a small power of 3 and possibly a 5. If N is divisible by 3 or 5 then QStreet is very useful. For deciding the large modulus to utilize the power for 32-bit arithmetic on Q4 we first have to decide do we want to reduce the coefficient ring before reducing the polynomial modulus or not. If we decide to reduce the coefficient ring first then the maximum value is bounded by Q times P. If we reduce the polynomial modulus first then the maximum value is bounded by Q times 2P-1. Additionally we can also look at a small polynomial. If the small polynomial is in fact a short polynomial we can replace P with W in the above conditions for determining a large modulus Q prime. On the other hand if we decide to compute as in ZQ then the size of NGT is restricted as a divisor D of Q-1. We then choose an N for computing the convolution. If the quotient N over D is small then after applying size D NGT multiplying N over D coefficient polynomials can be implemented efficiently. For the size D NGT I include 2P-1 for efficient computation. For small radius computations the butterflies can be implemented efficiently. For the large radius computations we employ rather straight for efficient computation. During the implementations of the butterflies we find that GSP extension is very useful. In particular there are instructions multiplying specific half-words and there are instructions performing two hyper-modulations at a time. We find that both of the approaches can be implemented efficiently on Quartus in 4. Thank you for your attention.