 Hi everyone, I'm Nang Zhang from Tsinghua University. I'm introducing a highly efficient architecture of New Hope NIST on FPG using no-complexity entity and INTT. This present includes a brief introduction, the low-complexity entity and INTT, the hardware architecture and the implementation routes. PQC algorithms are required to resist quantum attacks. New Hope NIST is such a lattice-based PQC algorithm for key encapsulation mechanisms. New Hope NIST is a variant of New Hope Eucenics and New Hope Simple. In the following, I will use New Hope to refer to New Hope NIST. NIST started a computation for the PQC standardization, and New Hope survived the second round, but didn't make it into the third round months ago. But our proposed low-complexity entity and INTT can be utilised by some other candidates in the PQC algorithms, and some fully homophobic encryption schemes, such as crystal deletion, Q-Tesla, Falcon, LTV and BFV. The main mathematical objects in New Hope are polynomials over the ring. The modulus Q and the order N are chosen to be special parameters, so that the primitive N-th root of unit and its square root exist. In this way, the polynomial multiplication in New Hope can be evaluated with NTT and INTT, which are the most time-consuming operations in New Hope. New Hope is an encryption-based KM. There are three main functions, key generation, encryption and decryption, as shown in the table below. Each function consists of several entity and INTT operations. Let's see the multiplication over the ring, when the modulo polynomial is arbitrary. The multiplication can be evaluated with convolution theory. The size of entity and INTT are doubled. After INTT, a reduction with the modulo polynomial is required. When fx equals x to the power of N plus 1, the multiplication over the ring can be evaluated with negative wrapped convolution. This method avoids doubling the size of entity and INTT and the explicit reduction, but it requires a pre-processing in INTT and a post-processing in INTT. The pre-processing deloads the modulo multiplications by the powers of gamma before fft. The post-processing deloads the skinning by the point size N and the modulo multiplications by the powers of gamma after fft. Area and speed are two important criteria for a hardware design. Area on fft means the consumption of LUT, register, DSP and block wrap, and speed refers to cargo cycles and frequency. In a hardware designer's perspective, low area and high speed are usually contradictory for a specific application. Low speed usually requires a larger area. Normally, designers have to make trade-offs between area and speed. Our solution explored a new design trade-off, where high speed and low area are both achieved to some extent. This is achieved by reducing the computational complexity of INTT and INTT. Now let's have a look at the low complexity INTT. In INTT, fft requires half N times log N modulo multiplications, and pre-processing requires N modulo multiplications. For the point N being 1024, the number of modulo multiplications of pre-processing accounts for 17%. The smaller the value of N, the higher the proportion of the pre-processing. When N is 16, the ratio can be up to 33%. It can be seen that the cost of the pre-processing is considerable when the N is not big enough. As a result, we are finding ways to eliminate the pre-processing to reduce the complexity of INTT. A low complexity INTT with trade factors computed on the fly is proposed by Roy. We follow the work but with trade factors pre-computed. This method merges the pre-processing into the DIT fft by merely changing the value of the pre-computed trade factors. The derivation of the low complexity INTT is inspired by the strategy of the Cooley-Turkey fft. We follow the divide and conquer method of fft that divides in time domain. First, the pre-processing and the fft are written together as a summation of N terms. Second, the summation is split into two groups according to the parity of the index of A. Third, the equation is grouped into two parts according to the size of index i. Note that A hat 0 and A hat 1 are half endpoint entities of A sub wj and A sub wj plus 1. In this way, N point INTT can be resolved with two half endpoint entities. The same decimation process can be applied recursively until two point INTT. The left figure is the data flow of an eight point low complexity INTT. The pre-processing is not required anymore. The right figure is the butterfly of the low complexity INTT. The differences between the low complexity INTT and the classic DIT fft are the trade factors. The blue parts are the same as the fft. The red parts are the differences because gamma sub wm is the square root of omega sub m. We have this equation. We don't need to evaluate the product of powers of omega and gamma on the fly. Only the m powers of gamma need to be pre-computed and stored. This is the final low complexity INTT algorithm without pre-processing. The difference with the classic DIT fft is present in blue color. This method eliminates the modular multiplications of pre-processing. The computational complexity of INTT is reduced from half n times log n plus n to half n times log n. This method has low additional timing cost or hardware resource cost. Now let's have a look at our low complexity INTT. In the INTT the fft still requires half n times log n modular multiplications and the post-processing requires double n modular multiplications where the pre-processing requires n modular multiplications. So the cost of the post-processing is even greater than the pre-processing for the same size n. When n is 1024, the number of modular multiplications of post-processing accounts for 29%. When n is 16, the retu can be up to 50%. Popement reduced the complexity of INTT by merging the skinning of pores of gamma into the fft. We further merged the skinning of n into the fft. This is achieved by changing the value of the pre-computed tweed factors of INTT and slightly modifying the butterfly unit of the DIF fft. The derivation of the low complexity INTT is inspired by the strategy of the gentleman-sunder fft. We follow the divide and conquer method of fft that divides in frequency domain. First the post-processing and the fft are written together as a summation of n items. Second the summation is split into two groups according to the size of index of a height. Third the equation is grouped into two parts according to the parity of i. Note that a sub wi and a sub wi plus one correspond to half-n point INTT. In this way n-point INTT can be resolved with two half-n point INTTs. The same decimation process can be applied recursively until two point INTT. The left figure is the data flow of eight point low complexity INTT. The post-processing is not required anymore. The right figure is the butterfly of the low complexity INTT. The difference is about the low complexity INTT and the DIF fft are the tweed factors and the multiplication by a half after each butterfly. The blue parts are the same as the fft. The red parts are the differences about the tweed factors and the yellow parts are the differences about the multiplications by a half. Similar to the low complexity INTT we have this equation. We don't need to evaluate the product of powers of omega and gamma on the fly. Only the n-points of gamma need to be pre-computed and stored. This is the final low complexity INTT algorithm without post-processing. The differences with the DIF fft are present in blue color. This method eliminates the modular multiplications of the post-processing. The computational complexity of INTT is reduced from half n times log n plus double n to half n times log n. This method has a low additional timing cost in our architecture. It just needs to slightly modify the butterfly unit. Now let's have a look at the architecture of the low complexity INTT and INTT. It consists of a coefficient memory. A tweed factor memory. Two butterfly units and a control unit. Two butterfly units are used to speed up the throughput. They perform in pipeline mode. RAM INTT is designed as a multi-bank memory to meet the bandwidth requirements of the two BFUs. We follow the address dilator proposed by one. It works well when log n is even. But there are address conflicts when log n is old. We rearrange the execution order of the NASA S-loop to avoid the address conflict. Because the DIT and DIF decimation methods are used for INTT and INTT respectively, two different butterfly structures are required. We propose a complex butterfly unit to support the two kinds of butterflies. It consists of one modular mat player, one modular idler, two modular subtractors, two modular multipliers by a half, and some maxis. The modular multiplication by a half doesn't need a real multiplication. When x is even, it just needs a redshift. When x is odd, it needs a redshift and an addition. We propose a low complexity modular multiplication for the modules 12,200 and 89. This modular has a property that 2 to the power of 14 is congruent with 2 to the power of 12 minus 1. This means that a data with more than 14 bits can be reduced by two bits with a subtraction. We recursively use this property to reduce the product result with 28 bits. Finally, the product is reduced with some additions and subtractions of a low more than 14-bit data. This is the architecture of the modular multiplication. Only one multiplication is required to generate the product. Low additional multiplication is required for the reduction. Because there is low if-else or wear statement, our method is time-content. With the proposed low complexity entity and entity, we designed this architecture for New Hope NIST. It supports all the functions key generation, encryption and decryption. The blocks RAM entity, RAMW and BFUs follow the architectures of entity and entity. There are two additional RAMs to store intermediate data, as the two BFUs can deal with two point-wise operations. The two RAMs and most other blocks are designed to be able to process two data points every cycle to match the doubled bandwidths. To further reduce clock cycles, time hiding is achieved by simultaneously performing operations without resource conflict and data dependency. This is the algorithm for the encryption in New Hope. The operations in the similar are performed simultaneously. In our architecture, a RAM may be read and read by operations in the same line, such as R2 at 9.3 and 9.5 in the error result. The reason is that the operations sequentially access the RAM, thus the operation that writes the RAM can be executed, as soon as the data in the same address are read out by another operation. As a result, although data dependencies exist, the operations can be performed simultaneously at the operation level. Let's have a look at the implementation results. The low complexity entity, entity and New Hope NIST are implemented on a Cinex RTX 7 IPD, which is recommended by NIST and widely adopted in the evaluations. The hardware resources and highest frequency are obtained from VADL, with the default strategy for synthesis and implementation. We compare the implementation results of the low complexity entity and other designs with the same point number and modulars. The aerial time products are measured with IOUT, FF, DSP and BRAM respectively. As shown in this figure, our design is the fastest and has the smallest ATP. The implementation results of the New Hope NIST designs are compared in the figures. Our design is at least 2.5 times faster than other designs on similar devices. The consumed hardware resources are also small, especially the IOUT and the DSP. The ATPs are at least 4.9 times smaller than other designs. Okay, this is the conclusion. We present the low complexity entity and entity, which eliminate both the pre-processing and the post-processing. With them, a highly efficient architecture of New Hope NIST is proposed. The implementation results show that the low complexity entity, entity and the architecture of New Hope NIST have a clear advantage in both speed and ATP. Furthermore, the low complexity entity and entity can benefit other entity inside algorithms. That's all. Thanks for listening.