 Hi all, I am Yu Feixin, corresponding author of the paper entitled Compact Hardware Implementation of CC Secure Key Exchange Mechanism, Christos Kuiber on PAP-HGA. This work was done in 2020 when I was pursuing a doctoral degree in Tsinghua University of China, and Shu Guoli was my supervisor. It is about a full hardware implementation of Kuiber, aiming at compact design with decent performance, using a limited number of butterfly units. The background of our work is the crucial threat of powerful quantum computers to public key crypto systems currently used in our daily life, with which mathematical hard problem, integer factorization, discrete logarithm that considered to be infeasible to handle with classic compute architecture, now can be broken in polynomial time by short algorithm. As a result, NIST launched a contest of new post-quantum cryptographic schemes and called for pros and cons from all over the world in December 2016, and there are seven finalists and eight candidates remained in the third round of contests. In NIST's current view, structured lattice appeared to be the most promising one. The motivation of our work is the urgent need for full hardware reference and implementation of finalists in PQC contests to demonstrate their potential strengths and intrinsic property. From such works, we can find bottlenecks in the way towards high speed or low cost designs on a hardware platform. Many previous works presented the results of hardware software code designs. The most computationally intensive procedures are uploaded to co-processor to speed up computation. NTT and Cachar core are typical modules of such time, while other parts of our protocol mainly tasks not convenient to be handled with hardware design, are left over to processor, such as ARM Cortex service or RISC-5 cores. Core components like NTT and Cachar core are commonly used in lattice-based cryptography implementations. These core designs can be developed in a relatively short period of time, and another important benefit is that they are not sensitive to minor changes of protocol parameters. However, the flexibility comes at a cost of low performance, and dedicated hardware design are much more preferable in pursuit of high performance. We select Kiber as our research target. It is an MLW-based key exchange mechanism proposed by Crystal's team, and is one of the seven finalists in PQC contests. Here is the algorithm description of key generation phase of Kiber. The main procedures include sampling and NTT-related calculations in polynomial multiplications. Several auxiliary procedures are listed as well, applying to random bits or intermediate results. Our start point is to come up with a proper way to arrange the order of sampling and NTT-related procedures to save cycles. Besides, the NTT calculation in Kiber is different from that in normal lattice-based cryptography protocols, and there would be some space to carry out optimization. Together with a proper Futsi-Saki-Okamoto transform implementation, a compact design with decent performance can be expected. As there is no too primitive route of unity in field-day queue, the cyclotomic polynomial modulus would not factor into two linear polynomials, but not unquadratic polynomials. Kiber adopts NTT in polynomial multiplication. Then both polynomial multiplicands should be evaluated at specially selected points. Then two facts follow. The first fact is the evaluation process actually divides into two parts, the part with even indexed coefficients, and the part with all the indexed coefficients. They can be conducted concurrently, and the total factors are shared between the two parts. The second fact is the point-wide multiplication is between two linear polynomials, not two numbers, and five multiplications in field-day queue would be involved in a straightforward implementation way. Noticing the constant term and linear term coefficient in separated natural, carousel method can be exploited to decrease the total number of small multiplications. In such a way, the linear term of result is rearranged to make use of intermediates during constant term calculations, and one multiplication in field-day queue can be decreased, saving cycles potentially. From the two facts above, natural idea is to exploit two butterfly units, each responsible for even or odd part of polynomial during NTT or NTT. While dual-run point-wise multiplication, they cooperate with each other, such that four multiplications involved can be done in two seconds. With rearrangement of NTT, the pre-scaling and post-scaling procedures can be absorbed into NTT calculation. And the total factors now should be replaced with product of original total factor and part of the scaling factor. Accounting for the two auxiliary procedures, a unified butterfly structure is adopted. With one multiplier and the other subtractor ahead of and behind it, several procedures can be adapted to this structure. In addition to normal use in NTT and NTT, it can complete one point-wise multiplication in two cycles. Denoted as PWM0 and PWM1. Compress and decompress procedures are supported as well. Several procedures can be rearranged and merged to save cycles potential. For example, the addition with E''' in calculation of V can be deferred into adjacent calculation of C2. In such a way, both calculations can be supported by the unified butterfly structure. It should be noted that the size of each element in E''' should be negative, which is easy as we can simply negative the size of each bit in constructing centered binomial distributed samples. In addition to normal samples at a base centered binomial distribution, the public matrix A should be sampled uniformly in field DQ. They both come from pseudo-redam bits from catch-up core with different instantiation parameters. Spont structured SHA-3 standardly is selected from candidates in contests, much like the PQC contests. There are many hardware implementations that can be referenced. In our design, a play implementation of catch-up core is adapted. With input-output width adjusted to 32 bits and absorb function added, it completes one round of catch-up F function per cycle and 50 cycles are needed to shift 1,600-bit inner state out. In total, 79 cycles are needed in one XOF, KDF, PRF, GH function in Kuiper. Four XOF invocations are needed to generate one element polynomial in public matrix with overwhelming probability in total 316 cycles. One PRF invocation is needed to generate one noise polynomial when eta equals 2 and two PRF invocations are needed when eta equals 3. In our MLW-based protocol, the multiplication between public matrix and noise polynomial vector as well as between two polynomial vectors can be divided into small polynomial multiplications natural. Each 256-term polynomials would be transformed to point-value level representation, consuming 440 cycles using two butterfly units. The input data vector comes from catch-up core and written into RAM block in 64 cycles. In total, 512 cycles are needed to complete forward entity. Similarly, 576 cycles are needed, of which 448 cycles are for entity calculation. The other 128 cycles are for compressed function. The point-wise multiplication between two polynomials will take 256 cycles. We would arrange the order of sampling and entity calculation on the basis of these figures, comparison between cycle counts of different calculations is needed, and the execution order can be adjusted accordingly. The main principle is to generate necessary data strictly before they are required by the entity core. And the entity calculations should be arranged properly, suited with execution order in the protocol. With all the effects above, we give out a manually adjusted concrete execution order of sampling and entity calculation. The total cycles consumed is determined by the butterfly units, and basically sampling hidden behind entity calculation. When more butterfly units are exploited, a higher input-output rate of catch-up core is expected, and the execution order can be adjusted in a similar way. Noticing a manually adjusted sampling order in the regular and conventional state machine would involve complex control logic, a predefined order table is most suitable here to store sampling order, and whether a noise polynomial or public matrix element polynomial should be generated currently is determined by the fetched bit from order table. Within the table, there are four kinds of sampling, namely public matrix element polynomial sampling represented with four consecutive zeroes, and noise polynomial when eta equals two is represented with figure one. Noise polynomial sampling when eta equals three is represented with a pair of two and three, implying two PLF functions should be invoked sequentially. Our catch-up core is working in an autonomous way that whenever i5o is not empty, and it has done the last hash, data would be fetched from i5o until enough data has been collected to begin the next hash process. Different instantiations of SHA-3 standards have different radar and different capacity C, determining the number of bits that should be appended. The two-bit mode signal determines the number of 32-bit zero data trunks appended for XOF function mode signal is zero. It is concatenated with 3-bit seven, determining the end of capacity counter is seven. For PLF function, mode is one. It is concatenated with 3-bit seven, determining the end of capacity counter is 15. The appending process would begin the next cycle one-bit plus signal is active, counting from zero until capacity counter equals the end, determined by mode signal. Then the catch-up core would begin to process data and generate output bits after 24 runs of catch-up F function. The absolve signal is active when catch-up core should work in Spongebob, or in other words, the output data from last invocation should be exclusive of when input data in current invocation. From the top of our design, a set of registers is deployed, storing intermediate 256-bit values. Part of them are collected from catch-up core, while the restored message is collected from entity core. These registered values would be sent to catch-up core in appropriate time. The components in dashed rectangle represent hash-related hardware modules. A FIFO is inserted ahead of catch-up core to buffer 32-bit data input trans, and the output of catch-up core can be either sent to register set as aforementioned or regulated and sent to entity core as pseudo-random input bits. The pseudo-random data would be processed in entity core, and the generated output would be sent to encoder regulated into 32-bit data chunks before transmitted to the other participant of the protocol. On the other hand, 32-bit data chunks from the other participant would first be buffered into FIFO, decoded into points with proper length, then transmitted to entity core to take part in the calculation. Several FIFOs are exploited in our design to buffer data or serve as data storage, and the smallest configuration should be determined to decrease the overall resource consumption. For data buffer, it is determined by the largest leftover data when continuous data fed-in is at a higher rate than output data. For data storage, it is determined by the largest data pile it would store at all times. The Fujisaki Okamoto transform contains ray encryption and comparison procedure in server-side. Noticing the polynomial vector t in public key is required in ray encryption, and ciphertext c1 and c2 are both required in comparison after they participate in decryption process. They should not be thrown away before the transform is conducted. Actually, there is no need to resort to extra storage resource. For polynomial vector t, it is redirected to input part of FIFO when transmitted to client. For ciphertext c1 and c2, they are redirected to input part of D540 and D541, respectively. When they are fetched by entity call to take participate in decryption process, with all the methods above, a compact design with decent performance is achieved. Compared with related hardware designs at that point of time, our work is not the fastest. The work in DFA 20 is more than two times faster than ours, mainly because more computation units were exploited. The LUT FF, BRAM and DSP usage is more than 1.5 times, 2 times, 4 times and 5 times more than ours. Compared with HHLW20, our design is more than 10 times faster, while the LUT FF consumption is more than 10 times and 40 times less than theirs. Compared with HLW code designs, our work is hundreds of times faster. These referenced works are all established on RISC 5 soft core with different distribution. The resource consumption varies in large range from only a quarter of hours to more than three times of hours. It can be concluded that HLW code designs are not good choices when performance is considered primary. Saber is another finalist in the context, and the parameters are very close to that in kyber. Compared with our related hardware design of Saber in RB20, the timing performance of our design is pretty close to that in RB20, while the LUT and FF consumption is more than three times and two times less than the design of Saber. The compactness comes from many three aspects. The first is storage reviews as aforementioned in 40-sec commando transform and proper arrangement of execution order in kyber, such that most intermediate results are consumed just after they are generated, and pseudo-radar beats generation follows the just-in-time strategy. The second is two multipurpose part-by-units are adopted and can support all entity-related calculations in kyber, making full use of the multipliers. The third is the input-output process of catch our call in conducted through shift of inner set queue without result to a dedicated run-shift register. As a result, the catch our call is not ready for the next hush, before output beats are fetched and input beats are bedding completely. A full invocation of catch our call will take 79 cycles, much larger than a threading 24 cycles. It would not cause any trouble in our design, as the generation rate of random beats can satisfy a need of entity call. The decent performance of this architecture is achieved from many two aspects. The first is we arrange execution order properly, such that sampling procedures are almost hidden behind entity-related calculations. And the second is several procedures are rearranged, merged and allocated in the unified butterfly structure. Rather than conducted separately, same into two cycles. In pursuit of higher performance design, more butterfly units should be exploited, and the catch our call should keep up with the random data requirement from butterfly units. Ideally, the adjustment of input-output beats can be adequate, otherwise a higher speed catch our call is needed. Implying the resource consumption would increase significantly. In our design, the adapted catch our call consumes 40% LUT, 35% FF of the whole design. If a low cost design is expected, the first thing is to select a low cost catch our call. Our design has been verified against Kuiper's non-answer test files, and the whole project has been uploaded to G-Type. That's all for my presentation. Thanks for listening.