 Hello everybody, I'm Hao, I'm going to present our paper entitled Batchin C-Side Group Actions using AVEX 512. This is joint work with George's Johann Peter and Peter. Commutative Super-Singular Isodine Diffie-Hellman, or C-Side for short, has recently proposed post-quantum key establishment scheme that belongs to the family of isodine-based crypto systems. It comes with highly attractive features like efficient validation of public keys, making it suitable for non-interactive key exchange protocols. In fact, C-Side has the ability to serve as drop-in post-quantum replacement for the classical isodine key exchange. The C-Side protocol is based on the action of an ideal class group and a set of super-singular unit curves. Unfortunately, the execution time of C-Side is prohibitively high for many real-world applications, mainly due to the enormous computational cost of the underlying group action. In this work, we explore how to use powerful vector instructions like Intel AVEX 512 to accelerate the computation of the C-Side group action. The C-Side group action works over a finite field Fp, where P is the large prime of the special form of four times some small old primes and minus one. It uses super-singular unit curves EA, which are defined over this prime field and represented in Montgomery form. In C-Side, we are interested in computing the action by an ideal where LI are prime ideals and EI are small exponents chosen uniformly from some interval. The C-Side group action equals to the computation of the curve E prime as the image of the curve E under an isogenic of degree L1 to E1 times until the Ln to En. The entire isodine computation can break into many smaller isodine computations, and each isodine is computed by using Betel formula. As for the C-Side key exchange protocol, Alice and Bob first generate their private key, which are secret exponents, and then they generate their public key by the C-Side group action with their private key and the starting curve. Alice sends her public key and the curve EA to Bob, and Bob will send back his public key EB. Alice and Bob will then check whether the public key that they received is valid or not. So if it is valid, then they will compute the shared secret KA and KB by using the C-Side group action as well. The octane curves KA and KB are esomorphic. The original C-Side paper contracted C-Side 512 would achieve NIST PQ Security Level 1. However, the concrete security of C-Side is under debate. For example, Pickard estimates prime P should be significantly larger in order to meet NIST Security Level 1. In addition, for the real-world applications, the constant time implementations are needed. Many papers have researched the efficient constant time implementation of C-Side. Some important optimization techniques are about using the alligator for sampling the random points on the curve, the sigma technique, and the two-point approach. Among these works, there were three main variants of the C-Side group action evaluation, namely MCR style, OEYT style, and dummy-free style. Notably, the dummy-free style is even resistant against fault-injection attacks. Most recently, a new algorithm of constant time C-Side, a group action named CTIDH was proposed, and it uses a new key space. It could watch this talk also at the chess 2021. In this work, we present the first vectorized implementations of constant time C-Side. Our software contains a high throughput and a low latency, two types of these implementations. The third throughput-optimized one is a batch-team implementation that performs eight group action instances in parallel, which is designed to speed up the server-side TLS processing. In order to correctly and efficiently batch C-Side group action, we present several different hybrid batching methods. Besides, these batching methods are also beneficial for minimizing the latency of C-Side-based signatures, such as C-Fish and C-Side, in which multiple independent group actions are computed in the key generation, signing, and verification processes. The latency-optimized implementation is developed to accelerate C-Side on the TLS client-side, and each of the high throughput and low latency implementation includes Apex 512F and Apex 512IFMA, two different versions. The target X64 software that we are optimizing is a lot-encrypt 19 implementation. In this presentation, we will focus on the OAYT-style implementation. Apex 512 is the latest incarnation of Intel Advanced Vector Extension, which augments the execution environment of X64 by 512 bit registers and various new instructions. Apex 512 contains multiple extensions, but a specific processor may support some, but not all of them. From another perspective, all processors equipped with Apex 512 support core extension named Apex 512 Foundation, which has a 32-bit vector multiplier. In our software, we are working on a same department where it displays a 512 bit vector into 8 64-bit elements. A simple example of a vector modification instruction of Apex 512F is shown here. One instruction can perform 8 element-wise 32x32-bit multiplication and finally get 8 64-bit products. Among the extensions of Apex 512, IFMA is very attractive for its public key crypto systems, whose underlying arithmetic is a large individual arithmetic. Intel described IFMA as two new instructions for big-number multiplication for acceleration of RSA vectorized software and other crypto-algorithms performance. Specifically, IFMA integer fields multiply add. Firstly, multiplies the packed 52-bit integers from two registers A and B to produce 104-bit intermediate product T. Then as either the lower or the higher 52-bit of the product T with another packed 64-bit integers from register C and store the final results in destination register R. IFMA was first supported with Intel Canon Lake and continued to be equipped in its successors such as Ice Lake, Tiger Lake, and Rocket Lake processors. In this work, we target the Intel Ice Lake processor. Okay, so now I will introduce our batched high throughput implementation. So let's first see the original CSI group action. Here is a variant of the original CSI group action. From these two highlighted lines, we could find that the algorithm is not constant time. Because the number of the isogenes to be computed depends on the value of the secret exponent, which makes the execution time of the group action depends on the secret information. Therefore, it is vulnerable to the timing attacks. However, the OEYT cell CSI group action that will be considered in this work computes BI isogenes instead of EI for each prime LI. It computes EI real isogenes and BI minus EI dummy isogenes. So the total number of isogening computations is always the sum of BI, which is constant, that equals to 404 for a CSI 512. This is the OEYT cell group action. The difference between the original CSI is highlighted. Apart from the different way of isogening computations, it also adds a constant time equality test to see if the corresponding EI is zero or not. The curve and exponent values will be updated according to the result of this equality test. So now we have obtained constant time CSI. However, this is not enough for batching. To be specific, we consider our batched software where 8 OEYT cell group action instances are to be computed simultaneously by AVEX 512 instructions. Besides, each instance is computed in 64-bit line, and instances are independent of each other. The problem is, it simply requires parallel instances to possess the same operation sequence, which is a more strict requirement than the constant running time. The operation sequence in OEYT correction also relies on whether the kernel generator R is infinity or not, which only depends on the randomness. A simple example is shown here. So in the first instance, the generator R is not a point at infinity, while in the second instance, it is infinity. The instance 1 and 2 will later perform different operations. This will cause mismatch between the instance 1 and 2, which is a problem for SIMD. In particular, the probability for a point of order Li to be infinity is 1 over Li, which is considerably high when Li is small, for example 3.5 or 7 in CSI. In order to obtain a batching-friendly and of course constant time CSI group action, we need to mitigate this mismatch problem. In the following, we will present three different methods to solve this if-conditional statement regarding the kernel generator R. Our first method aims at making group action independent of all inputs as well as all randomness. In brief, we'll remove the if class of checking R is infinity or not, at a cost of extra dummy-isolating computations. The idea has been proposed and used into previous implementations. To apply this idea, we add a new constant time infinity test and update the curve and other variables according to the result of the infinity test as well. Meanwhile, we accordingly add a new constant CI for HBI. Then the number of total-isolating computation is increased to the sum of BI plus CI. However, this method has several problems. There always exists a probability of failure computing the correct quantum curve in this method. That is, when too many infinity cases happen, it can make us lost the computation of the real-isolatingness. Therefore, a large number of extra dummy-isolating computations are required to make this probability negligible. For example, it needs to compute around 900-isolating to make this failure probability below 2 to minus 32, while before it was 404-isolating computations. Hence, it greatly reduces the efficiency of the algorithm, while the probability of failure still exists. Based on the above discussion, we are looking for a way to significantly reduce the number of extra dummy-isolatingness and eliminate the probability of failure, and meanwhile, return this batching-friendly fashion of group action. Then we succeeded to find the solution, which is a hybrid mode. Hybrid mode means that the entire batched software is composed of two different types of group action implementations, namely the batched component and the un-batched component. The batched component is an incomplete implementation that performs eight instances simultaneously. The un-batched component is a latency-optimized implementation, accelerating a single-class group action application. The key idea is to first take advantage of the batched component to compute the main bulk of the seaside group action for all instances, and then use eight times in sequence the un-batched component to handle the remaining computations needed in each instance. To apply the hybrid mode to extra dummy method, we remove the CI and create a new bound list, BeHat. And BeHat is used to record infinity cases happened in the batched component, and will be used as a bound list in the un-batched component. Our experiments indicate that for each instance, there are often around 10 isotrines remaining to be computed in the un-batched component. As a result, the total number of isotrines computation is just slightly larger than before. Moreover, since the un-batched component has no failure probability, we conclude that our extra dummy method has no failure probability either. Our second batching method is quite straightforward. We make all the instances always agree on the same branch, therefore executing the same operations. If the kernel generator R is infinity in at least one of parallel instances, then we force all instances to skip if branch and execute an else branch. In this else branch, there is a new scalar multiplication for t0. This is not indeed before because the li-torsion part of the point t0 has already vanished, but in our approach, we are forcing all instances to proceed as if all kernel generators were infinity. However, the li-torsion parts of some points t0 have not vanished. In particular, we define a new variable working as follows. However, when this variable equals to 1, the above idea imports some extra infinity-related computations, which in principle are not needed by every instance. These infinity-related computations are shown as listed. In this batching method, each instance still computes 404 isotrines but more infinity-related computations. For this reason, we refer to this method as the extra infinity method. Also, the probability of this variable equals to 1 is quite higher when li is small, for example, 3, 5, or 7. As a result, an increased number of infinity-related computations is expected, which affects the efficiency of the extra infinity method. We mitigate this problem by considering again the hybrid mode. More precisely, we divide the prime's li into two subsets, one for the batch component and the other for the unbatch component. L unbatch contains only the smaller primes, whereas L batch includes the remaining primes. In the same way, the bomb list B and the secret exponent list E of each instance are split into two subsets as well. In the extra infinity method, we first execute a batch component for eight parallel instances to compute the exogenous for the larger primes with corresponding subsets. And the batch component outputs a resulting curve B hat for each instance. Then we execute a batch component sequentially in order to obtain the correct codement curve for each instance. In this way, the infinity-related computation needs to be computed is much less than not using the hybrid mode. Okay, now, let's turn to the third approach. Before we introduce the third approach, we will see a few more details on the extra-dami and extra-infinity methods. So we consider an example where in an iteration of the inner for loop, n of the eight kernel points are infinity. The extra-dami method will complete the computations of this iteration and later it will compute a compensatory isogenous with an unbatch component. On the other hand, the extra-infinity method will enter its else branch to compute the computation for all eight instances, and it may later perform other infinity-related computations which are in theory needed by any instances. Based on the operations that are carried out in each method, we observe that the extra-dami method handles infinity cases more efficiently when n is small. So on the other hand, when n is close to eight, the extra-infinity method seems to be more efficient. So based on both observation, our idea is to combine two approaches aiming at obtaining a more efficient method. In order to do this, we set the new variable as listed. We add an if-else statement to check if this variable is within a predefined threshold or not. If it is not larger than this threshold, we will perform the extra-dami method. Otherwise, we will go to the extra-infinity. From our experiments, the threshold value for our OAYT-style implementation is three. And the third method we call it the combined method. So in terms of our high-throughput implementation, for the classical action layer, we will take advantage of three different action methods. For the curve arithmetic, we simply developed them according to the existing force of the bare-based minor optimizations. And for the prime field operations, we developed eight times one-way implementation according to the limb slicing technique to be specific. For the field modification, we developed many different variants and finally selected the fastest one among them. We also developed a dedicated squaring based on the classic optimization technique as it compiles a repeated partial product only once. Since the vector multiplier of the Apex512F and IFMA are different, the implementation of the field operations in these two versions are, of course, quite different. On the other hand, for the low latency implementation, it can also serve as an unbacked component in the hybrid mode for high-throughput implementation. The cost-group action layer is just the same as OYT-style group action. The curve arithmetic can be easily prioritized to two-way and the number of needed two-way modification squaring is just the half of the number of original one-way modification and the squaring. As for the prime field operations, we developed a two times four-way implementation based on the implementation of Arisaka, Aranya, and Lopez, which is originally designed for the field modification of SIDH. Two times four-way means it performs two field operations in parallel, where each operation uses four elements of the vector. We slightly optimized the field modification by interliving the integer modification based on multi-camera reduction and based on the same classic optimization technique, we developed a dedicated four times four-way squaring as well. So in order to figure out the real improvement of our work, we benchmarked our software and the SISI group action evaluation for all the OYT and dummy-free implementations on the same Isolec CPU. The speed-up ratio is defined by comparing the CPU cycle divided by its number of instances between the baseline and the specific implementation, which can be understood as a normalized throughput. We use our target X64 implementation as baseline because in this way we know precisely how much our vector processing techniques improve the result and this X64 implementation also served as baseline in other papers. As shown in the table, our two-way low-latency IFM implementation has roughly the same latency as the original non-constant time implementation and it is about 1.5 times faster than the baseline and our 8x1-way IFM implementation when applied with a combined batching method takes a 3.64 times higher throughput compared to the baseline. An analysis of the execution times of our high-throughput software shows that all the IFM implementations are nearly 1.9 times faster than the corresponding FX512F implementations, which confirms that the IFM extension indeed significantly accelerates a SISI compared to the general FX512F. The benchmarking results of dummy freestyle implementations are summarized in this table. These results show that our proposed batching methods still work efficiently when applied to the dummy freestyle SISI correction and it can yield an up to 3.63 times higher throughput compared to the baseline. Though FX512 can work on 8 64-bit elements simultaneously with a single instruction, the theoretical maximum speedup factor upon FX512 implementation compared to X64 is actually far from 8. The main reason is a multiplier. Taking 8 512-bit reintegral modifications using the Scuba method as an example, X64 implementation is 64 multiplication instructions for one instance while FX512F means at least 256 vectorized modification instructions for eight instances. And IFMA requires 200 instructions for eight instances. Compared to on X64 implementation approximate speedup of FX512F is 2.0 and IFMA is 2.56. Taking this analysis into account, our throughput-optimized FX512F implementations have expected speedups. As for the latency-optimized implementation, a 2-way IFMA latency-optimized implementation of the psych is 1.72 times faster than X64 assembly implementation. We can thus conclude our 2-way IFMA low latency implementations also correspond to the expected acceleration. And there are several reasons that make the 2-way latency-optimized implementation less efficient than the throughput-optimized implementation because overhats caused by aligning and blending FX512F in 2-way curve and isoidinal operations. Some point operations cannot be parallelized in an ideal 2-way fashion due to the dependencies of the internal field operations. And also some competitions in the field operations, for example the complete carry propagation cannot be parallelized in an ideal 2x4 way due to the sequential dependencies of the instructions. And also the instruction level parallelism of 2x4 way is lower than 8x1 way since four limbs are stored in one vector. In this work, we have showed that vector engines like FX512 offer great potential to optimize seaside. We presented the first vectorizing implementation of seaside and developing efficient batching methods for the cost group action and combining them with highly-optimized field arithmetic, we were able to achieve a 3.6-fold gain in throughput compared to a state-of-the-art x64 implementation. The correct parameterization of seaside to achieve NIST security level 1 is currently still a topic of debate while our proposed vectorizing methods can also be used for larger primes and certain parts of our source code can be reused. That's it. Thank you for your attention.