 Hi, I'm Wonkyung Jung from Seoul National University, and this is talk about over 100 times faster bootstrapping in fully homomorphic encryption through memory-centered optimization with GPUs. So let's begin. So what is homomorphic encryption or HE? HE is a cryptographic scheme that enables computations on encrypted messages. Here, a user encrypts message 1 and 2 into ciphertext. Then she sends the ciphertext into a service provider who can compute on the ciphertext without ever decryption. After that, the provider sends back the ciphertext 3, which is the output of the homomorphic multiplication between ciphertext 1 and 2. After decryption, user can get the message 3, which is the same as the multiplication output between the two messages, message 1 and 2. There are popular HE schemes, and each of them supports different data types and different HE operations. Among them, we choose CKKS as our target in this paper. It supports fixed point numbers and real number computations, so it is widely used in many real applications such as machine learning tests. The main problem of HE is their extremely high cost. First, the ciphertexts are extremely large. The size of a single ciphertext in CKKS reaches even hundreds of megabytes because each ciphertext consists of polynomials having many coefficients and also high degree. For that, the HE operations are heavy. Compared to the native integer or floating point operations, they are slower by 200 times in addition or by even up to 10,000 times in multiplication. Moreover, the bootstrapping operation is the heaviest one in HE. The bootstrapping, which is a concept introduced by Gentry in 2009, is an operation consisting of other HE operations. After each HE operation on a ciphertext, a noise is accumulated to a ciphertext, and this limits the number of operations on the ciphertext. By applying however the bootstrapping before the noise explodes, we can apply any number of HE operations, and this is called fully homomorphic encryption. However, the cost is extremely high and it takes even dozens of seconds in a recent single-threaded environment. In this case, hardware accelerators can do their job. First, there are massive pairisms in HE. HE uses number-theoretic transformation, which is called entity, in their polynomial multiplications. It is an integer version of discrete-read transformation, and as in DFT, it converts the n-square complexity of polynomial multiplication, which is a convolution, into n-logon complexity. The second thing is the residue number system. It uses ancient Chinese remainder theorem, representing a big number into a set of residues, r0 to rl, by applying modular operations with a given prime set, q0 to ql. Then it transforms the multiplication between two big integers into an element-wise modular multiplication. These two algorithms give large pairisms to HE, and as you know, GPUs can enjoy it. So why we choose GPUs over CPUs? They both have dozens of cores, but GPUs have much more higher integer operation throughput per core per cycle, compared to that of CPUs. GPUs also outperform CPUs around 10 times in both total integer throughput and also in main memory bandwidth. So for this reason, many HE research papers appearing now are exploiting GPUs more and more. So in this work, we represent the first GPU implementation of bootstrapping in a recent CKKS scheme. And we also analyzed this severe memory bandwidth bottleneck in the baseline GPU implementation. So we applied software-based memory-centric optimizations to amortize the memory bandwidth bottleneck, and this results in a speedup of around 200 times faster bootstrapping compared to a single-threaded CPU case. So let me first describe a recent full-rns KKS scheme, which appeared in HK20. So these are some parameters. The multiplicative level of this parameter set is represented as L, while primes are represented as Q0 to QL, and also P1 to Pk. Also, the modulus used in this parameter set is called large Q and large P. There is a secret polynomial SX and an error polynomial EX, whose coefficients are small enough. We represent each plaintext as MX or M. The plaintext in RMS form is represented as MX, which is in a polynomial ring called RQ, whose coefficients are both U of Q and its degree is up to N. In entity 4, we represent the polynomial MX as M. A plaintext is actually an encoded vector of up to N over two complex numbers. Each ciphertext is represented as a pair of polynomials BX and AX in RMS form, or B and A in entity form. So from a ciphertext, we get the plaintext M by decryption, which is the dot product between the ciphertext CT and 1,S, which is the secret key. We get the plaintext M with a small error E. I will explain some key operations in KKS. There is a plaintext multiplication called EMULT, which is just an element-wise multiplication between plaintext M and ciphertext CT. On the other hand, the ciphertext multiplication is much more complicated. It first computes a tensor product between the input ciphertext 1 and 2, and for the output 3 polynomials from D0 to D2, we perform a key switching to D2 with a key called multiplication key. And then for the output of the key switching, we add that with D0 and D1. There is another operation called ciphertext rotation, which does circular shift on a ciphertext by a rotation index R. It performs an automorphism which computes X to the 5 to the R on both AX and BX. Then as in multiplication, it returns a key switching output, but with its own rotation key by R. The difference between prior works and HK20 is that it introduces a new key switching method called generalized key switching or hybrid key switching. So let's understand the key switching in HK20. There is a polynomial A with modulus Q. We do key switching with a switching key SWK. First, it splits the residue set of a polynomial A into denom parts from Q0 to give Q denom for a given denom parameter where denom is shorthanded for decomposition number. Each residue set after decomposition has alpha moduli. Then, mod of raises the modulus of each residue set from QI to PQ. Here, the parameter P is said to be bigger than any other QI. The third step is inner product. The key SWK is a ciphertext factor with the size of denom and modulus PQ. We multiply each part of input with the corresponding element in the key and do summation. And finally, we reduce the modulus of the output to Q and multiply by 1 over P. This step is called moddown. The point here is that we decompose the polynomial into multiple small polynomials. Then how do we set the parameters? One thing to know is that large load PQ value lowers security. And also the denom parameter affects the security as well as computational complexity key size. So for a fixed Q, when denom is 1, which is the minimum, the value P should almost be Q increasing PQ. Therefore, it decreases the security. However, it reduces the size of each key switching key, which is the only one. On the other hand, we can increase the denom value up to L, which is the maximum. In this case, the value P would only be Q to the 1 over L, which increases the security, but we need L-sized keys. Besides, the computational complexity becomes minimal somewhere between denom equals to 1 to L. So in the previous work, they choose the value that minimizes the number of body lump multiplications. However, our observation is that on GPUs, we are more memory bottlenecked. These are the last level cache size of modern CPUs and GPUs. Compared to CPU, GPUs have only several megabytes of caches, which are hardly accommodating cybertexts whose sizes are dozens of megabytes. This makes HE operations running on the GPU more memory bottlenecked. So this is the latency of CKKS multiplication over different denoms, both in a GPU with A threads and a GPU. As you can see, the GPU implementation performs the best with a denom that minimizes the number of modular multiplications. On the other hand, GPU performs the best with the denom value is minimizing the number of total memory accesses, especially main memory accesses. So we made a GPU roofline flood of functions comprising an HE multiplication. If a point is close to the sloping roof, it means that it has low arithmetic intensity, implying the function is bottlenecked by main memory bandits rather than arithmetic units. On the other hand, if a point is on the flat side, then it means it is compute bound. As you can see from the figure, it seems that most of the functions are bounded by main memory bandits over GPU. This motivated us to focus on memory-centric optimizations on the GPU implementation. First, let me give you a brief introduction to a contemporary GPU programming model. A GPU has many streaming multiprocessals called SM, and each SM manages threads where they are grouped in parallel. A function executed on a GPU is called kernel, and GPU threads run the same corner and instructions in parallel on SMs. This is called SIMT architecture by NVIDIA Corporation. Each kernel is configured with the number of threads it uses and the amount of shared memory it uses. The shared memory is a user-configurable scratchpad memory, which is extremely fast but also small. We implemented the baseline GPU implementation based on the prior works that implement HE schemes. First, for example, a cache-friendly data layout, or how do we launch threads in a kernel, or fast entity implementations and so on. Our key contribution is that we applied kernel fusion techniques on the baseline implementation of GPU. Kernel fusion, or operation fusion, is a common technique that fuses multiple GPU kernels into a single kernel. Here we have kernel A and B. Each kernel reads data from GPU's main memory, which is also called global memory, shown as gmem here. Then the kernel computes on the data and writes back to the global memory after computing. After kernel fusion, the two kernels are fused into a single kernel. So there are two advantages on the kernel fusion technique. First, it saves some amount of global memory accesses. Here, as we fused the two kernels, global memory reads and writes between the two kernels are converted into registries and writes. It's because we can reuse the data in registers or shared memory, which are much faster than didem. This is especially good for low operation per byte kernels, because they are most bottlenecked by main memory bandwidth. Second, we can reduce the kernel launch overhead if the kernels are small. Each kernel has its own kernel launch overhead, which is not negligible if the kernel is extremely small, or small enough. We can fuse such small kernels to avoid the kernel launch overhead between the kernels. In this work, we found many kernel fusion opportunities in Intra and Intra HE operation manner. We first introduced Intra HE operation fusion. This is called model fusion. Actually, the model consists of multiple functions, so let's see it's inside. So this is the computational graph of model fusion, including INTTs, base conversions and entities. We show each single kernel as a gray rounded square. First, because the decomposed inputs are all in entity domain, we perform INTT first, and then we perform base conversions, which can raise or reduce the modulus, but in this case, we raise the modulus of each input. After that, we apply INTT to support multiplication between keys using key. Here, the base conversion itself also consists of two functions, scaling and matrix-metrics multiplication. For more details, please see the paper. What we do here is that we fuse multiple small INTT kernels into a large one. Here, because the modulus of each decomposed part is small as Qs of 0 to Qs of dNum, their kernels are small. By fusing the small INTT kernels, we reduce the kernel core overheads and increase the SM utilization in GPU. Second, we also fused the scaling kernels, which are just element-wise operations. Before saving the INTT outputs to main memory, we perform scaling operations saving memory reads and writes of the memory-bound scaling kernel. The second one is inert product fusion. You can see that in this figure, in the baseline implementation, there are multiplication kernels and addition kernels for multiplication with a key switching key. In this fusion, we perform all the kernels I mentioned in this single big kernel. The last one is multi-fusion. We take a look at the moddown first. In moddown, we first split the residue set of the input polynomial into two sets, Q-part and P-part. So this is the computational graph after the split. For the P-part, we apply INTT, phase conversion, INTT, and subtract that from the Q-part. Then we scale the output by 1 over P. In this moddown fusion, we fuse the three last kernels like this. We fused the subtraction kernel and the scaling one with its preceding entity kernel. Because the subtraction and the scaling are both element-wise operation and they are all memory-bound, they can benefit from the kernel fusion. Evaluated the Intra-IE operation fusions on a single GPU environment. So this shows the latency breakdown of homomorphic multiplication and rotation after applying our techniques. After applying all fusions down to MDF, we get almost two times of speed-up with a maximum Denom. Most of the speed-ups come from IF, Inner Product Fusion. This is because the key sizes are extremely large with such large Denom dominating the overall multiplication and rotation times. And what about on smaller Denoms? The kernel fusions become less effective because the version of the Inner Product is significantly decreased, although it gives around 1.5 times of speed-ups. We can compare our performance results with that of a prior GPU implementation, which uses maximum Denom. After applying all the fusions and by using smaller Denoms, we get 7 times of speed-ups, where the prior work reported around 50 milliseconds of multiplication time. So these were the results of Intra-IE operation fusions. Then what about Inter-IE operation fusions? We applied Inter-IE operation fusion in bootstrapping. Before we get into it, we first explained bootstrapping in the recent CKKS. Bootstrapping itself is an HE circuit made up of many HE operations. For detailed algorithm, please see the paper. We first showed the breakdown of a single bootstrapping latency on a GPU. It shows that most of the time is consumed on the functions called slot to coefficient and coefficient to slot, which are homomorphic linear transformations taking up around 60% of the time. Then how do we compute the homomorphic linear transformation in bootstrapping? The linear transformation in bootstrapping is represented as matrix-vector multiplications. The vector operand is a cybertext whose message is a vector of complex numbers, and its matrix operand is a sparse diagonal matrix where each diagonal is a plain text. Then how do we compute this matrix-vector multiplication? An algorithm called Baby-Step-Giant-Step-BSGS is used here. The multiplication between sparse diagonal matrix M and vector V is shown in this equation. Here, the i-th diagonal element, which is predetermined, is shown. It computes a dot product with the input vector rotated by i, which is a cypher text. The Baby-Step-Giant-Step algorithm turns this equation into another equation having two loops with loop variables L and K. Here, the rotated vector is a cypher text, and the corresponding multiplicand is a pre-computed plain text. So in VSGS, by setting both L and K to around root N, the number of expensive HE rotations become all N to all root N. So this is the inter-HE operation fusion, also what we call molten-add batching. So knife multiplication and add requires multiple memory accesses on the tempered cypher text city, as shown in the left blue box. However, we can fuse all the kernels in the blue box into a single kernel on the bottom side, saving most of the global memory reads and writes on the tempered cypher text city. We also evaluated the optimization in a framework called Hoisting. In Baby-Step-Giant-Step, we rotate a single cypher text by multiple rotation in dices, as shown in the blue box on the right side. That is, we apply different automorphisms and then we do multiple key switches. The Hoisting technique changes the operation order to do model first on a cypher text. After that, it performs different automorphisms on that. This optimization saves the number of model ups multiple times. So this is the bootstrapping latency on a GPU after incrementally applying several optimizations. So before we apply any optimization, we already got a bootstrapping latency under a second. After applying all the optimizations, we get over 200 times of speed-up compared to a single-threaded CPU case. And also, we evaluated the effectiveness of our implementation in the real application. In training a binary classification model, we get 40 times of speed-up compared to a 8-threaded CPU implementation. Finally, we propose a metric called amortized multiplication time. This is the multiplication time also considering the bootstrapping cost. The metric is the bootstrapping time plus multiplication times divided by the number of multiplications available after bootstrapping. This is the latency of amortized multiplication time and its bootstrapping overhead both in optimized version and our baseline implementation. Our implementation largely amortized the memory bottleneck, especially in large denim values. However, we can still see that most of the times are spent on bootstrapping. Our future work will be designing a custom hardware that reduces the bootstrapping cost. So this is the conclusion. We demonstrated a GPU implementation of a recent CKK scheme and the first implementation of its bootstrapping. We found that the memory bottleneck is the key obstacle in the GPU implementation. And so we applied memory sampling optimizations leading to a large speed-up compared to CPU implementations. So this is the reference used in these slides. Please refer to the paper. Thank you.