 Hi, I'm Vest van Woerden and thank you for joining Eurocrypt. In this video I will be discussing our recent work on advanced lattice sieving on GPUs with dense encores. And this is joint work with Liudica and Mark Stevens. So we'll start with a quick overview of her work. So most of the NIST post-quantum crypto finalists are based on hard lattice problems. And practical crit analysis is important to understand the concrete security and to pick concrete parameters for these schemes. Usually lattice sieving algorithms have the best practicalness for the runtime to solve these hard lattice problems. And our question was how fit are different sieving algorithms for specialized hardware, in our case GPUs. And this includes the more advanced sieving techniques that give large practical gains. So our contributions. We presented the first GPU implementation using all state-of-the-art sieving techniques. This improves both the runtime and energy efficiency by two orders of magnitude compared to a CPU-only approach. And it significantly improves several lattice problem records. So we are able to solve these problems in much higher dimensions. And additionally we present the first optimized implementation of the asymptotic best-known sieve, better known as BGL, both for CPU and GPU. So what's lattice? Well, a lattice is a discrete group that's generated by some bases. And of course, lattice can be generated by multiple bases. But if in any basis, the shortest vector problem asks to find a shortest non-zero vector of the lattice. And in high dimensions, this becomes a hard problem. The later problem is the balanced decoding problem, where you are given a target that lies close to its lattice vector. And the goal is to return this lattice vector. So to give an indication of concrete hardness of the shortest vector problem, we have the TU Darmstadt lattice challenges. So as a challenge, we are given a random d-dimensional lattice. And the goal is to find a vector that's at most 5% longer than the expected minimum length of the shortest vector. And we want to solve this problem using lattice sieving. So what's lattice sieving? Well, the idea is to start with a big list of exponentially many lattice vectors. Then inside of this list, we try to find pairs of vectors that are close to each other, such that their difference would give a short lattice vector. And what we then do is replace a long vector in our list with this new shorter vector and repeat this until we have only a bunch of short vectors left. So we would like to execute these lattice sieving algorithms on specialized hardware. In our case, graphics processing units, better known as GPUs. So on this small part of a GPU, you can already have 64 floating point cores, 64 integer cores, and 8 tensor cores, about which I will talk a bit more later. So in total, on the GPU, you can have thousands of cores, which is much more than the dozen or so you have on the CPU. However, on the CPU, each of these cores can work independently and execute their own instructions. On the GPU, all the cores have to work in batches of at least 32 and execute a single construction of all of these cores at the same time. Also, each of these cores have a very limited amount of registers that you can use and memory that's local. So what are these tensor cores? Well, originally they have been developed for machine learning and they can do one thing very well, that is efficient low precision matrix multiplication. So what do I mean with low precision? Well, I mean 16-bit floating point precision, while normal float would be 32 bits or double would be 64 bits large. However, the 16-bit precision is good enough for our purposes. And with efficient, I mean that on these old models we used, we can have up to 108 16-bit teraflops on a single GPU. And on the newer models, we even can reach more than 316-bit teraflops. So when we compare this to a current best CPU, these would reach with like 64 cores would reach at most about five teraflops. So you can already see the order of two matrix improvements from these GPUs. So let's discuss the pros and cons of these GPUs starting with the cons. First as I already mentioned, these things are not first though. 32 cores have to execute the same instruction, while on a CPU every core can do what it wants. Secondly, these GPUs are an external device. And this means that the main memory has a slow connection to the GPU and the main memory of this GPU. This means that the connection which goes via PCI bus in, for example, our server is limited by only 16 gigabytes per second to send data from the main memory to the GPU. Now you can imagine if you have these GPUs that can get a performance of 100 teraflops per second. Then you have to execute a lot of instructions. You have to reach a lot of flop, thereby transfer to this GPU. This very much, and otherwise you get very much memory bottlenecked. Also inside of the GPU, memory bottlenecks often limit the actual performance. And lastly, it's hard to adapt algorithms for many algorithms that are in principle serial or maybe can be paralyzed on multiple CPU cores. It can still be a whole research study to see how you can adapt them to GPUs, such as our work. But everything, if everything works together, then the pros are clear. You can get incredible performance, hundreds of teraflops instead of just a few teraflops of a CPU. And additionally, these things are very energy efficient. So you're going to get this incredible performance at only the same energy cost or maybe a factor or two more. So we didn't build our implementation from scratch. We extended the so-called general sieve kernel, which is from a warp from 2019. And this is an open-source sieving framework and implementation. And this implementation combines all the advanced sieving techniques, including all implementation tricks that you can think of. And it's fully parallel on CPUs for on a single machine. And they solved the CSP record at dimension 155 in about two weeks. And this was only run on like 72 CPU cores and 256 gigabytes of memory. So if we compare this to the old way of solving the sort of factor problem with enumeration, which is asymptotically inferior methods, but does only use polynomial memory, hence it's trivial to paralyze. So for this method, the old records were at dimension 152, however using more than 800,000 core hours. So it's clear that the sieving methods, both asymptotically but also in practice, significantly improve on the enumeration. So now it's time to discuss these advanced sieving methods and how we adapted them to GPUs. So in general, our sieving process goes as follows. We have our big database of lattice vectors, and then we have three phases that repeat. First we have a bucketing phase where the database is partitioned into multiple buckets. Then we have the reducing phase where inside each bucket, we check all pairs for closed factors to give a shorter factor in return. And then we insert these short factors back into the database and we repeat this. So we start by discussing this bucketing, then the reducing part, and then we'll discuss some additional advanced techniques. So first the bucketing. What we want to do is we want to partition the sphere and thereby partition our database. So suppose we have some bucket center C, then we want to find all the factors in our database that are somewhat in the direction of C. And then this gives us a bucket, and then what we want to do is we want to only check for pairs within each bucket if they are close to each other. And because they are already pointing somewhat in the same direction, this heavily increases the reduction probability per pair. So let's discuss some bucketing techniques. So first as a benchmark, suppose we have no bucketing. Then we have to check between all the pairs. So if we have database of n vectors, then we have to check n squared pairs. And this gives you a time complexity of 2 to the power 0.415 times the dimension. Now the first bucketing method I want to discuss is the so-called BG1 related bucketing method. Here we have random spherical cones that form the buckets. And we have square root of n of these buckets, and each has size square root of n. And if you then do the computation, because of this improved probability to find reductions, the time complexity goes down to 2 to the power 0.349 times the dimension. Additionally, you can replace these random directions with random lattice factors. And then because these centers of the buckets give our lattice factors, we get two additional properties. First, you can immediately also find reductions with this bucket center, because you're already computing the distances. And secondly, this bucket center can be used to not only check for pairs, but also for triples that might give you a sort of factor. And note that to compute, to see in which bucket a factor belongs, we need to compute any products with lattice factors. And if you write it out, you actually get like a matrix product, and this is exactly what tensor cores are good at. So now let's discuss another sieve, which is the BGL sieve. And this one uses structure spherical cones, and the structure comes from a product code. And using this structure, we can much more easily find a appropriate bucket for each factor. And as a result, we can, for some parameter k, we can have much more buckets that are of a smaller size. So in the end, depending on this parameter k, we can get a lot of small buckets, very cheaply. And this results in a time complexity of 2 to the power 0.292 times the dimension. If you take this k small, let's say, logarithmically in the dimension. For our implementation, we have some additional tricks, namely instead of using explicit product codes and explicit factors, we use implicit directions. We take a lattice factor, and then we use permutations and how the mart transforms, and then just read off the coefficients of this to do the bucketing. You can read in the paper more about this. And the nice thing about this method is that it's very suitable for factorized CPU instructions, and also for GPUs. And actually, we have a AVX CPU implementation, and we merge this upstream into the general sieve kernel, the open source implementation. And this is currently the fastest CPU sieve on there, and it beats all the other sieves around starting dimension 80 or 90. And I also want to mention that this PSG1 sieve was the one that was used by the general sieve kernel to set their old records at dimension 155. So we implemented these different bucketing techniques on the GPU, and they have the following practical quality. And what this I mean is that the number of productions we find in each side of the bucket compared to what we would expect optimally in theory. So the black bars here indicate the idealized performance if all of these buckets would be perfectly random. The second, the blue bars show the our PSG1 implementation, and the red and orange ones show the PSG1 implementation for k equal to 1 and k equal to 2. And what we see is that the PSG1 and the BGL for k is equal to 1 are very performant and come close to the idealized performance. And for the case 2 BGL version, we see a slight deterioration in performance, but note that computing these buckets is much and much faster. So let's now discuss the reducing part. So suppose we have a bucket of a lot of factors, and for each pair we want to check if they are somewhat close to each other. Now if we write out this distance, then if we pre-complete all these lengths, so that we actually need to compute this inner product between these two factors. So we need to compute pairwise inner products, and this is exactly what tensor cores are good at. Additionally, we need sparse output and only return the successful pairs. If we don't do this, then we get a clear memory bottleneck. But this means that we can't use any standard methods that have already been implemented. We need to implement our own highly optimized matrix products. Noticially that the number of computations we need to do to compute all these pairwise inner products is squared in the bucket size B, while the data you need to send to the GPU is only linear in the bucket size B. So this ratio between the amount of computation you do and the data you have to transfer improves for larger bucket sizes. So what does this mean for the performance? Well, if we exclude the overhead for transferring this data to the GPU, then we already get good performance at about bucket size of 4,000. So we get about 65 teraflops at this size. And this is very well compared to the theoretical limit of 108 teraflops, because we didn't count any of the instructions here to move the data within the GPU. However, if we include the overhead of sending this data to the GPU, we see that the performance drops significantly at these bucket sizes. And only at the bucket size of about 16,000 or higher, we can get reasonable performance. So in short, for small buckets, sending the data to the GPU is the actual bottleneck. And we need large buckets to reach optimal performance. However, large buckets go against the whole idea of the BGL bucketing scheme. So what does this mean for our implementations? Well, on a CPU, only implementation, we don't have to have these very large buckets. And if we then compare the BG1 sieve, which is indicated by the red line here, and we compare it to our BGL implementation on the CPU with k equal to 3, then the BGL implementation becomes much faster at higher dimensions. And the crossover is already at dimension 90. Actually, in the implementation that we later merge into the general sieve kernel, this crossover already lies at about dimensions lower even. However, for the GPU, we have these minimal bucket size to get optimal performance. And as a result, the BG1 implementation is actually faster than the BGL implementation in all the dimensions that we actually could sieve. And we estimate that the crossover point is maybe only at dimension 150 or 160, the sieving dimension here. So here we see a significant difference between the CPU implementation and the GPU implementation, that we get different trade-offs and a different crossover point for when these asymptotic best sieves actually take over from the more practical but asymptotically worse sieves. So we have discussed the core parts of our implementation, now discuss some of the more advanced parts. Namely, one of the techniques that is very important in practice is the dimensions for free. So instead of sieving in the full lattice, we only sieve in projected sub-letters. That's projected away from the first, say, L basis vectors. So we only sieve in this part. And then after sieving, we have a big list of short factors in this context. And then we lift it back to the full lattice using the bi-lifting. And because you have this many factors, you can show that you can still find the short factor of the original lattice for L that is about the overlocked T. And we can combine this with so-called progressive sieving. Instead of fixing this L a priori, we can start with an L that's pretty large. And so a sieving dimension that's pretty small. And then just slowly step-by-step decrease this L until the lifting actually finds the shortest factor on the full context. And this also means that if we do on-the-fly lifting, namely we just lift any shortest factor that we encounter, then we are much more likely to find this short factor. So this means that we can have an extra problem, namely can we officially detect if any barrel factor within a bucket, we take the difference of these, if this factor might lift to a short factor inside of the full context. And actually what this gives us is a BD problem on this small context between 1 and L. So one way to solve this BD problem is using bi-lifting, as you mentioned before. However, this has a quadratic running time in the lifting dimension. And that's why we introduced the so-called dual hash. So given a lattice, you can define a dual lattice by all factors in the span that have an insert in your product with all the lattices in the primal. So if you pick one of the dual vectors, you can partition the lattice in hyperplanes, where in each hyperplane, you have all the lattice points that have a certain insert in a product with this dual vector. But now if we get a target that lies somewhat close to the lattice, then as a result, this inner product with the dual lattice is also expected to be somewhat small. So our dual hash doesn't consist of just a single dual factor, but multiple dual factors, C1 and Ck. Then the dual hash consists of all inner products between the target and these dual factors. Now if the distance of the target to our lattice is very small, then we can also expect the distance of this dual hash to the insert lattice to be small. Now note, to compute it first, we need the bi-lifting, which is quadratic in the dimension. For the second part, it's clearly linear in this case, if we have already computed the dual hash. But we don't apply this to the field dimension, we only apply this to like a smaller lifting dimension of say 16. And then just using 32 or 48th dual factor seems enough for a very strong correlation. And note that we can officially compute this dual hash of the difference of two targets, because this function is linear. And last but not least, this is very suitable for GPUs. While bi-lifting is very serial in computation, this can be easily paralyzed. So what's the correlation that we get? On the left, we have the value of our dual hash. And on the bottom, we have the distance of our target to the lattice. So if you get a random target, then most of these targets lie very far from the lattice. And only just a few lie close enough that we're interested in them. If we set our filter at as high as this, then we don't have to compute the Bhabias for all of these pairs here. And we only have to compute it for these few pairs right below here. So that saves a lot of computation. So last but not least, we have to discuss memory. Let the sieve algorithms are very memory-intensive. And in our runs, we had to store up to half a billion of vectors. So every byte we can save per factor saves us tens or hundreds of gigabytes in the end. However, the general sieve kernel stores lots of information per factor. For example, the coefficients in the base representation, the so-called Kramp-Schmidt representation, the length of the factors, a unique identifier, this lift target is pre-computed. This dual hash is pre-computed if we use it. But also a pop count that's being used to quickly decide if two vectors are somewhat close to each other. And we want to remove all of this except this x, this length, and the unique identifier. And this means that we reduce the memory by more than 60%. So why can we do this? Well, this is actually an additional benefit from the fact that these bucket sizes have to have a minimum size of, say, 16,000. The overhead of computing all these things on the fly for inside of a single bucket is just linear in the bucket size times, say, the d squared or d to the dimension. However, after this, we have to check for reductions. And that is quadratic in the bucket size. So over these large buckets, the overhead of pre-computing all these things on the fly on the GPU is negligible. So what does this GPU implementation give us in terms of the two Darmstadt lattice changes? Well, we got a new record at dimension 180. So here on the bottom, we show the dimension of these changes. And on the left, we show the wall time, which is just the real lifetime that these experiments run. So the purple points here represent the enumeration that are clearly very slow. Note that these are very slow while still running on the supercomputer. The new records by the general sieve kernel up to dimension 105 are represented here. And they are already significantly faster. And these run on a single machine. And now in our work, we see that we get an order of two magnitude speed up over the old results. While still executing all of this on a single machine with a few GPUs. And even less CPU cores than were available on this for the record for the general sieve kernel. We stopped at dimension 180, which ran in about 51 days. And we didn't stop because we didn't want to run it longer, but we stopped because we reached our maximum RAM size of one and a half terabytes. To go any higher, we need more RAM, or maybe you need to save more memory. So to conclude, let the sieving algorithms can efficiently be implemented on GPUs, including all these more advanced techniques. And the memory bottlenecks disappear when the buckets are large enough. And this has the extra benefit of saving lots of memory with negligible overhead. However, it's important to note that BGL beats BG1 in practice on CPUs, but the crossover for GPUs lies much and much higher. So thank you for watching this video. And here are some of the cetaceans if you want to know more.