 Bonne matin, tout le monde, il n'y a pas de bref de café maintenant. Bienvenue à la 7e session sur la concrétion automorphique. Le premier tour est intitulé « High performance, fun workout » et « Some automorphique concrétion » sur GPUs, une implementation utilisée en CUDA. Et les autres sont Amad Al-Badawi, Bahad Vajver Ravalli, Changfung Moon, Kim Mimi Aung. Et Amad va donner la parole. Merci, merci beaucoup. Bonjour tout le monde, merci d'être ici, à ce moment pour attendre la session de la concrétion automorphique. Dans ce tour, je vais partager avec vous mon expérience en développant ou en implementant le FV ou des gens qui s'appliquent le PFV, « Some automorphique concrétion » sur GPUs. Et la concrétion automorphique, ou la concrétion automorphique, comme tout le monde peut le savoir, c'est ce qu'on appelle la « Holy Grail » de la photographie et la raison pour laquelle c'est parce que ça vous permet d'améliorer la data concrétion. Comme vous pouvez le voir, le client sent que la data concrétion a été concrétée à la cloud. La cloud, en quelque sorte, peut encore pouvoir concrétiser des fonctions utiles sur la data et obtenir un résultat concrétisé pour le client qui a la clé à la clé et peut concrétiser et voir les résultats dans le texte clair. En théorie, nous pouvons concrétiser n'importe quelle fonction, si. Mais en pratique, nous sommes assez limités. Et peut-être le principal challenge de la concrétion automorphique est d'obtenir une computation énorme. Et, comment ce problème est being handled, la performance issuant de la concrétion, je dirais qu'il y a deux directions. L'une est tentant de trouver des nouveaux schémas ou des techniques pour améliorer la performance de l'EFHE, comme le plan de repos, les schémes d'encodement, l'approximation de la compétition, peut-être de squashing la fonction de la target que vous avez besoin et même d'optimisation d'optimisation pour le circuit que vous voulez concrétiser. Et il y a une autre direction qui accepte le schéme de l'EFHE et essaye de speed-up ce que nous avons. Comme nous speed-up l'EFHE avec des compétences basques, comme l'élection de la génération de la génération, de la décryption et de la coopération homomorphique, de la multiplication de la génération. Nous utilisons des algorithmes modulaires, des implementations parallèles et même des implementations de hardware, comme le GPU, le PGA et le ASIC. Et s'il vous plaît, ajoutez le GPU pour cette catégorie de la plateforme d'exécution de hardware. Donc les contributions sont impliquées par l'EFHE sur le GPU, comme c'est la principale contribution de cette pièce. Nous inclusons un set d'optimisation de code et d'autres tools algepriques. Et nous benchmarkons les implementations de l'EFHE. Alors une question c'est pourquoi les GPUs sont pour l'EFHE. Bien, les GPUs sont plus naturellement disponibles. Ils incluent beaucoup de cours de computing et prouvent d'être fortes dans les problèmes parallèles ou des problèmes qui incluent un niveau de paralysme. Et il s'agit d'être que l'EFHE, le général de la génération homomorphique, a un niveau de paralysme exploité par les GPUs. Et c'est pourquoi ces deux, comme les plateformes et les problèmes, semblent être un bon match. C'est le texte de l'EFHE. Je ne serai pas dans les primitives basiques, mais j'aimerais vous savoir que dans ce schéma, nous dealons avec les polynomials. Et ces polynomials peuvent être très longs polynomials. Et aussi les coefficients peuvent être multi-précisions comme des milliers d'euros. A chaque coefficient. Et aussi, dans cette première opération, nous fâchons souvent la multiplication homomorphique. D'autres que d'être computationnelles et intensives, surtout dans le schéma de l'EFHE, il y a une grosse opération dans la multiplication homomorphique qui est la première. Il y a des people qui s'appliquent dans le schéma de l'EFHE, d'autres qui s'appliquent dans le schéma de l'EFHE. Ce n'est pas assez compatible avec les représentations qu'on sait de la représentation qu'on représente les polynomials. Et aussi, il y a une autre opération qui est la base de la composition. Comment faire ça dans R&S ou dans NTT représentations. Et ces polynomials, comme je le disais, sont longues. Le degré peut être un peu de mille. On s'arrête à la puissance de deux psychotomiques parce qu'ils ont de belles propriétés. L'adhésion peut être la multiplication et la complexité. Et les coefficients, comme on l'a dit, ont un nombre de précisions, un peu de milliers de pièces. On utilise des modulaires arithmiques comme la R&S pour les décomposer. Et aussi, on utilise des techniques par Bajard pour faire la scale et la roundt over Qx et la base de la composition. Alors, first, let's see how the polynomials will be represented. The Q, which happens to be the polynomial coefficient, we choose it as a smooth number and it will be the product of many smaller primes, these primes should fit in the GPU or the execution platform machine world size. And the polynomial will be represented in this matrix, what we call the R&S representation. It is K by N. N is the degree of the polynomial and K is the parameter that we can control which is log2 of Q over log2 of P. So by choosing the size of P you can either increase K or reduce it. And it's important actually to try to minimize K as much as you can so that you don't do much transformations. In this representation, you can do ultimately infinite number of abstract addition or subtraction by only doing component-wise. Now, to do the multiplication and you will apply this operation for each row of the matrix and once you go to the NTT you can do addition, subtraction and multiplication using component-wise operations. Now a question that may arise, what transform to use. So we have so many transform. There is the standard TFT there is NTT, there is DWT and what we use here in this implementation the DGT the discrete Galois transform. And in this table we summarize like the pros and cons from our perspective of these transforms. So the TFT is well established you can have many several libraries that can you just use in your implementation but the problem, the floating point errors will increase as in increases and probably to mitigate this problem you will either increase the precision of computing the DFT which will affect the performance or you can reduce the size of the P the primes, the CRT primes you use and you will have at this case a longer matrix and you will do higher number of transforms. Other transforms which are the extension of DFT over Galois fields like the NTT, DWT and DGT the nice thing about them they are exact but the difference between them is probably the transform link in NTT you will need to do 2N transform and in DWT you can do it using N in DGT you can do it using N over 2 transform link and this is quite interesting actually because as we will see later in order to implement this transform on GPU efficiently you will need to store the powers of the primitive roots of unity and on GPU we are quite limited in fast memory and you will need to try to shrink the size of these lookup tables as much as you can and that's why DGT here happens to be quite useful for this problem but there's one problem with the DGT is that it requires Gaussian arithmetic and will include a larger number of multiplications but our hypothesis was well GPU is good in computation probably is not that fast in memory operation so let's see what happen if we use the DGT and eventually we use the DGT in our implementation so these are the common equations of the transforms as I said you need to store these primitive roots in the lookup tables the second question is how to compute the theoretical transform in GFP hat or in GFPi in the same field prime that you use for the CRT well we found that if you do it in GFPi is better let's look at each solution so you can choose the solution so you can choose a nice prime let's call it P hat and let's take a use scale let's say it's a 64 bit prime so that it can fit in the machine world size the problem if you choose this approach is that you will need to take to be careful that you don't trap beyond P hat for the maximum convolution multiplication this will be the maximum convolution value you get and you will need to ensure that the operands you start with are less than this bound and this is only for one multiplication in this table you can see that for different n sizes the size of the prime that you can use in case 1226 up to 24 bit primes this means that your matrix if n is large will probably become longer and you will have to do more transforms and another thing so here we said that these primes will be like let's say 32 bit primes when you choose the 64 bit prime the storage will double when you do the NTT so you started with 32 and now you are working with 64 bit so the storage is doubled whereas if you choose the other solution where you compute the theoretic transform in each of the prime in different fields you will not have these problems so you can work with the same bit precision of the prime 64 bit there is no size doubling and you can also support unlimited number of operations in this domain well is this the theoretic transform important yes it's very critical for the homomorphic encryption in this diagram we see the basic operations included in the FV schema as you can see the NTT is almost 50% these are just different settings of the problem NTT almost 50% and we also have the RNS tools for the base decomposition and the scale and round also they consume about 30% or 35% so we need to pay attention to these to this particular operation CRT we also in the paper include an implementation of the CRT using the Garner algorithm on GPU we found that Garner algorithm is better than the classic algorithm of computing the CRT and the reason is in this table we show first the lookup table size you will need k square for the classic computation you will need less than that in Garner not much less but a little less thread divergence which is an important concept in GPU programming if you use the classic you probably you will end up with thread divergence which will limit the performance whereas in the Garner algorithm there will be no thread diversions but is CRT critical to performance? No, it's just in the input and output of the problem so it's not that critical to the performance and also we adopted some of the tools provided by Bajard, Enyard, Hassan and Zuki paper for adapting the AFV schema to be RNS compatible so that we can do the scale and round and base decomposition in the RNS domain in this flow diagram we show how to do the homomorphic multiplication for this scheme as you can see we start by the ciphertext, each ciphertext is two matrices and for the RNS tools these matrices will double in size and then we do the DGT roll wise for each matrix we do the tensor product and then we go back to the RNS domain to do the scale and round and then we do the first base extension go back to DGT to do the re-linearization and what I want to point out here is that as you can see we keep going back and forth between entity and RNS representation it would be very interesting actually to do these operations directly in the entity domain I mean the scale and round and the base decomposition whether they can be done in the entity still an open problem so for the benchmarking we compare the implementation with SEAL the latest version of SEAL which implements the RNS variant of the FV scheme and we also compare with the NFLLEB FV for these basic primitive the key generation decryption homomorphic multiplication plus the re-linearization actually the GPU can get us sometimes from I would say 30 to 40x speed up against these implementations if you want the numbers the exact numbers you can refer to the paper and I actually kind of thought ahead and thought that would this question will come from the audience which is which RNS variant of the FV should I choose or implement in the literature there are two variants actually of this scheme the first is due to BEH Pajart, Inyart, Hassan, Zuki and the other one is due to Halevi and Polyakov and Schupp and the answer for that can be found in this paper where we analyzed both schemas in terms of performance noise analysis and you can find the answer for this question in the paper thank you very much I'll be happy to answer any question thank you Hamad is there any question? there is no question so I have one so you present in slide 11 many different transformation for the entity and have you tried to test which is the best or you just implement one we tested we tested the DGT and DWT so in terms of performance one time performance I would say they both like comparable sometimes DGT is even faster but not that much a little but the most important thing actually is that as I said DGT use in over 2 transform which means you will need less or smaller lookup tables which is very effective on GPU development ok thank you so since there is no more question we will move to the next talk