 Hello. My name is Kim Mojärvian. I will present our work, Speed Reading in the Dark, Accelerating Functional Encryption for Quadrided Functions with Reprogrammable Hardware. This is a joint work between Miladvahdori and myself from University of Helsinki, and Tilan Mark and Mihas Doppar from XLAB. The work was done in the EU project Ventech. We will go through this architecture for the first hardware accelerator for functional encryption for quadratic functions. It is a hardware software code design, a design particularly for Xilin sync ultra-scale MPs of devices, but it is also usable for other such SOCs. The accelerator is optimized particularly for the decryptions of this functional encryption scheme, and this decryption operation requires bearings and discrete logarithm computations. One of the central contributions of our work is this parallel version of Shank's famous Baby Step Giant's discrete logarithm algorithm, and we utilize pre-computations and parallel processing in that algorithm. What we show is that our implementation provides large speed-ups compared to software-only implementations. We get several attempts or even hundreds of times faster results than a software library called GoFay, designed for the same operation. We also showcase to practical use cases in the domain of machine learning, in particular image classification, and show that these speed-ups translate also to practical use cases. So what is functional encryption? If you think about traditional encryption, it's all or nothing in the sense that whoever has the decryption key, okay, here, gets access to the full plaintext and sees it entirely after decrypting this hypertext. Then again, if you don't have this decryption key, you get nothing at all. Or maybe you know something about the length of the message or something, but actual contents are completely hidden from you. Functional encryption then again provides more fine-grained control. So it's possible to hand out decryption keys that allow to compute a specific function from the ciphertext. So a function of the plaintext when given this ciphertext. And this decryption key allows to compute this function, but doesn't give any other information about the x. Every time that somebody talks about computations with the ciphertext, one of course thinks about homomorphic encryption. But in this sense, homomorphic encryption is more like traditional encryption, that whenever you decrypt, you get the entire plaintext. And if you cannot decrypt, then you get nothing at all. But the functional encryption actually lets you compute something, but doesn't reveal the entire contents of the message. So then the question is of course, what kind of functions can we compute? And in general, general functions are possible in theory, but not in practice. So in practical sense, we are limited to very basic functions. So there is functional encryption for inner products. And those schemes allow you to compute inner products. And you can compute means or weighted averages or these kind of very basic statistics over a vector of data. Functional encryption for quadratic functions lets you compute already quadratic functions and can lead to much more powerful practical applications. And the focus of this work is the EPQF. And as we will show, we can actually do some simple machine learning tasks, even with these kind of schemes. So here's a small example of what kind of problems or practical problems could be solved with functional encryption. So we have a setup where we have a patient. And then in this case, two different doctors. They might be, for example, from different specialties. So it might be a cardiologist and oncologist, for example. And now keys are given for the patient, the encrypt data, and two different keys are given to the doctors. So the cardiologist can compute the function F and the oncologist function G. And now the patient may encrypt her data, like genomic data, as an example, and send it to a server. Then the cardiologist can query this genomic data and see, for example, if there is some hard condition based on the patient's genetic data. And the oncologist can search similar things, but for other diseases. So in this case, for example, some cancer. And nobody, not the doctor and not even the server, has full access to the patient's genomic data. And the central computation in this kind of a setup is the decryption function of this functional encryption scheme. And that's the motivation why this work also focuses mostly on the decryption operation. Because it's quite seldom when these keys needs to be generated. So there is probably quite little need for accelerating the key generation function. And then patient encrypts her data maybe once or very seldom in any case. And it's also very likely that patients can't have access to hardware accelerators. So this decryption operation that is both frequent and done in a centralized server is a very good candidate for hardware acceleration. Okay, so let's take a look on the actual decryption algorithm that we focus on in this work. So this is from a scheme described in a paper titled Reading in the Dark. And that's also the origin of the title of our paper, Speed Reading in the Dark, because we do that faster with our hardware acceleration. And if we look at this decryption algorithm, it takes as an input the specific format ciphertext and then a decryption key for specific quadratic function f. And then it computes a number of pairing computations, actually the number of pairings that are required in the decryption depends on the length of the ciphertext vector. So how many values are encrypted in the ciphertext vector. Then there are some finite arithmetic operations required also, but those are rather insignificant if you look at the big picture. And actually the performance of this algorithm is mostly determined by the pairings and importantly the final discrete logarithm that needs to be computed in the end. So this discrete logarithm returns an integer value as an output and that's the output of the quadratic function that we want to compute with this algorithm. And our work has focused, all the normalties in our work are concentrated on this discrete logarithm computation. So in this presentation I will focus mostly on that, but the details about how we compute the pairings are available in the paper. So a discrete logarithm is the problem of finding x when given alpha and beta in some cyclic group and beta is such that it's alpha to x. A famous way to solve this problem is the Sank's baby step algorithm which is based on this equality here. So once a match is found where some power of alpha equals this right hand side then we know what the x is. And the algorithm splits in two phases. Baby step phase which computes powers of alpha and stores them in a table and then the actual giant step phase is the one where we try to find the matches. And the discrete logarithms that need to be computed in this description algorithm are actually quite special in the sense that we know that the x is in a specific bound, which is very small compared to the size of the cyclic group. And so in most cases we know that the function that we evaluate can return output values in the interval from minus b to positive b. And it's also the case that this alpha value is fixed in this. It's a domain parameter so we can pre-compute this baby step, the table t. And although the size of the cyclic group is huge and that would prevent us from computing the whole t we can still compute the t up to some predefined bound and in this work we denote that by bp and then it allows us to evaluate all functions that satisfy this inequality. It's also noticeable that we don't have to actually store the entire alpha to j but it suffices to only store some bits of that value as long as all entries in t are unique. And this helps us to save a lot of space, which is very central in this work because we are actually computing huge tables. Then our architecture that I will present in the following slides is a parallel architecture so it is of interest to parallelize this algorithm. Both of the steps can be fully parallelized in the sense that if we have n cores we get almost n times speed up. The resulting algorithm that takes this list into account is actually quite complex. It's about one page algorithm so I spare you from that and you can take a look in the paper if you are interested. Okay, so here is the high level architecture of our accelerator. So as I mentioned it's a hardware software code design so that there is a software side so you can see this arm core here and then there's a hardware side which is implemented on an FPGA fabric. The most important components in the hardware domain are these n parallel processing cores. So n is 16 in our implementation case, but basically it could be another value for n could be used as well but we filled the FPGA and that meant that we could fit in 16. 16 parallel cores. These cores are optimized for speed and area which allows us to optimally use the resources that we have available. The actual architecture of this course is based on a pairing co-processor architecture designed by Mila and myself and published in FPL last year. So it's optimized for pairing computations but as it ultimately relies on the efficient finite third arithmetic it can be also used to efficiently compute the other parts of the decryption algorithm including the discrete logarithm computation. Here on the hardware side we also have a hierarchy of local memory so there are local memories in the CP cores but there is also shared memory which allows the CP cores to exchange data with each other. The main purpose of this memory hierarchy is to reduce the amount of data that needs to be transferred between the software and hardware domains because that communication would easily become a bottleneck if not implemented with a smart local memory in the hardware side. In the software side there is this ARM core that takes care of general control and flow management but also certain what I call auxiliary processing meaning small operations that need to be computed but which are either not supported by the hardware domain or that doesn't pay off to be delegated to the hardware at that point. An important component here is also this DDR memory which is actually not part of the designing chip but it's on the same board. This large DDR memory is used for storing the pre-computed T and actually we used almost 2 gigabytes of that memory so that's a very central component in this computation. Let's take a look on the results next. We compared the results of our work against the GoFail library also a result of the FENDEC project that implements the same FEQF scheme but in addition to the original GoFail we also compared against optimized GoFail which is the same GoFail library where we have implemented discrete logarithms with our discrete logarithmic computation with pre-computations and we can see from here that actually this algorithm has a very significant impact also in the software performance so you can see that with the normal original GoFail the discrete logarithms quickly start to dominate here and the decryption times grow to impractical levels quite quickly because all of these samples here are actually rather small impractical sense so in some cases there is only one value in the largest 20 but as soon as the size of the result grows the GoFail library gets slow. With the new discrete logarithms algorithm we get significant speed-ups but by using the hardware accelerator we get even much better results so we get speed-ups of over 1000 times compared to original GoFail but almost 20 times even against the optimized one and it's so that the bigger the functions are the more benefit you get from the hardware accelerator and that's a good showcase of the importance of hardware acceleration for practical use cases because in those cases the functions usually are on larger end At the beginning I said that we also tested our system with practical-like scenarios of use cases and we tested them with two different image classification use cases so the setup is such that for example in the first case we used this famous MNIST database of images of handwritten digits and the task there is to normally to give this image to a computer and it should then say which digit is there in the image in this case that we have the task is much more difficult because we actually encrypt the image so it's not possible to visually even look at that and tell which image there is but with the functional encryption it's still possible to do this machine learning computation giving out 10 different decryption keys one that gives the likelihood that the digit is zero one that it's one and so on all the way up to nine and in this example for example we see that the decryption with the key that gives the likelihood for the digit eight gives the largest output and then that means that the image in that encrypted image is likely to be eighth the fashion MNIST database is similar but instead of handwritten digits it contains images of different kind of clothes like t-shirts, trousers and so on and the task is to find out which one that is there are also 10 different classes in that case so in both cases actually the images are 28 x 28 pixels and on 10 different classes but because the task of classifying these handwritten digits is much easier actually the computation that is required there is much simpler so we can use N is 40 as a parameter in that case and we get on average 29 bit outputs from the computation and actually this kind of a model gives 97% accuracy in the case of MNIST we have to use bigger parameters so we have N is 128 on average 37 bit outputs and still we get less than 90% accuracy so if we look at the literature original GoFay was reported to have this MNIST case done in less than 20 seconds when we do that with optimized GoFay we get on average 1.3 seconds for that computation for the MNIST and fashion MNIST is done in 5.2 seconds with our accelerator we are 0.09 seconds for the MNIST case and about 0.4 seconds for the fashion MNIST so we get 15 times speed up which is something that already has a lot of practical significance so as a conclusion what we showed was that functional encryption for quadratic functions is difficult to form hard for acceleration although we focus on one specific scheme in this work also the other FAQ schemes that have been proposed are actually very similar in structure so they use bearings and discrete logarithms so we expect that our results can be used our results generalize to them also rather rather easily and our accelerator can be used for different FAQ schemes with only very minor modifications if you are interested in the paper or the details about the architecture algorithms and for discussion on side channel attacks please see the paper okay thank you very much and questions can be asked if you have any online session or mailed directly to me email address kimmo.u.järvi thank you very much for your attendance