 So, hey everyone, welcome to today's Protocol Labs Research Seminar. Today we are joined by Alexander Viond, who's both a doctoral student and a research assistant at ETH Zurich. Alexander works with secure computational technologies, including FHEs or fully homomorphic encryptions, trying to make techniques more accessible to non-experts. Today, Alexander will be talking about HECO, automatic code optimizations for fully homomorphic encryptions. So, Alexander, I will let you take it from here. Cool. Thank you for the introduction. So, yeah. Hi, I'm Alex. And today I'm going to talk a bit about our work on compiler designs for fully homomorphic encryption. And of course, as is usually with most of this, we do a lot of work with a lot of collaborators, including some lovely students, Parthik Amiru and ETH, and also my superlative Amber, who is also here today. So what is fully homomorphic encryption? So what is FHE? FHE enables computation on encrypted data, specifically, it allows us to delegate the processing of data to a Parthik, but now also having to give away access to the underlying data to that Parthik. And so that obviously has a huge potential to transform privacy as we know it today, because it will enable us to provide end-to-end security for a much wider set of applications than what we can do today. And the good news is that thanks to a series of dramatic performance improvements, FHE is finally faster than that to be useful. And in fact, we're already starting to see it used in practice. For example, Microsoft has recently deployed fully homomorphic encryption as part of the Microsoft Edge Browser's password monitor service, which checks whether or not your credentials have been found in one of the deep databases that Microsoft knows about, without you having to submit your credentials, nor them having to give you access to the deep databases they know about. But we're also seeing smaller startups increasingly looking into how to commercialize FHE, mostly around the area of privacy-preserving machine learning, where you have a server that owns a machine learning model, and you as a client can send an encrypted input and get back an encrypted inference result on that model. While there is a lot of problems in FHE, there is one big caveat. And that is that right now developing FHE applications is notoriously hard. It just isn't generally usable yet. And there's a lot of complexity in trying to make things work. And so for the moment, we're mostly seeing things like this password monitoring service being developed by a whole team of cryptographic experts. But of course, if you want to see wider use of this technology, then we must work on making it more accessible. And so today I want to briefly touch on what exactly is that makes developing FHE applications so hard, and then go into how compilers and tools like compilers can help to address these complexities. The reason why it is so complex to write FHE applications is that good performance tends to require fairly extreme optimizations. How we express our programs and how we optimize them is kind of crucial for the performance. The sort of trivial straightforward implementations in general will be exceedingly slow in the context of FHE. And even fairly small changes in the application can result in drastic differences in how you should implement it and then how well it will run. And so FHE imposes a lot of constraints on developers. And that in turn then results in a fairly unique programing paradigm for FHE. As an example, data independence is sort of built into FHE because by definition, computations in FHE are a black box from which no information must leak. And that includes things like bits about whether or not we branch. And so we cannot do data dependent branching. That means no jumping, no dynamic loops, and not even real if-else statements. We actually can emulate things like if-else statements by doing this multiplexing light thing that you can see here on the slide. But when we do that, we actually have to evaluate both sides of the block, right? And so in general, data independent means that the runtime of your secure computation is always at least the worst case runtime of the original algorithm. But in practice, it's frequently much, much worse even before you consider the actual overhead from the crypto in the map. Obviously that's maybe not the greatest news, but there is good news as well in the world of FHE. For example, a lot of the major FHEs right now offer a very powerful, restricted form of data parallelism. So you can encode many thousands of integers into a single side of the text rather than just a single one and then operate on them in a single instruction multiple data fashion. That can obviously give you a massive amount of speed up. But the catch is that you're very limited in terms of data movement. So you cannot treat this like an AVX type 12 vector interaction. It's much, much more restricted. And so using this in order to enable actual improvements in the latency of applications rather than just sort of paralyzing and improving throughput is actually a fairly significant challenge. But it's worth tackling this challenge because doing so tends to give you several orders of magnitude in speed up. So it's not unusual to see 10X, 100X or even higher amounts of speed up from applying Cindy parallelism to a problem. And of course, I'm only touching on a few highlights here. In general, the FHE design space is relatively complex. For example, most of the schemes only natively support additions and multiplications. And while that obviously allows you to compute any polynomial function of the integers, that isn't necessarily the kind of things you want to compute. And so you can use polynomial approximation to sort of get around that. But it turns out that in practice with the restrictions of FHE, it's very hard to find the balance between the accuracy in your polynomial approximation and the performance of that approximation. And so alternatively, if you need non-polynomial evaluation, you can look to emulating binary circuits, right? So you can, if you imagine addition and multiplication in Z2, or that just X4 and AND. And so with that, you can do arbitrary computation, but the catch is that now every single operation in your input program is getting turned into dozens, if not hundreds of actual energy operations that compose these, let's say binary adders, binary multiplier circuits that you are now having to emulate. So at the very least, when you're using this approach, you're seeing about auto-magnitude slowdown compared to doing things directly over integers. And in practice, it tends to be much worse, actually, because of sort of compounding factors. If that wasn't enough, then you also pretty much always, in both of these settings, have to consider things like parameter selection. This requires balancing security, performance, and accuracy. And so it's not just the thing of looking up the standard. It's really per application. You need to figure out what are the correct parameters for this specific task. And you also have to deal with this very unique cost model of FHE. It's somewhat like MPC in that additions are much cheaper than multiplications. But in fact, in FHE, most of the runtimes are dominated by these same specific operations that don't actually modify the message on an application level logic, but they just require to maintain the site or text as the computation goes on. And so this is also something that here is just how to wrap your head around. In summary, developing FHE applications is hard because it requires looking beyond the overhead of the underlying FHE operations and really the application design has a significant impact on the performance. And so you need to spend a lot of time in conduct. As a result of that, we see a massive performance gap between naive implementations done by non-experts and first-time users and those that were carefully designed and fine-tuned by experts with maybe a PhD or even the faculty position in this field. And so the design space, obviously, is very complex. There's a variety of different approaches, and they all have non-obvious trade-offs. So that even for experts, it's not even entirely clear what would be the best approach for every problem. But for non-experts, it's usually completely and transparent where to even begin with your implementation. As I said, I want to keep it fairly brief on the challenges today because we actually have a paper about this from last year where we go into a lot of the detail, a lot more detail about the challenges and what it means for sort of everyday developers. If you're interested in learning more about that, you can either check out the paper I linked below or we also have a talk about this recorded or that's available on fhg.org in the list of previous talks there. So yeah, but today I don't want to focus on that. And instead, I really want to focus this talk on asking how can compilers help in addressing these complex settings? And that basically means asking ourselves, how can we make fhg accessible to non-experts? And we propose that that requires combining concepts from the crypto side on the other one hand and then programming language design in order to be able to bridge that gap and to develop tools and abstractions that facilitate fhg development. And I personally have to believe that compilers are the key to democratizing fhg because delivering on the promise of fhg requires tools that allows non-experts to actually develop, you know, turn the existing code into efficient fhg solutions with ideally minimal changes. And in that world, the compiler should be taking care of most of the complexity, allowing the developers to express their computation tasks that they have in a high level language and then the compiler automatically generates and optimizes secure and efficient fhg code from that. And towards this vision, we have focused primarily on two challenges in our research. The first one is automatic translations. Specifically, we're focusing on mapping high level imperative code, which doesn't really mesh well with the programming paradigm of fhg. And how do we map that to efficient batch fhg solutions automatically without requiring user input? Well, minimal user input, for example, we do require the user to imitate which values are secret. Okay. On the other hand, we're also bringing in a new architecture for fhg compilers that we believe much better reflects the different optimization opportunities that the fhg stack allows for. And so I wanna have a closer look at what we actually mean by an end-to-end development tool chain for fhg. Traditionally, we might think of an fhg compiler, and this was definitely true, let's say two, three years ago, as a tool that converts a program in some more or less high level language into an arithmetic circuit. And the arithmetic circuit is sort of the natural and traditional mathematical model of an fhg computation. If you then wanna have an optimizing compiler, well, that means probably rewrites and simplifications on the circuit happening before you're able to run it. But in probability, we see that fhg applications come from a whole range of domains which all have different paradigms used to express the programs. And so when we directly translate these various high level paradigms into circuits, what happens is that we lose a significant amount of information and making it much, much harder to actually work with and optimize the program. And so one of the first things we do is we propose to extend the notion of an fhg compiler sort of outwards towards the front end by introducing program optimizations that make use of the high level information that is present in that input program while it's still a program and not a circuit. And on the other end of the system, we acknowledge that really what the end user wants is not an arithmetic circuit, what they want is instead a secure and efficient fhg implementation, right? And so that can mean having an executable that works on CPU, but also increasingly GPUs, FPGAs, we've seen work for that in fhg and then looking towards the future, fhg specific ASIC accelerators that are currently being developed as part of the Dalai Deep Drive program by a couple of competitors, including for example, Intel. So in order to support all of this, we also need to extend the compiler on the other side to the right here with the ability to target hardware first by lowering fhg operations to the underlying map, and then going further and scheduling these low level operations appropriately for the target hardware. Okay, so that is sort of the zoomed out view of what we think an end-to-end fhg development tool should needs to be. So how do we actually go about building such a compiler? And for that, I want to take a very brief look at the evolution of compiler architectures. So the traditional compiler is this monolithic tool that takes in, let's say, AC for C++ and then through a number of transformation crunches it down into assembly for specific types. And this is basically the architecture of GCC. But we shouldn't realize that there are significant advantages to introducing additional intermediate representations like the LLVM IR that is used to enable the client compiler to retarget to a variety of different backends from the same starting point. And this means that each of the translation steps that they now have to do is significantly smaller than the large gap that something like GCC has to cross. But it turns out that for high-level language, especially higher-level languages than CC++, this is still a pretty large gap. And so we see compilers for SWIFT, for example, introduce their own high-level intermediate representations but then lower, for example, to LLVM. And we're seeing this trend towards an increasing number of abstraction levels. For example, with Rust, we're now seeing a high-level and a medium-level intermediate representation before they're targeting LLVM. And this trend really goes to its conclusion, sorry, with domain-specific languages. So with domain-specific languages where we have a large number of intermediate representations and that means that the individual going steps between them are much, much easier to express to the point where you can sort of fairly easily express as a domain expert, rather than as a compiler expert. And so this is exactly what we're also using for our compiler, ECO, which, well, homomorphic, a different compiler, sorry, we don't have the most creative acronym there. And in our case, this looks like this, right? So from the input program, we first go into a high-level intermediate representation and here we still have our control flow, our high-level operations, like linear algebra operations, right? So even higher-level potentially then C. This is then lowered into this eventually optimum optimization into scheme-specific intermediate representation. And this is basically sort of representing the native operations of an FHG scheme. Most importantly, this means not having all of these convenient functions like equality, comparison, etc., that being sort of stuck in this multiply, add, etc., and then when we want to go lower and actually target hardware, well, then we translate to the polynomial intermediate representation, which actually is there to represent the underlying map of these schemes. And then there's another step, which is the RMS or Retinue Number System intermediate representation, which is representing an abstraction of how to efficiently work with large-degree polynomials and large-big-width polynomials in sort of fixed-width hardware. So all of these obviously are designed to translate and lower down to LADM and then inherit from that the existing targets. But we also envision that in the future ECO will target the writer or instruction set architecture of these upcoming data deep private accelerators specifically built for FHG. I think I said the word intermediate representation a whole bunch in the last minute or so, but let's actually look at what is an intermediate representation. And specifically, I'm saying this in the context of the MLIR compiler framework, which is what we're using to realize the ECO. So here, an intermediate representation is basically a description of the, let's say an API. You have an operation that belongs to a dialect. So standard.end, it has operands, it has results. It also has types, very important for the actual implementation of things, but I wanna completely ignore that for this presentation. And then this basically defines the syntax in some sense of the intermediate representation. But whether or not this actually means anything isn't defined here, right? We can expect that standard.end probably adds two numbers, but this actually isn't defined as part of this. It's defined through low rigs. So to give you a bit more of an interesting example that isn't just an addition. And so let's have a look at the VFB multiplication operation. So this is quite a bit less trivial than, let's say, an integer multiplication because it translates to, first of all, a series of polynomial multiplications between the polynomial elements of the psychotext. And then actually each of these polynomial multiplications translates even further into, let's say, an entity or an element-wise multiplication and then inverse entity, which is a way of sort of performing large-degree polynomial multiplications efficiently. And in fact, this will then, like one each one of those will actually translate even further into the RNS level IR, but that gets way too verbose for this sort of overview. So I've sort of zoomed in here a bit, so let's zoom out a bit. Let's have a look at what's actually happening here in terms of architecture. So we have a Python front-end and we also have currently a C-like, sort of domains with a language front-end and we're very open to adding more front-ends if people can sort of show us why they would prefer one. All of these front-ends they translate the input program into our high-level intermediate representation. This is made up of our custom dialect or set of operations around FHG, but it also includes, and this is the advantage of using something like the MLAR framework, built in dialects and operations like the tensile vector of dialect, ones for basic arithmetic, for control flow in terms of the fine function calls and all of these things that we get to inherit. And by cleverly connecting our FHG dialects with the built-in systems, we actually get to benefit from a lot of the built-in optimizations and we basically avoid reinventing the wheel for some of the compiler one-on-one stuff. So as I said, and just showed earlier, this then gets translated to scheme-specific IR. So for example, this BFV.mult operation that I just showed is from this BFV dialect and then we have dialect currently for BFV, PGD and CKKS, which are the really based FHG schemes that will have this SIMD capability that I mentioned earlier. So when we then want to translate that into an actual computation, the easiest way would be to just translate it into calls into an FHG library like SEAL, which implements all of the crypto by itself. But increasingly, we're seeing that there's interest and demand for being able to actually lower to hardware directly. And this is then when we do what we just saw, which is lower into polynomial operations, those, and that was the thing that didn't fit on the slides, get lowered into RMS, Resilient Number System Operations. And then finally, this gets lower to LAVM if we want to target something like H86, or in the future, it would get dispatched to a deep-private accelerator. I've already given you an example of what it looks like to take something from, let's say, BFV through to poly, but now I want to have a bit of a closer look at the high-level transformations that I mentioned earlier. And in order to be able to do this, we need to have a quick recap of this Sydney-like parallelism which I already mentioned. Obviously, it's amazing, it allows us to do things like replace many operations with a single Sydney operation, right, great, but the catch is that it doesn't give us a data movement. So in normal operations like AVX-512, which are very common server-sized vector operations, we can actually fairly efficiently commute or scan together, and this is very heavily relied on in-optimization targeting this. In FPG, this is not an option. The only thing we can do in terms of data movement natively is cyclic rotation. So we can take a cybertext and we can add significant actual runtime across the rotated in a cyclic rotation. And because of this restricted data movement, it doesn't really work the same way as computation paradigms like, you know, we see like AVX-512 or other vectorized products. And so really how to use this to get latency improvements, right, so how to accelerate the single instance of a program is very, very tricky and sometimes not very obvious. For example, this is a fairly simple program that computes a simple image sharpening, right? It takes a kernel and it iterates it over the image and what it may not look like it, it is actually very FPG-friendly. So currently it's in a very unfriendly form because we have all of these random accesses into these image pixels, which is very, very bad. FPG cannot do this sort of data movement or extraction of a single element from a vector efficiently. But this does actually have an efficient FPG implementation. The problem is that it's a rather dramatic transformation and I hope I'm not presuming here when I say that it's basically unrecognizable from the original program. For example, this inner loop here or this inner loop nest even has been turned into a single instruction in the FPG program and these outer two moves, well, they just completely disappeared because what we're doing with the FPG approach is we're no longer iterating a kernel or an image, we're actually duplicating the image a few times, rotating it appropriately and then jamming it through the kernel in one big Cindy operation at the end. And so this is an example of the kind of transformation that we need to do in order to get decent performance out of FPG and it's also an example of how drastically transformations can be. So it might be natural to resolve two things like synthesis or other sort of heavyweight tools to achieve this, but our goal here really was to do this with simple and efficient translation rules that you can actually use in everyday development of program. So in order to show you how HECO would tackle something like this, I'm going to drop to a slightly smaller and simply example that will actually fit in the slides and that's the hamming distance. Very simple task, calculate how many positions of the vectors disagree, right? So the hamming distance of this example is two. So this is like the C++ representation of the program. This is what it would look like in the high-level intermediate representation. I've omitted all the typing information here just because it gets very messy if we do that. So right now, this has a lot of these random index accesses and as I mentioned earlier, these are very bad for FPG. We can emulate them by rotation, multiplications with mask and all these kinds of things but basically they're just prohibitively expensive. At that point, we actually show it's much better to not even bother with batching and put every data even into its own cybertext but that's not exactly a great solution. And so what we do is we want to try and minimize these operations. But before I can show you how we do that, I want to quickly unroll this program just for exposition, it makes it a lot easier to follow along. So let's give this a few unrolls, there we go. Now we have a program, double the hamming distance of size four, maybe not the most practical program but it fits on the slide and it currently has a cost of eight index accesses and sorry, that should be four multiplications. And we want to make this better and the first thing we're going to do is maybe not going to be super obvious why it's the way we're doing it but we'll get back to it later. Because the first step in what he could do is it combines what we call sequential operations. For example, all of these additions here that we see they keep adding to the result, right? But it's one series of additions and we can combine these into a single big addition operation and we can obviously drop the plus zero because that's not relevant. And now instead of having a binary addition operations we have an every addition. And this is actually quite important later on but for now I just hope you will believe me that will become a relevant. Okay, so this is the next step. It's much easier to see how this relates to batching because what we do now is we look through the operations and we try to apply an operation instead of two individual vector elements to the entire vector. Well, that's great. We can get rid of all the other ones because they've essentially been pre-computed by this operation. But the catch is that now what we need to do is we need to add an index access every time we use this result, right? So instead of actually removing index accesses we nearly move them around in the program. The good news is that if you do this replacement in a let's say clever way and you continue doing it, for example doing it to the multiply here then we're starting to see the first improvements, right? Now we only have four index accesses rather than eight that we had in the beginning and we've also removed three or four multiplies. In practice, in like reverse programs it takes a bit more than just hand waving to make this always work. And so in the compiler we actually have a good amount of what we call target slot logic which decides how and where to vectorize these things so that this works out in as many cases as possible. Okay, so having done, we've made the program I guess not just 50 but actually more than 50% faster. That's really nice. And in the world of like playing that would be like a best paper or whatever. But in the world of FHE that's just not even table stakes. 50% faster is no linear sufficient in a world where the performance gap between naive and expert is many orders of magnitudes in the worst case. And so I promised you we'd get back to this interesting additional operation and that's what we're gonna do now. First of all, we need to translate these index accesses into rotations. So that means that we now instead of fully emulating that this is a scalar we're just moving the value that we want to be on the same slot in each of the cybertexts that we have. So that rotations obviously are a part of why an index access is so expensive but they're still cheaper than fully emulating an index access. And then we can get rid of the last index access in this program by just saying, look if we're trying to return a scalar we'll just look into the first slot of the zero slots of your cybertext. And that gets us to a program that now has no more index accesses to single multiply if you rotate this is much much faster than the input program was but most importantly it's still linear it's still all of n in the size of the input. And we can actually do significantly better here. And that's by exploiting that all of these inputs here have the same origin. Like so all the things being added have the same, they originate from the same cybertext they're just different rotations of the same cybertext. And here we are the first, as far as I know to exploit, automate and sort of generify this epigy focal technique that is used to compute the sum of elements in a cybertext. And the way we do this is fairly straightforward we take a copy of the cybertext we rotate it by half we add them back together to get the partial sum and then we rinse and repeat this like a logarithmic number of steps until we have the sum in every slot. And obviously in the complete example here where we went from three rotations to two rotations because at size four the difference isn't significant but you can I think clearly see how for larger sizes and in every tree we're talking 8,000 to 64K elements per cybertext going from all n to all n can make a significant difference in runtime. And I will have some evaluation results on this later. Okay, so this is sort of a toy walkthrough of what HECO does in terms of compilation. When we zoom out a bit the compilation pipeline looks like this. We take in an AST for the program from the front end we then convert it into the SSA high level intermediate representation SSA here being single static assignment it basically means we don't have variables that are overwritten again. It's a very standard form for compilers and makes it much easier to write optimizations. We also then apply a bunch of standard out of the box simplifications constant folding common sub expression elimination sort of standard techniques that just help us get the program size down a bit. Then we do what we call type separation where we split the sort of vectors and tensors that appear in the program into the ones that are irrelevant because they're operating on plain text and stuff and the ones that we actually care about that operate on secret values and that allows us to then do all of our optimizations only on the things that we should actually be optimizing because it would not be good to apply FHD style optimizations to standard sort of plain text number crunching. After this we do another round of simplifications and then we do the vectorization parts which is what I spent most of my time just sort of walking you through because the vectorization parts could unlock more redundancies we do another set of simplification before then finally doing this cyclothex folding parts which exploits this focal technique that I just showed you. And then from there, if it's sort of continuing this train we go to the scheme specific intermediate representation and that would be the point where we do what's called noise management which is inserting cyclothex maintenance operations and all these other sort of slightly more technical intro to related aspect of the compilation. Okay, so that's what he could ask. Well, how does it do? Well, if we look at some actual evaluation results of the optimization, we can see that when we compare eco solutions against naive non-veg solutions we see about auto-magnitude speed up. So that's a lot of access, right? And we see that in pretty much all of the cases we get at least an auto-magnitude speed up. And that's great in itself, I think as a result but in order to show that we're not just sort of doing some speed up or that we're actually finding sort of what the experts would have written we're also comparing where possible to a tool called Porcupine. So Porcupine is also an energy compiler but it's a synthesis-based tool. So it really only works for sort of toy sizes and it takes even then up to like 20 plus minutes and sometimes just absolutely fades. But when we could evaluate it we could show that eco and Porcupine actually produce essentially the same output. There are some situations where Porcupine is ever so slightly better than our tool but it's basically within the margin of error on this kind of scale. And so what we're showing the key clue is that you can in fact achieve near expert batching solutions for a lot of different programs the same way that the synthesis-based tools have been doing it but actually do it in less than a second of compile time making it actual practical to give it to non-expert developers and ask them to integrate this tool into their sort of developed, compiled debug cycle because obviously you cannot go wait for a 20-minute synthesis compilation every time you want to do something. Of course, eco also scales to more real-world example sizes and in the paper which is currently on submission but we have a pre-printed up on archive you can see more about how eco scales to actual larger real-world size problem instances. This is me nearly finished but I want to do one quick diversion before I conclude which is that I want to briefly talk about FHD standardization and so maybe not in the way you're thinking about which is the scheme standardization. This is actually going amazingly well for several years now we've had a great de facto community standard which specifies sort of what levels or security are considered acceptable what parameter sets are considered safe and so on and as a result of this being very very stable over the last few years this is now being turned into an official ISO standard and it's being driven by Microsoft and Intel primarily right now but obviously as an ISO standard has a huge community around it now working on it and I'm very optimistic that within the sometimes glacial pace of an ISO program this will come up fairly soon and then we'll actually have a very fixed and nice description of all the standard schemes. What we don't currently have and what I obviously as a compiler person care about much more is standardization in terms of intermediate representations. We did have this draft API that came out around the same time as the original community standard for the schemes but I think it was just a bit too early at that point so what we ended up with instead is just a de facto conceptual API for example SEAL has an evaluator object but then offers add multiply et cetera functions corresponding to the scheme operations and pretty much all the other libraries follow a similar pattern. The problem is that well first of all they're different but that's just an engineering question but also conceptually it's not quite sufficient to really think about and argue about. So this is just one layer in HECO whereas all the other layers don't really have an obvious standard of representation and so now that compiler efforts are accelerating and I don't just mean our compiler but there's also the transpiler by Google, Zama has an internal compiler and there's various other parties and stakeholders actively building compilers for FHE. I believe it's high time to revisit the standardization and expand it beyond just this one level. And so we actually have a round table an informed round table on unifying FHE abstractions that I'm moderating. We meet roughly monthly and it includes stakeholders from Intel, Microsoft, Google, Zama and variety of other people and companies working on FHE compilers. And so if you happen to be watching this and are interested in this area please reach out to us or to me because we'd love to talk and hear your ideas even if you're not necessarily working on FHE compilers but maybe also on other kinds of advanced probably compilers like zero knowledge groups or NPC. Okay, now it's my actual final slide. I just want to refer back to the paper once more and also mention that you can actually find HECO in a more or less ready to run state on our Github at github.com slash marvel.he slash HECO one caveat currently you need to go to the depth branch not the main branch in order to find the company. Thank you.