 All right, good morning, everyone. Thank you for coming for my talk. The topic for my presentation today is homomorphic encryption in the open source world. So all of us have a mobile phone, a laptop, and internet access, and lots and lots of data is being collected. And I don't know what it's being used for. So that's like a valid concern that most of us have. But wouldn't it help to know that all of this data could be anonymized and made sure that none of your private information is getting leaked? At the same time, you get good inferences, predictions, and still a good recommendation system. Well, who wouldn't like that? And that's why I'm here to talk more about homomorphic encryption. So before we deep dive into the technicalities, I would like to introduce myself. My name is Akanksha Dughal. I'm a senior data scientist in the emerging technologies team in the office of the CTO at Red Hat. I'm based in Boston. And there's my GitHub, LinkedIn, Twitter, whichever one you want to ping me. And like more to have a chat and discuss further about this topic. Happy to do that. So let's get back to homomorphic encryption. So starting with what is encryption, like most of us already know, but still it's a way of scrambling data so that the only authorized parties get access to the exact information. More like it's a process of converting the human readable text into an incomprehensible version that only certain parties understand. And it's also known as ciphertext. And homomorphic encryption is a kind of encryption that allows you to make computations on this encrypted data. It ensures that performing operations on encrypted data is as smooth and simple as performing it on any normal data set. So as companies continue to develop machine learning models for different tasks, these models can be a private asset and something that you cannot share with public for several reasons. And on the other end, the user who also has some private information does not want the model or that company to know their data. Homomorphic encryption is something that lets you encrypt this data in a way that only the client side would know what the data looks like. And thus, this encrypted data is sent to the model and the result of those models only comes back in an encrypted form to the client itself, who are the only authorized party to decrypt it and get the final result out of it. So homomorphic encryption also has numerous applications that range from health care to smart electric grids, from education to machine learning as a service, and all sectors where input privacy is paramount. And making the use of data is usually already very, very complex due to regulations or the significance of the data or related security concerns. So one of the most important use cases of homomorphic encryption, in my opinion, is going to be the health care industry where precision medicine requires intensive computation about privacy and agency against breaches. So the agencies and the hospitals must ensure compliance with relevant laws, such as HIPAA. And the pharmaceutical companies are also concerned about protecting their IP here. And the trade-offs and the regulations come with disastrous outcomes for both the organizations and their patients. So homomorphic encryption is one such solution to some of these trade-offs. And it comes at a minimal cost as compared to the outcomes of violating these regulations. Coming back to the topic of this presentation, which is homomorphic encryption in the open-source world. So where does this come in picture? And how can it be used in the open-source world and help the communities become a better place? So starting with the benefits, both homomorphic encryption and open-source promote transparency, ensuring verification of algorithms and implementations, and thus want to enhance their security at each point. So it also encourages a collaborative environment for researchers, developers, and contributors while ensuring the privacy of their data. So some of the use cases, starting with protecting sensitive information, open-source communities oftentimes have sensitive information, such as user data, passwords, or financial transactions. Homomorphic encryption can be used to encrypt this data, protecting it from unauthorized access, while still allowing authorized parties to perform computations on the data. Secure collaboration, open-source projects oftentimes have collaborations between multiple parties, oftentimes competitors, including developers, contributors, and users. Homomorphic encryption can allow them to securely share their data and code between these parties without revealing any sort of sensitive information, while they all continue to work together on the same project. Protecting the intellectual property, most projects rely on open collaboration, but it's important to protect the intellectual property, such as code and algorithms that could be private to a particular user or contributor. Homomorphic encryption allows you to protect the property by encrypting these codes and algorithms, while still allowing authorized parties to execute the computations using your own code. Also, data privacy, there's lots and lots of data being collected on a daily basis, for it be any open-source projects or simple websites that you're often not even aware of that are collecting your data. Homomorphic encryption definitely allows you to encrypt this data and ensure user privacy at all points and also allows data scientists like myself to make inferences from the data without actually knowing what the data is or not knowing any personal information about the person whose data is. Also, facilitating secure voting, so if you've been a part of open-source community, you know oftentimes there's a voting procedure and sometimes under pressure, most people are not able to cast their vote freely, so homomorphic encryption would also allow you to ensure that your votes are private while you get to give your opinion on any decisions or anything that is related to your open-source project. Well, every coin has two sides, so homomorphic encryption thus has a lot of advantages but also comes with certain disadvantages at the same time. Even though homomorphic encryption is great, it can perform inferences on encrypted data, the model and the data never see each other and it does not require any sort of interactivity between the client and the business at any point and it also allows you to outsource your data storage and processing. However, it comes at a very expensive cost in terms of computation and most computational workloads are not designed in homomorphic encryption friendly way and it restricts to certain kind of calculations at some point, so division and comparison of two values which is sort of crucial and also data filtering for data scientists is not possible because you cannot compare two values and you cannot subtract in some open-source projects you cannot even subtract values or divide values, so that comes at a very big limitation especially for the data scientists working on this. So in the part of our team at Red Hat, we did a research on getting a better understanding of what homomorphic encryption is and how we can use it in our projects. So we started from the start, we looked up when was the first time somebody talked about homomorphic encryption which was in 80s and 90s when the paleo-cryptosystem was developed and the most famous RSA cryptosystem was developed. So starting with the partial homomorphic encryption, this type of scheme was the first of its kind and was developed to evaluate any circuit that composed of any single type of gate either addition or multiplication but never both on the same time and it didn't restrict any size or depth but it was suitable to only perform addition and multiplication on the encrypted data. RSA cryptosystem is the biggest example here and also the paleo-cryptosystem is another example for partial homomorphic encryption. Then few years later, with more research comes somewhat HE, it supports two types of gates. It could be composed of both addition and multiplication but with a restriction on the depth of the calculation that you can perform here. SHE is useful for performing low degree polynomials and however, sometimes we need to evaluate circuits with an arbitrary depth. So SHE wouldn't be the best bet here and BFE scheme is also a type of SHE. I will talk a little bit about the scheme later but won't go into too much technical depth here because these involve a lot of calculations, mathematics and probably that would be some other talk. Then comes the fully homomorphic encryption. This encryption scheme evaluates circuits that are composed of both addition and multiplication gates and in contrast to the somewhat HE, if HE has unlimited circuit depths and which makes it suitable even for deep learning applications and all fully homomorphic encryption schemes use a specific type of post-quantum cryptography called the lattice cryptography and most of these libraries require deep expertise of the underlying cryptographic scheme to use them correctly and efficiently at the same time. So CKKS and BGV schemes come under the category of fully homomorphic encryption. So CKKS or the fully homomorphic encryption scheme uses an approximate arithmetic way of calculation as compared to the BFE scheme in somewhat HE which uses the exact computation here. So to give you an example, what somewhat HE does here is so example, you have two numbers, two and three and you encrypted both these numbers. So how BFE is going to calculate it is like it's gonna just add two and three and make it a five. However, when you're doing fully homomorphic encryption it allows you to do arithmetic calculations. So you can do a decimal and an approximate calculation as well. So what it usually does it, it makes your two and three as 1.99 and 3.01 and then perform an addition on that. So sometimes you get an error of say 0.1% but usually it's more correct way of doing things and when it comes to machine learning and deep learning calculations, CKKS and the BGV scheme are more suitable to perform any sort of complex tasks because they don't use the exact addition or multiplication for doing the calculations. So we also performed a comparative study of various open source tools that already exist in the world and tried them out to get a better understanding of where should we start working on this. So first thing first like just when you Google FHE or homomorphic encryption, you know there's just one leader in the market. It's Microsoft SEAL, it's by far the best library available to use but the only limitation here for me as a data scientist would be that it's written in C++ and for most of our applications and our development works we use Python workloads and Python is more suited for doing any sort of machine learning calculations. So that brought us into a situation where we wanted to look for more libraries that are better suited in Python and can be used directly. So we tried to look for some wrapper functions or libraries that do use Microsoft SEAL in the background but still can be used in Python and can be just imported directly into our notebooks. So starting with the Python failure developed by Pascal Pailer in 1999, it's a PHE or the partial homomorphic encryption scheme that allows you to do addition of two, a ciphertext, multiplication of a ciphertext and a plaintext and this implementation is easily available on Python Pailer library. You can pip install and start using it. Even though this is not the latest technology but I would recommend anybody who's trying to get their hands on or homomorphic encryption to start using the Pailer library first to get a better understanding of how encryption works, how decryption works and how they perform basic calculations. I will also show a small demo of this in a bit once I cover all the libraries. Then comes PySEAL. It was an active library when we were looking at it but it soon turned into an archive. This also used C++ in the background. It is a wrapper but does not do any sort of justice to the original work. It's not at all easy to use and lacks some kind of documentation to use it and even to build this particular library you need to build it inside the repository and there are tons of config files and I'm sure you don't want to deal with that. So we decided to not move further with this library and then comes PyFHE. It was developed a few years ago by some folks at MIT as a part of their thesis. It's an open, fully homomorphic encryption library allows you to do all the schemes available like BFE, BGV, CKKS and tons of operations like addition, multiplication, even subtraction, re-linearization, rescale, modulus, rotate, conjugate, multiply matrix and whatnot but it soon ended or when their school ended because they stopped maintaining this library right after their thesis got over and they have very little documentation left so hence we couldn't use this and another library we found was PyFHEL or PyFEL it implements functionalities of homomorphic encryption libraries and it allows you to do all sorts of calculations and it's really good for machine learning demos as well. It's a great implementation and we also tried this out as a part of proof of concept and it's pretty good to use but we also found a much better library than this while we were researching it's called TenSeal. It's developed by OpenMinded and it allows you to perform any sort of operations on tensors as well. It's built on top of Microsoft SEALs so it preserves efficiency by implementing most of its operations using C++. It has excellent features enabling encryption and decryption of vectors of integers using the BFE and the CKKS schemes. It allows you to perform any sort of vector multiplication dot product or anything that you can think of using vectors and coming from a data science point of view I think it's really important to deal with vectors and something that allows you to do seamless calculations is the thing of the hour and it also allows you to perform complex machine learning algorithms like logistic regression or CNN and we also did a proof of concept using TenSeal so without any further ado I'm gonna take you all to the demo. I'm gonna just change my screen. Sorry, can you all see it fine? Perfect, so this is the home for all the code that we've put together so far. We've put together all the research in documentation whatever we found so far with all the references and our takeaways from that we've put together that you can go check it out. Besides that, we did a lot of proof of concepts tried one library at a time and did basic hello world computations to see how they work, how easy it is to use for somebody who has no idea what homomorphic encryption is. So starting with the Paleo Notebook. As I said, this is the easiest thing to start getting an understanding of homomorphic encryption. It's just a pip install and once you have two numbers you can just quickly go encrypt them and try to add them and that would simply give you the result. If you talk about the time taken to perform this calculation, so homomorphic addition took 92.8 microseconds and the vector addition took 102 nanoseconds. So we can obviously see that it was much faster to do the vector addition and the homomorphic encryption took fair amount of time. It took 1200 times extra time than the normal addition. So standing on my same point like it is expensive and time consuming to perform encryption and perform operations on them at the same time. But this is back in the 90s so we've come a long way from there which I'll come to later. And when you try to do multiplication for two cipher numbers, it's not possible. So just giving context here to show how far we've come today that we're able to even do complex machine learning algorithms where we, when we started from here and we didn't have any sort of multiplication facilities either. So we did pysil which is no longer available right now. It's complex to use lax documentation so we are not going to go over pysil. Then there's Pyfel which is also pretty simple. You can like directly start using BFE scheme encrypt your data using that. You can also create a key here and encrypt your data and you can do integer array encoding encryption. So it's pretty simple to use that but I'm gonna jump to tensil because that according to us is one of the finest libraries available right now to perform any sort of operations. So starting with, we convert our normal plain text to ciphertext and start doing addition, multiplication, subtraction on it. And we also try to see how long does it take. Okay, refresh it. Cool. So the time taken to perform addition here is 14.59 times faster than the homomorphic addition whereas it used to take 1200 times more to perform it using the paleocrypto system. So it's definitely much, much faster now and also a lot easier to use with all the research backing, research papers and lots and lots of open source tools to explore. So that makes things easier and there's a lot of work, active work going on in the field of homomorphic encryption. There's also subtraction facility which wasn't there for the longest time and now we can also perform subtraction but it's also much slower. There's also multiplication that you can perform on your data and then using the CKK scheme you can also perform more operations like addition, subtraction, negation, power, exponentials, dot product and matrix multiplication. However, more complex the operation gets longer it takes and the more memory it takes in your backend. So for example, matrix multiplication, the vector one is 3,600 times faster than the homomorphic matrix multiplication. So that's how complex it gets as you increase the intensity of the operation. So now I'll give you a quick overview of the proof of concept that we performed using Tensil. It's a logistic regression. It got a data set from Kegel. It's a heart disease data set making predictions on the risk associated with the heart diseases and I tried to just put together a basic logistic regression model. So literally, filelines of code, assigned a classifier here and try to make a prediction and I got some results. It's super easy to do logistic regression when it comes to plain data but when it comes to doing encryption, like I tried to first move towards a PyTorch model and try to pre-process this data, try to see what the shape of the data looks like to ensure that everything is in place and then train a logistic regression model here and after that, I also tried to do an encrypted evaluation on this data. So I took the plain parameters to train the data but just that the evaluation was performed on an encrypted data set to see how long does it take to do that and what accuracy we get it. So that also performed pretty well. I'm trying to see whether that accuracy lies. So yeah, so the difference between the plain and the encrypted accuracy was 0.27 which is not too bad. I think given like you don't see the new data, it's slightly difficult to make a prediction on that and still seems okay but then I also tried to train an encrypted logistic regression model on encrypted data. So like starting from an encrypted data set it takes a lot of time to train because your model is not aware of the exact values and you also take time to encrypt the values then feed it inside the model. So I will quickly go towards the end. So I did try to see how long does it take so I did a lot of fit and trial, parameter tuning and like to find out what's the best suited sweet spot after which my kernel doesn't break. So I was able to find that like after three epochs I was able to get a decent enough model and it was making good predictions and to my surprise we got a better accuracy when we trained on an encrypted data set. So that was like a win but could be a flake. I don't court me on that but just particularly for this data set we got a better accuracy when trained on the encrypted data set and we were easily able to make predictions by even though we had encrypted data and encrypted evaluations it was working absolutely fine. So it's a foolproof technique that you can use and still ensure your data privacy at the same time. So that's all about the demo. So homomorphic encryption versus federated learning. I'm sure this question must be in your heads because a lot of folks are working on federated learning as well. So I will take a quick minute to tell you how these are similar or dissimilar from each other. So even though both these projects have a similar concern they wanted to ensure that there's security, privacy and allows you to do work in a collaborative environment yet ensuring your privacy, homomorphic encryption and federated learning are different in the aspects of what thing is getting encrypted here. So when it comes to HE the computations are performed on encrypted data whereas the federated learning allows you to train the machine learning models on decentralized devices. So what it does it like I will give you my model in an encrypted format or in a secure location where you can access it so that I never get to see your data set. However, homomorphic encryption involves you to send me your data in an encrypted format and I send you back a prediction which is also encrypted that you get to only decrypt. So HEA is obviously more complex due to encryption and decryption that is involved that you have to do before sending it to the model whereas federated learning does not require you to do any sort of encryption or decryption because you get the model and you can just like throw in your data right away but that's the difference. Like the main problem here is that HE in my opinion is more privacy focused which is I think like an important concern. However, federated learning is distributed data across many devices. So coming back to why we are doing this since a lot of websites are collecting our data we definitely don't want them to also see our data and anonymize it and then make predictions. So in my opinion homomorphic encryption is the way to go because even when you're using homomorphic encryption you don't get to see the data and still make predictions whereas in federated learning you are just technically importing a model and then sending your unanonymized data and that would be a privacy concern in my opinion so that's not the way to go when it comes to somebody's private data. Well, similar to this is confidential computing as well whereas homomorphic encryption does not require any specialized hardware see confidential computing on the other hand it works with a lot of hardware it requires you to have a whole setup where you put together a confidential computing setup and store your data making sure nobody gets to see it even though the concern is nice it allows you to ensure the privacy of your data it comes with a heavy cost of maintaining hardware not only by the purchasing cost but also the storing cost it requires hefty places to store such hardware and that's it I think I've convinced you guys to use homomorphic encryption on your data sets so please let me know if you all have any questions I'm happy to take them. Just a quick question again I'm an UBG encryption and things like that so the question is does homomorphic algorithm support logical operators such as can I compare to encrypted values and determine if they're greater equals or... Yeah, there is still a bit of a situation on that front you cannot compare two values I think I did list them in the set of disadvantages because you cannot compare which two values one of them is which one is bigger basically so I think that's one of the biggest limitations when it comes to homomorphic encryption. Thank you. Quick question, what is the impact on model performance? Model performance as according to like the small proof of... Response time and stuff like that Yeah, response time is much, much, much higher it takes pretty long to do a simple logistic regression I think I'm gonna share the time that it took so average time per epoch was 350 seconds when trained on three epochs however when you try to do a basic classifier a logistic regression model it takes like barely a second it's just when you run the cell it gives you an output so this is also a very small dataset that I used here it was I think 4,000 rows and still it took that long to perform so it does come at a cost but we are also trying to work on distributing our algorithms in a way and making sure that the computation happens faster and we're also working with some sort of hardware research teams to see if it's possible to accelerate our time taken to perform homomorphic operations but as of now, this is like the beginning of the research here so it takes a lot of time to do any sort of calculations Thank you Embarrassing to myself, you will find I really didn't understand most of it from my question In that example of logistic regression could you explain as a model owner what part is known to that party say in the unincrepted data we know age that information so as a model owner, what do I know and when I provide a prediction what do I know that I'm predicting to? Am I making sense? Do I know what properties are there or do I just don't know if age is like one million and which decrypts to 35 What is that encryption? Yeah, so I think might be wrong here but what your model knows here is the type of data that you're receiving and some sort of sample data to get a better understanding of how the data looks like so if this is the heart disease data set whatever columns that were present was aware to the model because column names are not encrypted here but when it comes to predictions I might not be sure how the model is getting to understand it and what it takes to get a better understanding of the data as far as I know it knows the column names and just few numbers inside that but I'm really not sure how it works on the back end and how model understands everything but that was a really good question, thank you All right then, thank you Thank you everyone for joining