 Good morning everyone. Thank you for coming for my talk So I believe everyone who's sitting in this room has a mobile phone a laptop and You'll know like when you have access to internet Lots and lots of data is being collected from your machines Sometimes with your consent and sometimes without your consent and it's really hard to track Who's taking your information and what are they doing with it? But would it help to know that there's a way that all these websites will collect your data Will give you amazing Recommendations and inferences out of your data But at the same time will ensure that your data remains private and nobody gets to know your personal information And that's why I'm here introducing homomorphic encryption that ensures privacy all around us and the main focus here is going to be How we try to ensure privacy in the open-source world using homomorphic encryption So before we deep dive into this, I would like to take a moment and introduce myself My name is Akanksha Duggal I am a senior data scientist from the emerging technologies data science team at Red Hat at the office of the CTO. I originally come from Boston United States of America and I have my Github LinkedIn Twitter All linked here if you have questions or concerns about this topic. Happy to chat about that so Let's move towards homomorphic encryption. I would Assume that most of us know what encryption is but still would like to take a moment to explain what that is It's basically a way of scrambling the data so that only certain authorized parties would have access to the information of what that data actually means and What is homomorphic encryption? It's basically the process where we can perform Computation on this encrypted data So as most companies continue to develop machine learning models Sometimes these machine learning models could be a key asset to the company and therefore cannot be directly Shared with a client who wants to use this model on their data and at the same time This data is also confidential to the client Which they don't want to share with the company who's providing their machine learning and AI services to them And that's their homomorphic encryption comes in picture. It lets you apply the machine learning from the company and Clients data on that model without neither of them getting to know about the details of the private information that they have here Talking about applications of homomorphic encryption There are tons and tons of them starting from health care to smart electric grids from Education to machine learning as a service name it and homomorphic encryption can be applied to literally any industry Where input privacy is the paramount concern? In my opinion the most important use case of homomorphic encryption would lie in the health care industry where precision medicine would involve a lot of Privacy-related rules and regulations and a lot of the companies who are in the pharmaceutical industry and Creating medicines or they need important data to predict what sort of medicines we want to have for certain kind of diseases It's really really important to also ensure the privacy and all the data that was concerned with the patient So that's where homomorphic encryption is super important because if you breach any of the rules and regulations Associated to the privacy of the patients it comes at a huge cost So homomorphic encryption lets you just bypass that extra cost that you pay usually if you breach these rules and regulations and it helps you still make good predictions using all the past data that we've collected over the years and talking about how homomorphic encryption can be really vital in the open-source world as We all know the benefits of homomorphic encryption is to ensure that there is a private a secure Collaborative environment and that's the common part with the open source world where transparency is the most important thing We would like to ensure when it comes to open source communities and open source projects And also at the same time we would like to ensure that there is some sort of privacy and a secure environment for our contributors as well Starting off with protecting sensitive information So open source communities oftentimes deal with a lot of sensitive information for example There is user data there is passwords some financial transactions and we would like to ensure that this data is not open for everybody even though It's a part of the open-source project and homomorphic encryption ensures that only authorized parties get to see the private information However, we still try to ensure that the data is put in the right places encrypted and only used for the right purposes Also helps in secure collaboration So for example, there are sometimes multiple companies who are collaborating on an open-source project Even though they want to put together their technical abilities technical Knowledge to this project However, they do not want their proprietary information to be shared with this competitor company They are working with on this project. So open homomorphic Encryption ensures that there is a secure collaboration between two different companies and developers While they contribute to the open-source projects Protecting the intellectual property as we said that Sometimes the models and the data that these companies bring together are an asset to these companies How much ever they would like to contribute to the open-source world? Sometimes few things cannot be made public. For example, if you were to do an automated driving system Tesla is like the only leader in the market at this point But if other companies were to join hands and put together machine learning model put together Data sets, but they cannot seem to share the data sets in the whole wide world So what they can do is just put together their data and crypt it Whereas all of them use it for the same model Ensuring that all this data that they've collected was private and still making Good predictions for the automated driving systems Finally ensuring data privacy It is also very important to ensure that any sort of personal data is protected from getting lead Also a lot of open-source projects have secure voting So oftentimes we have to vote for some decisions that we would like to take In open-source communities if you want to like go further with this project Do we want to like it get extra brownie points for this particular project and sometimes people don't have The entire privacy to cast their vote freely So homomorphic encryption is something that could be brought into picture and ensure that People have the right to vote and also just the privacy to vote for whichever project they want to vote for So having said that Homomorphic encryption does have a lot of advantages Starting from performing inferences on encrypted data just like it would perform inferences on the plain data There is no interaction that is involved between the data set holder or the model holder And it also helps to just like do the outsourcing for data storage But everything comes with some sort of disadvantages Since this is a computation that is performed on encrypted data It is computationally very very expensive The normal computers are usually not designed for homomorphic encryption kind of workloads So it comes at a huge cost and even to do the smallest of the calculations It takes a lot of resources to just perform addition or subtraction besides that It also has some limitations in terms of the calculations that you can perform on the data. So the Major thing when we do as a data scientist in any of our projects is data filtering data cleaning data comparison But homomorphic encryption specifically lacks this use case where you cannot compare two values It basically cannot let you know which value is lesser than the other value And that also makes data filtering or data division an impossible task So anything that involves operations besides Division or comparison can be performed using homomorphic encryption But this is like the only limitation that we have so far So talking about different types of homomorphic encryption Back in the 80s and 90s when we first started not me But when the world started to research on Homomorphic encryption the first scheme and the first type of homomorphic encryption was developed It's called the partial homomorphic encryption It was started with a failure crypto system It allows only basic operations where you can perform addition and multiplication This also has couple limitations which I'll go over in the demo as well But you can perform two encrypted numbers addition But you still cannot do multiplication between two encrypted numbers You can only do one encrypted number and one plain number So this does has kind of limitations, but if you talk about the era this was developed in I think they were doing a pretty good job at that time Then comes the somewhat itchy which came up with some more Advanced abilities it supports only two operations, but It has more depth you can like get into the technicalities You can perform algebraic equations and BFP scheme is one of the things that Comes under the somewhat itchy I would not go into too much detail about these schemes because I think that's a good topic for another talk But just to give you an overview somewhat itchy is The majorly the BFE scheme comes as a part of the somewhat itchy category and then comes the fully itchy so as as of now most of the Researchers and companies are using fully homomorphic encryption techniques to perform homomorphic encryption It can allow you to perform any number of complex Operations any depth exponentials matrix multiplication Name it and you can perform it using fully homomorphic encryption at this point even machine learning models like linear regression logistic regression CNN models image detection all of it is possible only and only because of the fully homomorphic encryption scheme it uses the approximation of real numbers to make predictions and Make calculations on encrypted data So one small difference between how somewhat itchy and fully itchy are different from each other So for example, we had two numbers two and three which add up to five but somewhat itchy would approach this problem as Encryption of two plus encryption of three is equals to five Whereas fully itchy makes you give an approximate estimate of what this number could be It does not directly add two and three. It would be somewhere around say one point nine nine two three point nine nine or It would say one point nine nine plus two point nine nine some something which is like plus minus Error of ten percent so it would not be an exact calculation But if you see in long-term and complex calculations, this is actually much more accurate than just assuming them to be real integers And that would cause some sort of errors if you're doing just like whole numbers at some point So when it comes to complex mathematical operations It makes much more sense to have the approximate feature where you can just go to decimal points and perform the calculations So we also performed a comparison study on Various open-source tools that are available and we performed some research on that So talking about like the kind of work I do at Red Hat our team is mainly focused on doing Research-based projects in mainly Python where we come up with machine learning and AI solutions For any sort of problems within and outside Red Hat So as a part of that most of our code is written in Python So we thought if we have to integrate this feature in any of our projects It would make much more sense to have this tool written in Python language Whereas when you just first Google homomorphic encryption, you will see that Microsoft seal Which is written completely in C++. It is Not something that we can directly use for any of our projects So the first challenge that we want to tackle here was we need libraries open-source libraries That are specifically written in Python and that can be seamlessly integrated with the core base that we already have So starting with the paleocrypto system, which is the first Technique that was developed as a part of the paleocrypto system. It's developed from 1999 it's a partial homomorphic encryption scheme as I said it does have limitations of Performing operations we can only do addition and multiplication of one plain text and encrypted number so this is kind of a Shortcoming for this encryption system that it cannot perform complex operations But for anybody who's just starting to get into Crypto system and encrypting numbers and getting into the encryption area I think paleo's crypto system is the best way to understand the basics of this concept and to get an understanding of How and why we are doing all of this Then comes the pie seal which is also a wrapper function for The Microsoft seal but it has a lot of limitations in terms of trying to import this to our Python code It it has to be built with certain number of config files And there are a lot of dependencies that oftentimes are hard to manage when it comes to open-source Projects where we would want seamless integration with new people new projects. It's kind of difficult to manage something like this So we thought that we'll research more about this and then we found a pie FHE It was developed couple years ago in MIT as a part of a PhD project It's a fully homomorphic encryption library written in Python and includes the BFE and the CKK scheme It has almost all the operations that you would like to perform on your data But like as soon as these people graduated out of school they start maintaining this project There's lack of documentation how much ever we would like to use this it definitely lacks a lot of documentation to be used on a long-term basis and Then we found out pie FHE L Which is also an open-source homomorphic encryption library has tons and tons of Operations that are available. It has a very similar syntax to normal arithmetic. It's super easy to use And it uses C and C++ in the back end It's perfect for any sort of homomorphic encryption operations But if you all know more about data science most of our data is in the form of vectors and When it comes to complex operations We would also like to use vector arrays and things that you can use to perform on tensors and Finally like we ended almost our research on tensile. It is an open-source library developed by open-minded It is built on the top of Microsoft seal super easy to use It's literally a pip install tensile and you can just start using it right away in your scripts in your Jupyter notebooks Super seamless to use and with the ability to perform these operations on tensors We can use pie torch models. We can perform any sort of machine learning operations using tensile So we've also done a proof-of-concept using tensile library It's super easy to perform logistic regression and make predictions on any data that you have So this is the by far the most awesome library that we found Open-minded is also currently working on another open-source Python library Which is very similar to pie seal So basically they're just reviving that old library which was dead couple years ago And they're trying to revive it and also make it a pip install that you can use for Any sort of data that you have but tensile is more focused on tensors and vectors and complex machine learning algorithms So I think I'm gonna move to the demo and give you an overview of what we've done so far and How did we get to this research point like it looks super easy like we do like a comparison study But like it's months and months of effort where we get to understand what we've done so far And I would like you to also go to the repository and check it out If you're if this is something that interests you you can just go to the repository where we have put together all the Documentation and all the pain points that we went through. There's issues. There's documentation that you could go through So starting off with all the notebooks that we've put together. So as I said, we started with the Python failure This is the most basic way of getting into the crypto system industry You can just like go through the notebook and try to understand how and why we are doing this so starting off with we import this library of from failure and Just assign to random numbers and then we try to encrypt them and perform addition on that so once we add them we get the result which is Exactly the number that we were expecting and we also try to do the same thing with a scalar and an encrypted number Again, we got a correct result But the most important thing to notice here is that how long does it take to perform this computation? I agree. This is an accurate measurement of what this addition would be but it takes a lot longer than expected So if you look at it The homomorphic addition took ninety two point eight micro seconds Whereas the vector addition, which is the normal Addition it took hundred two nanoseconds. So if you were to see the comparison of this The vector addition is More than thousand times faster than homomorphic addition. So this comes at a huge cost of time and resources But if you talk about doing a multiplication between two numbers using the failure library What you see is they give you an error Which says good luck with that because they don't have the ability to multiply two numbers So this was the first first set of things that they started to work on in terms of homomorphic encryption and Slowly and gradually it improves and that's what I'm going to show So I'll go over to 10 C which is like the final library and make a comparative study of how this is Better and faster from the failure crypto system. So tensile as I said super easy to import just import tensile and Initialize a context with couple values and parameters that you would like to specify Tensile also has great documentation on how you should parameterize your encryption keys, etc Then we take a couple vectors try to perform addition So we first perform ciphertext to plain text, which is one scalar and one encrypted value And we also do the same for subtraction multiplication, etc But let's move to the ciphertext to ciphertext calculation and try to track the timing and memory that it takes to do so So if you look at the addition Homomorphic addition took 24 for 9 microseconds Whereas vector addition took 1.71 microseconds the funny thing to notice here is how far we've come The last notebook we saw it was a nanosecond and a microsecond and now they both at least have the same metric at this point So earlier it was more than 1000 times faster at this point It's just 15 times faster which is still something that is doable with the amount of resources That are available right now in terms of performing any sort of computations I think this is still a reasonable amount of resources that homomorphic encryption requires at this point Then if you look at subtraction that is also around 15 times Faster and then multiplication however takes much much longer So vector multiplication is 3000 times faster than homomorphic multiplication But if you all have done any sort of computations on Data and machine learning you would know that it's no piece of cake. It does not require just addition It's much more complex than you would imagine. There's derivatives, etc So it takes much much more resources when it comes to the real problems and that's why we performed a proof of concept to see how long and how Expensive this is going to be to actually perform a homomorphic encryption in the real world So we did a logistic regression proof of concept. We took a data set from Kegel This is a heart disease data set and we wanted to predict The overall risk you were saying all the data that was available So we have this data available here like we just inspect what the data has We just try to clean it up a little remove the NA values and then drop a couple columns that were looking irrelevant and Just quickly put together a logistic regression model without any encryption on this data it's just like five lines of code you mentioned a classifier and Make a prediction and finally calculate the accuracy and this is the final report that we get after doing basic logistic regression and Mind you this takes literally a second So this is Jupiter notebooks for who don't know what this is So Jupiter notebooks is an interactive form of using Python code You can just run each cell at your own expense You can run these cells in any order and you can get an output right away. So Well the moment you click enter on these cells. It just takes like a couple seconds to show the result So that's how fast creating models on Small data sets or just non-encrypted data sets is so now we move forward. We create a torch model Just try to initialize the logistic regression model here Which gave us a good accuracy on this data set and then finally we thought it would be interesting to do an evaluation On a model and the data that is encrypted and see how long does it actually take? So when we perform encrypted evaluation on this data set, let me just quickly go to the final part So we try to do a couple epochs when I was doing it on my system initially I just like tried five epochs because that's something that I just choose as a number to start my calculation with and My Jupiter hub it just hung up on me because this was eating up so many resources Then I brought it down to Some less number of epochs so I think I went with three and I thought that was a sweet spot where it wasn't breaking for me The average time for each epoch to train here took 350 seconds Whereas the normal logistic regression would take barely a second to just run and here it took 350 into 4 that's super long for doing any sort of basic calculation on Very small data set. So this data set has only 4,000 rows And still it takes this long to make a Prediction on this data set, but if we talk about the accuracy It was much better than the normal one that we got on the basic logistic regression model Might be a flake, but I would like to believe that it took it sweet time use the encrypted data But still came up to the level of the plain Logistic regression model, which is awesome. So one thing we are sure of here that if anything This is an accurate model that makes accurate predictions Just comes with some shortcomings that it's sometimes expensive for you But we are also working with couple teams within an outside red hat to ensure that we can have some sort of Accelerators and distribute our workloads so as to ensure that it is not as computationally expensive as It's getting so hopefully Maybe next time we would have like a much more optimized Proof of concept where we are doing all of the same things but in a much faster and less expensive fashion So that's about my demo. I will go back to the slides I don't think we have the time, but I think do we have time? Yeah, I can quickly go over the most frequently asked questions So a lot of people ask how is homomorphic encryption related to federated learning? These two go hand-in-hand and a lot of people oftentimes confuse one for the other Even though both of them ensure Security privacy ensuring distributed workloads collaborative environment. They are specifically different from each other So homomorphic encryption is basically I have the model And you send your encrypted data to me Whereas federated learning is just training machine learning models on decentralized devices Basically means that there is a model somewhere and I encrypt the model and send it to the people who have their data that they cannot share So for example, if there's a pharmaceutical company who has like sensitive patients data and they would want to use my model and They'll tell me that oh, I have like huge data sets. I can't possibly encrypt that and send it to you So what I do as a data scientist is that I encrypt my model and send it to them And they can just use it at their own expense and try to make predictions on their data but coming back to the initial statement that I made when I started my talk was that Lots and lots of websites are collecting data and I don't want them to see my data I don't want them to just like openly use my data to make predictions Homomorphic encryptions ensures the privacy of the data While making predictions, but federated learning is something that that company would still have access to your data The model which is a proprietary information to a different company that is the encrypted part here So I still think when it comes to the privacy of ensuring contributors data or customers data Homomorphic encryption is the way to go but federated learning is no bad I think it still ensures that your data lies within the company and doesn't go out in the open world so that's the comparative study between homomorphic encryption and federated learning and Another more frequently asked question is how is it related to confidential computing? so confidential computing involves a lot of hardware and You need to ensure that you put your data in a private spot So for example Amazon Microsoft all of these big players are putting up storage centers Where they ensure the privacy of your data by putting them in a remote location Which is also again very expensive not only does it require storage space? It actually requires a physical space to store all of this data somewhere in private environment So that's like the main differences between homomorphic encryption and confidential computing There are a couple blogs out there that you can check out to understand the differences between the two And this is just like my study on how these two are different from one and the other So that's it from my side This is the GitHub repository and my email with my colleague who's worked on this project please feel free to ask any questions or Concerns feel free to raise issues on the repository if you're interested to contribute and I would be happy to help you out with that There's a question Yeah, so all the tools that I mentioned in the presentation were based in Python Thank you for your question. Yeah I'm gonna repeat the question So he asked that Since we we put all our data in the database It's oftentimes, you know, like it involves a lot of operations to encrypt the data Shouldn't we do it before putting it somewhere? So nobody has access to the exact information. Is that right? Yeah, all right, that's a very good question So as a part of this project We have just done a proof-of-concept and the whole point why we use Jupyter notebooks is to just see the results the moment We perform any sort of calculation But if we were to talk about the real world where real data exists We would try to create a pipeline of this data So the moment the data is being collected from our customers It should be in the pipeline where a script runs on this data and ensures that before it reaches the database It should be encrypted and once it's in your database it's it's already gonna be encrypted before that step and from this database you can choose any sort of Calculation on our algorithm that you want to perform on this data ensuring that the data was still private So I think it's this is all customizable. I just threw everything together in one notebook It's super easy to put them in different scripts and in different order of the pipeline that you would like to perform This operation on Yes Yeah Yeah, so fully homomorphic encryption has CKKS and BGV schemes both Yeah, when I say like yeah, yeah, yeah So most of the libraries that support CKKS also support the BGV scheme. Yeah Yeah Right Essentially Confidentially something like you know actually industry grade, you know, let's say Standard image recognition, you know, what's your exception? Okay On any reasonable hardware if I'm right they several years probably because like since this is like two orders of money to hit yeah I mean if ever ever this became industry standard then it's actually push everybody out of the machine learning like except for like two companies because nobody else could afford it right yeah doing this that's a great question and this is the fear that a lot of companies like even us thought would be a really big con for us in terms of homomorphic encryption it's super hard to do extensive calculations but trust me there's a lot of research going out in the entire world where people are working towards this personally my team is currently working with the Boston University where we are trying to come up with some sort of FPGA that helps us accelerate distribute our workloads along with some not so very expensive hardware that would still allow us to do the same sort of operations but at a lower cost and a much faster process besides that there's also different ways where we could write our code in a way probably like just distribute our code in a way that it's easier to perform calculations so that it doesn't eat up a lot of resources because the biggest resource it's eating up right now is the memory where it creates the keys then encrypts the data stores it somewhere that just eats up a lot of memory but if we have a very distributed way of performing all of this meticulously I think with time the day is not far where we would be able to do all of this seamlessly and about the image detection there's a lot of research going on MNIST data set if you're aware of that where there are numbers 1 to 9 0 to 9 and there are pictures of these numbers where the machine learning model tries to recognize the number just by looking at the picture of it so there's cnn model which a lot of people have been exploring on kegel using homomorphic encryption as well there's a lot of open research going on that and it's somewhat accurate at this point just that like since this is an image it takes longer you know even in normal machine learning the image detection is much harder than numbers and it's like five seconds exactly but we do have to give the benefit of doubt to homomorphic encryption because like in 1999 was the first time where there was an actual homomorphic encryption system that was developed and we've come so far and since we are starting with logistic regression and cnn models i think it's only the beginning of this era for homomorphic encryption and in no time probably in couple years this is the charge gpt era we never know when we just reach that point where we are able to do all of this seamlessly soon