 Right. Thank you all for attending my talk this morning. And so my name is Nisha Huluhan. I'm a research scientist in IBM research in our Dublin lab here in Ireland. And on the AI privacy and security team I work on differential privacy. And what I'm going to talk to you about today is our work in the area of privacy preserving machine learning. Using the great power of scikit-learn and Python to help us on our way. So to begin with, the area of data privacy as we now know it can trace its origins back to about the 1960s. And a lot of tools and algorithms have been developed since then that have that are still in use to this day and are still very powerful for protecting data. But the the ecosystem in which those techniques now reside have changed a lot in the last 20 years. To the point where we now have a big risk of being able to link these anonymized data sets with external data sources and re identifying individuals in this and on in these anonymized data sets are exposing all the sense of attributes that we wouldn't have wanted to originally. There are many examples of these kind of attacks happening out in the wild and I've listed a couple of examples here. So in the first example with Netflix they publish a data set of anonymized movie ratings for their Netflix prize competition. They wanted researchers to improve their recommendation algorithm and they gave them some some data to use to help them. But researchers were able to link it with the publicly available internet movie database and expose individuals in the Netflix in the Netflix data set and expose the I guess the more sensitive ratings that they would have given to Netflix that maybe they wouldn't have given to a public service like IMDB. In the AOL case then they published a data set of anonymized internet search histories and reporters in the New York Times were able to dig into the data and re identify one individual in the data set and expose her entire search history, which I'm sure for a lot of us will be quite a sensitive matter. And finally then the New York City Taxi and Limousine Commission published a data set of anonymized taxi trip records a few years ago, and a blogger was able to attack the data set and link it with photographs of celebrities getting into taxis in New York City, and re identify and also link those records together and find out where the celebrities are traveling to and from and how much they tip their driver. I think in this case it's Bradley Cooper getting into a taxi in Manhattan. So all of these are examples of data sets that were anonymized using traditional methods published in the wild and then subsequently attacked. But the privacy risks of data extend far beyond these kind of examples. They extend to publishing simple statistics on databases and being able to do to run database reconstruction attacks on those statistics to basically reconstruct the original data set. And it also extends to machine learning models, I guess more pertinent to this talk, where you can reverse engineer the machine learning model to find out information about the original training data set. So, following on from these, I guess the failings of traditional anonymization in the current world, the idea of differential privacy was first conceived in 2006. And the key idea of differential privacy is that we introduce random noise to blur the data in such a way that we preserve the privacy of individuals in the data set, but still allow population trends to be accurately observable. And the unique selling point of differential privacy is that it is future proof. And by future proof, I mean that there's no data set that can be published tomorrow or in 10 years time that can undo the anonymization we've just applied. There's no data set that we can publish to undo the random noise that we've added. And that makes differential privacy the strongest privacy guarantee that we have at the moment. And that's why it's such a desirable area of research at the moment. So differential privacy also introduced the idea of a privacy budget being attached to queries. So we always talk about epsilon and differential privacy. And this epsilon is linked to the amount of privacy leakage that we get when we ask a query. Typically when you have a data set you want to limit the amount of privacy budget that can be spent or the amount of privacy that can be leaked. And that is encapsulated in this privacy budget. And it's very easy when you're running queries to keep track of the privacy budget and add them up and to give a total privacy budget at the end of the privacy budget spend at the end of a series of queries. So to look at a kind of a simple schematic of, of this, if we have a sensitive data set from which we're looking to extract some kind of knowledge or information in a privacy preserving way, which we then want to be able to pass on to a data analyst. Without posing a threat to the individuals in the data set. And we can extend this use case to machine learning environment where we have a sensitive data set being fed into a machine learning or AI algorithm. And for privacy's sake, we're going to inject differential privacy in some way into the training process, and then pass the trained algorithm onto the data analyst. And because of the guarantees of differential privacy. We know that there are mathematical guarantees on the amount that the data analysts can infer about individuals in the sensitive data set. And that gives us great comfort in being able to pass this information on to to an external source whom over whom we don't have any control. And at IBM research what we've done is we have built diffprivlib that does all the important stuff here in the middle it does the machine learning with differential privacy built in. We can train machine learning models with differential privacy on sensitive data sets, and then passing them on outside any sensitive or secure enclave to external parties. So our approach in building diffprivlib was to obviously to use Python, which is a very popular programming language for machine learning and data analytics. We wanted to use, I guess the de facto standards in data analytics and machine learning that are numpy and scikit-learn. And, and build upon them to add our different privacy capabilities. A core pillar of our work for diffprivlib was to ensure an almost identical user experience to that of numpy and scikit-learn. And that extended to having a lot of default parameter settings for the privacy aspects of, of these functions to ensure that anybody who was familiar with numpy and scikit-learn would automatically be familiar with diffprivlib before they even started using it. And I think by and large we've we've achieved that goal, which you'll see shortly, I'm sure. So in a nutshell, here we have some, a quick code snippet which, again, if you're familiar with scikit-learn this should be fairly familiar to you too. And so in a nutshell diffprivlib is a library for doing machine learning with differential privacy built right in. There's no expertise required. The user doesn't need to know anything about differential privacy or even data privacy to get up and running. Again, because of all the default parameter settings and the very similar user experience. It's open source. It's up on GitHub. It's free to use and to modify to your heart's content. It's easy to install with pip. And it's integrated again, as I said, with scikit-learn and numpy to get you up and running quickly. And finally, then it's easily integrated with any existing scripts. So typically, if you've a script that's running a bit of code for machine learning or data analytics in one or two lines of code, you should be able to replace the scikit-learn or the numpy functions with their diffprivlib equivalence and the script should run as it did before. But with the added confidence now of its satisfying differential privacy. So before I dig into some code, we just have a quick look at the four main modules of diffprivlib. The first of those is the mechanisms. So a differential privacy mechanism is the basic building blocks of differential privacy. They're the piece of code that actually add the random noise to the data. Typically, a user of diffprivlib won't actually come into contact with any mechanisms because they're all used under the hood in the tools and the models that we have to achieve differential privacy. In essence, a mechanism is just a probability distribution from which we use to add noise to the data. The next module then is the models module, which is the scikit-learn part of diffprivlib. We've a number of machine learning models from scikit-learn that we have implemented with differential privacy, including things like logistic regression, linear regression, PCA, K-means. And importantly, each of the models that we have inherits the scikit-learn equivalent as its parent class. And that gives us access to a lot of scikit-learn functionality for free. And that makes it much easier to use. Now, we do have additional warnings that we push from diffprivlib occasionally. One of those, including the privacy leak warning that you see here. And I'll explain more about that when we move on to the notebooks shortly. The next module we have then is the tools module, which is the NumPy part of diffprivlib. And this is a collection of kind of simple functions for data analytics tasks. That includes mean, standard deviation, some count queries. And importantly, the histogram function as well, which is a very important function for differential privacy. You can plot things like distributions and get counts of data sets in an efficient manner from a privacy perspective. And finally, then we have the accountant module, which is an accountant that keeps track of the privacy budget spend that I mentioned earlier. So you can see in this snippet of code, we have three queries run on a data set, each with an epsilon of 0.1. And then at the end, we add that up to be a total of 0.3, which is what you'd expect. We also have capability for using advanced composition techniques that if we give a little bit of slack in the guarantee that differential privacy provides, we can get a big benefit in terms of the privacy budget that we spent. So in this plot here we see that over 30 queries using what we call naive composition without any slack in our guarantee that we're spending more than 0.4 of our privacy budget. But having a little slack in that guarantee allows us to reduce the spend to just over 0.2. So in essence, this allows us to extract more knowledge for the same privacy guarantee. So I'm going to move on to some notebook demos now quickly, I think for the next 10 minutes or so. So the first one of those is a quick 30 second introduction to diff privilege. So this is simply we're going to train a Gaussian naive base classifier with differential privacy. We begin by importing the iris data set from using scikit-learn, we're then going to import our Gaussian naive base classifier from diff privilege and initialize it and train it using the fit method. And as I said before, we have a, as we saw before, we have a privacy leak warning here, and that's because we haven't specified the bounds hyper parameter. And this bounds hyper parameter ensures that the model is calibrated correctly. And without specifying it, what the model is going to do is actually going to read in that data, that information from the data set, which constitutes a privacy leak above what we would expect from differential privacy. That's why we have a warning here. And typically you wouldn't want this to appear in your scripts. So we'll fix that at the end of this script. As I also said before, each of our models inherits the scikit-learn class as its parent class and you can see that here. We can now that the model is trained, we can classify unseen examples, so we're going to use the test data set for that and you can see using the predict method. We have our classifications and we can then use the score function again from scikit-learn to test the accuracy of the test data set. And we're at about 77%, which is pretty good for the size of the data set. So in this particular cell, what we're going to do is we're going to run our classifier for various epsilon values. And in order to suppress our privacy leak warning, we're going to specify the bounds parameter upon initialization. And what the bounds does in this case is it specifies the range in which the values in each column lie so that the model can correctly calibrate the noise that it's going to add. Obviously we're going to have to add a lot more noise to data that's spread over a wider range and this parameter fix or source that out in our model. We're then we're going to plot the accuracy of this model across various epsilons on a log scale from 10 to the minus two to 10 to the two. Obviously because of the randomness involved there's always going to be some fluctuation the values. So these only give a single snapshot for each epsilon but it's still a good test to see how we're doing. And then we're going to initialize the classifier with these parameters the bounds and the epsilon value and then fit it and test the accuracy of the model. And as we can see here so for small epsilon which is again a small privacy budget, we're getting quite poor accuracy and very jagged curve. But as we increase our epsilon as we increase the privacy budget we're approaching 100% accuracy at the top. That's exactly what we would expect. So in the next notebook we're going to run through a differentially private machine learning pipeline. This time we're going to use the adult data set from the UCI machine learning repository. We want to import it using NumPy. There we go. And in order to construct a baseline we're going to train a pipeline using psychic learn to begin with without any privacy. And here we have the pipeline is going to be composed of a standard scalar which scales each column to have zero mean and unit variance. We're then going to reduce the dimensionality of the data set to have only two columns. And then we're going to feed that resulting data into a logistic regression and which is going to do our classification for us. In this case the adult data set is a binary classification task. And because different lib is limited to using the LBFGS solver, we're going to use that same solver for this non private version source that we're comparing like with like. So we're going to initialize the pipeline. And then we're going to fit it and score it using our training data set and test data sets. And we see here that we have a baseline accuracy of 80%. So what we're now going to do is going to turn this particular non private pipeline into a differential private pipeline and parameterize it accordingly. So if you go down here. So we're going to import the models module from diffprivlib and we have a standard scalar here correctly parameterized with the bounds. We have a PCA which is principle components analysis reducing the dimension down to two. And then we have logistic regression classifier again parameterized accordingly. For each of these we're going to set the epsilon value to be one third. And which means the epsilon value for the entire pipeline when we add it up is going to be one. So we're going to initialize that and we're going to fit and score it. And as we can see here we have an epsilon of almost 81%. So that compares quite favorably to the non private pipeline which was only 80.3%. It's not uncommon to see a higher accuracy from a different actually private model compared to a non private model because of the noise has been added. It reduces overfitting as one consequence which allows us to improve the accuracy as a byproduct. And in this cell then is simply doing a similar task to the previous notebook where we're running our model over various epsilon values. I'm not going to run that because it takes a bit too much time but we can look at the results down here extracting them from pickle. And you can see here we have again for small epsilon values for a small privacy budget we have very noisy results. But things start approaching our baseline accuracy at epsilon equals 0.1, which is a very good result in this particular case. And finally then the last notebook I want to share is a quick data exploration workflow with different live. I'm going to run all of this code. So we begin by importing numpy, diffprivlib and mapplotlib for all of our plotting. In this case we're going to use our budget accountant this time to take a keep track of our budget spend across the script. And we're going to use an epsilon of 0.04 for each of our queries. In this case so we have initialized our budget accountant to have a epsilon value of up to one. So here we're going to use the cover type data set from scikit-learn. And we're going to do a little bit of pre-processing on the data set here. And then we have our column names and the ranges of each column which are going to use later on when we're specifying bounds for each query. As I said before the histogram function is very important in different privacy and very efficient. So we can use the histogram function in this case to see, have a look at the distribution of the labels in this data set. So we can see here that approximately 50% of the examples in this data set are associated with label number two. And about 35% with label number one. Again, all of these queries have a little bit of random noise because of the differential privacy guarantees that we have. But for the size of the data set, these results are quite accurate and reliable. We can extend our analysis of the distributions of the data to the features of the data set as well. So this is all of the columns of the data set. Again using the histogram function here to access the data. And these are the results that we get so we can see there is, we can get a good visualization of the type of data that we're dealing with using these particular functions. And now is probably a good time to have a look at our accountant to see how much privacy budget we've spent so far. So we can see here we have a total spend of 0.52. And using the lend function then we see that we've executed 13 queries, which correspond to the 12 queries here and the one query previously. So we can also use two dimensional histograms to get even more insight from data. And in this case we're plotting the distribution of each of three features here, as well as their distribution of labels. And that's using the histogram 2D function again similar to the non pi equivalent. We can extend that to plotting features against other features again using the two different two dimensional histogram function here. So we're plotting the horizontal distance to hydrology on the x axis, and then using our colors using horizontal distance to rollways. And another way then to compare features on two dimensions is to use color maps, again which we execute using the two dimensional histogram. And finally then we have our more simple queries if we're looking to hone in on on specific features of the data using our mean variance count non and count non zero as well. And you can see then at the end we can examine our total privacy loss and if we've any other queries to execute we can use the remaining method to find out how much privacy budget we have to spend. So, before I finish here there's some additional resources for you if you want to learn more so all of our code is available on our GitHub repository on the IBM domain. And that includes all of the notebooks that I presented here today and other notebooks as well. So our documentation is hosted on read the docs. And I guess most importantly if you want to get started with drift privilege, then one line command in your terminal pip install drift privilege and you'll be away. So that's all I have for you today. Hopefully we'll have some time for some questions if there's any going. Yeah, we do. We have four minutes for questions. Great. Okay. I think we have any questions. Everything must have been perfectly clear. Oh, we have a question. Great. All right. Is it possible to set up differential privacy so that you could reverse it in the future if you needed to, or is it a one way process. So typically what you do is, you have the raw data set in a secure environment and you can publish differentially private statistics on that. There are other techniques. So for example, Apple uses what we call local differential privacy, which means the differential privacy is applied before the data even leaves your device. And that means that the data controller only ever sees random not randomly, or data that has been randomized. And in that case, you can reverse engineer it. But typically if the data is valuable, then you would keep a raw copy of it in a secure environment and only release queries on it using differential privacy. Okay, that's awesome. Also, just a little help for me. Could you and share your screen? Yes, I can. Okay, thank you. All right. So this is the next question. What happens if the budget account runs out? So the, in an ideal scenario, the idea is that you have a fixed privacy budget for a data set. And once the budget runs out, the data is destroyed. Clearly that's not realistic in today's world. So a lot of the time you would give, for example, the idea is that you would give a single data analyst a fixed privacy budget to spend on a data set. And once that budget has been spent, the access to that data set is revoked. So that would be the modern interpretation of that. Okay, awesome. So the next question, you are inheriting from SK learns classes. If they modify, how do you guarantee that your code maintains compatibility? That requires updating of the code as simple as so typically a lot of the time the last two, I think, 22 and 23 we've had to push out patches the following day for changes that cycle learn have made. So it's just an ongoing process of keeping it up to date. Awesome. What is the point of privacy differential? Could you be more specific and suggest a few using a scenario? So the, as I mentioned in the talk, the differential privacy was conceived because of the failings of traditional anonymization methods. And differential privacy isn't perfect, but it works very well when you have a lot of data and a lot of sensitive data and when, sorry, my train of thought. So there are very specific circumstances where differential privacy is very useful, but we still need to use our traditional anonymization methods to safeguard against data risk. Okay, that's awesome. I think right now we're out of time. Thank you for your talk. You can actually reach out to the breakout room. We actually have a few more questions you can answer it there. And also, this is for you.