 So I'm going to be talking about efficiency for machine learning models. Before I begin, just to get a sense of the audience, who works with data for their day jobs in any form? Who builds machine learning models as part of day job? Who builds deep learning models as part of day job? So this talk is mostly about how to improve the efficiency of your machine learning models. A little bit about me. My name is Shankar. I work as a data scientist for Manulife, mostly working with financial data. Previously I used to do data science for Trade Gecko and before that I spent about six years working at NUS Research, mostly doing self-driving technology and autonomous drone systems. That's kind of my background. You can get in touch with me, Twitter is at Mailshanks and that's my email address. So let's begin the talk proper. Just for a little bit more context, when I say efficiency in machine learning models for this talk, what I mean is statistical efficiency and not computational efficiency. So in a typical deployment scenario, you would have to worry about computational efficiency obviously and what I mean by that is things like how much memory does your model consume at run time, at training time, how much disk space and what's the time and space complexity of your model. So this talk is not about all of those things. Instead I will talk about statistical efficiency which is how efficiently you use the amount of data at your disposal. These topics will be relevant to you if you are mostly working in a domain where data is expensive to acquire or if it's limited in any sense or if the outputs of your model are kind of feed into a very critical business decision which you cannot really afford to get wrong. So think of trading and so on. What I mean by statistical optimality is basically take any measure of performance for any model. There are a number of different measures of performance. A statistically efficient model will give you the same level of performance with less data compared to a less efficient model. Maybe you're thinking that oh but then I work with big data so I don't need to be statistically efficient. I have enough data to be highly inefficient and still get by. I claim that you never have quote unquote big data in this context because if you do then what you can do is you can slice your data and you can ask more questions and then you're back to the domain of having to worry about statistical efficiency. Also I'm going to assume that you're familiar with sort of the more standard textbook methods about cross-validation and stacking your models and unsumming your models and all of that. You can read these things in textbooks so there is not a lot of utility in me repeating them. Now because we've kind of running very late I'm not sure if I'll have the time to cover everything. I'll probably not have the time to cover the last part but this is an overview of the talk. So the first thing I'm going to talk about is how do you kind of heuristically figure out if your data has any signal that sort of part one. Second part has actually two subparts the theme there is why sensitive pre-processing. What I mean by that is when you pre-process your data if you allow the pre-processing to be influenced by the target variable that gains you some amount of efficiency. So I'll talk some details about how exactly you go about doing that. And then the third part is adding noise and differential privacy. Turns out that if you add noise to your data set you can actually increase statistical efficiency. Sounds a little crazy when I say that but if you add the noise in a principled way you can actually be more efficient. I'm not sure if I can I'll have time to kind of delve into the details of the third part but we'll see how we're doing on time later. So does your data have signal? Let's say you've got you built a preliminary model you've checked model performance on a holdout data set. If you let's say you've got you're in a scenario where you have a large number of predictors maybe 50 predictors or 100 predictors predictors or features. In general if your features are noisy or if they are not related to the target that you're trying to predict. If you include those features in your model your model will perform worse. So you want to obviously discard them but it's difficult to identify which features are noisy and which features have signal. So how do you identify the features which do have signal? So here is a sort of heuristic or rather empirical technique to identify features which are kind of signal rich and those which are not. So let's say you've got original data in this form. So you've got X1, X2, X3 which is just a symbolic representation of your rows. And each of those correspond to Y1, Y2, Y3 which is your target that you want to predict. Here is the basic idea how do you go about identifying if a data has got signal. Let's say you fit instead of your large complicated model let's say you fit a much simpler one variable model. That this one variable model models the relationship between a given feature and the outcome. Now you've got this model that's step one. Step two you take a data set you pluck out the feature that you're examining and then you scramble Y. So you've got your rows X1, X2, X3. Now you scramble Y instead of Y1, Y2, Y3 you scramble them. The idea is this. If X has signal then the one variable model that was built with the original data will perform much better than the scramble data set. Whereas if X is mostly noise then its performance won't be very different from the scramble data set. So what you do is you build a one variable model for this, you build a one variable model for the scramble version and then you compare performance and you would expect that for features which are heavy, which are informative model performance will deteriorate quite a lot if you go about scrambling your data. You can repeat the scrambling process many times. It's free to do because you don't need new data. And then for each scramble version you can build a new model and measure its performance. And what you're going to end up with is you're going to end up with an ensemble of scores. These are the scores that your model will output on noise. You can compare your original model, the one that was built with the original data, against this distribution of scores. And if your data has got signal then it should look something like this. So what this shows is a deviance. Deviance is one way of measuring model performance. This shows the distribution of deviance for noisy datasets. So this is noise and mostly deviance has values somewhere around 1390. Most models measured on noise give values around 1390. This is on a simulated dataset that looks something like this. And this is the deviance for a feature which is actually related to the output. And you see that the deviance for the signal is much lower than all of the scores that the noisy models generate. The simulated dataset is here. N1 is noise. It's not related to Y. And this graph here shows the deviance for N1. You can see that the deviance is kind of well within the range of what you would get if you just build models on noise. You can extend this. If you've got 100 variables or 100 features you can compute their scores against models that are built with noise and you can compare each of these. But some caveats here, if you do that you run into what's called the multiple comparisons problem. Because now you're essentially, if you think of how you do a statistically significant test, if your test significance threshold or p-value is 0.05 and you run 100 tests you will expect to get 5 false positives. The same phenomena is going to occur here. So if you've got 100 features and you set your threshold at 0.05 you will end up with 5 features which are actually noisy. But you will be fooled into thinking that they actually have signals. To get around this you can do things like go with f-scores or chi-square tests and so on. But the basic kind of idea holds. Also this will not work with more non-linear methods like random forest and so on which kind of just memorize your training data. They will find a relationship even if you've got feature against noise. Even if you've got noise against the output. So there is no way to compare deviance or precision or anything with models that were built with noise. So these graphs were generated with linear models. Bear in mind that we are doing this to identify features which are noisy and features which are signal rich. Once you've identified features which are noisy versus signal rich you can then go and build a complicated non-linear model with features that you are confident about carrying signals. Question? So you are basically generating a random sample in the data. We scramble the output column. So now it's a new dataset. Why scramble? It's supposed to be random. You've built the results on the linear model. So I'm just trying to understand what the added value compared to doing a statistical test on the linear correlation or key square test for a non-linear relationship between two models. It is statistically they are equivalent. Okay. Yeah. They are statistically they are equivalent. But it's just that this is kind of empirical, right? You don't depend on a particular test. You don't depend on a t-test or any particular significance test. This sort of comes directly from the dataset that you're working with. So this distribution of noise scores are directly in a sense generated from the actual data on hand. So you don't have to worry about things like correcting for biases in these hundred different significance tests that are available. So it's kind of non-parametric? Yeah. Questions about this part? Anyone else? Yeah. So if you don't choose, you can't choose it with forest or scan of model. No, this comparison? Yeah. It's hard to do with non-linear methods like random forest. So you should be with which model you can use? Linear models. Use linear models? Yeah. Just GLMs. Yeah. Okay. Bear in mind that the reason you're doing this is to just identify features. So once you've kind of identified features that you want, then you're kind of free to use non-linear models or any other models. The second kind of family of methods is vice sensitive pre-processing. The general idea is that when you pre-process your data, allow the target variable to influence the predictors. And within this, I'll illustrate this by talking about two specific sort of two scenarios. The first scenario is when you do dimensionality reduction, use PCA. And the other one is what's called impact coding, which is essentially a way to encode categorical variables in your data. So let's look at part one, which is vice sensitive dimensionality reduction. Is everyone familiar with PCA? Or who is not familiar with PCA? Okay. PCA is just a way to reduce dimensionality in your data set. You've got a data set with 100 features. You can run PCA on it. What it does is it will generate 100 transformed variables. Each of those variables will be a linear combination of the original variables. And what you can do is after you've sort of done the PCA transformation, instead of keeping 100, you can drop a few. You can drop maybe 50 or 75. And there are some mathematical properties and structure that are preserved in this transformation. Essentially what happens is that when that transformation happens, the transformation basically looks for directions that explain the maximum amount of variance in your data set. And that is why this transformation is useful because you can run any random transformation by PCA. The answer is PCA because PCA preserves this particular structure. That is, it finds new directions which are orthogonal to each other and which explain the maximum amount of variance in the original data set. So generally you would use PCA to reduce dimensions and decrease noise. The standard way of doing PCA ignores the target variable. You just take your data set, you run PCA on it, you reduce dimensions and then go build new models. The idea here is this, is that instead of ignoring Y, your target vector, allow it to influence your PCA transform. And I'll talk about exactly how you do that. So here's the kind of the problem set up. Let's say Y is composed of two variables, YA and YB, and epsilon, epsilon is noise. So Gaussian, normal noise. You've got signal variables which I denote as X. So X1, 2, 3, 4, 5 are signal variables and noise variables are noise 1.01 and 1.02 and so on. YA is actually a affine transform of X1, 3, and 5. So all the odd numbered X. And YB is a transform of all the even numbered X. So X2 and X, I'm sorry, X4 typo. So this is supposed to be X4. So Y is what you observe. Y in turn is made up of, is a mixture of A and B and noise. A, so YA is a mixture of X1, 3, and 5, odd ones. YB is a mixture of X2 and X4, so even ones. So this is my data set. It's a synthetic data set. The reason it's synthetic is because it kind of shows the properties of PCA and the alteration to PCA quite nicely. So there are five signal variables, 45 noise variables. The noise variables are correlated to each other. The data is highly unscaled. Each variable is kind of in its own range. So Y is between minus 5 and 5. Noise 1 is between minus 4 and 4. X2 is between minus 9 and 10. Noise 1 is between minus 30 and 26. So they're all in different scales. The noise variables are correlated to each other. Because of the way the data is generated, 33% of it is basically random noise. So you cannot predict at best. You can predict 100 minus 33%. So let's look at the ideal case. Now, the reason I'm showing you this is just so that you can compare it with what follows. In the ideal case, you know what's noise. And then if you know what's noise, you should prune them. So if you do that, if you get rid of all the noise variables and then you run PCA, then these are the singular values. So this is the first, second, third, fourth, fifth singular values. A singular value for a PCA basically corresponds to the proportion of variance that is explained. So what this tells you is that the first and the second singular value, which correspond to the first and the second principal components, kind of numerically account for sort of this much and then this much percentage of the total variance. Just a sanity check. We got rid of all the noise variables. There's only signal variables in these graphs. And therefore, because there are only five signal variables, you can have only five singular values. If you take all the five principal components, you're back to square one. It's like using the original and transform data set. So these are the singular values with your kind of ideal case data where you've managed to throw out the noise. Let's look at what happens to the principal component loadings. Principal component loading is basically a set of weights that transform your original data to your kind of principal component. So for example, this one, PC1 stands for principal component one. So the first principal component is made out of x0.1 times 0.4, 0.45 plus 0 times x02 plus some amount of x03 plus 0 times x04 plus some amount of x05. So the first principal component is a linear combination of x and same goes for the second and third and the fourth and the fifth. This is what PCA does. It tells you the loadings of the PCA basically tell you how much of your original data make up each of your kind of transformed axes. So what's interesting about this graph is that just take a look at the first and second principal component, right? The first principal component has weights attached to each of the odd axes. So x1, x3 and x5 have non-zero weights. The even ones have zero weights. For the second principal component, it's the inverse. The even ones have weights. The odd ones are all zero. What that basically tells you is that component one and two taken together actually represent a good proportion of your original data. So between one and two, you have weights for each of the axes. So this is the ideal case. This is another visualization. What is done here is the original data is projected in 2D where the axes correspond to component one, principal component one and principal component two. The color of the dots correspond to the range of y. So this is max value of y. This is middle value, middle region of y, and this is the lowest region of y. So when you do this transformation, when you project y on the first and the second principal components, there is a rough trend that you can see that is higher values of y are trending in the upward right corner. Lower values are in the lower left corner. So there is, you can see sort of a pattern, actually well-distinguished pattern between lower and higher values of y just in 2D space. So what this tells you is that actually PCA is quite good at uncovering structure in the absence of noise. Now the challenge is can we uncover this kind of structure when you have noise in your data? So of course we never, in real scenarios, we never know which variables are noisy, so we don't know which ones to throw out. So let's take a first cut at PCA analysis. Now this is the bad way to do PCA. This is PCA without scaling. PCA as a method is sensitive to the scale of your data. So if you're measuring things in different units, that will screw up your PCA and then that's why best practice for PCA is to always center and scale your data. But let's see what happens if you don't do that. So if you don't center and scale your data, we've got 50 variables, remember? These are the singular values of the PCA. So you see like there's no kind of discernible trend. As you start to include more and more principal components, you are kind of covering more and more of the data space, which is what you would expect, but there is no what ideally you want to see is that the first few principal components or the first few singular values are accounting for a large proportion of the variance. You don't see that here. And the reason you don't see that here is because noise is masking everything. And then if you look at the loading matrix, look at principal component one. Most of the weight is actually like just three variables have some weight, which is noise one, three, and five. Actually the same goes for each of the principal components. Most of the weight is distributed among the noisy variables. It did not manage to pick up any of the signal variables. Now the reason all of this is happening is because we've not scaled our data. So better version of PCA is if you scale your X. So let's scale. By scale I mean center and standardize. So subtract the means from each of your features and make sure the variance is one. If you do that, then these are the ranges for X-scaled variables. So just a visualization of the values of each of these features. And now if you run PCA, then you can see interesting structure. So you see that there's sort of a knee in the graph at this point, right? The first 20 singular values have values much higher than the rest. And then this is also interesting. The loading matrix is also interesting. Though the noise variables do have weight, the signal variables have non-zero weight as well. So in this case, if you're going to use your PCA transform data set, you will actually capture some of the signal, though you will also capture noise. If you build a linear model with the first 20 principal components, so now if you build a model with your PCA transform data set, and you use only 20 features instead of the original 50 features, which you started off with, then you get this. You get a model with R squared, 47.8. Bear in mind, 33% of the data is just noise. So you can't get better than 67. And then you get an R squared of 47.8, which is kind of, which is not bad. At least it's not 2%, it's not 2%. Now the question is, can we do better? This graph actually tells you that the 47.8 R squared is not that good, not that big a deal. You remember the projection onto 2D space. So this graph is the projection of Y into 2D space where the axes are PC1 and 2, but for the original ideal case. So this whole graph, I played God and I got rid of the noise features. And when you do that, you see the strength, that is the reds are kind of well separated from the blues. Here, which is a more realistic scenario, and I can't identify noise variables, instead I take the first 20 principal components and then I do this projection onto 2D space with, and the axes are principal component one and two. The reds and the blues are kind of well intermixed. So you've lost structure, you cannot identify this kind of structure. Even if you follow, you cannot identify that kind of structure, even if you follow well-principled PCA, so-called well-principled, which is your data is well-scaled. That brings us to why sensitive PCA, which is PCA that is sort of influenced by the values of your target variable. What you do is you make sure that your axes are in the same scale as your output. Your features are in the same scale as your output. So if you don't do this, in order to be able to compare and contrast. For regular scaling, what you do is that each of the features, each of the axes are unit-scaled. Here, you're not doing that. Instead of unit-scaling your features, you are actually setting them up such that they share a common scale with the target output. When you do that, what happens is a unit change in X corresponds to a unit change in Y. So if Y is MX plus B, X is one feature, Y is an output, this is the transformation to be more precise. The transformation is X prime is MX minus mean of MX. So in this equation, what's happening is that one unit change in X corresponds to M units change in Y. You're just correcting for that, basically dividing by M on both sides, so that a unit change in X will now correspond to a unit change in Y. So this is actually a very simple transformation, but by virtue of doing this transformation, what you're now doing is you are allowing the values of the target variable to influence your PCA. Now your PCA is going to try to identify structure, not just in the X matrix, but in the joint X and Y matrix in a certain sense. So when you do that transformation, these are the values that you get. So you see that most of the noise variables are actually between very small, close to zero, and the ranges for X are much larger than the noise ones. When you run PCA on that Y transformed, on this sort of funny transformed data set, this is the graph of singular values. You see that the first one, two, three, four, five, the first five singular values have much higher values than the rest. So that kind of should give you a hint that maybe data, your data intrinsically has only five degrees of freedom. And this is the loadings matrix. And this is really, really interesting because now you see that pretty much every, actually every noise variable has a weight that is zero. And now you're only taking signal variables. And furthermore, the first principal component has weight attached to X1, 3, and 5, and the second principal component has weight for X2 and X4. So between PC1 and PC2, you've actually uncovered a good part of your input data. And now if you do the 2D projection thing, again, this graph is for the ideal case where I played God and got rid of the noise variables. Now this one is Y scale training data. So this plot is from doing the Y sensitive PCA transformation. And when you do that and when you project it, you see that the red and the blues are actually almost as well separated as this. So the takeaway message is that when you basically bring your input data, you have to have the same scale as the target variable that you're interested in predicting. If you do that, then PCA manages to actually find much better structure in your data. A general question that always comes up when we talk about PCA is how do we decide the number of components? And sort of a textbook answer or an answer that's kind of bandied around quite often is look for a knee. Now in this example, we have a knee which is a synthetic data that I constructed. So you have this kind of sharp drop off, but often it might not be so obvious and it's a subjective kind of thumb rule in any case. Now to resolve this question, actually, you can use a similar kind of scrambling or permutation test that we went over just a few slides ago. And the idea is this, you scramble Y. You keep your X, your input data set the same, but then you scramble Y. And then you see what happens to your PCA output, what happens to your singular output. If you do that, what you're doing is you're preserving structure in X, but you're kind of disconnecting your X matrix from Y. You are destroying the relationship of X to Y if you scramble Y, obviously. But your X is still well preserved. And the idea is this, you compare your original singular value without the scrambling to the singular value that you got after the scrambling. And the question you want to ask is, does a component appear more significant in a metric space that is induced by Y compared to just any random Y? So if you do that scrambling and plot the singular values over and over again, you get a distribution. And this is the distribution for this dataset for the first singular value. So most of the singular values are in this range. And the red line is the actual singular value for the non-scrambled data. And now you see that it's obviously much higher than the values that you get on scrambled data. So you know that the first principal component is sort of quote-unquote statistically significant. You can do this, of course, for all the singular values. And this is what happens if you do that. Now the dashed line is the average value for these simulations. So for example, this dashed line for the first principal component is the average value of this distribution. And then similarly, this next dashed line is the average value for the distribution that corresponds to the second singular value and so on. This light blue curve is the 98th percentile. So you can see that the first five singular values are well above the 98th percentile, but the rest of them are actually not. So again, this gives you quite a nice hint that your underlying data is actually just five independent measurements. So that kind of concludes the vice-sensitive PCA part. Continuing on to the same thing, another kind of pre-processing that you can do, which is sort of vice-sensitive or target-sensitive, is when you encounter data with a lot of categorical variables. So this question often comes up, how do you encode data that has a large number of different categorical variables and zip codes is a typical example? So an example dataset is one that you can find from San Francisco's police website. You basically set a database of crimes that are reported. This is a visualization of kind of the location of the crime for like one month. You know that there are sort of hot spots in the city, right? What we want to do is we want to be able to predict crime before it occurs. Now obviously the location is a source of significant signal. You want to encode location data in your predictive model. The question is how? If you just download one month of data, right? So this is July. Typical month has 8,000 rows, but there are about 4,000 unique zip codes. So the number of data samples, the ratio of the number of unique values and the categorical data feature to the total number of rows is actually not good at all. You've got 4,000 unique values and only 8,000 rows. So if you just do a one-hot encoding, for example, that would be very difficult for any kind of algorithm to model. If you instead say, okay, let's not have zip codes, let's just use a block-level data, then that ends up overfitting. If on the other extreme you say, okay, let's just have a district which is kind of very broad area, then it's not predictive enough. So how do you handle categorical data with a large number of values? One thing that you can do is what's called impact coding. What you do is you replace the categorical feature with the output of a submodel. This submodel basically gives the probability of crime for a given location, for a given category value. In sort of a mathematical sense, it's basically equivalent of naive base. What you're doing is you are looking at a location and just counting the fraction or the number of days that crime was reported in that location relative to all other locations. If you encounter a new location which is quite likely, then you can kind of smooth your model. What I mean by smoothing is you just impute the global average, the global crime average. This process is actually equivalent to encoding a Bayesian prior if you were to be building a Bayesian model. So this concludes kind of impact or effects coding. You see, again, the idea is that you allow the target variable to influence your categorical variable. This is kind of the last part. How are we doing on time? Do you want me to stop? How much time do we have? I guess this part would probably be 15 minutes, so we can either add it or just leave it at this. So the third kind of family of techniques is adding noise slash differential privacy. Who has heard of differential privacy here? So it turns out that actually adding noise to your dataset can help you model your data well. Now that sounds a little crazy idea because you're always fighting against noise, but the key is to add noise in a principled way. The ideas here basically come from the field of differential privacy. Differential privacy is this kind of sub-discipline which talks about how do you do data mining on sensitive datasets where the privacy of individuals is very important. So think of healthcare. You want to mine healthcare data, but you do not want to identify any particular individual. So differential privacy basically deals with methods that make guarantees about data mining procedures and privacy together. So a little bit of context. We spoke about impact coding, right? The most common scenario is this. What most people do is you've got your original training data. These are all the independent variables. This is the outcome. You use this to learn the categorical encoding, and then you transform your original data. You replace your categorical feature with your output of your sub-model, and now you've got this transformed dataset, and then you build a model on it, and now you've got a model. And at runtime, you take your original data. You apply this transformation here. Now you've got transformed. Now you've transformed your runtime dataset. You take this model, you push your data through this model, and now you've got predictions. The problem with doing all of this is the following. At this step, here is where the problem lies. The problem is that you're learning the categorical encoding on the same rows that you use to build your model. So this process of learning the encoding is actually leaking data, because when it comes to runtime, you won't be able to do this. You will have to take your old encoding and apply it here. You will not be able to reuse these rows. In this case, you're reusing the same rows that the data rows that are used to learn this encoding are the same as the ones that are used to build the model. So that is actually a data leakage, and then that leads to overfitting, and that will lead to an overly optimistic estimate of how well you're actually doing. So when you take this model and actually deploy it, it's not going to perform as well basically because you use this data. Now, this doesn't just apply to impact coding. Actually, this applies to any form of preprocessing. For any form of preprocessing, if you learn the preprocessing on a subset of your training data and then you use the same dataset to build a model, you're going to overfit. So one solution is to do this. You split your training data. You learn categorical encoding or anything else, any other form of preprocessing. You do that on this data, and then you reserve a subset to learn the model, and you learn the model on new rows, and then you deploy. This kind of solves the original data leakage problem, but now you're inefficient. You are not using all of the data to learn your model. So how do you kind of resolve this tension? And here is where differential privacy comes in. So differential privacy gives you a method to be able to do this process, not have to split your training data and use a subset for preprocessing and use a different subset for learning. You can use everything for preprocessing and for learning. If you use what's called in literature as a differentially private method. So I'll explain what differential privacy is. So changing tracks a little bit just for one slide. Differential privacy is the following. So you've got a summary statistic. You want to learn a summary statistic about a dataset. Let's call that function A. So it could be mean or median or the 90th percentile or any other kind of statistical function of the data. So a learner wants to implement a summary statistic A. Now you've got an adversary. The adversary proposes two datasets, S and S prime. S and S prime differ only in a single row. So that is important. So there's only one row of difference. Everything else is the same. S and S prime are identical except for one row. And then the adversary proposes a test set or a value Q. And differential privacy is defined as follows. Differential privacy is a definition that is applied to algorithms or methods. So this algorithm or this method A is said to be epsilon differentially private if the following condition is met. The condition is basically that the log of some probability must be within a certain number. So basically what it says is that the log of, there are two terms here. So the probability that this algorithm A applied to dataset S, the output of A applied to S, the probability that it falls in a certain interval Q. Divided by the same probability that it falls in the same interval but applied to the different dataset S prime. So if you take the ratio of these and you take the log, the output needs to be lesser than a certain epsilon. So that's kind of the technical definition of what differential privacy is. How does that help us in machine learning? Kind of to help illustrate this definition, I've got an example. So let's say S is a dataset with 100 zeros. It's all zeros. S prime is this other dataset. It's identical except that it's got one one. And this function that we want to learn is A that the function is basically mean. Now if you don't know about differential privacy, then if you apply the function to S, then you're going to return the value zero. And if you apply it to S prime, you will return the value point zero one. And if you have an adversary, adversary can trivially implement a decision boundary that is midway between zero and zero point zero one. And if you output zero point zero one, then he knows that, oh, you're looking at S prime. Whereas if you output this, he knows that you're looking at S. Think back to kind of what I said about the origins of differential privacy, right? So S and S prime could be datasets with medical records. And this, in a practical sense, this would be useful if let's say you're thinking of participating in a medical survey and surveys are asking sensitive questions. Now you want to know if by me answering this survey, will I kind of divulge my identity, right? Now if it's a small enough set of people, it's easy to imagine a scenario where by participating truthfully in the survey, you basically give away your identity, right? So that's kind of where this whole differential privacy thing started. Differential differential privacy aims to be able to allow people to participate in these kinds of surveys, but not divulge their identities. So just to help illustrate the example without differential privacy, this is what happens. You basically give the answer. You compute the mean of S and S prime as asked. So the adversary can either give you S or S prime, right? So the adversary is trying to figure out whether you're operating on S or S prime. And then if you just answer truthfully, then he will know because you're just answering truthfully. Now what happens if you add noise? So if instead of answering truthfully, what you do is you take your answer and you give a noisy version of your answer. You give your answer plus Laplace noise in this case. Then there is some chance that you might fool your adversary because if your answer is this, your answer could be this value, right? Because your mean is zero. The adversary asks you to compute the mean of the set S, which is the all-zero set. You compute the mean, it is zero. Now you add noise. After adding noise, it is, let's say, 0.00, 0.02, which is this value. The adversary could think that, oh, maybe he's got S prime, right? So you can fool your adversary. But even if you do that, right? Now most of the distribution actually for this one is on the left and for this one is on the right. So adversary can still actually guess your data set in most cases and therefore you need to add more noise. As you start to add more and more noise, your response for S and S prime start to become more and more indistinguishable from each other. And at a certain threshold, the adversary has only sort of epsilon probability of being right in his guesses. And when that condition is met, that's when you say that the algorithm is now epsilon differentially private. So how does this help us in learning models? The answer is that actually adding noise in this way, the intuition is that adding noise in this way helps you hide which data set is used for pre-processing versus a modeling. So your algorithm will find it very difficult to overfit. Your algorithm will not be able to learn the intricacies of the pre-processing data set versus your sort of training data set. That's kind of the basic intuition behind using differential privacy for learning models. So here's a simulation. You've got 10 signal variables and 100 noise variables. You want to predict an outcome by. This is simulation of stepwise regression. So what happens is you take, let's say you take feature one, you build a model, and then you build a second model with either feature two or three or four or five or six or anything until 99. So you pick anything between feature two and feature 99. You pick the one that actually performs the best when added to the earlier model. And in this way, sequentially you start to add more and more features depending on their performance in a holdout set. So what is happening here is you've got three different data sets. This line corresponds to training data set. This line is a holdout or test data set. And this is what's called fresh data set. So this is 1000, this is 1000 rows, this is 1000 rows. This is 10,000 rows generated from the same data-generating mechanism. This first score is actually the true performance of your model. This is what would actually happen if you were to use this model in these circumstances. But you obviously don't have access to this in a real scenario. You have access to only this curve and this curve. So what you're going to do is you're going to train on this. In the model on this, now you want to add a new feature. You add a new feature, you test its performance on this kind of holdout set. And you see that it really improves your holdout set performance. And then you make the decision to add it and then you go on. Now, obviously this is horrible because you are like completely fooled by this process, right? You're looking at test scores and your test scores are doing quite well. In reality, you're not doing well at all. And the reason that's happening is because you've actually contaminated your test set or your holdout set, that is this set. The reason you contaminated it is because you're using the output from this set to inform your model. And the more often you do that, the more you peek at it and the more your model adapts to this data set. And that's basically what's happened here. Your model has completely adapted itself to this data set and therefore it can't generalize, which is this blue one. It can't generalize to really new values. And this is typical scenario when you're building machine learning models is that you tend to peek even if you have a holdout set, you tend to peek at it because you want to tune hyperparameters or you want to tune a new pre-processing routine. And when you do that, the more often you peek, the more you contaminate your holdout set and the more risk there is of overfitting. And the basic idea behind using differential privacy for this is now instead of peeking at your holdout set, instead of peeking at it directly, what you do is you query your holdout set but you are returned with the answer plus noise. So you never look at your holdout set, you only look at a noisy version of your holdout set. So this is actually formalized in what's called the three-shoulded-out algorithm. There is a science magazine article on this with Cindy at work at all, which you can look at. I'll have the references here. And the three-shoulded-out algorithm basically formalizes these things, which is how do you add noise? I mean, not all the details are here. For the details, you can look at the paper. But the basic idea is that if you are learning a model and you've got a training set and a test set, so the training set is to learn your model, the test set is to tune hyperparameters and you're doing something like this, something like stepwise ridge regression, then you train a model on your training set and then you kind of query your test set for model performance. And the three-shoulded-out algorithm dictates the following. What it says is that if the model performance on the training and the test set are close enough to within a certain three-should, then you return the training score plus some noise. If they are not close enough, then you return the test score plus some noise. Now the details of the noise parameter, et cetera, are not in this slide, but the aim of this slide is to kind of give you intuition behind why this works. The reason this works is because now you are using, like you're not peaking at your test set, you are actually always, you have access always to only a noisy version of a test set. So if you kind of apply the three-shoulded-out algorithm with the right noise parameters, then this is what happens. What's happening is that, so this is training error as you start to add more and more predictors. And this is the test set error. And the blue one is the sort of the really ground truth of what would happen if you were to deploy this model. And what's interesting about this graph is that the blue and the orange track each other quite well as you add more and more features. Now, of course, the question is how much noise do we add? What kind of distribution do we draw the noise from? And so on. Now those are complex questions. In literature, so mostly you add Laplace noise. That's what you'll find in most papers. And sort of what are the parameters of the Laplace noise? Actually depends on the number of rows that you have in your data and the kind of queries that you're making on your data. So look up this science paper that's preserving statistical validity in adaptive data analysis, the second one, and also this one, the ladder. So these are kind of the two canonical papers which sparked off conversation in the machine learning community about differential privacy and machine learning and then follow their references. So to summarize, these are sort of the methods that are not so commonly found in textbooks. If you want to increase your model performance, check if your data has signal. Allow your target variable to influence data pre-processing and dimensionality reduction. And you can actually add noise to your data set in a principled way, use differential privacy to avoid risk of overfitting. So that's it. Years ago, I was just thinking, is there a negative side to this noise? I mean, it gives you an advantage. It's definitely trying to build that model. But is there then a downside to it? Yeah, I mean, if you add too much noise, then you're back, you're not efficient again. So the trick is to find the right amount of noise to add. That's where kind of thing lies, yeah. The second thing I thought about, and again, I haven't worked with statistics in years. I need to freshen it up again. But I was thinking, is it possible to incorporate the new data where you say the model will kind of skew off because it was adapted to your original training that couldn't you incorporate running along the way and improve the model on a kind of... Well, so I didn't talk about the online streaming setting here. So this whole thing is under the assumption that you have finite data. If you have infinite data, then you kind of don't need to do any of this because you can learn underlying distributions directly and asymptotically your models will be sort of base optimal if you got infinite data. Yeah, I was thinking like in a customer, what should we offer, what would happen to these customers? And then I guess you could learn from that and incorporate new models all the time. Yeah, I mean this deals with a much simpler setting where you've got like these static data sets and you want to just build the best model out of those. If you want to do some kind of online learning, you'll have to adapt all of these to the online setting, which might not be very straightforward, or they might be, I don't know. Okay, so we will host it online. I suppose probably it's a resource that makes it available to host it online. Okay, so before we end, we have no lucky draw.