 The following program is brought to you by Caltech. Welcome back. Last time, we talked about radial basis functions. And the functional form of the hypothesis in that model is the superposition of a bunch of Gaussians centered around mu sub k. And we had two models or two versions of that model. One of them where the centers are fewer than the number of data points, which is the most common one, in which case we need to come up with the value of the centers, mu k, and learn the values of wk. And it turned out to be a very simple algorithm in that case where you use unsupervised learning to get the mu k as the center by clustering the input points without reference to the label that they have. And after you do that, it becomes a very simple linear model where you get the wk as the parameter using the usual pseudo inverse. And in the other case where we used as many centers as there are data points, and the centers were the data points, there were obviously no first step. And in that case, in order to get the wk, we actually use the real inverse rather than the pseudo inverse. One of the interests of radial basis functions, they are very popular functions to use in machine learning. But one of the most important features about them is how they relate to so many aspects of machine learning. So I'd like to go through this because it's actually very instructive, and it puts together some of the notions we had. So let me magnify this a bit. So radial basis functions have this as the building block, the Gaussian. And they are related to nearest neighbor. In the case of nearest neighbor, you have a data point, one of your points in the training set. And it influences the region around it. So everything in the region around it in the input space inherits the label of that point, until you get to a point which is closer to another data point, and then you switch to that point. So you can think of now RBF as a soft version of that. The point affects the point around it, but it's not black and white. It's not full effect and zero effect. It's a gradually diminishing effect. It's also related to neural networks, thinking of this as the activation in the hidden layer, as we saw last time, and the activation for the neural networks in the hidden layer was as sigmoid. And the main conceptual difference between the two in this case is that this is local. It takes care of one region of the space at a time, whereas this is global. That thing affects points regardless of the value of the signal, and you get the effect of a function by getting the differences between these different sigmoids. Then we had the relationship to SVM, which is very easy, because in the case of SVM, we had an outright RBF kernel. So there was simply a very easy way to compare them, because they used the same kernel, except that there were many interesting differences. For example, when we used the RBF, we clustered the points, we determined the centers according to unsupervised learning criteria. And in the case of SVM, the centers, if you are going to call them that, happen to be the support vectors in which the output is very much consulted in deciding what these support vectors are. And the support vectors happen to be around the separating boundary, whereas the centers here happen to be all over the input space in order to represent different clusters of the inputs. The two remaining relations, as far as the RBF are concerned, are regularization and unsupervised learning. Unsupervised learning is easy, because that is the utility we had in order to cluster the points and find the centers. So you look at the points, and then you try to find a representative center for them, such that when you put a radial basis function around that point, it captures the contribution of those points. And then more or less dies out, or at least is not as effective when it goes far away. And this is another center that does the same. The interesting aspect was regularization, because it seems, on first value, it's a completely different concept. RBF is a model. Regularization is a method that we apply on top of any model. But it turns out that RBFs were derived in the first place in function approximation, using just a consideration of regularization. So you have a bunch of points. You want to interpolate and extrapolate them. And you don't want the curve to be too wiggly. So you capture a smoothness criteria using a function of derivatives. And then when you solve for them, you find that the interpolation is done by Gaussians, which gives you the RBFs. So this is what this model does. Today, we're going to switch gears completely, and in a very pleasant way. If you think about it, we have gone through lots of math and lots of algorithms and lots of homeworks and all of that. And I think we paid our dues, and we earned the ability to do some philosophy, if you will. So we are going to look at learning principles without very strong appeal to math, because we have very strong math foundation to stand on already. And we will try to understand the concepts and relate these concepts as they appear in machine learning, because they also appear in other fields, in science in general. And they are fascinating concepts in their own right. And when we put them in the context of machine learning, they assume a real meaning and a real understanding that will help us understand the principles in general. So the three principles are the usual label for them is Occam's razor, sampling bias, and data snooping. And you may be familiar with some of them, and we have already alluded to data snooping in one of the lectures. And if you look at them, Occam's razor relates to the model. Both of these guys relate to the data. One of them has to do with collecting the data, and the other one has to do with handling the data. And we'll take them one at a time and see what they are about, and how they apply to machine learning, and so on. So let's start with Occam's razor. There is a recurring theme in machine learning and in science and in life in general, that less is more, simpler is better, and so on. And there are so many manifestations of that. And I just chose one of the most famous quotes. I put quote between quotes, because it's not really a quote, he didn't say that in so many words, but at least that's what people keep quoting Einstein as saying. And it says that an explanation of the data, so you are running an experiment, you collect the data, and you want to make an explanation of the data. The explanation could be equals mc squared, or something else. So you are trying to find an explanation of the data, and here is a condition about what the explanation should be like. It should be as simple as possible, but no simpler. Very wise words, as simple as possible. That's the Occam's razor part. No simpler, because now you are violating the data. You have to be able to explain the data. So this is the rule. And that quote in one manifestation or another has occurred in history. Isaac Newton has something that is similar and a bunch of them, but I'm going to quote the one that sort of survived the test of time, which is Occam's razor. So let's first explain what the razor is. Well, a razor is this, and you have to write Occam on it in order to become Occam's razor. And the idea here is symbolic. So the notion of the razor is the following. You have an explanation of the data, and you have your razor. So what you do, you keep trimming the explanation to the bare minimum that is still consistent with the data. And when you arrive at that, then you have the best possible explanation. And it's attributed to William of Occam in the 14th century, so it goes back quite a bit. So what we would like to do, we'd like to state the principle of Occam's razor, and then zoom in in order to make it concrete, rather than just a nice thing to have. We'd like to really understand what is going on. So let's look at the statement. The statement in English, not in mathematics, says that the simplest model that fits the data is also the most plausible. And we put it in a box, because it's important. So first thing to realize about this statement is that it is neither precise nor self-evident. It's not precise because I really don't know what simplest means. We need to pin that down. And well, I know that the simplest model is nice, but I'm saying something more than just nice. I'm saying it's most plausible. It is the most likely to be true for explaining the data. That is a statement. And you actually need to argue why this is true. It's not wishful thinking that the simpler and things will be fine. There is something said here. So there are two questions to answer in order to make this concrete. So the two questions are, the first one is, what does it mean for a model to be simple? It turns out to be a complex question, but we will see that it's actually manageable in very concrete terms. The second question is, how do we know that this is the case? How do we know that simpler is better, in terms of performance? So we'll take one question at a time and address it. First question, simple means exactly what? Now you look at the literature, and complexity is all over the place, and it's a very appealing concept with a big variety of definitions. But the definitions basically belong to two categories. When you measure the complexity, there are basically two types of measures of complexity. And my goal here is to be able to convince you that they actually are talking about more or less the same thing, in spite of being inherently different conceptually. The first one is a complexity of an object. In our case, a hypothesis H, or the final hypothesis G, that is one object, and we can say that this is a complex hypothesis, or a simple hypothesis. The other set of definitions have to do with the complexity of a set of objects. In our case, the hypothesis set, we say that this is a complex hypothesis set, complex model, et cetera, and so on. And we did have concretely a measure of complexity of small H, and a measure of complexity of script H, big H. And if you remember, we actually used the same symbol for them. It was capital omega. Capital omega here was the penalty for model complexity when we did the VC analysis. And capital omega here was the regularization term. This is the one we add in the augmented error in order to capture the complexity of what we end up with. So we already have a feel that there is some kind of correspondence. And if you look at the different definition outside, there are many definitions of the complexity of an object. And I'm going to give you two from different worlds. One of them is MDL, stands for Minimum Description Length. And the other one, which is simple, is the order of a polynomial. Let me take the Minimum Description Length. So the idea is that I give you an object, and you try to specify the object. And you try to specify it with as few bits as possible. The fewer the bits you can get away with, the simpler the object in your mind. So the measure of complexity here is how few bits can I get away with in specifying that object. And let's take just an example in order to be able to relate to that. Let's say I'm looking at integers that happen to be million decimal digits, huge numbers, any numbers. Now I'm trying to find the complexity of individual numbers of that length. They will be different complexes. So let me give you one number, which is, let's say, 2 to the million minus 1, in order to make it million digits. So let's say 2 to the million minus 1. Now 2 to the million minus 1 is 99999999, a million times. Now, in spite of the fact that this is a million length, it is a simple object, because you were able to describe it as 2 to the million minus 1. That is not a very long description. And therefore, because you managed to get a short description, the object is simple in your mind. This is very much related to Golmokorov complexity. The only difference between Golmokorov complexity and minimum description length is that minimum description length is more friendly. It doesn't depend on computability and other issues. But this is the notion. And you can see that when we describe the complexity of an object, that complexity is an intrinsic property of the object. Order of a polynomial is simpler to understand. I tell you there is a 17th order polynomial versus 100th order polynomial. And you already can see that the object is more complex when you have a higher order. And indeed, this was our definition of the complexity of the target, if you recall, when we were running the experiments of deterministic noise. In that cases, we needed to generate target functions of different complexity. And the way we did it, we just increased the order of the polynomial as our measure of the complexity of that object. Now we come to the complexity of a class of objects. Well, there are notions running around that actually define that. And I'm going to quote two of them, very famous. The entropy is one, and one we are most familiar with, which is the VC dimension. Now these guys apply to a set of objects. For example, the entropy, you run an experiment. You consider all possible outcomes of the experiment, the probabilities that go with them. And you find one collective function that captures the probability, some of p log 1 over p. And that becomes your entropy. And that describes the disorder, the complexity, whatever you want, of the class of objects, each outcome being one object. In the case of the VC dimension, it applies directly to the notion we are most familiar with. It applies to a hypothesis set. And it looks at the hypothesis set as a whole, and produces one number that describes the diversity of that hypothesis set. And that diversity, in that case, we measure as the complexity. So if you look at one object from that set, and you look at this measure of complexity, now that measure of complexity is extrinsic with respect to that object. It depends on what other guys belong to the same category. That's how I measure the complexity of it. Whereas in the first one, I didn't want to be a member of anything. I just looked at that object, and tried to find an intrinsic property of that object that captures the complexity. So these are the two categories you will find in the literature. Now when we think of simple, as far as Occam's razor, as far as different codes are concerned, we are thinking of a single object. I tell you, E equals mc squared, or I look at the board, pv equals nrt. And that is a simple statement. You don't look at what other alternatives were there to explain the data. You just look at that object intrinsically, and that is what you think of as the measure of complexity. When you do the math in order to prove Occam's razor in one version or another, the complexity you are using is actually the complexity of the set of objects. And we have seen that already. We looked at the VC dimension, for example, in order to prove something of an Occam's nature in discourse already. And that captures the complexity of a set of objects. So this is a little bit worrying, because the intuitive concept is one thing, and the mathematical proofs deal with another. But the good news is that the complexity of an object and the complexity of a set of objects, as we described in this slide, are very much related, almost identical. And here is the link between them, counting. Could it be simpler? So here is the idea. Let's say we are using the minimum description length, which is a very popular and versatile. So it takes small l bits to specify a particular object h. I'm taking the objects here to be h, because I'm in machine learning, the objects are hypothesis. So I use that. Now the measure of complexity in this third is that the complexity of this fellow is l bits, because that is my definition. Now, this implies something. This implies that if I look at all the guys that are similar to this object in terms of complexity, they also happen to have l bits worth of minimum description. How many of them are there? Well, 2 to the l, right? And now you can look at the set of all similar objects, and you call it capital H. And you have 1 of 2 to the l as the description of an object here, and you can take the 1 of 2 l as the description of the complexity of that set. So now we are establishing something in our mind. Something is being complex in its own right, when it's one of many. Something is simple in its own right, when it's one of few. That is the link that makes us able to use this side for the proofs, and make a claim on this side. It is not an exact correspondence, but it is an overwhelmingly valid correspondence. Now these are whose bits, and I can pin it down exactly how about real value parameters. Let's look at our 17th-order polynomial. You can look at a 17th-order polynomial, and you can see that because it's 17th-order, it goes up and down and up and down, and that looks complex. But also, because it's a 17th-order polynomial, it's one of many, many in the realm of infinity in this case, because having 17 parameters to choose makes me able to choose a whole bunch of guys that belong to the same category. So the class of 17th-order polynomials is big, and therefore it's not only that the individual is complex, that's it, is also complex. There are exceptions to this rule, and one notable exception was the deliberate exception. And we wanted something that looks complex, so that it does our job of fitting and whatnot, but is one of few. And therefore, we're not going to pay the full price for it being complex, and that was our good old friend SVM. Remember this fellow? This looks complex all right, but it actually is not really complex, because it's defined only by very few support vectors, and therefore, in spite of the fact that it looks complex, it's really one of few, and that is what we achieve by the support vector machines. Now, let us take this in our mind that we are going to use the complexity of an object as the same as the complexity of the set of objects that the object naturally belongs to, and we will see some ramifications. So now I'm going to give you the first puzzle of the lecture. There are five puzzles in this lecture, so you need to pay attention, and each puzzle makes a point, and the first one has to do with this complexity. So let's look at the puzzle. The puzzle has to do with a football oracle, someone who can predicts football games perfectly. You watch Monday Night Football, you want to know the result, and something happens Monday morning. You get a letter in the mail. You open the letter, hi, today the home team will win, or the home team will lose. You don't make much of it, you send something, it's not a big deal. You watch the game, and it's a good call. Interesting, 50% lucky. Next Monday, another letter, another prediction. And the funny thing is that he predicted either the home team will win or not, and it was a very long odds. I mean, everybody thought the other way around. And at the end of the game, the guy was right. And the guy was right for five weeks in a row. Now we are really very curious, and you are eagerly waiting on the sixth week in the morning of Monday to see where the letter is. You have a perfect record, now comes the letter. The letter says, you want more predictions? Pay me 50 bucks. Very simple question. Should you pay? The question is easily answered, because now the scams are so many that I just don't look at anything. There must be something to it. But I really want to pin down, what is it? Because that is the message we are carrying out. So the idea here is that no, you shouldn't, and the guy is really not predicting anything, and the reason for that is the following. He's not sending letters to you only. He's sending letters to 32 people. In the first game, he sent, for half of them, he said that the home team will lose. The second one, he said the home team will win. Now, because he did that, he's sure that some of the guys will get the correct answer. So the game is played, and the home team loses. So in the second week, he goes for the guys that he was right, and sends half of them that the home team will lose, and the other half the home team will win. Now, he had plans to send the other guys as well something similar, except that it's hopeless now, because they already lost with them. So they're not going to pay him the 50 bucks. So just for the memory, this is what would have been sent. There are no letters sent here, but he would have gone 0101. And he waits for the game, and out comes the home team 1. So you can see who is going to send letters to now. And the other guys are lost because this would have been sent to them, but that's OK. And he waits, and what happened this time? The home team lost. And therefore, here's your next letter. Home team 1, here's your next letter. Only two people are surviving from this thing. And here is the result, the home team 1. Now, at that point, the guy sent how many letters? 32 plus 16 plus 8 plus 4 plus 2. So about 64, 63, to be exact. The postage on that, writing the letter, he probably spent 30 bucks on that. And he's charging you, the lucky guy out of the 32, 50 bucks. That's a money-making proposition. Very nice, and it's understood, and illegal, by the way. But the interesting thing here is to understand, why is this related to what we've just talked about? You thought the prediction ability was great, because you only saw your letters. There is one hypothesis, and it got it right perfectly. The problem is that, actually, the hypothesis set is very complex. And therefore, the prediction value is meaningless. You just didn't know. You didn't see the hypothesis set. So now we understand what is the complexity of an object. Now we go to the question is, why is simpler better? So the first thing to understand is that we are not saying that simpler is more elegant. Well, simpler is more elegant, but this is not the statement of Occam's razor. Occam's razor is stating that simpler will have better out of sample performance. That's a concrete statement. In all honesty, if Occam said that you take the more complex guy and it will give you better out of sample error, I will take the more complex one. Thank you. I am after performance. I'm not after elegance here. It's nice that the elegant guy happens also to be better. But we need to establish that it is actually better. And there is a basic argument. It manifests itself in many ways, and we have already run one in this course during the theory. And you put some assumptions, and there's a formal proof under idealized condition of the following. Instead of going through any formal proofs, or quite a variety of them, I'm extracting the crux of the proof. What is the point being made? And I'm going to relate it to the proof that we ourselves ran. So here are the steps. There are fewer simple hypotheses than complex ones. That is what we established from the definition of complexity. And in our case, that was captured by the growth function. You probably have forgotten what this is long ago. Oh, this was taking endpoints, finding what your hypothesis set can generate in terms of different patterns on those endpoints. We call dichotomies. So if you can generate everything, like the postal guy, then it's a huge hypothesis set. If it can generate few of them, then it's a simple hypothesis, and it's measured by that growth function, and that resulted in the VC dimension. Remember all of that. So now, fine, fewer simple hypotheses than complex ones. Then what? The next thing is, because there are fewer ones, it is less likely to fit a given data set. That is, you have an endpoints, and you're going to generate labels. Let's say you generate them at random. And you ask yourself, what are the chances that my hypothesis set will fit? Well, if it has few of those guys, obviously that goes down. And the probability, if you take it uniformly, simply to be the growth function divided by 2 to the n. If my growth function is polynomial, then very quickly the probability of fitting a given data set is very small. Fine, I can buy that. So now, that's nice, but you want to convince me now that simpler is better in fit. Here you told me that I cannot fit. So what is the point? The punchline in all of those is that if something is less likely, then when it does happen, it's more significant. And there are many manifestations of this. Even when you define the entropy that I alluded to, a probability of an event is p. What is the information associated with that particular point? The smaller the probability, the bigger the information, the bigger the surprise when it happens. And indeed, you define the term as being log 1 over p. So p is very small, tons of bits of information. Something is, half the time will happen, half the time will not happen, it's just one bit. It's not a big deal. And we, looking back at the postal scam, the only difference between someone believing in the scam and someone having the big picture is the fact that the growth function from your point of view when you receive the letters was one. You thought you were the only person. He's one hypothesis, and you got it right. And you gave a lot of value for that, because this is unlikely to happen. On the other hand, the reality of it is that the growth function is actually 2 to the n. And this is certain to happen. So when it happens, it's meaningless. Let's look at a scientific experiment, where a fit is meaningless. So you are running an experiment, or you ask people to run an experiment to establish whether conductivity of a particular metal is linear in the temperature. I can design an experiment for that. So you go and you ask two scientists to conduct the experiments. And they go and they come back with the following results. Here is the first scientist. Took the metal, but they had a dinner appointment. So they were in a hurry. So they got two points and drew the line and gave you this. The second guy had a supper appointment, so had more time to do it. So they did it three times, and then the line. I have a very specific question, which is, what evidence do they provide for the hypothesis that conductivity is indeed linear in the temperature? What is clear without thinking too much is that this guy provided more evidence than this guy. The interesting thing is to realize that this guy provided nil, none, nada. Why is that? Because obviously, two points can always be connected by a line. So the notion that goes with this is called falsifiability. If your data has no chance of falsifying your assertion, then by the same token, it does not provide any evidence for that assertion. You have to have a chance of falsifying your assertion in order to be able to draw the evidence. This is called the axiom of non-falsifiability, and in some sense, it's equivalent to the arguments we have done so far. And in our terms, the linear model is just way too complex for the size of the data set, which is two, to be able to generalize at all. And therefore, there is no evidence here. So in this case, this guy could have been falsified if the red point came here. Therefore, he actually provides an evidence. This is the point. This guy could not have been falsified. So now we go to the next notion, which is sampling bias. A very interesting notion, and it's tricky. But by the way, if you look at all of these principles, it's not like they're just concepts and nice and relate to other fields. They also provide you with red flags when you are doing machine learning. For example, when you use Occam's razor, what does it mean? It means that beware of fitting the data with complex models, because it looks great in sample, and you are very encouraged. And when you go out of sample, you know what happens. You know all too well by the theory we have done. Similarly, when we talk about sampling bias and later data snooping, there are traps that we need to avoid when we practice machine learning. So let's look at sampling bias. And we start with the puzzle. So here's the puzzle. It has to do with the presidential election. Not this one, but in 1948, this was the first presidential election after World War II, which was a big deal. And the two people who ran were Truman, who was currently president. And he ran against Dewey. And it was very close in terms of people who will take opinion polls and whatnot. And it's not clear who is going to win. So now, one newspaper ran a phone poll. And what they did is ask people how they actually voted. So this is not before the election asking, what do you think? This is the night of the election, after the election closed, they actually called people, picked up random at their home, asked, who did you vote for? Black and white. Dewey or Truman, et cetera, they collected the thing and they applied either some statistical thing, or hefting, or some other quantity, and came with the conclusion that Dewey has won decisively. Decisively doesn't mean he won by 60%. Decisively means that he won above the error bar. The probability that the opposite is true is diminishingly small. And the result was so obvious that they decided to be the first to break the news, and they printed their newspaper, declaring, great, so do we win? What happens when someone wins an election? They have a victory rally. So let's look at the victory rally. One problem, the victory rally of Truman. And you can see the big smile on the guy's face. So what happened? Well, polls are polls, and there is always a probability and this and that. No, that's not the issue here. That's the key. So don't blame Delta for it. What was Delta again? I mean, I've been doing techniques for a while. I forgot all about the theory. So let's remind you what Delta was. We were talking about the discrepancy between in sample, the poll, out of sample, the general population, the actual vote, and we were asking ourselves, what is the probability that this will be bigger than something, such that the result is flipped? You thought it was doing, winning, and it turned out to be Truman, and that turned out to be less than or equal to Delta, and Delta is expressed in terms of epsilon, capital N and whatnot. So in principle, it is possible, although not very probable, that the newspaper is just incredibly unlucky. Now the statement is very interesting. No, the newspaper was not unlucky. If they did the poll again, and again, and again, with 10 times the sample, or 100 times the sample, they will get exactly the same thing. So what is the problem? The problem is the following. There is a bias in the poll they conducted, and it is because of a rather laughable reason. 1948, phones were expensive. That means that households that had phones used to be richer. And richer people, at that point, favor Dewey more than the general population did. So there was a sampling bias. There was always the case that the population they were asking, actually favor Dewey. The sample was very reflective of the general population, of that mini-general population. The problem is that that general population is not the overall general population. And that brings us to the statement of the sampling bias principle. It says that if the data is sampled in a biased way, then learning will produce a similarly biased outcome. Learning is not an oracle, not like the football oracle. Learning sees the world through the data you give it. I'm a learning algorithm. Here's the data. You give me skewed data. I'm going to give you a skewed hypothesis. I'm doing my job. I'm trying to feed the data. So this is always the case, and then you realize that there is always a problem in terms of making sure that the data is actually representative of what you want. So again, we put this in a box. That's the second principle, so it's important. And let's look at a practical example in learning. In financial forecasting, people use machine learning a lot. And sometimes when you look at the markets, the markets are really completely crazy. A rumor comes out, and the market goes this way, et cetera. And you're a technical person. You're trying to find an intrinsic pattern in the time series. So you decide, OK, I'm going to use the normal conditions of the market. So I'm going to take periods of the market where the market was normal, and then there is actually a pattern that when people buy, buy, buy, and sell, sell, sell, this happens, or whatever you're going to discover using your linear regression or other learning algorithm. And you do this. And then you deploy it. And when you test it, well, you test it in the real market, and realize that now there is a sampling bias. In spite of the fact that you are very happy in sample, you actually forgot a part of the market. And you don't know whether that part will be terrible for you, great for you, or neutral for you. You just don't know. That's what sampling bias does. The newspaper could have done this poll, and by their sheer luck, the general population thinks the same of true man and dewy as the small sample they talked about, in which case the result will have come out, and they will have never discovered that they made a mistake. So sampling bias makes you vulnerable at the mercy of the part that you didn't touch. And in this case, you didn't touch the market in certain conditions. And if it does happen, all bets are off. OK. So one way to deal with the sampling bias is matching the distributions. It's a very interesting technique, and it's actually applied in practice. I'm going to mention that. So what is the idea? The idea is that you have a distribution on the input space in your mind. And there was one assumption in Hefding and VC inequality, and all of that. They didn't make too many assumptions. But one assumption they certainly made is that you pick the points for training from the same distribution you pick for testing. That was the only thing that they required. So when you have sampling bias, that is violated. And therefore, you try to say, OK, I don't have the same distribution. I have data picked from some distribution, and the people I'm going to deliver the hypothesis to their customer, and they're going to test it in other conditions. What do I do? So what you do, you try to match the distributions. You don't reach for the distribution and match them. You do something that will effectively make them match. And let's say that this is a training distribution, and that this distribution is off a little bit. This is a probability density function. Both of them are Gaussian. One of them is off and with a different sigma. So what you do, if you have access to those, if someone tells you what the distributions are, then they give you a sample, there is a way by either giving different weights for the training data, or resampling the training data, is to get another set, which as if it was pulled from the other distribution. It's a fairly simple method. I very seldom that you actually have the explicit knowledge of the probability distribution, so it's not that useful in practice. But in principle, you can see that it can be done. And the price you pay for it is that you had 100 examples when you are done with this scaling and resampling, or whatever method you use. The effective size now is 90. So you lose a little bit in terms of the independence of the points, and therefore you get effectively a smaller sample because of it. But at least you deal with the sampling bias that you wanted to deal with. Now, this method works. And even if you don't know the distribution, there are ways to try to infer the distribution works. But it doesn't work if there is a region in the input space where the probability is zero for training. Nothing will be sampled from that part. But you are going to test on it. There is a probability of getting a point there. Very much like guys without a phone that happen to have zero probability in the sample, but they don't have a zero probability in the general population. And in that case, there is nothing that can be done in terms of matching, because obviously you don't know what happened in that part. On the other hand, in many other cases, there is a simple procedure, which is actually very useful in practice. If you look at, for example, the Netflix competition. One of the things you realize is that I have the data set. It's a huge data set, 100 million points. And then I'm going to test your hypothesis on the final guys, the final ratings. So it's a much smaller set. And the interesting aspect about it is that if you look at the distribution of the general rating, the 100 million, it really is different from the distribution of these guys. Therefore, the question came up, can I do something during the training such that I make the 100 million look as if they were pulled from the distribution of the last guys? Very interesting question, has a very concrete answer. And the 100 million become 10 million, not that you are throwing away points, but you are weighing them such that when you are done, they look smaller than a set. But then you are actually matched to that, and you can get a dividend in performance. There is a cure for sampling bias in certain cases, and there is no cure in other cases in which all you can do is admit that you don't know how your system will perform in the parts that were not sampled. That would be fatal if you are doing a presidential poll, but may not be as fatal when you are doing machine learning, because all you are going to do, you are going to warn against using the system within that particular sub-domain. Third puzzle. Try to detect sampling bias here. Credit approval. Oh, we have seen that before. That's a running example in the course. So let me remind you what that was. OK, so the bank wants to approve credit automatically. It goes for the historical records of customers who applied before, and they were given credit cards. So you have a benefit of, let's say, three or four years' worth of credit behavior, and you look back at their inputs, and the inputs in those cases were simply the information they provided at the time they applied for credit, because this is the information that will be available from a new customer. And you get something like that. This is the application. You also have the output, which is simply you go back and see whatever the credit behavior is, and you ask yourself, did they make money for me? Because it's not only credit worthiness that you are a reliable person. It's also that some people who are sort of flirting with disaster are very profitable for the bank, because they max out, and they pay these ridiculous percentages, so they make a lot of money. As long as they don't default. Once they default, it's a problem. So there is a question of just, did you make profit or not? That's a question. And I'm going to approve future customers if I expect that they will make profit for me. That's the deal. Where is the sampling bias? It's, we're OK. We'll probably allude it to it in one form or another. The problem is that you're using historical data of customers you approved, because these are the only ones you actually have credit behavior on. So the guys who applied and you rejected them are not part of this sample. And when you are done, you're going to have a system that applies to a new applicant. You don't know a priori whether that applicant will be approved or not, according to your old criteria. So it could belong to the population that was never part of your training sample. Now, this is one case where the sampling bias is not that terrible in terms of effect, not in terms of characterizing what is going on. You have a part of the population, and they have zero probability in terms of training, and non-zero probability in terms of testing. It's good old-fashioned sampling bias. But the point is that banks tend to be a bit aggressive in providing credit, because as I mentioned, the borderline guys are very profitable. So you don't want to just be conservative and cut them off, because you are going to be losing revenue. Because of this, the boundary that you're talking about is pretty much represented by the guys you already accepted. You already made mistakes in what you accepted. So when you get that boundary, the chances are the guys you missed out will be deep on one side. You've got all the support vectors, if you want. So the interior points don't matter. They matter a little bit, but actually that system with the sampling bias does pretty good on future guys. By evidence that you reject someone, how do you know that it's good because you rejected it? They apply somewhere else, and they make the other guy lose money, so you realize that your decision was good. So you can't verify if you have a consortium of banks, whether that sampling bias here has an impact or doesn't have an impact. Final topic, data snooping, the sweetest of all. Well, it's the sweetest because it is so tricky, and manifests itself in so many ways. And let me first state the principle. The principle says, if a data set has affected any step of the learning process, then the ability of the same data set to assess the outcome has been compromised. Very simply stated, the principle doesn't forbid you from doing anything you can do, whatever you want. Just realize that if you use a particular data set, whether it's a whole or a subset or whatever, use it to navigate into, OK, I'm going to do this, I'm going to choose this model, I'm going to choose this lambda, I'm going to do this, I'm going to reject this, whatever it is. You made a decision, then when you have an outcome from the learning process, and you use the same data set that affected the choice of that, the ability to fairly assess the performance of the outcome has been compromised by the fact that this was chosen according to the data set. I think this is completely understood by us having gone through the records. We put it in a box, and then we make the statement that this is the most common trap for practitioners. By and large, I've dealt with wall street firms quite a bit in my career, and there are lots of people who are using machine learning, and it is rather incredible how they manage to data snoop. And there is a good reason for it, because when you data snoop, you end up with better performance, you think. Because that's not really snooped. So I looked at the data, I chose a better model. The other guy didn't look at the data, and they are struggling with the model, and they are not getting the same in-sample, and I'm ahead, it looks very tempting to do. And it's not just looking at the data. The problem is that there are many ways to fall into the trap, and they are all happy ways. So if you think of it as landmines, it is actually happy landmines. You very cheerfully step on the mine, because you think you are doing well. So you need to be very careful, and because it has different manifestations, what I'm going to do now, I'm going to go through examples of data snooping. Some of them we have seen before, and some of them we haven't. And then you will get the idea, what should I avoid, and what kind of discipline or compensation should I have in order to be able not to suffer from the consequences of data snooping? So the first way of data snooping we have seen before is looking at the data. So I'm borrowing something from our experience. Remember the land linear transform? So you have a data set like this. And let's say you didn't even look at that, and you decided that I am going to use a second-order transform. So you take this is the transform. You take a full second order. You apply it, and you look at the outcome. And you look, OK, this is good. I managed to get 0 in sample error. What is the price I'm paying for generalization? 1, 2, 3, 4, 5, 6. That's an estimate for the VC dimension. So that's the compromise between these 6 and however many points, et cetera. So you realize, OK, well, I fit the data well, but I don't like the fact that it's 6. I don't have too many points. So my handle on generalization is not good. So let me try to do better, at least in your mind. So what you do is like, OK, wait. I mean, I didn't need all of these guys. I could have gone with just this guy, knowing that this is the origin. All you need to do is just x1 squared and x2 squared. This is just a circle center that the origin. Why do I need the other funny stuff? This would be, I mean, if I'm going for a more elaborate stuff. So now 1, 2, 3. Now I have VC dimension of 3, so I am better. Of course, we know better, but I'm just sort of playing along. And then you get carried away and say, OK, I can even do this. It's not an ellipse. It's a circle. So I can just add up x1 squared and x2 squared as one coordinate, and then I have 2. And see what the problem is. And the problem is what we mentioned before. What you are really doing, you are alerting algorithm in your own right, but free of charge. That's the problem. You're looking at the data, and you're zooming in, and you're zooming in. You're learning. You're learning. You're narrowing down the hypothesis, and then leaving the final learning algorithm just to get you the radius. Yeah, big deal. Well, the problem is that you are charging now for a VC dimension of 2, which is the last part of the learning cost you, which is choosing the coefficients here. But you didn't charge for the fact that you are a learning algorithm, and you took the data into consideration, and you kept zooming in from a bigger hypothesis set. You didn't charge for the full VC dimension of that. Now, it is very important to realize that the problem here involves the data set. Because what happens when you look at the data set, you are vulnerable to designing your model or your choices in the learning, according to the idiosyncrasies of the data set. And therefore, you may be doing well on that data set, but you don't know whether you'll be doing another independently generated data set from the same distribution, which would be your out-of-sample. So that's the key. On the other hand, you are completely allowed, encouraged, ordered to look at all other information related to your target function and input space, except for the realization of the data set that you're going to use for training, unless you are going to charge accordingly. So here is the deal. Someone comes in, I ask him, how many inputs do you have? What is the range of the inputs? How did you measure the inputs? Are they physically correlated? Do you know of any properties that I can apply? Is it monotonic in this monotonic? All of this is completely valid and completely important for you in order to zoom in correctly. Because right now, you are not using the data. You are not subject to overfitting the data. You are using properties of the target function and input space proper, and therefore improving your chances of picking a correct model. The problem starts when you look at the data set and not charge accordingly, very specifically. Here's another puzzle. This one is financial forecasting, befitting. So right now, there will be data snooping somewhere here, and you need to look out for it. So in this case, this is a real situation with real data. You are predicting the exchange rate between the US dollar versus the British pound. So you have eight years worth of daily trading, which just simply takes a change from day to day. And eight years would be about 2,000 points. There are about 250 trading days per year, at least when the data was collected. And what you are planning to do is the following. You look here. Let me magnify it. This is your input for the prediction, and this is your output. So r is the rate. So you don't look at the rate in the absolute. You look at delta rate, the difference between the rate today and the rate yesterday. That's what you're trying to predict. So you're asking yourself if there is going up or down every day, and by how much? So you get delta. And you get delta for the 20 days before, hoping that a particular pattern of up and down in the exchange rate will make it more likely that today's change, which hasn't happened yet, you are deciding to either buy or sell at the open. Whether this will be positive or negative, and by how much? So if you make a certain prediction, then you can obviously capitalize on that, and make predictions according to that. And if you are right more often than not, you would be making money, because you are losing less often than winning, if you have the right objective function. So this is the case. So what happens here is that now you have the 2,000 points. So for every day, there is a change, delta r. And what you do first, you normalize the data to zero mean and unit variance. And then after that, you have this array of 2,000 points. You create training set and test set. So the training set, in this case, you take 1,500 points, 1,500 days. So every day now, you take the day, and you take the previous 20 days as their input that becomes your training. And for the test, you picked it at random, not the last one, just to make sure that there is no funny stuff that you change, or this or that. You just want to see if something is inherent. So just to be on the safe side, you did it randomly. And then you take 500 points in order to test on. So right now, out of the 2,000 array of points, you have a big array of 20 points input, one output, 20 points input, one output, 1,500 of those. And on the other side, on the test, 20 points input, one output, 20 input output, 500 of those, this is for the test. That's the game. So you go on with the training. You train your system with the training. And to make sure, because you heard of data snooping and whatnot, these guys are in a lock. You didn't look at the data at any point. You just get it all of this automatically. And then when you are done and you froze the high final hypothesis, you open the safe. You get the test data, and you see how you did. And this is how you did. You train only on D. Train, you test on D test, and this is what you get. So this is the, I'm not saying how often you go to try it, but I'm actually saying that you put a trade according to the prediction, and I'm asking you how much money you made. So for the 500 points, sometimes you win, sometimes you lose, but you win more often than you lose, which is good. And at the end of two years' worth, that's what 500 days would be, you would have made a respectable 22% unleveraged. So that's pretty, pretty good. So you're very happy, and now you need to, having done that, you go to the bank and tell them I have this great prediction system. I hear the system. I'm going to sell it for you, and I guarantee that it will be, you do the error bars and whatever. And they go, and they go live, and they lose money, and they sue you, and all of that. So you ask yourself, what went wrong? What went wrong is that there is snooping, and the interesting thing is that where exactly is the snooping? So there are many things, randomly, the fact that I use inputs that happen to be outputs to the other guy. No, that's legitimate. I'm just getting the pattern. You just go around it, and it is really remarkably subtle to the level where you can fall into that very, very easily. And here is where the snooping happened. The snooping happened when you normalized. What? I had the daily rates, right? 2,000 of them. I have the change. All of that is legitimate. Now, I sort of slipped the fast one by you. I hope I did. When I told you, OK, first you do, you normalize this to zero mean and unit variance. It looked like an innocent step, because you get them to a nice numerical range, and some methods will actually ask you to please put the data normalize, because it's sensitive to the dynamic range of the data. The problem is that I did this before I separated the training from the testing. So I took into consideration the mean and variance of the test set, that extremely slight snooping into what's supposed to be the test set, not to affect anything it has affected me, but it's just a mean and how could it possibly make a difference? Well, if you didn't do that, you split the data first. You took the training set only, and you did the normalization. And whatever the mu and sigma squared that did the normalization for the training set, you took them frozen, and applied them to the test set so that they live in the same range of values. And you did the training now and the test, so without any snooping. Under those conditions, this is what you would have gotten. So no wonder you lost money. All the money you made is because you sniffed on the average of the out-of-sample. And the average matters, because if you think about it, let's say that the US dollar had a trend of going up, that will affect the mean. But you don't know that, at least you don't know it for the out-of-sample, unless you've got something out-of-sample. So I'm not saying normalization is a bad idea. Normalization is a super idea. Just make sure that whatever parameters you use for normalization are extracted exclusively from what you call a training set, and then you are safe. Otherwise, you will be getting something that you are not entitled to get. Easy to think about if you are actually thinking, I'm going to deploy this system. I don't have the test set. So if you don't have the test set, you cannot possibly use those points in order to normalize. So use only things that you will actually be able to use when you deploy the system. In this case, you have only the training. Now the third manifestation of data snooping comes from the reuse of the data set. That is also very common. So what you do, I give you a homework problem. I am very excited about neural networks. Let me try neural networks. Oops, they didn't work. I heard support virtual machines are better. Let me try them. Yeah, I did, but it was the wrong kernel. Let me use the RBF kernel. Oh, maybe I'm just using two sophisticated models. Let me go back to the linear models and just use a nonlinear transformation. And eventually, using the same data set, you will succeed. And the best way to describe it is a very nice quote in machine learning. It says, if you torture the data long enough, it will confess. But exactly the same way that a confession would mean nothing in this case. So the problem here is that when you do this, you are increasing the VC dimension without realizing it. I used neural networks and it didn't work. And then I used support vector machine with this and that. Guess what is the final model you used in order to learn? The union of all of the above, just that some of them happened to be rejected by the learning algorithm. That's fine. But this is the resource you had. So you think of the VC dimension, and the VC dimension is of the total learning model. So again, as we will see, there will be remedies for data snooping. And there's a question of, it's not like I have to try a system, and when I fail, I just quit. That's not what is being said. It's just asking you to account for what you have been doing. Don't be fooled into thinking that I can do whatever. And then the final guy that I used with a very simple model after all the wisdom that I accumulated from the data is the VC dimension that I'm going to charge. That just doesn't work. Now the interesting is that this could happen not because of you use the data, but because of others use the data. Oh my God, it's really terrible here. Here's the deal. You decide to try your methods on some data set. So you go to one of the data sets available on the internet, let's say for heart attacks or something. And you say, OK, I'm very aware of data snooping. I'm not going to look at the data. I'm not going to normalize using the data. I'm going to get the data and put them in a safe, and close the safe. And I will just do my homework before I even touch the data. And your homework is in the form of reading papers about other people who use the data set. You want to get the wisdom. So you use this, et cetera. And you find that people realize that Boltzmann machines don't work in this case. The best kernel for the SVM happens to be the polynomial over the 3. What about it? So you collect it, and you look at it. And then you have your own arsenal of things. So as a starting point, you put a baseline based on the experience you got. And you say that I'm going now to modify it. Now you open the safe and get the data. Now you realize what happened. You didn't look at the data. But you used something that was affected by the data, through the work of others. So in that case, don't be surprised that if all you did was determine a couple of parameters, that's the only thing you added to the deal. And you've got a great performance. And you say, I have two parameters. I mentioned two. I have 7,000 points. I must be doing great out-of-sample. And you go out-of-sample, and it doesn't happen. Does it happen? Because actually, it's not the two parameters. It's all the decisions that led to that model. And the key problem in all of those is always to remember that you are matching a particular data set too well. You are now married to that data set. You keep trying things, et cetera. And after a while, you know exactly what to do with this data set. If someone comes and generates another data set from the same distribution, look at it. It will look completely foreign to you. What happens? It used to be that whenever these two points are closer, it's always a point in the same line far away. That's obviously an idiosyncrasy of that thing. Now, you give me a data set that doesn't have that. That must be generated from a different distribution. No, it's generated from the same distribution. You just got too much into this data set to the level where you are starting to fit funny stuff, fitting the noise. There are two remedies for data snooping. And I'm going to do this, and then give you the final puzzle, and call it a day. You avoid it, or you account for it. That's it. So avoiding it is interesting. It really requires strict discipline. So I'll tell you a story from my own lab. We were working on a problem, and performance was very critical, and we were very excited about what we were having. All the ingredients that make you go for data snooping, you just want to push it a little bit. So we realized that this is the case we had a discipline that would take the data. The first thing we did, we sampled points at random, put them in the safe, and then the rest of the guys you can use for your training, validation, whatever you want. So at some point, one of my colleagues who was working at the problem declared that they already have the final hypothesis ready. It was a neural network at that point. So now I was the safe keeper, so now I'm supposed to give them the test points in order to see what the performance is like. I smelled the rat. So what I decided, I asked them, could you please send me the weights of the final hypothesis before I send you the data set? That was the requirement. Because now it's completely clear. He's committed to one final hypothesis. If I send him the data set, and he says it performed great, I can verify that because he has already sent me that. It's a question of causality in this case. And the problem is that it is not that difficult to come. Here's the data set. And what you really had, you had the candidate. But you had three other guys that they are in the running. And then you look at the data, and you decide, OK, maybe I get one from the running. You can do very little. And in particular, in financial applications, it's extremely vulnerable, because it's so noisy. It is very easy. When you fit the noise a little bit, you will make much better performance that you will ever get from the pattern. So you had better be extremely careful. And therefore, you have a discipline that really is completely waterproof, that you did not data snoop. The accounting for data snooping is not that bad, because we already did a theory, and when we have a finite number of hypothesis we're choosing from for validation, we know the level of contamination. Even if it's an infinite one, we have the VC dimension. We had very nice guidelines to tell us how much contamination happened. The most vulnerable part is looking at the data, because it's very difficult to model yourself and say, what is the hypothesis set that I explored in order to come up with that model by looking at the data? So because accounting is very difficult, that's why I keep raising a flag about looking at the data. But if you can't account, by all means, that's all you need to do. Look at the data all you want. Just charge accordingly, and you will be completely safe as far as machine learning is concerned. Final puzzle, and we call it a day. And we are still in data snooping, so maybe this has to do with data snooping. But it also has to do with sampling bias. So it's an interesting puzzle. So this is a case where you are testing the long-term performance of a famous strategy in trading, which is called buy and hold. What does it mean? You buy and hold. You just don't keep, oh, I'm going to sell today, because it's going down. You just buy and sit down and forget about it. It's like a pension plan or something. And five years later, you look at it and see what happens. So you want to see how much money you make out of this. So what you do is you decide to use 50 years' worth of data. That's usually a good lifespan in a professional life, so that will cover how much money you make at the time you retire from the time you start contributing to it. So here's the way you do the test. You want the test to be as broad as possible, so you go for the S&P 500. You take all currently traded stocks, the 500 of them. And then you go back, and you assume that you strictly applied a buy and hold for all of them. So don't be tempted to say that I'm going now to modify it, because this guy crashed at some point. So if I sold and then bought again, I would make movement. No, no, no, no. It's buy and hold we are testing. That was frozen. So you do this, and then you compute, and you find that you make a fantastic profit. And you compute if I do this, you're now young in your career, and I apply it. By the time I retire, I will have a couple of yachts, and I will do this. Wonderful thing. Can you see the problem? You are very well trained now, so you can detect it. The problem is there is a sampling bias, formally speaking, because you looked at the currently traded stock. That obviously excludes the guys that were there, and took a dive. And that obviously puts you at a very unfair advantage. And the interesting is that people do treat this not as a sampling bias, but as a data snooping, in spite of the fact that it doesn't fit our definition of data snooping. It does fit the definition of snooping, because you looked at the future when you are here, as if you are looking 50 years from now, and someone tells you which stocks will be traded at that point. So that's not allowed. But nonetheless, some people will treat this as data snooping. In our context, this is formally just sampling bias, and sampling bias that happens to be created or caused by a form of snooping. I will stop here, and we will take questions after a short break. Let's start the Q&A. So in the last homework that people were using, LibSVME emphasized the fact that data should be scaled, so why would they not discuss this in the course? There are many things I did not discuss in the course. I had a budget of 18 lectures, and I chose what I considered to be the most important. There is a question of input data processing, and there is a question not only of normalization, it's also a question of decorrelation of inputs and whatnot, which is a practical matter. And the fact that I did not cover something doesn't mean that it's not important. It just means that under the constraints, it's a constraint optimization problem, and you have the solution, and I have to have a feasible solution. So that's what I have. I think we have an in-house question. Thanks. Professor, you mentioned that if you reuse the same data set to compare between different models, it's a form of data snooping. OK. So how do we know one form of model? The part of it, which is formally data snooping, is the part where you use the failure of the previous model to direct you to the choice of the new model without accounting for the VC dimension of having done that. So effectively, it's not you that looked at the data, but the previous model looked at the data, and made a decision, and you didn't charge for it. So that is the data snooping aspect of it. If you did this as a formal hierarchy, you start out, here's the data set, I don't look at it, I'm going to start with support vector machines with RBF, and then if I fail, I'm going to do this, et cetera. And given that this is my hierarchy, the effective VC dimension is completely legitimate. The snooping part is using the data for something without accounting for it, in this case, using the data for rejecting certain models and directing yourself to other models. Yes. So by accounting for the data snooping, do you mean like you consider the effective VC dimension of your entire model and use a much larger data set for your entire model? Yeah, I mean, if you get the VC dimension, so if the VC dimension is so big that the amount of data set won't give you any generalization, the conclusion is that I won't be able to generalize unless I get more data, which is what you're suggesting. So the basic thing is that you are going to learn and you are going to finally hand a hypothesis to someone. What do you expect in terms of performance? Data snooping makes you much more optimistic than you should, because you didn't charge for things that you should have charged for. That's the only statement being made. Is there a possibility that data snooping will make you pessimistic? It will make you more conservative. I mean, I can probably construct deliberate scenarios under which this is the case. But in all the problems that I have seen, people are always eager to get good performance. That is the key. That is the inherent bias. And that is what directs you towards something optimistic, because you do something that gets you a smaller in-sample error. And you think now that this in-sample error is relevant, but you didn't account for what it costs you to get to that in-sample error. So it's always in the optimistic direction. Yes, thank you. OK, assuming that there is sampling bias, can you discuss how can you get around it? OK, so we discussed it a little bit. So if there is a sampling bias, if you know the distributions, you can let me look at the... OK, so in this case, let's say that I give you these distributions. What this means is you generated the data according to the blue curve, and therefore you will get some data here. So what is clear, for example, is that the data that correspond to the center of the red curve, which is the test, are underrepresented in the training set. And on the other hand, the data that are here are overrepresented. That's the blue curve is much bigger. It will give you some samples. It will hardly ever be the case that you will get that sample from the testing. So what you do, you devise a way of scaling or giving importance, not scaling the y value, just scaling the emphasis of the example, such that you compensate for this discrepancy as if you are coming from here, and there are some resampling methods to do the same effect. So this is one approach. The other approach, which is in the absence of those guys, is to look at the input space in terms of coordinates. So let's say that with the case of the Netflix, you look at, for example, users rated a certain number of movies. Some of them are heavy users, and some of them are light users. So you put the how many movies a user rated, and you try to see that in the training and in the test, you have equivalent distribution as far as the number of ratings are concerned. And you look for another coordinate, and a third coordinate, and try to match these coordinates. This is an attempt to basically take a peak at the distribution, the real distribution that we don't know, in terms of the realization along coordinates that we can relate to. So there are some methods for do that, are basically are compensating by doing something to the training set you have to make it look more like it was coming from the real distribution, the test distribution. Is there any counter-example to Occam's racer? Is there a counter-example to Occam's racer or not? OK, I mean, it's statistically speaking in what way. I mean, I can take a case where I violate the marriage between a complexity of an object and the complexity of the set that belongs to the object. So I can take one hypothesis, which is extremely sophisticated in terms of the minimum description lens of the order of the polynomial. But it happens to be the only hypothesis in my training set. In my hypothesis set. So now, if this happens to be close to your target function, you will be doing great in spite of the fact that it's complex and whatnot. So I can create things where I start violating certain things like that. But in the absence of further knowledge and in very concrete statistical terms, Occam racers hold. So the idea is that when you use something simpler, on average you will be getting a better performance. That's the conclusion here. Specifically talking about applications in computer vision and the idea of sampling bias comes to mind. Is there any particular method used there to correct this or just the same we discussed? I think it's the same as discussed. Just apply to the domain. Sometimes the method becomes very particular when you look at what type of features you extract in a particular domain and whatnot, and therefore it gets modified in that way. But the principle of it is that you take the data points from your sample and give them either different weight or different resampling, such that you replicate what would have happened if you were sampling from the test distribution. OK, I think that's it. So we'll see you on Thursday.