 المساعدة المساعدة تسأل لك بكالتك مرحباً لدينا نتحدث عن ثلاثة محاولة محاولة أنه إذا كنت في بيسنسة محاولة محاولة يجب أن تكونوا أعلمين لكي يجب أن تتواجد محاولة محاولة في محاولة محاولة أولاً هو اوكمز ريزر الذي يقول أنه سبعاً أفضل ويجب أن يكون أفضل في محاولة وذلك يجب أن تستخدم ريزر لكي يجب أن تتواجد محاولة المحاولة المحاولة أو المحاولة المحاولة في هذه المحاولة لكي يجب أن تتواجد إلى محاولة محاولة المحاولة مع المحاولة المحاولة now in making this a specific statement that is justified we had simply two arguments that are interesting in their own right one of them is the fact that a complexity of an object corresponds to the complexity of a class of objects so this correspondence was one the other one is that if you have an unlikely event then when it does happen that is more significant than it was a likely event to begin with and when we put them together we get the proofs of the Occam's razor under different assumptions the second principle had to do with sampling bias which reminds you of the fact that we said that the training data comes from the same distribution as the test data that was the basic assumption in all of our theoretical analysis and when it doesn't then there is a bias and since your learning algorithm learns only from the training data it is going to inherit whatever it is that is in the training data as a distribution and therefore the result will be accordingly biased if the mismatch is sort of nice and continuous at least non-zero for all the points then there is a way to compensate for the sampling bias by trying to make the sample look as if it was coming from the other distribution but if the training data doesn't represent a particular part of the space so that space has a probability zero as far as the training is concerned but it has a positive probability for the test then there is really nothing that can be done to replicate the behavior of the target function over that part of the space and therefore you get something that is inherently biased the last principle I had to do was data snooping which is the most important in terms of being a trap that you fall into and the idea here is that when you use a data set in the training in any capacity it could be a very light capacity and we saw an example where the only way the data was used was in order to derive normalization constants for the inputs something very light nonetheless the fact that you use the data means that you cannot call it test set after that and trust the performance that is suggested by that data set and indeed we took a case where we allowed snooping and we ended up with very optimistic view when in reality the performance was very poor and of course if you go for the real out of sample you hand the system to your customer and they test it they will see the real out of sample not the optimistic performance that you had today's lecture is the final lecture and I am going to use it in order to give the big picture of machine learning and try to feed the stuff that we covered within the big picture and then tie up a couple of loose ends that are relevant to that okay so here is the outline first I'm going to talk about the map of machine learning because machine learning is pretty diverse as you will see and we will see what we covered and how you can pursue it further and what topics I would recommend that you read about and whatnot and then we will take two topics I will explain why we picked these two topics and talk about them in some detail not in very technical detail like we cover the topics of this course but at least to give you some background about where these topics stand as far as the machine learning is concerned so that if you decide to pursue them you have a head start and finally I'm going to acknowledge the people who have contributed greatly to this course okay well when it comes to machine learning it's a jungle out there and it's interesting that if you buy two books in machine learning and you look at them you will feel that you are reading about two completely different subjects if one of them is theoretical and particular theories there are a bunch of theories and one of them is practical and one of them is emphasizing a particular technique they just have nothing in common not even the jargon so if you go out on your own and just look at what happens in machine learning pretty much this is the picture you will get not a pretty picture okay and you can see buzzwords galore and you know people will get excited about one thing and tell you that this is you know God's gift to humanity and you know the other thing you know people will be very opinionated it just all over the place so I'm not going to attempt to be complete okay because being complete here is fatal trying to cover everything so that everybody is happy that you covered the results they got I don't think this is a good strategy I sort of I preached the Occam's razor last time remember Occam's razor okay you should have a razor and then you should trim trim trim until you get the essential part this is pretty much what I try to do here okay because I believe that if you understand the fundamentals inside out you can pursue things completely on your own okay from then on you are not going to be intimidated by grandiose statements of one nature or another you will know where things lie and what not so my task was to get the foundation right and in the course of doing that I had to omit many many topics so now I'm going to give you the map of the whole thing what we covered and what can be pursued in order just to have a good outlook on the situation okay so here is the map there is theory okay and theory means that you mathematically model what happens in reality and then try to do mathematical derivation in order to arrive at results that are not otherwise obvious that's what theory is in general okay and there are usually two aspects when you look at a theory what assumptions they made and then what the derivation is in order to get to the results I hardly ever saw a situation where there is a problem with the second part people are very competent mathematicians they are not going to make a mistake in derivation so the chances are they will when they make a statement mathematically they actually mean it and they proved it so that is not our concern the biggest pitfall in theory is that people make assumptions that make what they are solving really divorced from the practice that you are going to see when you use machine learning okay and when I picked the theory I picked it with a view to relevance to practice I wanted to get something it has to be some mathematics and has to be proved and all of that but then when you see the result you can use it and I will go through other alternatives that have succeeded in that to different degrees then there are techniques and that is really the bulk of machine learning okay we covered some techniques but I'm going to categorize techniques into two sets and give you samples and then you will understand from what we have done where it lies and how you can pursue it further and finally there are paradigms and paradigms meaning different assumptions about the learning situation not mathematical assumptions but different assumptions that deal with different learning situations like for example supervised learning versus reinforcement learning and whatnot and when you make these assumptions the problems you are solving are sufficiently different that you end up with really a different body of knowledge that you have to study and therefore we call them different paradigms okay so these are basically the categories so let me start with the paradigms first and then go to the other ones okay so we covered supervised learning that was almost the exclusive topic of the course and it is by far the most popular and the most useful form of machine learning so if you cover just that you are already very much ahead the other topics are interesting and they have applications and they should be studied but definitely not in the league of supervised learning in terms of impact on practice we touched on unsupervised learning but at least we got the idea that clustering is the key and indeed clustering is the key and with unsupervised there are also variations of that there are semi-supervised and there are everything I say here has a bunch of variations already there so I'm just giving you the center of mass of these paradigms then there is reinforcement learning that I described in the first lecture very briefly but we didn't cover at all and the reason is justified because the main problem in supervised learning was the question of information do I have enough information in the data in order to get the target function and generalize when you go to reinforcement learning remember what reinforcement learning was you don't have the target value on the examples you just take an action which is an output not necessarily the target output and then something comes tells you that you did well or you didn't do well so the sort of reinforcement of good actions and elimination of bad actions will make you eventually converge to a good solution and we said that it applies to games let's say you're playing you know trying to to learn backgammon and what you do you just play against yourself generating at will examples as you want here's the situation what do I do I'll do something I can generate that the only question is after you do that how do you take the feedback of winning and losing and go back and adjust your strategy such that you converge to a good strategy so the issue here is to provide learning it's not a question of information it's a question of the algorithm that will take all of these tons of examples that you can generate at will and produce a way to converge to a solution from one strategy to a better strategy to a better strategy so it's a completely different paradigm and if there is one topic in this entire view graph that I would encourage you to pursue would be read about reinforcement sometimes active learning so active learning it could be active reinforcement or active supervised active learning means that instead of someone giving you the dataset you query about the value at a particular point so you give me the input and you ask for the output if it's supervised or you give me the input and expect a reward or punishment if it's reinforcement learning so it's an adjustment and there are some interesting results there and the other sort of mini paradigm is online learning and this is purely so take any form of learning and instead of giving you the full dataset and allowing you to work with it any way you want I am streaming the dataset to you so you take something and you try to modify your current hypothesis and then you take the other guy and you cannot store everything if you could store everything you have the whole dataset so there are limitations on storage and computation and under those constraints you ask yourself how can I learn so these are the most famous paradigms there are other paradigms I am not trying to be exhaustive now let's go for the theory the main theory in machine learning is the Wapnickshire Brunenckes theory and it is the one that I covered in great detail in this course as you realized and the reason is very straightforward it's relevant you do the math you go through the proofs and then you get the VC dimension you equate it to the number of parameters in some cases you go to practice even if you are taking bounds and treating them as if they were equalities that leap of faith works very well in practice so it's not that the the theory was there and we decided that this is a good one the theory was there and then we tried to take wisdom from the theory and apply it in practice and it worked this is the value added by choosing a topic and putting it here you know that there is a reason why it's here and the reason here is that it is actually relevant to practice and then there is bias variance bias variance sort of a sweet little theory and it gave us some intuition about this and indeed it was included it was sort of low cost to include it and it does lead to some understandings like the learning curves and whatnot there are theories that I didn't describe although that they are very substantial in the literature one of them is based on computational complexity it basically treats machine learning as a branch of computational complexity with an emphasis on asymptotic results can I do this in polynomial time or not and it's a very respectable body of work and the the only question for including it or not including it is whether these particular results correspond to something that I face in practice so when I look at it should I do the computational complexity part of it or should I do the generalization part of it the generalization part of it one hands down because it's the one that is the bottleneck when I practice and finally this is the there is the famous Bayesian approach now this treats machine learning as a branch of probability okay so you have a problem we can always put a probability distribution and by the time we put the full joint probability distribution you can answer all questions and it's a very sweet theory because you can sort of ask any question you want and you will find a very concrete rigorous mathematical answer to that question probability distribution okay okay so now let's go for the techniques I mean there are other theories again I just give you the sort of the biggest players when you look at techniques you should separate models as in hypothesis sets and algorithms that go with them that's one category and then the high level methods like regularization for example that doesn't restrict itself to a particular model but it's superimposed so we look at the models linear we emphasize a lot it is not usually emphasized in regular machine learning courses they usually go for other models it's emphasized very much in statistics for example and I find it to be very underrepresented in machine learning it's a very important model with the nonlinear transform you can cover a lot of territory and it's very low cost and it should be tried in many learning problems then we went on to neural networks and support vector machines and the kernel methods we cover quite a bit of territory nearest neighbors I alluded to very quickly when I talked about RBF it's a very standard method not much to say about it except that it's a good benchmark if you have a data set why don't you categorize everything according to the nearest neighbor and this would be give you a performance and then you can compare other methods to that it's not that difficult to implement we used to look at RBF to many things in machine learning and then there are Gaussian processes which some people are completely fond of which is great and it really has the same spirit of Bayesian it's a full probability distribution so a process here means a random process a random process is nothing but a random function if a random variable is a random number a random process is a random function so we have probability distribution over different functions and the assumption here is that they are Gaussian which means if you have a finite number of points the probability distribution of the y coordinate is jointly Gaussian for those guys so if you have a full description of that probability distribution you can solve anything you want because you can say if I have this data point then I'm conditioning on that Gaussian variable being equal to that and I'm asking myself what is now the conditional distribution of the other guys and for Gaussian this is completely solved and you know we have nice matrices to just multiply out which is good to use and if you are modeling something that happens to be a Gaussian process then obviously you win greatly because you are actually matching that there is SVD which is the singular value decomposition used figuratively in this case this is the factor analysis we used in the Netflix problem where we represented the user as a bunch of factors and the movie as a bunch of factors and we tried to match when you put this you find that as if you are the entire rating matrix into two matrices and this will be similar to the singular value decomposition in mathematics so we have seen part of that finally there is graphical models and graphical models is almost a different paradigm in its own right what are graphical models they are a model for where the target is a joint probability distribution that's what you are trying to learn and the key here is that the joint probability distribution between a very large number of variables becomes very difficult to manage computationally because there is the number of possibilities would be exponential in the number of variables so the bulk of work in graphical models is trying to find a simple way or an efficient way in order to get answers about that joint probability distribution and to learn it so it is mostly computational and it's based on graph algorithms and the main aspect of putting it in a graph is to use the properties that happen to be conditional independence as a way to simplify the graph so if you look at the things I showed so far probably there would be a full course in graphical models which is completely justified if you are in the business of modeling joint probability distribution and computation is a consideration this is the thing to learn there is no question about that it's specialized but it's very helpful in that case and the other one I mentioned is that when we talk together with active learning because there are a lot of commonalities now we go for the methods and the methods are very important because they cover a lot of territory regardless of what you have we use the regularization can you think of other models other methods at the same level that we used regularization and we use validation right these were all methods overall okay and and the last one we didn't cover was input processing this is something you do regardless of the model you are going to use and I find that input processing is best taught within a project's course it's a very practical matter and when you teach a project's course and people will have to deal with the real data it's a good thing to start by telling them okay here is the principle component analysis is the intellectual value to input processing it's a practical matter and therefore it is best taught when you are teaching a practical course okay so now from all of this I am going to talk about two topics today okay so one of them is Bayesian and the other one is aggregation okay I'm not going to talk about them in depth I'm actually going to try to make a point particularly about Bayesian okay and when you get the solution right why are you now adding up stuff to the minimal possible there is a good reason the Bayesian is the elephant in the room okay if I don't talk about it you will hear about it a lot and you will wonder why in the world didn't I talk about it looks great when you look at the results okay so I need to put it in perspective and that's what I'm going to try to do I'm not going to cover the scope of Bayesian of the Bayesian approach and make a point about you know when is it valid when can you use it what are the drawbacks and what not the other one is aggregation I would say that aggregation was a run Arab topic in this course okay I would have normally included it if I had more time and I had a natural position for it because it's a fairly simple technique and covers a lot of territory and has been successful so I'm going to try to cover it in some level of detail that will make you read it and understand what you are reading and where it lies okay so this is the plan so with that let's go to the two topics Bayesian learning first and then I'm going to go about aggregation okay Bayesian learning is trying to get a full probabilistic approach to learning so the first thing to do is to remind you of the learning diagram let me magnify it a little bit so we are not going to concentrate on the probabilistic aspect okay there are many probabilistic components one of them is inherent which is the fact that the target could be noisy and therefore we model the target not as a function but as a probability distribution okay think of the case for example we dealt with as trying to predict heart attacks okay you know getting a heart attack or not getting a heart attack is a probabilistic aspect and if we get a heart attack again we all can get a heart attack and a heart attack right so it's a heart attack but so here where I am not trying to get that المشاكل في حيث أن يستطيع أن يستطيع التواصل المساعدة. وع spite of the fact that it is an assumption, it's a very benign assumption for the very simple reason of the word unknown. I wanted a probability distribution just to get the machinery of probability going, and I'm making no assumptions about the probability distribution. You can pick anyone you want, and I don't even want to know it. So this is very light as assumptions go. Now, when it comes to the Bayesian approach, what you want to do is you want to extend the probabilistic role completely, so that everything is just a big joint probability distribution of all the notions involved. And if you get that going, then you will obviously be able to derive anything in terms of that joint probability distribution. So when you do this, let's think of something. In the prediction of the heart attack case, remember that we did use probability in order to derive the algorithm for picking a hypothesis. What did we do? We said you have a data set. What is the data set? A bunch of patients with their attributes, and whether or not they got a heart attack within a year of getting these measurements. That was the data set. And you were trying to say that I am going to pick the hypothesis, that if this hypothesis truly reflected the probability of getting a heart attack within that timeframe, then the probability of getting that data set which actually took place would be higher. That was our approach. And we call this the likelihood, remember? So it's not we are picking the highest probability hypothesis. We don't have the luxury to do that for a reason that will become clear. We are picking the hypothesis that will make the data that actually happened the highest possible probability, maximum likelihood. So we already used the probability approach here. Now the only difference when you go to the Bayesian approach is that you actually go for the real quantity. The data already happened. Why are we maximizing the probability? Well, if we maximize the probability, if what happened is likely given a scenario, then that scenario is likely. That's why you call it likelihood. But a more principled approach would be to actually try to use the probability that this is the correct hypothesis given the data. That is the bottom line. I'll give you the data. This is given. And you have a bunch of hypotheses. You ask yourself, is it this hypothesis or this hypothesis or this hypothesis that reflects the target function? Well, you look for which one is the most probable to be and you declare that. And that would be the Bayesian approach. If you go to statistics, there is always a school that love Bayesian and there is a school that hates Bayesian. And there is an ongoing struggle between them. And it's funny because you think this is mathematics. People shouldn't have just tests like that. But the problem is that Bayesian depends on something that I will describe here. And the controversy all comes from that assumption. But it came to the level in statistics where they describe a person as oh, this is a Bayesian person. I'm not a Bayesian. It's almost like it was a religion or something. But that's the reality of the field. And you will understand why it evolved into that when I described the components. Okay. The main component that raises the controversy is the prior. So let's look at what that is. Okay. We want the probability of a hypothesis being the correct target function given the data. Okay. And if you want to compute that and even if you have the model for the noise and the model for the input probability distribution all of the stuff that we had in the learning diagram is already taken for granted. You still need one more probability distribution in order to complete this. Okay. And the way to discover it is just let's write it down. This probability and you apply its Bayesian rule hence the word Bayesian in order to get this from the quantities we know. Okay. So we know this one. We know the probability of the data given that if this hypothesis was faithful description of how people got heart attack then the probability of getting that data is you just compute how much noise in each point according to this being F and you get it which we got in logistic regression and resulted in the error measure over there. So that is that was given. Okay. The part that we need that is not given is this one. Okay. When you multiply them you get the joint probability distribution and when you divide by the probability of D you get the conditional the other direction which you want. Okay. Now there is no sweating getting this because if I have the joint probability distribution I just integrate out whatever I don't want and I end up with the marginal. So there is no difficulty in this and in fact if your job is to just pick H according to this criteria then this fellow doesn't matter because it doesn't depend on H it scales all of them up or down. So if you are picking between two hypothesis according to this probability you might as well forget about this and take the numerator as your indicator and pick the one that gives you the bigger numerator. Okay. So you think if this is proportional to this and this is what I'm going to use. Okay. Now this is the mystery quantity so let's put it down and describe it. Okay. What does that mean? Okay. It's not conditioned on anything. I'm asking you here's the hypothesis set. Okay. It's a perceptron. It has a bunch of weights. Could you please tell me what is the probability that this particular combination of weights will actually give you the target function? Okay. How in the world are you are you going to know that? Okay. So what you are doing here is assuming that there is a probability distribution for that. You are going to put a probability distribution over the hypothesis set the last component in the diagram that didn't really have a hypothesis. The probability reflecting that the statement that this hypothesis is indeed the target function. Okay. And any discrepancy between the dataset and the hypothesis which is supposed to be the target function is attributed to the fact that the data is noisy. The data does not reflect the target function. The data deviates by added noise. Okay. So this one is called the prior. Okay. Prior because it is your belief about the hypothesis set before you got any data. Before. After you got the data you can modify this and you get the probability of each given the data. Okay. That's now more informed. You get the specific data from the the target function. And then you are going to zoom in and make a better choice among the hypothesis that you have for which one qualifies as the target. Okay. And this four this one is called the posterior. Happens after the fact. Okay. Now if you are given the prior let's say that I actually you know give you a problem I don't know the target function but I know quite a bit about it in terms of probabilities and here is the way I'm going to formalize that I'm going to give you a full probability distribution over the entire hypothesis set that tells you the relative probability of different hypothesis being the correct target function. That's my prior. If you have that you have the joint probability distribution and if you have the joint probability distribution you can answer any question. So that's a very attractive route to follow. Okay. You can get anything. Okay. And because of that it's important to look at the prior that is the the center of the idea of a Bayesian approach. Okay. So the main point I'm making the only point I'm making in fact in this particular section is the fact that prior is an assumption. So before I get to that let me give you an example of a prior to make it concrete the one I refer to. You are having a perceptron. So your perceptron model what is a perceptron model? Okay. It's hyper planes in some space d dimensional so I have weights W0 up to Wd that tell me the you know the slope and the offset and this will tell me what is the separating plane and I'm going to use this as a hypothesis in order to separate some data points generated from a target of some of the guys according to some noise I can say 10% of the points are flipped so this is your contribution of the noise just to have something concrete in your mind to imagine so you have a perceptron H is specified by the weights perceptron is you give it an H because it's a full function but in reality you just tell me what the parameters are and I know what is the function because I know the weights are more likely than others and what I'm going to try to do I'm going to try to make the assumption as benign as possible because I really don't know I mean when I say which weights are more likely than possible I'm not saying which weights are more likely to come out or I'm asking which weights are likely to actually reflect what the target function is the target function that we said is making a big commitment so knowing that for the perceptrons the magnitude of the weights doesn't matter if you scale all the Ws by any positive number up or down you get the same surface right because it just you know classifying plus one or minus one you care about the signal being positive or negative so I'm going to take Ws in a limited range and I'm my hope in putting that prior is that I didn't make a big assumption okay hope is the operative word here okay so what does that mean well it means that okay so if I get the probability distribution over all the weights then I can see which weights contribute to a particular hypothesis and then I have the the prior over H which is the one I want okay so I have P about which hypothesis is the target function according to this which seems to be completely uniform and then take the data and the data will tell me that one some guys are more possible than others if I you know if I pick something that will require my interpretation of the noise to say that you know 90% of the points had to flip in order for this to be the correct target then this is very unlikely because I okay and you compute the probability of the data given H and then you multiply them and you get what you want which is the the posterior and the posterior is the product that is proportional to the product of the two okay so this is a concrete example if you want to apply this okay now we make the main statement about the doesn't have to do with learning in particular to make the point let's say that I take the most neutral prior to describe something unknown so you have something that is unknown okay like a target function in this case it will be a number so I have an unknown number and all I know about it is that it's between minus 1 and plus 1 okay and someone decide that it will be convenient to have a probability that x is repeatedly x is just a number sitting out there that I don't know I don't know in not in a probabilistic sense that it's a random variable I don't know it's an unknown parameter okay so you ask yourself would this be equivalent to x being random in some setting can you model this with probability and invariably on face value this looks completely innocent and credible okay because here you didn't make any commitment it is as likely to be this as this as this this one is unknown so it seems that you captured what the meaning is okay I would like to argue is that it doesn't it actually makes a huge assumption okay now you are not saying that you don't know x you saying that I know that x came from this so if you know that here are bunch of stuff of stuff that I know that actually I didn't know here if you generate a bunch of x's and take their average the chances are you'll be around zero okay if you look at a bunch of x's here and average I have no clue what you are going to get if you do this I can tell you even not only that the average will be close to zero I can tell you how close it is in terms of variance here I could be all over the place so you realize that innocent as this maybe this is a different problem than this one okay and if you insist on modeling x as a probability and you want to capture the statement here exactly without adding any assumptions okay you can definitely do that although it looks much less attractive than this one that you want the true equivalent in terms of probability this fellow this is what you are going to have okay so here x is unknown here x is random and the probability density function of that happens to be a delta function centered around a point A that I don't know that would be strictly modeling this this does not model it okay and the fact that people take the liberty of doing that results in many cases in really a huge building based on a false premise okay in some cases you get away with it but in some cases you don't okay so this is the key point is that when you put a prior you are actually making a huge assumption think of it this way we took great pains to say the target function is unknown they say they want to learn a function I really don't know what the function is okay so now you go to your hypothesis set which you picked out of your hat I'm going to do neural networks or I'm going to use linear whatever it is okay and then you say now here is the probability is that for each of these points I'm going to specify very specifically the probability that this is exactly I'm not saying don't do that I'm just making the the statement realize that you're making an assumption and it's it's a big assumption okay so let's see the ramifications of that if you actually knew the prior you would be in fantastic ship why is that because you can compute the posterior for every point in your hypothesis set this is h equal f over the entire space that is this is really out of sample I am taking d which is in sample and I'm making a statement about the probability for every hypothesis to be out of sample I don't worry about regularization and vc bounds and this is it you know this you know the prior you can pick the most probable hypothesis okay and without without any dispute this is the most probable hypothesis you can even go further well I pick the hypothesis set and this is the probability that each of them is the target function and now that I have the evidence from the data I have a better picture of it which is the posterior I can now actually ask myself I can drive the expected value of h because I have for every h the highest probability and sticking with it why don't I get the benefit of the entire probability distribution okay well the target function could be this one could be this one could be this one with these probabilities if you want a good estimate of the value of the target function at any point x why don't you take the value of h of x for every hypothesis in your set and put them together as expected value because you have and you can even get an estimate for the error bar after you get the estimate this is my estimate for this I can tell you what are the chances that I'm wrong think of the possibilities I'm predicting the stock market okay and I learned using this and I have the inputs for today and I want to predict the output which is what will happen tomorrow now I can tell you the price movement what is the expected value of the price movement which is what I care about and I can also tell you what is the error bar so if I tell you that the expected value is positive and the error bar is small then the chances are overwhelming that I will move positive and I will be putting my money in going that direction if the error bar is huge then I'm not so sure and I'm not sure it's worthwhile betting on it okay so it's you know that's a good situation to be in and also you can drive anything you can imagine I mean you have a joint probability that constitutes that event and you get the probability and you have an answer for that okay now let me make a statement about the approach so far okay we have been struggling with VC bounds and loose bounds and then we have to use regularization and we is a heuristic for the regularizer and then we have to set a side a validation set and we wonder about the independence of the cross-valid what we need to do is plug in the quantities and they get the answer and they know that the answer is correct they don't worry about any of the stuff okay so the way I think about it is the following when you are following this ideology it's as if you want to have a good life and this is your approach to getting the good life first you rob a bank then you live righteously ever after well you can live righteously ever after you can afford it the other guys are struggling with this and that with regularization and whatnot the problem here obviously is the first step okay and the first step here is sugar coated greatly it's a benign prior it's just uniform we didn't do anything because obviously it's very attractive afterwards okay but you are standing is this actually justified okay if you do it in general without any justification then okay it's a nice theory built on an assumption that doesn't necessarily hold on the other hand it can be very valuable and it can be justified in basically two cases one of them is that the prior is valid and the other one is that the prior is irrelevant okay what do I mean okay prior is valid is that for some reason this is indeed the probability that a particular hypothesis is equal to the target function that is a fact in which case I concede okay you are doing better than me I can go with all my approximations and heuristics and this and that and I'm not going to do nearly as good as you do okay so in this case this trumps all the cases if you know that the assumption is valid okay so if there are cases where the assumption is valid I highly recommend that you follow this approach it may be computationally expensive because for example you have expected value with respect to the posterior in a high dimensional space that's not an easy task on the other hand since this is the ultimate performance it may be worth the effort the prior is irrelevant is a more interesting aspect so the idea is the following when you put a prior if you get more and more data sets and you look at the posterior you realize that the posterior is affected largely by the data set and less and less by the prior if you start from another prior and another prior and another prior as long as you don't take extremes zero at certain points you just get something reasonable it basically gets factored out as you get more and more data and because of that if you have enough data that the prior doesn't matter then you can think of the prior not as a conceptual addition it's just a catalyst for the computation and there is a particular approach to this where you think let me pick a prior just because of its analytic properties I have no reason whatsoever to believe that this is a valid prior but it happens to be that when I have this prior and you give me a data point and I compute the posterior that computation is easy these are called conjugate priors where you don't have to re-compute the posterior for the entire function you parameterize the thing and you find that all you need to do is change the parameters when you get new data points so this is completely valid if you are going to do this enough that by the time you arrive it didn't matter what you started with then what you are really doing you are putting a system in your computation such that you arrive at the correct result and it doesn't matter what your assumption was that's all I'm going to say so you can take a full course on Beijing learning and the techniques are really wonderful I'm not doubting any of that just be careful where to apply them because there is an assumption and the assumption is stronger than it seems to the uninitiated okay let's move to aggregation methods okay so I am talking about aggregation methods as I mentioned because they are really useful and they are not that difficult to understand okay so I'll give you the big picture and then you can pursue different algorithms so first is what is aggregation okay first it's a method that applies to all models as we said so the idea here is that you combine different solutions okay so you know let's say that I give a homework problem to the class that requires you to develop machine learning you develop the machine learning algorithm you get a final hypothesis everyone gets a final hypothesis and now I want to get the final hypothesis of each of you guys and put it together and combine them into a solution hopefully better because it got the wisdom of everybody here that is the idea okay so what you have you have a bunch of hypothesis I would have called them G as a final hypothesis because that's what they are they are the outcome from full training so each of them comes from training the entire D with certain specifications okay but I'm still calling them H because because I'm using aggregation the final hypothesis will really be a combination of those guys so they remain the H notation not the final hypothesis okay so the picture that goes with that okay here is the system that you got here is the system that you guy next to you got et cetera so I have all of those guys so now I want to put them together and get one solution very easy concept to have okay so one example okay people already solved it and I want to combine the solutions another one is interesting and it's particularly interesting for computer vision okay in many cases you let's say that you want to detect a face okay now this is a very complicated task so you could do very simple detections that are related to being a face you can try to detect is there an eye is there a nose are the positions relative to each other is the lighting consistent whatever I'm just just to put stuff and it doesn't have to have that great meaning you can just have simple masks that look at the picture and extract a feature that you think is related to being a face if you take any single feature and you try to decide whether this is a face base or not you will do horribly because the error will be huge okay then the decision all of a sudden is reliable okay this is important in machine vision because in computer vision because in computer vision the computation is a big deal you need to do things quickly because you are trying to be either real time or close to real time okay so using very simple features as they are called in this case and then combining them then that is that is a good application for it okay so combining is very simple I mean there are many ways of combining the most common is that if it's a regression problems these are guys give you a real number take an average okay if you are doing classification so everybody is deciding yes or no take a vote okay could be weighted average and weighted vote but that's basically the essence of it and there are there are lots of methods that belong to that category okay now let me make the point that it is different to do aggregation than to just do a two layer learning okay so what do I mean by that if you have a two layer model so there are a bunch of features followed by that you have seen that already like a neural network neurons in the hidden layer feeding into the output and you are trying to learn okay the learning is joined you learn all of the units at once so you look at this let me magnify it okay these are your units what you are learning as it takes the data and simultaneously adjust all of these guys in order to get the right solution this could be back propagation for example okay and in that case any one of them is not necessarily trying to replicate the function okay it's just trying to contribute positively to the function so finally in the final layer you could be taking the difference between these two units and that is the important thing that affects you and here are not trying to get it right are just trying to be sort of good soldiers and good features in that and the reason you do that because you are doing it all at once and you are trying to minimize the error so whatever combination happens happens so this is what we have done before okay now in the case of aggregation the units learn independently that each one learns as if it was the only unit look at this picture okay and in this case the learning algorithm looks at one at a time okay maybe it's a different learning but at least it's considered at a time and then gets you what that is and this guy is actually trying to replicate the function and this one is also trying to replicate the function okay et cetera and finally when you have all of these guys that trying to replicate the function you combine them and you get the output so that is the lack of a better word but they are really different categories and I wanted to emphasize this point one of them I call after the fact okay what does that mean it means that you already have solutions okay you for example remember the Netflix guys okay so you have the crowdsourcing and you let the problem out everybody tries their heart and then gets a solution with a view to performance individually nobody was thinking about putting them together okay if you are thinking about putting them together you may have other consideration for example you may decide okay this was going to go into a blend which is the word that goes with it okay and therefore I'd better get something that is different from what the other guys the boosting approach is I'm calling it before the fact to contrast which is the fact that you are developing the solutions with a view to the fact that they will be blended okay so you get one guy and then when you go to the other guy you are trying to develop a guy that will blend well with the first guy etc okay so you are not trying to get it to perform well in its own right you are trying to make it a good part of a blend okay and we saw an example of that one of the Q&A sessions which was a question of bagging where I give you that the dataset and let's say that I want to give the class the problem but I want to combine at the end okay and they are going to work independently so I want to do something to make sure that they are fairly independent so what I do I re-sample D and give everybody a different sample from D a bootstrap sample so because of that I introduce some independence in the information you are getting and when I put them together I will get the benefit of that independence okay so in this case this is what I am doing for the case of bagging I am actually giving everyone of them a different dataset but it is independent from the other guys and then I am going to combine what they have okay now this leads to the boosting algorithm which are very successful algorithms and the idea here is that instead of leaving the decorrelation to chance I am going to enforce it okay so I am building one hypothesis and the next one and I am making sure that whatever I am getting in the new hypothesis is novel was not covered by the previous guys okay and that obviously improves my chances of getting a good mix okay so what you do you create the hypothesis sequentially so you do one, two, three, four and then given the four that you have so far what is the best fifth that I can add to the mix okay and you make it good by making it decorrelated with the other guys so the picture here is rather interesting so it's not quite independent but it's not as bad as doing all of them at once what you do you have already done those so this is a recursive procedure okay now you read off from these and you realize how they perform and based on that you do something such that the data set you pass on to the new guy makes it develop something that is fairly independent and now this is frozen and it contributes here and then this guy is trying to be independent of all the previous guys so every time you have one you get something new okay therefore by the time you put them in the mix you get something interesting okay so this is the idea and the way to do the the the independent is rather interesting let's say that so far I have 60% correct and 40% wrong that's what I have using the first few guys okay so when I put them together that's what I get 60% and 40% means that 60% of the examples I got right if it's classification okay and 40% I got wrong okay so here is an idea to make the new guy fairly independent of those guys let's say that I emphasize the guys that I did badly on I give them bigger weights in the training and deemphasize the guys that I got right okay because the new distribution is concerned the new emphasis I have it looks that what I have so far is 50-50 it's random okay so what I have is not random because it deals with that training set as it is but if I give it different weights and I ask myself what is the weighted error now and the weighted error is 50% it means that it's as far as the previous guys is concerned it's as if the new guy is adding value to what I had before this is the general principle and when you plug in this you get a very specific algorithm you do this recursively and you get not only how you emphasize the examples given the old ones you also give derive the different weights of the mix so some of the guys will be successful in training and some of them will not be successful and you are going to have weights according to that the most famous algorithm specific prescription for how you do the emphasis and how you do the weighting it's called adaptive boosting at a boost and it is the one that is used in the computer vision example that I gave and indeed in that case what you are doing you are instead of working with just error trying to make it 50-50 you are actually working with something similar to the margin that we had before remember when we had the support vector machines we weren't settling we want to get it right with a margin of safety so the adaboost algorithm defines a cost function that has to do with violation of a margin and then tries to improve that margin as you go and the weights both for emphasizing the examples and for picking the combination of the hypothesis you have are with a view to maximizing that margin and it's a very successful algorithm in practice now let me the final technical slide talk about the blending after the fact because it was applied to Netflix with great success and I want your attention because I'm going to give you a puzzle last puzzle of the course I guess okay so here is the deal now I don't have the benefit of decorrelation or anything I don't have a choice I just give the data to people they came up with the solutions and now we want to put them together think of self as the last month of Netflix or other guys that look promising they have good solutions they already have the solutions they are not going to ask them to let's redo the whole thing with a view there is no time okay so what you want to do you just you want to take their solutions and put them together and get the answer okay so you want this and your plan is okay let me do this it's a regression problem you are trying to get a rating and therefore what I'm going to do I'm going to combine the solutions that I have that are a best combination for the solution so that I get a good performance now you can think of this as a very simple training at a higher level okay now you take the solutions as if they were the inputs okay and you are trying to predict the output which is the same output so what you are going to do you are going to have a principal choice of alphas using now not a training set not a test set not a validation set but an aggregation set so you set aside some points that are not used in the development of these guys okay and then you use those just to decide what are the best coefficients you have for these solutions such that the error on that aggregation set is the best possible okay now if you do this guess if you use mean square error what will your algorithm be for choosing alphas it's the good old so you do the training set that these guys used okay then the guys that got the best training set the training error will have a big weight because and obviously that's a problem because we know that having a big small training error is not indicative but if these guys are frozen and I take a fresh set and then I get the combination all it's completely valid and then I get those guys and get the solution okay there was you know time we had the the the Netflix data and people will come up with solutions with a view to aggregation that's when when the the competition was still alive okay so people came up with a solution so now I want to combine them okay so can you imagine you work on this you have your solution your solution your alpha has a good solution because its solution is affecting the output a lot okay now can you imagine what happens if your alpha is negative you worked hard you got a solution okay and then when you put that solution together with the best solutions the best possible outcome we can get is by subtracting your solution from the from the from the from the sort of lose yourself esteem because of this the size of the weights is not the criteria or the sign of the weights is not the criteria for your solution being valuable because it could be that you are so correlated with other solutions that what the system is trying to do is trying to combine those in order to get the signal part right and eliminate the noise so depending on the noise on your solution my god if I if I did nothing I would have gotten a weight zero and now I have no that's not the case so the question is how do you evaluate which of those solutions is the most valuable in the I had that practical problem because I wanted to reward people and I don't want people I mean when I give this problem everybody is trying to get it to the blend okay whereas the other guys who are all doing the same thing everybody will get a small share so it's not a big deal okay so I want to have an objective criteria for doing that what would you do in order to evaluate that or this actually was used in the competition as well when when eventually there was one team that decided that to have an open in the prize according to how much you contributed and open and actually that team ended up in second place okay so it's not it's not like a bad idea okay so the idea here is that if you really want to know the value of a particular hypothesis here is what you do you take it out you evaluate all without your solution you get a performance and then you evaluate with your solution and you get your solution so one of the ramifications of that is that if two people are doing identically the same thing then obviously each of them is useless because when you take it out the other guy will hold the day okay but those guys who did something completely fringe although I ask you what is your performance oh I got 6% towards 10% and what is your performance I got 8% okay if I take 3.5% the 8% drops to 7.5% okay so they contribute 1.5% okay if I take the guy who had 6% and take it out the 8% goes to 7.95 who cares okay so this was the way in order to be able to reward your actual contribution tommyx okay now there will be no Q&A I am going to do acknowledgments for people who contribute to the course before I close on a personal note the first acknowledgement goes to my colleagues Prof. Malik Magdon Ismail and Prof. Shuan Tien Lin okay all I can say is that they are as responsible for the content of this course as I am enough said now I'd like to acknowledge the course staff I mean there are many people who helped out the people who contributed the most and Carlos Ron Costis and Doris have contributed to everything you can imagine about the course from suggestions about the format even getting the right slide system designing the homeworks writing the registration and online system from scratch everything you can imagine and they filled into other tests when it was needed and they ended up working far more they are paid for so I am really grateful to them and the head TA who is Carlos happens to be familiar to you you heard his voice in every lecture he is the voice of the Q&A session and I suspect that people are curious to see what the guy looks like they heard his voice okay so let me ask Carlos to come to the podium and say hello to the online audience you know my voice now you know my face okay so I am exempt from the Q&A session today fair enough thank you so much it's an honor to work with Yasser it's really great to work with him so thank you Yasser thank you very good thank you okay now the most of the people have seen this course and by everybody after that we'll see this course through the tapes okay now the most of the people that will see this course through the tapes okay and the medium that resulted in those tapes are the AMT staff Leslie Maxwell being the director and I would like to acknowledge them in in very passionate terms they have done an enormous job it was great amount of work for them and they are all here obviously with the cameras and that and I'm really grateful to their contribution I'm also grateful to the computing support staff and rich in particular not only because of providing the infrastructure for having this course but also for supporting the idea of the course very early before we even raised the money so we had a head start on this and we started talking to people and doing that and we weren't sure even that it will happen so I appreciate the fact that there was confidence that this might be a good idea in order to put that to begin with okay now this course cost money and I was insistent and Celtic was insistent that this will be perpetually free for everyone that was the whole idea we wanted to give a Celtic quality course to anybody who has the discipline to follow such course without costing them a penny okay and we succeeded in that and in order to succeed in that you actually need the money if we didn't raise the money this would not have happened and I'd like to acknowledge the sources of the money and the people who drove the money raising the information science and technology initiative the engineering and applied science division and the professor's office Matthew in particular took the lead on raising the money in getting the publicity he just sort of took an interest in this and was incredibly invaluable in getting this going and at the division both Aris and Manny were very helpful and supportive and believe me if you do something that intensive you need the support of everybody otherwise you sort of you lose heart in the middle and that was very instrumental in getting this going and at the professor's office both Ed and Manny reached for the educational funds and got the money so we didn't have to worry about it so it's not only financial support it's also moral support and confidence that this will be something good now there are many people that I will have forgotten to acknowledge so I'm going to give a general acknowledgement and I apologize for any particular person that contributed and I forgot to give them their sort of due reward the TAS other than the ones I mentioned have contributed greatly and they took a load off my agenda by answering questions to the students and taking care of the homeworks and there are many staff members at the division level and at the department level who helped in all kinds of ways the Caltech alumni were absolutely great I got an incredible support from the Caltech alumni who are genuinely a family spread around the world and the alumni association that helped get the word out was very instrumental in getting the publicity right and I had incredible support from colleagues all over the world okay you know I usually don't like my email because 70% is spam and 20% are scams and then 10% are relevant okay but it was worth going through all of this in order to hear the wonderful words of support that are getting from the four corners of the world now on a personal note allow me to dedicate the course to the best friend I have ever had well I learned a lot from her and learning is precious and my hope is that this course was a positive learning experience for everyone okay and in particular I thank my Caltech students for really putting up with the inconvenience of cameras regimented lecture format and whatnot in order to share this learning experience with the whole world thank you