 All right. Welcome everybody to the wonderful world of machine learning and data mining. The most fascinating subject you will ever see. My name is Pedro Dominguez and I will be your instructor. Please feel free to ask questions and share your thoughts at any time. One of the funnest things that happens in this class is when people share the experiences that they have with, you know, doing data mining in their jobs and whatnot and ask questions and so on, so I hope some of that happens here. Let's get some logistics out of the way first. So, my email is PedroDXCS. Please feel free to email me at any point with any questions that you have. Also, I have office hours every week. I have set them from 5.30 to 6.20 right before class, which is the most convenient for you. And my office is CAC648, so please stop by anytime. If, you know, you happen to be on campus at some other time and you want to see me, you know, we can do that by appointment as well provided, you know, I'm in. Your TA is Rob Guns back there. He's in charge of making your life hell. So, everything that happens in this class is his fault. Everything that is mine may. His email is RCG. Again, feel free to send him an email. He has office hours again at the same time. Not ideal to have them at the same time, but it's, again, probably the most convenient for you. And the office hours are in at 2.16. Okay? You don't have to write any of this information down because it's all on the class web at www.cs.washingtonedu.p546. This is the short form URL. There's also a long form. And you can also just find it by following pointers from the department site and whatnot. There's also a mailing list that we highly encourage you to subscribe to. You can do this by following the link from the class webpage. We're at CSCP 546 at CS. And, you know, we will be posting important information about the project, certifications, and whatnot there. Okay? The decision whether to send an email to the list or to us is one of whether you think the question is only of interest to you or of interest to everybody. If you think it's of interest to everybody, please send it to the list. We reserve the right to embarrass you by replying to a question that you sent us privately publicly. Okay? So beware of that. There's also a discussion forum that you can join to, you know, again discuss the class. And that can actually with, you know, with your colleagues and us, that can be a very, very productive thing to do. So we highly recommend that. Okay. Evaluation, our favorite subject. How are we going to give you a lot of useless long hours of work where you learn nothing is the problem we're trying to solve. And here's how we solve it this time. We don't have a meter more final. There's four assignments. Each one is worth 25% of the grade. We'll hand them out on weeks two, four, six, and eight. And they each do two weeks later. Okay? And they roughly cover the corresponding two weeks of the course. You know, more or less not exactly. Okay? The assignments are a mix of three things. One is implementing machine learning algorithms. Right? This one of the main things that you're going to do here is learn various machine learning algorithms. And there's no better way to learn them than to actually implement them and work with them yourself. The second part, of course, without which, you know, a course like this would not be complete, is applying those algorithms to real data sets and seeing what you find. Of course, you have limited time to do this, right? Some people do this, you know, as a full-time job. And even then, it's not enough. But nevertheless, we hope to give you a flavor of what it's like to really do data mining machine learning in practice. Some of the data sets that we might be using we haven't decided yet are things like clickstream mining. Can you predict what people are going to do on your e-commerce website, for example? Recommender systems, right? You're going to build a system that recommends, for example, movies to you based on, you know, how much you like to the movies that you've seen. Spam filtering, right? You can build a spam classifier that, you know, tests whether something is spam or not. And then maybe you can use it as your own spam filter for your own email and do better than the generic spam filters from the search engines. And so on. Some of it will be exercises, you know, math-style exercises because, for some things, that's the best. Some of the questions, maybe related to the programming part, maybe not that test your understanding and also help you to, you know, dig deeper and whatnot. Some of the projects will have more open-ended questions that those of you with, you know, a ton of pre-time on your hands, I know that's everybody, that can use to go to town and, you know, and go further. And, you know, just as an example, in this class a few years ago, one of the data sets that we used was the famous Netflix collaborative filtering data set, right? The one with a million-dollar prize. And, you know, we had a great time implementing algorithms for it. One of the students who had never done machine learning before went on to be a member of one of the two top teams, right? And they only lost because they turned in their solution, like, you know, an hour later than the winning team. So the next one could be you, okay? And, you know, or, you know, if you end up making millions of dollars with only overall algorithms, like some of these people have, you know, don't forget your old friends, please. Okay. So any questions on the assignments evaluation and so on? You have no questions now, but I guarantee that you'll have plenty of questions later. Source materials. So what are we going to be using in this class? Some of you probably already have this. If you don't, I encourage you to get these as soon as possible. If you're only going to have, there are many good textbooks in this series, some of them more advanced, some of them more applied, et cetera, et cetera. The one that I recommend as the first book to study is the book Machine Learning by Tom Mitchell. It's a fairly short book as they go. It's fairly simple and accessible. It's a very nice, efficient way to get the basics of machine learning. It's clear. It's not too math heavy. It has a broad view of machine learning. So in many ways it's an ideal textbook. The only problem with it, and it's a big one, is that it's outdated, right? So it's about 10 years old or more. So there's many things that it doesn't cover. And also some of the things that it covers, it doesn't quite cover to the depth that we're going to need here. So to complement that, we have pattern classification, which is actually a new addition of the earliest classic text in this area. So I highly recommend that you get both of these books if you haven't already. And then I highly recommend that you do the following. Scheme the relevant chapters before you come to class. So on the website, we've actually posted a rough plan for the whole course. It may change, but roughly that's where it's going to be. And there's also draft slides up there, and whatnot. Again, those may change. But you make the most of the class if you already have an idea coming in of what we're going to be talking about. And literally, you can do that by just getting through the slides in the book in 15 minutes. If you don't do that, then you spend a lot of your cognizant foreign class just getting your bearings. If you do that, you'll actually make much better use of your time here. And then after you've gone to class, then you can go back to the textbooks and study things more deeply and try to do the assignments and whatnot. And then ask us questions if you have them and so forth. We will also be complementing as needed the textbooks with various papers and whatnot. For example, there's a very good paper on the quickstream mining project that we're going to do and so forth. So we'll do this on a week by week basis. Questions about any of this? Because, of course, I want to get to the exciting part. So number one, why should you bother with this? What is the point of studying machinery and data mining? Now, I could tell you that it's fascinating and it happens to be what I do for a living, but I'm biased, right? So why should you believe me? So what I say is like, don't believe me. Believe these people. Here are a few quotes, and there's many more like this that should at least tell you that there's something that you should learn. And you're probably already aware that machine learning is a very hot area in computer science these days, but you may not be aware of just how hot it is and why it's important. So let's spend a little bit of this first class on that. So here's something from a guy you might know. His name is Bill Gates. And he says a breakthrough in machine learning would be worth 10 Microsoft. That would be the world's first multi-trillion-dollar company. I think he's guilty of understatement. We'll see why shortly. But hey, if it's worth 10 Microsofts, there might be something in it. So DARPA is this famous agency that funded the internet development and a lot of other great things in computer science in other areas. Tony Tether, who was until a few years ago the director of DARPA says machine learning is the next internet. If it's the next internet, maybe it's worth knowing about. John Hennessey, who's actually a well-known computer scientist, but these days is the present of Stanford, says machine learning is the hot new thing in computer science. Prabhakar Raghavan, who was until recently the director of research at Yahoo. Since I made this slide, it's already outdated, because he's been hired by, I think, Microsoft or Google, one of those. Web rankings today are mostly a matter of machine learning. You might not know this, but every time you do a search and get results, you're using a machine learning algorithm. Greg Papadopoulos, another famous guy in information technology. He used to be the city of sun, now he's a venture capitalist and other things. He says machine learning is going to result in a real revolution. These are people who are not necessarily very likely to engage in hyperbole. Jerry Yang, founder of Yahoo, machine learning is today's discontinuity. Technology and civilization advance not as a linear progression, but by making sudden huge leaps. He says machine learning is the one that's happening today. Steve Baldmer, another guy you might have heard of, says machine learning today is one of the hottest aspects of computer science. Clearly, there's a lot of bollywood about machine learning. What is the bollywood about? What is this thing called machine learning? Why is it so important? It's traditional to start a class by defining the subject and I'm going to fall for the same thing. At the end of the day, you will know much better than you can say in one sentence what machine learning is. But for now, let's just try to do that. And I'm going to do this definition in two words. It's not one sentence, it's two words. Machine learning is automating automation. What we as computer scientists do is automate everybody's job and make them unemployed. Yay. Oops, no. That's not what I meant. What we do as computer scientists is we make things possible that weren't possible before, because it would take a million people to do them and we don't have those millions of people. With computers, we can actually do the work of a million people with one CPU. And so all sorts of things that were completely impossible before become possible. Like computer games and imagine if somebody was actually painting each of those pixels by hand. So the power is in automating things. The more we automate things, the cheaper they become, the more people that can have them and so on and so forth. But here's the question. Who automates the automation? As we automate more and more things, perversely, the bottleneck becomes us, the computer scientists, the software engineering, the information integration. That's what's slowing us down. So this is what we want to do. We want to make the computer science, the programming, the software engineering process itself be automated. Right? So here's the picture. Computer scientists put everybody else out of jobs. Machine learning people put the other computer scientists out of jobs. That might be a good reason to learn machine learning. Right? And then you'll still have a job where no one else is left standing. OK? And it's also very ethical to steal from the thieves. Right? Then two minuses cancel. And we're very nice people. OK. So joking aside, how are we going to do this? Right? This sounds very good. But how do we automate automation? Right? We have to get computers to program themselves. This is the idea in machine learning. We're not going to program the computers. We're going to get the computers to program themselves. So I always say to my students, you know, if you're lazy and stupid, come work with me. If you're smart and you're hardworking, you can do systems, you can do software, programming languages, networking security, you know, graphics, all of that stuff is for the smart people. We machine learning are dumb and we're lazy. We want to push a button and have the all of them do the stuff for us and then we go to the beach. And we're still making a lot of money and whatnot, right? So that's the agenda. Getting computers to program themselves. So how are we going to do that? Right? Think of all the smarts that you put into writing software. Right? Can we actually get a computer to do that for you? That's the agenda. Right? We want to replace this ball neck of writing software where all the, you know, persons, you know, months are now going into the thing that's keeping us from going further and have the computers do that themselves. And then maybe things can take another leap to happening even faster and on a larger scale and so forth. So how are we going to do that? Right? This sounds like, well, sure, beautiful dream, but, you know, how's that going to happen? We have one secret weapon to make that happen. And it's because we have that weapon that all of this is possible and wasn't before. And that weapon is data. We're going to let the data do the work for us. And as you heard, this is the era of big data. Right? There was that McKinsey report, you know, like some months ago saying, we are now in the era of big data. We need 150,000 people who have expertise in analysis of big data. And, you know, what we have right now is like 1% of that. We're going to take that data, right? And it's by sucking all the juice out of that data that we're going to have the computer automatically write programs that otherwise will take a lot of work on our part or even better. We might not even know how to write those programs, but by running machine learning programs on data, we'll actually have them. We'll see some examples of that. Okay? So that's the idea. We're going to take the data and turn it into programs. Now, even with that, you're probably still thinking, well, you know, this sounds too good to be true. Right? So first of all, we can do this in some cases, but not others. And as the research progresses, we'll be able to do it in more cases. But even with a small, you know, simple that some level programs that we can, you know, create today in this way, it's amazing what you can already do with them. And we'll see some examples of that shortly. But nevertheless, you might still be justified in having some skepticism. Okay? So let's see a little bit of how this is done, right, before we dive into the details. Here first is the picture of traditional programming, right? This is what we all know and love, right? This is what we do for a living. There's a computer, and the program go in and out comes the output that you want, right? And our job is to write the programs inside, is to write these programs that then go inside here and combine with the data to produce the results. Okay? So this is what we all do. Machine learning is turning that around. What machine learning does is it just switches two little things in this diagram. The output is not the input. And what was the program, what was an input, now becomes the output. The output is not the input and the program is not the output. You see what this is doing? What this is doing is the machine learning program, right, this is what the machine learning is going to do, is that it takes in the data, the raw data that the program is going to have to operate on. And then the desired output, right, we just give the computer examples of what we want it to do. And then hopefully, on a good day, if the machine learning works, and you can get it to work, then from the examples of the input and the output that you want, you can induce the program that should be running on the computer. Okay? And then once you have that program, guess what, then you put it in there and now the same old thing happens. Okay? But the beautiful thing is that you didn't have to write the program yourself. Okay? Or the long hard exercise of writing the program became a much easier exercise of just controlling the machine learning to make sure that it does what you want. Okay? Here's a metaphor for what happens. Not everyone likes this metaphor because it's very low tech, but actually think it's a very instructive metaphor. And the metaphor is this, is that machine learning is a lot like gardening or agriculture, right? What could be, you know, lower tech than agriculture? Why is that if you think about it in all the artifacts that we have, you know, cars, you know, TVs, computers themselves, right? We have to put them together piece by piece. We have to create the nuts and bolts from the raw materials. We have to assemble the nuts and bolts into sub-systems and those things into the systems, right? This is a ton of work. It's a ton of work and it doesn't scare because the more complex the system becomes, the harder it becomes to actually put it all together without making mistakes and whatnot. Okay? Compare this with what happens in gardening, right? Gardeners have a very easy life. Nature does most of the work for them, right? The gardener just provides the seeds and the nutrients and a little bit of care. And then from that, you get, you know, flowers, plants, trees, right? Or agriculture gets, you know, you're weak in your corn and your rice and so forth, right? So this is a very smart thing to do if you can, right? But something else that most of you work for you, you just provide the elements and a little bit of, you know, tender loving care. So this is what we actually do in machine learning, right? In machine learning, the seeds, right? If you think about it, you know, in your beautiful plant or flower, all the information is already in the seed, right? All that then happens is that that seed sucks up nutrients and, you know, outgrows the beautiful thing. So our seeds are going to be the algorithms. Of course, the seed by itself is not a plant. You need to feed it. It needs water. It needs nutrients. This is where the data comes in. Here is our nutrients. You're going to feed that seed with a lot of data, with a lot of nutrients, and then the programs will grow. Your job is to manage the whole process, make sure that the plant doesn't die, or if it dies, you plant another one and so on and so forth. But it's much easier to do that than it is to, you know, assemble a whole plant by yourself, molecule by molecule, right? That would be a very hard thing to do, but yet this is actually what's happening while the gardener is asleep, okay? So keep in mind this metaphor, you know, as we go forward, I think this kind of illustrates well why machine learning is potentially so powerful and such a great thing for lazy people like us. Okay, so that is, you know, just a short overview of what machine learning is. Let us more concretely illustrate this with, you know, the number of, you know, applications of machine learning, right? Just so you can see more specifically what are some of the things that we... All these things you might be familiar with, some of them might not be in any case, you know, they're just a sample of the many things. I mean, there's more things being done with machine learning today than anybody can keep track of. So I started with some of the ones that are more, say, let's say, current and commercially important, and then, you know, gradually moved on to the ones that are more speculative, but even the more speculative ones are not that speculative these days. So, of course, you know, a very good place to start, which I already mentioned, is Web Search. The single most, you know, these things are closely guarded secrets, right? But what seems to be the case from what people like Prabhakar say is that the single most important ingredient in designing a search engine is the following. When the engine gives you a set of pages, right, and there's summaries, you click on some, but not others, right? You click on the one that looks most interesting to you. And if that's the one you want that satisfies your needs, then you're done. Otherwise, you might go back and click on others or even refine the query, right? Now, if you think of this as an input from machine learning algorithm, this is a beautiful thing to have, right? Because what this input-output-parry setting is, if I put up these web pages for you, then you'll click on these ones. So what the machine learning algorithm can learn to do is predict which pages you will click on, and then it actually just puts those up. It puts up the ones that you are most click on instead of letting you see all the track. And the other thing is that search engines have a lot of data, right? They get billions of queries every day. You'll get a lot of data to do this. So at the end of the day, by using machine learning, you end up with a much better search engine than you have in the days when machine learning wasn't used for search. Here's another very different example. Computational biology. Very big area for machine learning. There are many different applications of machine learning in computational biology, but think of just this one. Drug design. In the old days, companies designed drugs by trying things out in the lab. It's a low painful process. And it was very trial and error, very random. And for a while this worked. But these days they have largely run out of what they can do with that. So now we know what they do is what's called rational drug design. One of the big components in rational drug design is like you try to design the drugs and be fairly sure that they're going to work and actually try them in the lab. For the cost and time of trying one drug in the lab, you can try millions of them on the computer. And this is where the machine learning comes in. I know something about molecular biology. I have examples of molecules and what they do. Like, for example, how they dock to viruses and inhibit them and whatnot. And now what I can do is I can try to predict for other molecules by generalizing from, again, the ones that I saw how these ones are going to behave. The input is the molecule and the output is what I want or how will it dock and whatnot. So this is one very important application in computational biology, but there's many others. Here's another great one, finance. Finance was actually one of the early major applications of machine learning even before the web was around. Everything on the web uses machine learning these days. Think of a bank. One of the things that no banks do is they send you credit card offers. They use machine learning to decide what to send what offers to. They have that on you and from that they try to figure out using a machine learning program how to predict whether you are a good credit risk or not. And if you're a good credit risk, then they will send that to you. If you just go to the bank as a company or an individual and make a loan application, chances are your application will be evaluated by a machine learning program. They do not use that much at least for loans of a certain scope. Human evaluators anymore. They use machine learning programs. So they're predestined applications on whether the outcome was default or whether they paid on time and what not. So you can use machine learning for that. And then they have all this money from you. They need to decide how to invest it. Machine learning comes in again. You can use machine learning to try to predict stock market, currency fluctuations, where the companies are going to make money, et cetera, et cetera. So machine learning gets used very extensively for this type of thing. Next thing they're worried about whether they're going to lose you or not. So they can use machine learning and they do to predict whether you are likely to leave or not to drop them for another bank or another credit card and make an offer to you at that point. They can even use it to decide whether they think you're not really worth keeping and suddenly try to discourage you. I'm not going to say whether they do this or not, but they might. Finally, they're also worried about fraud. There's a lot of fraud with credit cards in fact, but in fighting fraud. Actually, not just in finance, but in everything. Because again, what you do is you figure out what are the signs that credit card has been stolen and then they suspend the credit card. You've probably had this experience getting a call from a bank saying or an email, your transactions are suspicious. Can you confirm that these are yours? And there's more, but this is just a sample of the things that machine learning gets used for in finance. Okay, so to take a total different application, space exploration. Here's two examples of space exploration. You send a probe out there to Mars and the probe is on its own. Light takes whatever, 10 minutes to travel between Mars and Earth. If the robot has to be instructed by people, it slows it down by orders of magnitude. So the more autonomous the robot is, the more productive it will be. It needs learning algorithms to figure out what to do on its own. Another example, astronomy and planetary science are one of the big producers of masses and masses of data. Like you send a probe to Venus and it takes pictures of the entire surface. This is a real example. And now the planetary scientists want to figure out where the volcanoes are on Venus because that allows them to test the use of planet formation. Well in your days you would have to pay someone to look at these plates and say volcano, not volcano, volcano, not volcano. And of course they're only skilled so far. These days you get a machine learning program to figure out what is a volcano and what is not. It learns from a much smaller number of examples that were labeled by people but then it goes off on its own. And in the same way you can form catalogs of stars and galaxies and other objects in space and what not. Robotics is another example and at this point it should be obvious, right? Robots, the more autonomous and flexible they are, the better. There's actually lots of robots in industry today but they are very inflexible. There is one movement or a sequence in a more controlled environment. They're not flexible, they can't react to failures. You want to make them more flexible without having to program this all yourself. In many cases you don't even know how to, right? You want to program to grasp this or that into more of a different object. You can use machine learning for that. That's what people do. As a very famous example of this, right? The famous self-driving car, right? These days we're at the point of having cars that will actually be your taxi and not actually have a driver. You know, for the people who built these cars they will tell you there's machine learning in every new concurring of what they're doing, right? Information extraction. How do we make computers smarter, right? It's one thing to use a search engine and type in a few keywords and get back a bunch of documents but that's not really what you'd like to have, right? What you'd like to have is the ability to ask questions and this thing answers your questions by reasoning over databases that are collected from the web, right? This is a very big agenda for a lot of companies that have been definitely good for us as users. Again, the way to make that work is to use a lot of machine learning to be able to do the information extraction automatically. Here's one that used to be somewhat futuristic but no longer. Social networks, right? Ten years ago, you know, machine learning was already, you know, kind of, you know, exploding but social networks were still on nobody's radar. This is, of course, you know, social networks are one of the biggest things on earth. You've probably heard that Facebook valued at something like $100 billion, right? Why are they worth $100 billion? Right? If you read the stories on this, the number one reason they're worth $100 billion is that they have this motherload of data about how people relate to each other, live their lives, do things with their friends, what they prefer, what they don't, right? Their ability to mine that data and then use it to select, you know, for example, which banner adds to show and, you know, and other things are valuable, okay? So first of all, you have a company that gathers a huge amount of social information like the social graph and lots of other things. But then, right, the question is what value can you extract out of that data? The better your machine learning, the more value you can extract out of it. And, you know, if Facebook is already worth $100 billion with the machine learning that they have today, imagine what it would be worth if machine learning that was much better, okay? So it's coming. Okay, so let me give as a last example one that's completely different from all of these. Machine learning is now becoming very popular basically for every problem in computer science and beyond, but in particular for things like systems and software problems. You can use machine learning to try and optimize the behavior of your data center, to make it faster to respond, to make it consume less power. You can use machine learning for things like debugging, right? We all know that debugging is one of the most annoying and most labor-intensive processes in software development. Wouldn't it be great if we could partly automate it? Notice it doesn't have to be the case that the problem will fine and fix the bug all by itself. But if the only thing that this program does is it tells you what are the 10 most likely places where the bug is, this will already be very useful, right? It will save you from looking for needle in the haystack potentially. And again, you can use machine learning programs for this. You can use traces of the program running. You can use traces of how people debug their programs in the past. And again, try to predict from the program from what happened then, you know, was there a crash? Was there not a crash? You know, what did the user do? What did the programmer do? Like, what did the compiler produce? From that, try to learn where the bugs are and where they are. And, you know, these are just a few examples, but what I really like you to think about is what is your domain of interest? What problem are you interested in? And how would you apply machine learning there? Maybe this is a problem where machine learning has already been applied, but chances are you can come up with a different angle on it. Maybe this is a problem where machine learning has never been applied and there's a huge opportunity for you there, okay? And if you do decide to do that and, you know, have questions and thoughts, you know, please talk to us. We'll be very excited to find out what you're up to. Okay? All right. By the way, you know, this part of the class is kind of like a cocktail party, you know, style. The rest of the class is not going to be like this, right? It's going to be more, you know, concrete and more problem solving, but, you know, enjoy this part while it lasts. And also, I think it's useful to have sort of like this overview of what machine learning is and what it's good for before we dive into the details. Any questions? Not so far. Okay, let's keep going. So, hopefully, you know, in these few minutes I've at least persuaded that machine learning is intriguing and potentially worth studying. So, now let's start studying it. What is machine learning? How do you do this amazing thing of getting a computer to program itself? And if you're interested in machine learning, if you just decided, for example, you might want to use machine learning in your, you know, problem of interest, in a way, it's kind of hard to get your bearings because, well, you know, many of the test books are pretty impenetrable for someone who doesn't already have a lot of mathematical background and whatnot. And then if you look at the literature, right, what you see is like there's tens of thousands of different machine learning algorithms. And, you know, this is, of course, a very active area of research and, you know, there's thousands more coming out every year, right? So, the number of machine learning algorithms seems to be growing exponentially and now how do you get your bearings, right? Do you have to learn each one of these new things or, you know, face doing the wrong thing or being out of date? Actually, you don't. Right? And this is one of the first things that, you know, I think you can take this class to the following. Machine learning algorithms are essentially just combinations of certain components. And if you understand what those components are, then this combinatorial space of machine learning algorithms becomes a much more space of the components and then you can combine them in different ways. So you want to know what those components are and you want to know how they go together. And then as you hear things about machine learning and you see new things coming out, now you have a way to keep things organized and to not get, you know, overwhelmed. So what are those components? They're basically three. Almost all machine learning algorithms have these components in one form or another. They are representation, evaluation, and optimization. Okay? So let's look at each one of these and see what it is and what examples of it are. So representation. If your computer, right, if your machine learning program is going to create its own program, the first decision that needs to be made is what language this program is going to be in. Like you as a programmer might decide, well, I want to use Java or I want to use Perl or I want to use whatever. It's a similar decision to be made by the machine learning algorithm or by you for the machine learning algorithm. Of course, the languages in this case are not going to be things like Java, etc., which are designed for humans. They're going to be languages that hopefully are good for the machine learning program to use. So what are some of those languages? Well, here's one. Decision trees. Decision trees are the first one we'll study and they are according to surveys that by a good margin, the most widely used representation. Okay? And by the way, roughly speaking, the way we're going to organize this class is by representation. So we look at each of these representations in turn and then see how we do the evaluation and the optimization with them. Also as a rule of thumb, new representations come up much less often than new evaluation functions come up, which in turn come up less often than new forms of optimization come up. So in a way, the representation is the most stable core of machine learning. This is not to say that there aren't new representations to be invented. I'm sure there are and I look forward to them. So here's another one that's also a very natural one. It's sets of rules. You can think of a decision tree as being a set of nested if-then statements. That's at some level what a decision tree is. So it's just a basic programming construct with launch. Sets of rules are also if-then statements except they're not nested. They're just, you know, it's a sequence, right? The first one that fires wins, or then you do some combination of the ones that fire. Sets of rules can be very powerful because in particular if the rules are in first order logic, and then they're also called logic programs, you can genuinely write any program in this way, right? So it's very powerful. It's a very powerful representation. You don't necessarily want to write a program in this way, but in principle, you know, there's languages like Prologue, for example, that, you know, can do anything other than Sun, and we will see how to learn such programs from there. Another one, and in fact one of the oldest ones is instances. This is the best of the laziness approach to machine learning. The instances approach is you just remember the cases that you saw. And then when a new test comes up, you just try to find the nearest case and you apply it. So you're doing medical diagnosis, you just remember all your past patients. And when a new patient comes in, you find your old patient that's most similar to her, and you apply the same diagnosis. Of course, there's many variations on this, and we will spend a whole, you know, a lecture on it, but the basic idea is joyfully simple and unsurprisingly powerful. So this is the use of instances themselves as the representation for what you want to learn. Then getting on towards more sophisticated things, there are a lot of approaches in machine learning that use probability, right? Problem is a very powerful tool. It lets us deal with uncertainty, it lets us weigh evidence, you know, it means we don't have to always be making black and white decisions. Efficiencies that we concern, graphical models are a very large family of probabilistic models where I use a graph of the dependencies in my distribution. So we will also look at some of those, and in particular there's two main families called Bayes Nets and Markov Nets, and we will see what some of them are. As some of you might know, the Turing award this year was won by Huda Pro, a great computer scientist. His main contribution was developing Bayesian networks. Neural networks, this is a very fun one. The idea in neural networks is let's reverse engineer the competition. If you're going into market with a new product and the competition is way ahead of you, it's a very good tactic to just look under the hood, right? See how they're building their cars and see if you can make yours the same. While they're behind, this is a good thing to do. And make no mistake about it, when it comes to learning, we are way behind ourselves. What does that mean? We as the producers of machine learning programs are way behind ourselves as humans. The human brain is the most amazing machine learning system ever invented. And now there are animals, too, for that night. The human brain is really doing something extraordinary, and we haven't figured out what it is yet. If we did, you know, there wouldn't be worth 10 Microsofts, it would be worth, I don't know, 1,000 Microsofts, or, you know, the whole world economy potentially. So the idea in neural networks, and of course the human brain is a network of neurons, hence the name Neural Network. In neural networks, let's look at what the human brain does, and then let's see if we can turn that into an algorithm. A very selective idea. And again, we will spend a week seeing how we can do this and the state-of-the-art things that come out of it. Even more recent, so all of these approaches have been around for decades. They continue to be developed. So new algorithms in any of these families continue to appear. Here's one that became popular more recently, but is also very, very powerful. And this is support vector machines. Support vector machines, in a way, are a more powerful version of instance-based learning, but they are quite powerful. They are different from some of the other methods in that they use much more sophisticated mathematics. And many people just use them as a black box because often they work very well, but it's a pity to have to use them as a black box. And, you know, we in this class don't want to use anything as a black box. We want to understand how things work so that we can change them, make them fit our needs, and then tomorrow when things come along we want to be able to then understand those as well. So what we will do when we study support vector machines is we will cover just enough of the mathematical background that you can see what support vector machines are doing and why. And then one thing that people have also found in recent years is that the best thing that you can do when it comes to learning is not to learn one program or one model, but learn a whole bunch of them and combine them. This is the wisdom of crowds applied to algorithms. It's the wisdom of algorithms, right? Often you have many people vote on something they come up with a better estimate than any individual person. The same thing is true of machine learning algorithms. If you run a hundred algorithms with a hundred different conditions each one of them and combine the results, you will usually get better results than if you just run one of them. This is very powerful, for example, the Netflix prize was won and the second place was also systems that combined a lot of different algorithms. In fact what both of these things were doing was competing to see who could aggregate the most people quickest. They all started out being small groups competing with each other and then they realized that by aggregating they could do better. So they wound up with larger numbers of things being combined than ever before. I think in the future we're going to see more of this. So being able to create model ensembles is a very important topic. It's often a very easy way to get a large extra payoff after you've put all the work into developing algorithms and trying different things. Instead of picking one, combine them. And we will also see how to do that. So these are the main things that we will cover here but not all of them. But this at least gives you a flavor of the kinds of representations that get used in machine learning. Questions so far? All right, so. But of course once you've picked a representation for your program, your work has just begun. Right? Sure. You've picked a language. Now what? Well this is where the next thing comes in and the next thing is evaluation. What the machine learning system is going to do is generate candidate programs. Right? From the data it's going according to some algorithm generate a candidate program. Is this what you want? These were the inputs and outputs that you gave me. I have just created a program that seems to pretty well transform these outputs into those outputs. Is this what you want? Right? Now if you could personally inspect every program that the machine learning system generates fine, but of course you don't want to do that. It's going to generate easily millions of candidates. So what we need is an automatic way to evaluate the program. So the next big choice is the choice of your evaluation measure. How am I going to gauge that version A is better than version B? And therefore select version B. And then maybe try variations of B, C and D and select one of them. So here are some of the measures that people use. The first measure is accuracy. Very simple. Out of those examples, how many of them did you get right? Right? Let's say, you know, I'm an email provider and I have lots of examples of emails that I spam and not spam. I know which ones I spam because the users conveniently mark them as spam. Right? And now I create a program that predicts whether something is spam or not, and yet this is just a fraction of the spams that it marks as correct, plus the non-spams that it correctly marks as non-spam. So this is like the fraction of diagnosis that were correct, or the fraction of your credit cards that you issued that they really did not, you know, go bankrupt and so forth. So this is the simplest measure. Two other measures that are often used like information retrieval applications are precision and recall. Precision and recall are a lot like accuracy. Precision is out of all the fraction of things that you predicted were yes, how many of them really were yes. So what fraction of your spams, of your predicted spams really were spams? What fraction of the pages that you said were interesting really are interesting? And then recall is out of all the ones that really were spams, how many did you identify as spams? So notice that these two measures are complementary and you really want to do well on both of them. The reason these measures are useful over something like accuracy is that in many domains there is a very large number of true negatives. Most of the things are false, they're not, you know, interesting and you don't want to mark them as interesting and you won't and you want to ignore those. Accuracy doesn't ignore those precision and recall do ignore that. Now before you're trying to predict instead of just being yes, no, like spam, not spam or interesting, not interesting, it's something like a rating. For example, if I send my catalog to this consumer, how much will they buy? In this case, you're predicting the numeric quantity, so you want to measure that varies with how much of an error you make. Again, there's many things that you can use here, but the most popular one is squared. So it's the sum over all your examples of the square of the difference between what you predicted and what the true value was. Okay? When you're building probabilistic models what you're trying to do is try to predict the probability of something. For example, what is the probability that this page is interesting? Then, you know, the standard measure to use is something called likelihood, which you may remember from statistics 101, but, you know, even if you don't, we'll go over it here. Likelihood is basically how likely is what you're seeing according to your model. If what you're seeing is unlikely according to your model, you're probably not a very good fit to the real world. If what you're seeing is very likely according to your model, then your model to the best of your knowledge probably does fit the real world. Okay? So that's the idea of likelihood. Likelihood is often good when your learner is not very powerful, but when your learning is powerful, it's very like, you know, you could get by very high likelihood by just, you know, memorizing your examples, and then you don't generalize, but you want to generalize. One way that you have of generalizing is or you want to generalize is instead of using likelihood, use what is called a posterior probability. Again, we will look at the details of this when the time comes, but the posterior probability is a combination of two things. One of them is the likelihood, and another one is the prior probability. The prior probability is how likely you think something is a priority. This is where you get to put your knowledge, your expert hypotheses, your assumptions into the system. You can put them into your prior. Before you've seen any data, you believe that certain things are more likely than others. That's your prior distribution. And then when you combine that with the likelihood you get your posterior, and then it's the posterior that you use to evaluate your models. So when you're using prior probability, you're actually evaluating your candidate programs using a combination of the data and your own beliefs about the product. So this can be very powerful. It can also be complicated, but we will see how to do it. So, you know, these are all things that people have used for a while. But here's something that people discovered when they started applying, you know, machine learning in practice, you know, on a big scale. They often found that these algorithms, the machine learning algorithms were not really producing what they needed. And the reason for this, a very common reason was that if you look at something like accuracy, right, what percentage of my SPAMs or non-SPAMs did that predict correctly? You could actually be very accurate in a very poor job. For the following reason is that the cost of a false positive, saying that something is a SPAM when it isn't, can be very different from the cost of a false negative. Something that is not a SPAM and you say it is. Or sorry, something that is a SPAM and you say it isn't. If something is a SPAM and you say it isn't, well, then it just gets through and annoys the user a little bit. Something is a message from your boss and I need this by tomorrow and you're fired if you don't do it. And that gets tagged as SPAM because it has too many exclamation marks and too many words in capitals. This is like a realistic thing. Then you could lose your job because of that. So the moral of the story is you want to be conservative. You want to err on the side of predicting that something is not SPAM. And as it turns out, this is a very, very common thing to happen. For example, if you have a test for AIDS and you want to err on the side of telling people yes, you may have cancer do another test as opposed to letting them walk away with a cancer in back. So the moral of the story is we want to take cost into account when making decisions and ideally we will actually do that right inside the learning algorithm itself. So this is another type of measure that you could use. More generally, you could use what in decision theory is called utility. Different decisions, different outcomes have different utilities that you want to do. This is what being rational means is you want to make the decision that maximizes your expected utility. So this is your average utility over the outcomes that you have. And again, you could use that after superiority or you could use that inside the learning algorithm. More recent things use even funerals like, for example, in support of the machines, we use the margin. The margin roughly has to do with how far you are. When you make, for example, a boundary between on this side it's spam and on the other side it's not spam, I would actually like my boundary to be as far from any given example as possible. Because if it runs very close to an example, maybe that's not quite where the boundary should be. So that's another thing, another measure that you can use. And then there's the ever-popular measures that come from the era of information theory. Information theory has lots of interesting notions like entropy, the information content, KL divergence also known as the relative entropy between variables. We can also use these to guide our learning and indeed we will see examples of that. And there's many more. Again, people come up with new measures every year. These are the more popular ones and we're not going to cover everything but it's good to have a notion of what some of the main things are. Often an exercise you can do is you can take an algorithm using one of these measures and change it to use another measure and see what has to change in order to do that. For example, you might try to turn an algorithm that is not cost sensitive is just maximizing accuracy into one that is cost sensitive. All right, so we've chosen a representation, we've chosen an evaluation measure, but in some sense the most important part is yet to come. The most important part is, okay, so now how do I actually generate candidate programs? So that part is the optimization or search part. Now you need an algorithm that will actually generate lots of candidates for you that you can evaluate. And then finally pick one of them and you return it to the user and that's the program that you're going to use. Now, of course, if you have unbounded resources you could just try every program on Earth but you probably don't want to do that because you don't have the resources. And in fact, even if you have the resources you might not want to try that because you might overfit this and then get better results that way as we will see later. So that's why we're talking about picking which programs to generate. And of course this should be guided by the evaluation measure. This is why we have the evaluation measure inside the system and outside. My algorithm search algorithm is going to be evaluating candidates, picking some of them and then generating variations of those and so forth. Now again, there's many different types of search algorithms that you can use or optimization. People in operations research optimization, in computer science we call it optimization or such but it's really all the same thing. The type of optimization that you want to do naturally depends on what type of representation you're using and what type of evaluation measure you have. In particular, if your representation is discrete, what you need to do is combinatorial optimization. What you do is this process of combining pieces into larger pieces, see how good they are, taking a piece, modifying it, see how good that is and so forth. The simplest type of combinatorial optimization is just greedy search. It's what I've been using as an example. You generate a candidate, you see how good it is, you generate 10 candidates, you pick the best one and then from that candidate you generate 10 more and you keep going. This is the research but there's many other types of combinatorial optimization that you can use. If you have numerical parameters to optimize which many machine learning algorithms if not most do, then what you use is what's called convex optimization. It's really just numeric or continuous optimization. Convex is a technical term that we will see what it means later on. The simplest form of convex optimization is gradient descent. It's the continuous analog of greedy search. Gradient descent is if you want to find the value just let the ball go down the hill. That's what it is, right? Wherever it is steepest, it's where we'll go. This is good for a lot of things but it's also pretty crude. It can fail in many ways. It can be very, very slow, it can get stuck so we'll see some better ways of doing convex optimization. And finally, particularly most recently people in machine learning have started using a lot of constrained optimization. So what is constrained optimization? Constrained optimization in a way is a combination of combinatorial and continuous optimization. It's continuous optimization where you are subject to constraints. I have this surface, right? I'm not trying to find my optimum in that surface but I'm trying to find the optimum subject to stay within a certain box. And this box is what the constraints are. The famous example of this is linear programming. I have this linear function that I'm trying to optimize. I'm trying to optimize my profits from my factory or whatever but I have these constraints on how many machines I have and how many things I can produce an hour of the components and whatnot. So I'm trying to optimize a function subject to constraints. If the function and the constraints are linear then what we have is linear programming but there might be no linear as well and some of those techniques get used in machine learning. So this is just a very brief overview of the space but I think it's very important to have a map of the territory before we dive in. What we're going to be doing most of the weeks of this class is looking at specific representations and optimization algorithms for them and issues that arise but at this point we at least have a general map of the territory. Questions so far? Now there's more than one type of machine learning and again it's good to be aware of what the main types are. The examples that I've been giving so far and that little diagram that I have with the box and the computer and the program these are really examples of what is called supervised learning that's also known as inductive learning. This is the largest subfield of machine learning it's the most mature subfield and it's the one that's most widely used so it's the one that we will spend most of our time on. So what is it? This is learning where you have supervision this is why we call it supervised where the learning problem is given examples of the desired output. The learning problem is told this is spam, this is not spam this person has aids according to this test this person doesn't. This is the direction in which you should drive in order to stay on the road. So obviously learning with supervision is much much easier than learning without supervision. It's still pretty hard and we'll see what some of the issues are that arise but it's certainly a much more solid technology at this point than easier to deploy. The problem with it of course is that it's hard to come by supervision. Big data is all over the place generating you know the rules and rules of data can be done very easily in many domains. Labeling the data still has to be done by humans. In cases where the labels for the data are generated automatically for example this stock I predicted that it would go up tomorrow and tomorrow it did go up. Those domains are great because you can really kill them with supervised learning but in many many domains you don't have supervision so then what you do is you have to learn from data where you yourself have to figure out what the output is. An example of this is cluster. For example a very popular exercise in marketing is to cluster your customers. Like Nike's two main types of customers are teenage boys and middle aged women. True fact. It's very important to realize that these two clusters exist because they have very different tastes. A marketing campaign geared to teenage boys is unlikely to appeal to middle aged women and vice versa. So cluster can be really really important. A marketing journal will not tell you in advance what those clusters are. You have to figure them out. Also another example this is clustering text. I have millions and millions of documents in my company. How do I organize them and find them? Well maybe I can form a hierarchical cluster in them. Things like Yahoo and Demos they are hierarchies of documents or web objects. In most cases in the beginning in the early days these were done by hand but pretty soon they were done using machine learning. Use machine learning to figure out what this hierarchy is. And then once you have the hierarchy when a new document comes along you can say oh this one is about sports or more specifically it's about baseball or even more specifically it's about this or that. So once you figure out what the outputs are then you can figure out how to turn the inputs into the outputs and then in the future you produce those outputs in response to the inputs. Now unsupervised learning is also a longstanding field of machine learning but obviously it's much harder. It's much harder because you have less information to work with. It's also much harder because it's much harder to tell what is a good result and not. In supervised learning it was spam and you predicted it was that's a success. You gave the credit card to this customer and she paid you know her bills on time. In fact the best credit card customers are obviously not the ones who pay the bills on time. Are the ones who are always a little bit late this season what not. The ones who don't pay are bad. You have the supervision to produce that. So what do we do? Well this is where semi-supervised learning comes in. It is often the case that you can't afford to label all the data but you can afford to label some of it or you have some by a lot some subset of the data that does come with labels. And so what you want to do is you want to start with that small subset of the data that is supervised and then you want to bootstrap from there to label the ones that aren't supervised. So use a small amount of supervision to at the end of the day do a large amount of unsupervised learning. And this can be amazingly effective because a little bit of supervision can go a long way. In fact you could say that this is how most people learn most things, right? It's semi-supervised. They get a little help from their parents or from their teachers but then they're on their own. The parents and teachers can be on top of them all the time saying wrong, right, wrong, right. So we're not going to talk about semi-supervised learning here just because there isn't time for everything but it's good to be aware that it exists because it might be just the perfect thing for your problem. Finally there's reinforcement learning. Reinforcement learning is probably the most ambitious type of machine learning. AI types tend to like reinforcement learning a lot. It's also in some ways the hardest. The idea in reinforcement learning is that reinforcement learning is not learning to make point-by-point decisions. For example, is this stage interesting or not? Does this patient have this disease or not? Reinforcement learning is like a real agent in the real world. You're making a sequence of decisions. You're playing a game, for example, or you're doing something over time. And it's only after you've made a bunch of these decisions that it becomes clear whether you were doing the right thing or not. At the end of the day, you win the game of chess or you lose. And then you have to somehow propagate this information back to those decisions that you're making early in the game. When it was not obvious at all whether they were going to be good or bad. So, you know, where reinforcement learning is a kind of semi-supervised learning because you only have a small amount of supervision but it's distinguished by the fact that there's this sequential kind of semi-supervised learning. You're learning to propagate the results of the payoffs when you finally have them back to the decisions that you made earlier. Again, we're not going to be covering this topic because there's no time for everything but it's a very interesting and large area of machine learning. It is by far the least used of all these types of learning. So, you know, if you're looking for algorithms to use in practice tomorrow, reinforcement learning may not be your pick. But you should at least be aware that it exists because, you know, it might. And in the future, there'll probably be better reinforcement learning algorithms that will be more easy. Yes. I have a question about the semi-supervised learning. I'm confused about if it is just learning with the smaller training data set, isn't it the same as supervised learning just with the smaller training data? No, no. Let me clarify this. So here's the idea. Suppose that you have a bunch of emails, some of which are labeled as spam and some of which are labeled as not spam. This is supervised learning. Suppose none of them are labeled. Then you just want to cluster them into good and bad emails and maybe this will coincide with being spam and not spam. This is very hard. You have no labels. Semi-supervised learning is when you have three types of emails. Some are labeled as spam. Some are labeled as not spam and some are not labeled. So for some of them you don't have labels at all. And what typically happens in semi-supervised learning is that you only have labels for a small fraction. So let's suppose, for example, that you are an email and this is a real example in many cases is what happens. Let's say you're an email provider and you have a billion emails. But you only have 100,000 of them labeled as spam and 100,000 of them that for some reason you know for sure are not spam. So this is the semi-supervised learning state. And the idea there is that you want to start out doing supervised learning. Just looking at the ones that are labeled. But then from those you can bootstrap. You go like, oh, these emails here are not labeled but they're very similar to the ones that were labeled as spam. So maybe those are spam and maybe those are not spam. And now I have a larger training set. Until finally, hopefully, I've labeled the whole training set. When this works it's extremely powerful. Of course it's harder to make it work than fully supervised learning but it's easy to make it work than fully unsupervised learning. More questions? Okay. So these are the main types of learning. In this class for most of you we're going to focus on supervised learning also known as inductive learning. Since this is doing what the philosopher is calling induction. In some ways it's the oldest known type of learning. Towards the end of the class we will talk some about unsupervised learning. But for most of our time here we're going to be talking about inductive learning so let's start focusing on that. So what is inductive learning? What is the problem? Why is it hard and how can we solve it? Here's a very simple definition of inductive learning. What I have is examples of a function. Each example has two parts. X and F of X the output. X is your email or maybe it's just a list of the words that appear in your email. This is a common thing to do. And then F of X is this is spam or this is not spam. Or it could be something like automatically picking which folder you want to refile the email into. Or automatically deciding what tags you want to level it with. In all cases the key thing that defines inductive learning is that this is the test. When I see an X when I see an example that I have not seen before this was not in the data that I used to train my learning algorithm my training data. Along comes a new example a new email. Can I now classify? Can I now predict this one correctly? Can I for a new example X correctly predict the value of the function? This is the problem and this is the challenge. I have to guess what the functions of unseen examples are. This is very different from almost everything else in computer science. In the rest of computer science you're doing things that are more like deductive reasoning. I have these things and I know how to turn the crime to get this result. The crime might be very complicated might be millions of lines of code but you only have to work on input state you know what they're going to be. Here I basically have to make this leap from the examples that I've seen to the ones that I haven't seen. A new patient comes along her combination of symptoms is different from any one that I've seen before. Can I still correctly predict what she suffers from? People are good at this. Traditional programs are not necessarily very good at this but this is what inductive learning is all about. Now there's three main types of inductive learning. If the function that you're trying to predict is discrete for example spam versus not spam or what topic is this piece of text on then the problem is called classification. And each of the discrete outcomes is called a class simply enough. Before you're trying to predict this continuous like you're trying to predict the continuous variable for example there's this company here in Seattle that you might have heard of that predicts house prices. This is a very nice thing to be able to predict. This is a regression problem. The house prices are continuous variable. Regression is what most statisticians do for a living. People in computer science and machine learning tend to be more interested in discrete things. So we will focus mostly on classification in this class. Nevertheless for most classes of algorithms there's a classification version and there's a regression version. So if you know one usually with a little bit of tweaking changing the evaluation measure and some things you can see how you would get the other from the first. And we will touch on regression occasionally because you know we want to have some coverage but again we don't have time for everything and so mainly we will focus on classification. The other thing is when the final type of inductive learning is when what you're trying to predict is the probability of that instance. How likely is this thing to show up? Or how likely is this page to be interesting? In this case what you're trying to compute is a probability and now this is called probability estimation. You could think of probability estimation as a kind of regression since a probability is variable but it's a very special kind of regression because remember the probabilities of all things have to add up to one. So you're not free to just predict any old thing that you want. There's this normalization condition that you have to satisfy. So at the end of the day the techniques that you use for probability estimation are often very different from the ones that you use for regression. And if you're trying to predict the problems of discrete outcomes often the classification and the probability estimation go hand in hand. So you have to build a classifier by building a probability estimator and then thresholding the probability. So I build something that predicts the probability that an email is a spam and then I say if the probability is about 0.5 then I predict that it's a spam. Or if I want to be more conservative then I will predict that it's a spam if the probability is above say 0.8. So these tasks are different but there's also relations between them. Questions about any of this? Okay, so that's the definition of inductive learning. Here's a preview of what we're going to do in this class. So as I said, most of the class will be spending on supervised learning. I believe it's very good to get your hands dirty right away. Start working with real algorithms and applying them to real things right away. Many machine learning algorithms are joyfully simple at least the basic versions of them and the basic versions get you very far. People often get intimidated when they see these very mathematical machine learning algorithms and machine learning textbooks with literally hundreds of equations in every chapter. There's equation number 11.647. But the truth is actually that is all icing on the cake. You can get a lot of the problem machine learning with much simpler things. So let's start studying some of them right away. So we will look at decision tree induction, rule induction, instance-based learning basically one of these every week, Bayesian learning or more generally statistical learning, neural networks, support vector machines, how to do ensembles of these things. We will spend a week on learning theory. There's a largely determined theory and some people like to start with a theory. I actually don't think that's the best idea because you don't know the point of the theory in the beginning and you know, it's very natural to be alienated. What we're going to do is like, we're going to look at the learning theory after we've looked at a bunch of these things and see what the common questions are that arise. One of the things here is that there's a set of representations and evaluation functions and whatnot, but there's a set of recurring problems that keep coming up like over-fitting and the cursive dimensionality independent of the representation. And so once you've seen that, I think at that point hopefully we'll be ready to like, can we really come up with fundamental notions of how we deal with these? And that's when we will dive into some theory and we will focus on the theory that's useful. There's a lot of gratuitous theory out there but that's not what we're going to do. It's not going to be useful to you in deciding, you know, what are we going to use and how to optimize it and whatnot. And then actually after that, we will look at support vector machines because support vector machines are actually a machine learning approach that comes out of this theory. A lot of the approaches that we're going to see, they predate some of the theory that we're going to look at, the support vector machines actually come out of that theory, so that makes a very nice surprise. Also some model ensemble methods came out of that theory. And then finally, as a huge field, we could easily spend the whole quarter on unsupervised learning. We don't have time to do that, but we do have time to look at the two main types of unsupervised learning and a few of the key techniques in each one. So even though we're not going to spend that much time out in class, you know, you will learn enough to go and apply this tomorrow if you want. And the two main types of unsupervised learning are clustering, which I already mentioned, and dimensionality reduction. The idea of dimensionality reduction is that there's two ways in which big data can be big. Two oversimplifying things a little bit. One is that you have a lot of examples. I have, you know, I'm Facebook, I have information on a billion people. Wow, that's big data. That's one dimension of big data. But the other dimension of big data, you know, in machine learning terms, that means I have a lot of examples to learn from. But the other dimension is that I might know a lot about each one of these people. In fact, you could even argue that that's the big advantage that Facebook has, is that it actually has more information than, you know, Google or Amazon or any of these companies. So in technical terms, we say that you have many dimensions for each example. I have not a thousand variables. I might have hundreds of thousands or even millions of variables. And this really is a brave new world. If you look at traditional statistics, you know, what's considered high-dimensional and there's like 10 dimensions. In fact, in the old days, anything above three dimensions, which is what our world has, was considered high. And in today's like, you know, what is a hundred dimensions? A thousand dimensions is starting to get interesting. A hundred thousand is big data. But now if you have all those dimensions, this is going to be very problematic. As we'll see, this is going to make life very hard for the machine learning algorithms and it's going to make life very hard for you. How do you look at, you know, a hundred million pieces of information about somebody? It's going to take too long. So the very natural idea in dimensional reduction, as the name implies, is that we want to reduce that number of dimensions. I don't want to be looking at a hundred thousand pieces of information. I want to be looking at, you know, a dozen. Or maybe even three, so that I can plot it, you know, in 3D or even two. Or, you know, four if time is one of the dimensions. So dimensional reduction is a study of how to take examples with many, many dimensions and reduce them to the most informative dimensions. Okay? And again, this is a very active area of research these days. Many fascinating governments being invented. It's time to cover the main ones and to give you a sense of the power of this type of machine learning. Okay? All right, so, before we go into all of that, however, there's something very, very important to remember, which is what we're going to spend, you know, 99% of this class on, or let's say 90 plus percent is actually only 10% of what people who do, doing analytics, you know, machine learning in practice spend their time on. And it's very important to be aware of what the other 90% of it is. Okay? Because, you know, if you become a data analyst, this will be chances are what you spend most of your time on. So what are those things? Well, if, let's say you have a problem, right, you know, you're in a company or you're a scientist, you have some phenomena that you're trying to understand, you have some prediction that you're trying to make. Right? You want to design a vaccine for AIDS. Right? This is, of course, a very real example. Right? What is the first thing that you have to do? It's not to get out your machine learning, you know, sweet. The first thing you have to do is understand the problem. Understand the domain. Learn the biology. Learn the marketing. Learn natural language if you're going to do document, you know, processing. Okay? This is where talking to experts is important. Usually a machine learning project involves at least two people. The machine learning expert and the domain expert. In some cases these two people are the same, but that's a very, you know, high burden for one person to bear. You might be working in collaboration with a doctor who has the expertise about AIDS or in collaboration with a marketer or in collaboration with, you know, someone who's an expert on finance. Okay? If you're willing to talk with you, you have to learn some finance or medicine or whatever to talk with them. Okay? And you also have to understand what the goals are. Right? Because there's no point in understanding the domain if most of it is irrelevant to the task at hand. And sometimes the goals are very clear. Right? This is a hedge fund. You're a quant, right? You have $10 million, you know, invested. Learn to predict better than chance and if you do that successfully, you know, you'll be rich and if you don't do that successfully, you'll be fired. Right? In that case, the goal is clear. More often, actually, what happens is that the goal is extremely unclear. The goal is like this, you know, like I'm the CEO of a company and I know that we have a lot of data. We have tons and tons of data and our customers coming every day to our website and I have the feeling that there's enormous value in this data. My competition is ahead of me in data mining and we've got to catch up with them. Please do something. Do something is not a very clear goal. Right? In some ways, this is a harder problem and in some other ways, it's actually a great opportunity for you. Like, go to town. Right? Have fun. Figure out what there is to do. When I talk to people in industry, like, you know, people who do machine learning at these companies, the number one thing that they always say is then we can. They only have so many thousands of people in the company and, you know, some companies, a good chunk of the engineers are, you know, people with expertise in machine learning but they have too many things to try. So go and discover where in the company is the value to be created. Either way, it's very important to understand the problem, the domain, the goals, the constraints that you're going to be operating in. Certain things might be very good machine learning but might not be deployable. In which case, you know, you should think about that so this is step one. Step two. Sounds very boring and banal but this is often the single most time-consuming step. You've got to get your data. Remember, the reason we can be lazy is that the data is going to do our work for us. No data, no learning. Big data, big learning, that's where the opportunity is. Like, the reason machine learning is taking off like gangbusters is that we have more data than ever. Does this mean that our life is easy? Life is harder. Dealing with more data is harder than dealing with less data. At the end of the day, you might be able to learn a much better model but you have to and you know, the thing is that like the bigger the data is, the crappier it is. This is like Murphy's law of machine learning. The more data you have, the more it sucks. You know, in the old days somebody would do an experiment in the lab, very well controlled, would have a thousand data points with ten dimensions each but it was like sparkling, clean data. These days instead of those thousand data points and each one of them has ten thousand dimensions but you know, half of it is garbage. Half of it is people who, you know, put in their birth data as being January 1st, you know, 11-11. Half of it is like, you know, incompatible formats got munched together, right? So you've got to take all that, you have to find your data, you have to clean it, you have to integrate it. You often have multiple sources of data. In all large projects, this is always the case. You don't have a single source of data. You have multiple sources. Often from very different origins and you have to put them all together. And if you put them all together wrong, you know, garbage in, garbage out, right? You're learning problems and you learn the stupid things that happen because of what you put in the data. Right? For example, I know an example where the program learned that men get pregnant because somebody merged the fields wrong. Right? That kind of stuff happens. Okay? So cleaning the data, preprocessing it, integrating it, selecting what's what's most important is a very big, big chunk of all of this. Finally, you get to what's usually considered the fun part and again, if you talk to the people who do this, this is usually the part that they really like to do, which is to learn the model. Right? This is where the fun begins. You know, you do discovery. You know, you're like a scientist, either really a scientist or the equivalent of a scientist in industry. Right? This is when you get to discover things and, you know, follow hints, you know, be a detective. And this is where we have a lot of great technology for you to use. This part is very mature. Part of the reason why it's very mature of why we're going to spend almost all our class on it is that we can come up with very general techniques that you can learn in this class and then you can apply to a hundred different things. And tomorrow there'll be a hundred new things that you can apply to that. Today we haven't even thought of them. Right? So here we have very general tools and they're pretty mature. So this stuff works. Part of the reason why this often doesn't take most of your time is that it's pretty good at this point. So, you know, using it actually doesn't have to take that long. It depends on how accurate you want to get. You could spend a long time on this, you know, getting ever more accurate at predicting, you know, what things people are going to click on and what not. But in many ways, this is the most efficient part of the whole process. After you've learned your models, you are not done yet. You have to understand them. There are some areas where if your model works, people are willing to close their eyes and let you use that. One of these areas is finance, in some cases at least. Nobody really, at the end of the day, understands how the markets work. If you're out with them, no matter how opaque it is, reliably predict something better than chance, well, it's making money. We'll take it. We prefer to have something we can understand, but you know, if we can't understand but it's making money, we'll live with it. For most problems, however, if people don't understand what your algorithm produces, they will refuse to use it. You don't want to tell somebody, well, you need surgery because you have breast cancer. Why? Because this complicated nonlinear function with these coefficients is outputting plus 0.5, which is above threshold. This is not a very persuasive thing to tell a patient. Or to tell the CEO of a company if you want to persuade them that there's a different market segment that they need to address and whatnot. Remember, you're probably going to be in the room with experts who are very convinced of their expertise and often your learning algorithm will tell them, like, you know, I'm sorry, no, we know better now. And they're going to say, this box of junk is telling me that I'm wrong, no. I know of many examples where people had very accurate machine learning algorithms that were not used because of this. Just to give a random example, you can use machine learning algorithms for wind design. But, you know, if the wind flies and whatnot and has the right envelope of characteristics, people might be satisfied with that, but hell no. They wouldn't be convinced that that thing is safe. It's just some black box neural network that produced this design and they're not going to take it. The second reason why interpreting results is very important, more subtle but equally important, is that this whole thing is not a one-shot process. It's a cycle. And you need to understand what your algorithm is doing in order to then go and improve it. The machine learning algorithm is doing this very large search. That's what the computer is for. But the hour loop involves you. It involves you tweaking the algorithm and trying different things. And again, this is what people spend a lot of time with. But it's very hard for you to tweak the algorithm or try another algorithm or maybe bring in a new source of data because that's what was missing if you don't understand what the algorithm is doing. So it's very important to understand the results in order to make that loop converge to the point where it's doing something really good so this whole thing is really not a linear sequence of steps but it loops within loops. From interpreting results you might want to go back to learning models or you might realize that you need better data. When learning the models you might realize that you need better data. This whole process might lead you to a better understanding and based on an understanding you're going to do a whole bunch of different things now. So this is how things actually work in practice. Finally, let's suppose you came up with a crackerjack model. This thing works amazingly well. Are you done? You're not done. The majority, I don't have a firm statistic on this but I'm willing to bet that the majority of machine learning projects that were successful in the lab, if you will actually wind up not being used in practice because as with anything else between your project and people actually using it there's all these things that have to happen. There's all these stakeholders that say yes or no that might help it or hinder it. Things might have changed on the ground while we were learning your model and these things are all getting better. So I think the hit rate is actually improving. But the key thing to remember is that your job is actually not done. You can't just go to your boss and say here's the model, not use it. You actually have to make sure that the people who have to use the model will be willing to use it or that this thing will get deployed and they'll have the format like here's an example. Donnelly the largest printing company in North America. They actually had at one point the result of this was a decision tree. What they actually created was a piece of paper that got pasted onto the machine saying if this reading is this and that reading is that do this, this was the deployment. Of course there's often the deployment is you have an algorithm on our website that is predicting oh, if you bought this book you're gonna buy that book as well. But it has to work at that level number one and then number two it has to keep on working. The world keeps changing. The phenomenon that your modeling might have changed. Also the data might have changed. Your algorithm assumes that the data is coming in this form but now the data has changed to that form and nothing starts to go haywire. So like any piece of software in a way machine learning system also has a life cycle and you have to keep outtaining to it making sure that it stays relevant if necessary going back and redoing it or even better and as increasing as the case is making sure that the learning is happening online. Every day you're adapting the model or every week or every month. So the key point to remember here is that machine learning in practice is a cycle that has many steps. We're now going to zoom in on the one step in the middle that's very well understood but bear in mind that the others are there as well. The other reason of course why we can teach learning models but not the other parts so much is that learning models there's a very general set of techniques many of these other parts are very domain specific. There's a number of things that you can say for example about how do you deploy systems but really it's going to be dependent on your particular domain. The same with things like data cleaning and data formats and whatnot. The bang for the buck of trying to teach your stuff like that is much lower. I have other things but it's very important to remember that those things are there as well and you're going to have to deal with them if you want to do successful machine learning in practice. Questions about any of this? All right, so very good. Let us take a break. It's 7.54. Let's take a 10 minute break and reconvene at 8.05. We'll generally be doing this every week. We'll take a 10 minute break roughly in the middle and then do the second part. So see you back here at 8.00. All right, let's keep going again. I'm going to round 2. Round 2, exactly. Round 1.2. Round 2 is next week. So what we're going to do the rest of today is dive into the meat of things but before we start studying specific algorithms, specific representations let us just try to understand what the inductive learning problem is, why it's hard and also how we can solve it and what the important dimensions of the problem are. Okay? So, what is inductive learning? Just as a quick recap, we are given training examples of some unknown function X. A training example is a pair of two things. The input X, which could be a vector or it could be a piece of text or it could be lots of things. And the output corresponding to that input f of X. Okay? And our goal is to find a good approximation of f. Ideally, we will be able, from those examples, discover exactly what f is most of the time that's too hard but maybe what we can find is a good approximation of f. Good enough that when we apply f, when we apply our approximation to tomorrow's examples of f, most of the time they will get the right result or at least the useful result. Okay? So, here are a few concrete examples of this to keep in mind as we go through this. I've already mentioned some of them but let's make them a little bit more concrete. Number one, credit risk assessment. Okay? Very important application. In this case, your X is the property of your customer. You know, like their job, how long they've been at the same house, their income, the size of their mortgage, about your debt, you will have to write on the form that they send you. And also, if you're asking for money to buy a specific thing, you know, properties are the thing that you're trying to buy. You know, if this is an investment, it's one thing, if it's consumer credit, it's another, et cetera, et cetera. The output that you want is either approve or deny the credit request in the simplest version. Okay? So simple binary decision. Here's another example, disease diagnosis. Again, a very important example. Their age, their sex, their, you know, things that they suffered from in the past and also the results of, you know, the tests that you've applied to them. You know, both the symptoms that they have and blood tests, x-rays, et cetera, et cetera. And the output now is potentially a prediction with many different classes, right? It's like, what do they suffer from? Okay, which disease is generating these symptoms? And potentially even, you know, actually, you know, what medicine should you be taking? Or what combination of medicines? So it could just be a diagnosis or it could actually be a therapy. Here's another example, a very popular one since the early days of machine learning and actually an even more popular one these days because, you know, there's cell phones with cameras and what that. Face recognition, right? In some ways, this is one of those quintessential machine learning problems. Here, the input is just a bitmap for the person's face, right? I give you a picture of somebody, right? It's a bunch of pixels. And the goal is to say who it is. Right? This is you. This is your grandmother in this picture. You know, this is Barack Obama. This is whoever. Okay? Final example, again, a very famous one these days, automatic security. Right? You want your program to learn to drive a car. Right? What is the simplest way to formulate this problem? Not necessarily a complete solution, but at least the heart of it, it's like this. The input is, again, a bitmap image, right, captured by a camera in front of the car. Right? It's a picture of what's in front of the car. And the output is the degree to which you should turn your steering wheel. Okay? Let's suppose that we're going at a constant speed or that it's the person controlling the speed, right? The computer now just has to figure out which way to turn. Right? And this is now a continuous problem, except that, you know, you could quantize and say, well, you know, there's only 10 different directions from hard left to straight to hard right. And your goal is to, given a picture, predict which way the wheel should be. Okay? So keep these examples in mind as we go forward. This is a nice variety of the kinds of things that you might want to do with supervised learning. But here's an interesting question. Right? So these are all good classic examples of supervised learning. There's others, but there's also problems through which supervised learning is not a good idea. And as with any technology, it's very important to have a sense of what problems machine learning is good for and what problems it's not good for. In particular, what is inductive learning good for? Okay? And this question actually has some very interesting answers. So here are four different situations where inductive learning may be a good idea. The first situation is where nobody knows the answer. The reason you don't have, you know, always think of this comparison, right? I could have this program be learned by machine learning algorithm from data or I could have a human write the program, right? If it's better to have a human write the program than don't bother with the machine learning, right? So another question is like, under what circumstances is it better to use supervised learning, right? Here's the first one. In a way, it's an obvious one. If people don't know the answer, then they can't program it, right? In particular, you know, let's say you're trying to find something that binds to the ACE protease molecule, right? So that inhibits the AIDS virus, right? It keeps it from multiplying and so forth, right? It would be great to have an answer to this question. Nobody has it, right? The amazing thing is that there are actually many problems where machine learning can find answers that people don't know. So the machine learning, really, there's doing like honest-to-God discovery. Why would that be the case? Well, people in machines have different strengths, right? People have a lot of varied knowledge and a lot of intelligence, a lot of tactics, you know, problem-solving ways. Machine learning algorithms on the other hand, you know, they have a lot of power that can't bring to bear, you know, with your brain on that problem, right? Your brain has a lot of hardware for doing vision and language, but not for other things, right? If your problem is to design a molecule, well, you know, in the lab can only try so many molecules. If you have a good simulator of these molecules and you've learned to predict whether they're good or bad for your purpose, you can, you know, have your cluster be similar in millions of these all night long at very low cost, okay? So the learning can have an advantage when you can use more computing power, et cetera, et cetera, than the human hands, okay? In many ways, these are very exciting applications of machine learning because they are learning things that weren't even known before. However, even when we're not in that situation, and, you know, many or even most of the applications of machine learning are not in that situation, we are in a problem that humans can actually do better, right? So if you could get a human to do it and the human was cheap enough, maybe you could just let the human do it, right? But you want to program it. And there's a very big difference between somebody knowing how to do it and somebody knowing how to program it. We spend a lot of time programming things, so we almost forget that this difference exists, but consider the following. Suppose, for example, that, you know, what you're trying to do is character recognition, right? We humans are really, really good at character recognition, right? We can recognize, you know, semi-garbled handwriting, you know, pharmacists can recognize doctors' handwriting, writing almost superhuman-wise, et cetera, et cetera. But nobody knows how to write a good program to recognize characters. Nobody. People did that in the beginning. They said, well, what is an art? It has a stroke like this and a stroke like this, and, you know, what is a cube, right? And they programmed all these rules and then, you know, they broke all the time. This was terribly inaccurate. So we're in this interesting paradoxical situation where we know how to do something, but we don't know how we know. Another great example of this is, you know, just riding a bike, right? You know how to ride a bike. Can you program a computer to ride a bike? If you figure that out, tell me, because I want to steal your idea and publish it under my own name, right? It's much easier. So here's the paradox that it's much easier to ride a bike than it is to program a computer to ride a bike, because your brain is very powerful. Deep in the, you know, lower layers of your brain, particularly in the case of things like riding a bike, it's actually, you know, your brainstem that's doing a lot of the work. Your conscious high-level self that writes programs, you know, with language has no idea how that works, but your brain knows how to do it. So there are many, you know, driving the cars is another example, right? So there are many applications where people know how to do it, but we don't know how to program it. In many ways, this is an ideal problem for supervised learning, right? To generate our supervision by having people do it. In fact, a lot of the only cars that drove themselves, this is exactly what happened, right? You just had a car, right, and the camera pointed at the road, and the person was driving the car, right? And this is the supervision, right? At the end of the day, we have a video of the road and we have a record of which way the person, you know, turned the wheel. And this is our supervision, right? So if people know how to do it, they can give a supervision. Same thing with character recognition, right? There's a lot of these character recognition datasets. From the post office, right? They're obviously big consumers of this technology, right? And by the way, if it wasn't for machine learning, sending a letter would be a lot more expensive, right? Because there have to be people figuring out what the address is. But the way these algorithms were trained was by actually having, you know, characters labeled by people. That wasn't very hard to do. And then the machine learning actually figures out how to do it. So this can be that. So that's the second class of applications for which supervised learning can be good. But that's not the end of it. Here's another case, right? Which also occurs very often. Suppose that, yes, people know how to do it, and yes, they know how to program it. The problem is that it's not cost effective. Because, for example, the thing that you're trying to learn is changing every day. So if you're going to hire someone to program it, like they have to be programming it every day. And in fact, it might take them more than a day to programming it, at which point, you know, it's already out of date. People have all sorts of rules about how to predict the stock market, but they don't really work. The models that work really change, you know, in some cases, depending on the type of investment, it could be on a day-by-day basis, or a week-by-week basis, or whatnot. In this case, again, it might be a good idea to use machine learning. Finally, you know, in some sense, this is another version of the same thing, but let's suppose things do not change very rapidly. Like, they're very fairly stable. Like, for example, your taste in movies, or your taste in books, right? The problem, however, is that for a recommendation system to be useful, right, you know, think of what Amazon has, right, or, you know, every e-commerce merchant, you know, worth their salt has. They have a recommendation engine that says, oh, people who bought these books, you know, also bought these. They need one of those for every customer, for every user out there. Okay? Your tastes are stable, you know? But, you know, there's seven billion of us on this planet. There's, I don't know, 500 million people or whatever, I use Amazon. You can't possibly go hire someone to write a program to model your tastes. That would cost more than the money they'll make from selling you the books. And by the way, like, it really contributes to Amazon's bottom, and I forget what the statistic is, but, like, a large percentage of their sales come from the recommendation engine. So this is definitely a really, you know, useful technology, but if they had to hire someone to go and model your tastes, you know, it would be a big money loser. So the lesson here is, even when people know how to do it, if you need to do it in a very large number of cases, like, for example, one for every person, then it's probably not cost-effective to do it, you know, manually, but machine learning can do it, right? Amazon has your click stream on their website, or, you know, Walmart, another big consumer of these things, has, you know, your record of the things that you bought every time you went to Walmart, and from that, they can try to figure out what you like to buy and, you know, what they should put on the aisles and then what not. Questions? Okay. Okay. So there this in mind when you're considering machine learning for an application, right? If your application doesn't fall into these things, and if it doesn't have the characteristics that you don't always have to be completely right, probably not a good idea if it's machine learning. For example, you probably don't want to use machine learning to design a network protocol, right? It's a very complicated and intricate thing, and if you get one step wrong, the system breaks. Okay. On the other hand, you know, I don't think it's being used for things that I didn't think it was good for, so, you know, I wouldn't be surprised if tomorrow there was a network protocol designed by machine learning, or maybe there is one already, so. But in any case, it's good to bear these things in mind. Okay. All right. So what is the inductive learning program? How is it hard, and how can we solve it? Before we dive into the details of different algorithms and their different strategies and whatnot, let's look at this problem in the simplest form that we can. So here it is. I'm trying to learn a Boolean function, y, and it's a function of four Boolean inputs, x1, x2, x3, and x4. Okay. So this black box here is the Boolean function. All right. This is my goal, is to figure out what that function is. If I figure out what that function is, then tomorrow when a new set of x1 to x4 inputs comes along, I apply the Boolean function, and I get, hopefully, the same result as the real function. Okay. So this is the goal. Now what do I have? What I have is examples of the function. So here's a table with examples of my function. For example, my first example says that when x1 is zero, and x2 is zero, and x3 is one, and x4 is zero, then y is zero. Okay. So I have seven examples. And now, of course, my job is to induce what the value of the function for the other examples is. Okay. All right. Over to you. How would you solve this problem? Let's see your suggestions. Maybe you'll come up with right here right now new machine learning only that no one has thought of. So how would you solve this problem? Don't worry about it being a sophisticated solution. We're not going to look at sophisticated solutions to anything today. So can you think of a simple way to solve this problem? If else? Sure. If else, say what? Effect conditions on variable values. Absolutely. So in fact, this is what we're going to be looking at very shortly. But be a little bit more specific about that. So that's your language, right? But now how would you learn statements in that language? How would I learn it automatically? Yeah, exactly. By building decisions? Okay. So here's the thing that's going to happen. Think of each example. Okay. So for example, I could say if X1 is false and X2 is false and X1 is true and X0 is false, then Y is false. This is an if-then rule. Then I could have one of these if-then rules switch to these examples. These seven rules get all these examples correct. Is this a good solution? It might be good enough. This is the whole problem. Is that this is not a good solution? Yeah. So I was going to say, wouldn't that kind of be overfitting the data if you keep all of them possible? This would actually be maximally overfitting the data or maybe you could say that this would be the passable solution, like the opposite of what we want. The problem with this is that such a solution would only tell me what I already know. I don't need those rules for that. I have the examples to tell me that. Remember, the goal in learning is to learn a function that I can apply to new examples that I haven't seen before and get those right. What would happen with this solution on a new example? It wouldn't match any of them, right? I only have rules for the examples that are already there. This would be completely useless on new examples, yeah. So what if we try to find the closest to a new item? Say, I already know seven of these and I get a new one and I try to guard some kind of a distance between what I already know and the new one. So this is the type of strategy that we can use. And now a question that we can get. We can just have a very large... Instead of saying, let me just do a bunch of random operations on these things and whichever sort of operations actually fits all this data. This is a good strawman, actually. Why not just do something random? We could. The thing is that something random might not be that good. In a way, what we're trying to do here is better than random. Chances are random will basically be as good as flipping a coin, meaning you classify half of your spans and non-spans and so forth. We give you about half the time, right? No, but if being right half the time is good enough, then great. But you probably want to be right more than half the time. By the way, notice the following. Let's say you have a problem with 10 classes, each of which occurs 10% of the time. Being right half the time is a brilliant achievement. Because by flipping a coin, you only write 10% of the time. So sometimes being right half the time is amazing. By the way, on the stock market, when you're trying to predict what the stock is doing up or down, being right half the time is not good enough, but being right 51% of the time is great. If you can reliably be right, because you can do many trades even per second. If you reliably write 51% of the time, you can make money. I'm oversimplifying a little bit, but if somebody says that they write 60% of the time, they're lying or they're doing something wrong. In the stock market, 60% is out of reach. But 52%, hey, you know, you could be on the way to great reaches with 52%. Yeah. I have a question here. I was just looking at, for example, whenever X2 is 1, the output is 0. So while coming up with the model, do I need to have some intuition about what the relationships between them are? Because that may be a good relationship. Very good. So first of all, this is where having any kind of knowledge about the problem can help, right? You bring that to bear and you start to see. The other one is that that is exactly the kind of rule that we're going to be looking for. Aha, look, this too is predicted with the class being 0. It's not the whole answer, right? But we're not going to try to come up with the whole answer all at once. We're going to try to build it out of these pieces. Yeah. I think it's just an approach to how you do this. So you have to have a weight or something assigned with each of the inputs. So the probability, if this input is 1, the probability that the output y is 1, and then you somehow do some function of all of those, and if the total is above some threshold, then you decide that y is 1. Absolutely. See, you have just invented the naive base classifier. I've done it with a machine before. There you go. So you didn't invent it. Ah, darn it. OK, so naive base is actually the foundation of spam filtering, right? At least we don't know, because they don't say it for obvious reasons, right? But at least in the beginning, most end filters were naive base and naive base. Most of them still are naive base with a lot of things on top of it. But yeah, this is the basic idea of the naive base algorithm, right? For each of these x's, you compute how good of a predictor of why it is, and then you combine them all. And then we will see more detail how to do this. Very simple strategy, but actually very effective for a lot of things. OK, so here's another way to look at this problem, OK? Or, you know, continue with the same way. That was what I had. But now let's look at it. We're never going to be able to do this again. That's why we're doing this in a small example here. I can actually look at the table of the whole function, right? Because it's only four Boolean inputs, right? So it doesn't have that many states, OK? So now what I've done here is like this table is the standard that I have on the previous slide, right? OK, those examples are still there, right? But in addition, I put in here all the other possible examples, OK? Except that the ones that are not in my training said I have a question mark for. So for example, I've never observed 0, 0, 0, 0, so I have that with a question mark, OK? And the problem that we're trying to solve here is just this. I want to fill in those question marks, OK? This is my problem. What do I put where those question marks are, OK? And here's the first important thing to realize. It's a really basic thing, but it's very important. And even machine learning excerpts can get it some other time. You can put any darn thing that you want in there, right? If you don't know anything about your problem, just the data doesn't get you there, right? Machine learning is all about the data, but the first lesson is the data's not enough. You need something else, right? With just the data, right, I know the values for the examples that I've seen, like, you know, this and this. But yeah, there's, you know, like somebody said, do it randomly, flip a coin for each one of them, right? How do you know that that's going to be worse than anything else, right? You have your super-specified classifier technology. Why should that be randomly guessing? The reason, and in practice, however, we do have super-specified classifier technologies that does be randomly guessing most of the time. But this is because it embodies something about the problem that actually turns out to be right for the problem, okay? So the first lesson is that if you're in a state of complete ignorance, you can't do better than random guessing. You have to bring in something else, right? Going back to the gardening analogy, right? Complete ignorance is like having all the nutrients but no seed. Well, you can have all the water and, you know, nitrogen and whatnot in the world that you want. No seed, no plant, okay? This is the first important thing to remember. The next thing is, actually, in this example, right, we're going to be able to do another thing that we're never going to be able to do again, which is this example is so small that, you know, this question that I asked, like, you know, how you're going to, so how would we come up with those, say, if there are rules, right? In this case, actually, we can just try all of them, right? In general, we're never going to be able to do it because there's way too many, right? But this is actually the small enough case that we could actually try all the rules, right? And there's another lesson here which is, like, in this case, right, I have, there are 16 possible examples, right, four variable 16 examples, right? I know seven of them. There's only nine that I don't know. So, you know, when somebody said, well, you know, the ones that you know, that's not a bad thing to have. In this case, it's not a bad thing to have. It's almost half. But in real examples, right, you've seen a billion email messages. How many possible email messages are there? Ten to the who knows what, right? That those billion are infinitesimal. It will be, like, in this table only knowing one example. And that is still an overstatement, right? You're going to, you will have seen 1% of 1% of 1% of the things that you could see. So what you, just remembering what you know is really not going to get you off the ground. You'll be lucky, actually, if you ever see one of those exactly ever again in many cases, okay? So we need something better. And now notice the other interesting thing here, which is I have four variables, right? So I have two to the four states, right? So four variables, right? Two to the four states, right? IE, 16 states, right? So how many functions are there? How many possible functions are there? Of 16 Boolean inputs. Oh, yeah, I know. People didn't answer? Two to the 16th. So generally, right, if you have, let's say, n dimensions, right? You have two to the n possible states, right? One for every combination. And then each of those states you could label with two or false, right? So you have two to that, right? Everybody got that? This first two to the n is the number of states. And then the number of functions is two to the number of states, okay? So look at this shocking thing that I just put here. This is two to the two to the n. This is a doubly exponential function. Now you've all heard how easy problems in computer science are polynomial and the hard ones are exponential, right? We in machine learning will laugh at those people who say NP-hard problems are hard, right? Those are ridiculous, right? We're not just dealing with an exponential, we're dealing with a double exponential, right? And when you're writing a program, you know, if you have four inputs, you have to choose among 16 things. If you're trying to learn the function, you have to deal with two to the two to the four. It's an exponentially harder problem, right? So yes, you know, there's a reason machine learning, you know, hasn't been done from the early days of computer science, right? This is the problem that we're faced with. It's this double exponential, right? Even if your number of dimensions was small, this would be, you know, really tough. Imagine now when your number of dimensions, instead of being four, it's like, say, a thousand, right? Now you have two to the two to the thousand, right? The total number of subatomic particles in the universe is an infinitesimal thing compared to that, okay? So the bottom line is brute force is never going to work, okay? And this is why generalization is important, right? Because the part that you've seen compared to what you're going to see tomorrow to what you could see is infinitesimal, okay? Questions? These are simple things, but this is really what underlies, you know, all the algorithms that we're going to be dealing with is this, this is the nature of the problem, right? This is also, in many ways, very different from other things in computer science. Okay, so that's the following, right? The space is too large, right? Let us just pick a language, right? Let's pick a type of rule, right? And then just, in this case, try them all, right? Let's see what happens. Again, this problem is small enough that we can try that. We're not going to try all possible Boolean functions, right? Even that is a lot, although in this case we could try it, right? To sort of make this half realistic. Let's pick a restricted set of rules or a restricted class of rules and use those, okay? So, you know, let's take your suggestion and just use, you know, if-then-rules, okay? Let us just use if-then-rules of the form, you know, ifx this and x this and x this, then the class, okay? So, for example, this first rule says that if nothing, then the class, meaning that, you know, every example is-you don't test anything, every example is true, right? Well, if you go looking-and now what we're going to do is like this is what's called generating test, right? We're going to create all possible rules of this form and we're going to test everyone against the training set, right? And if there's a counter-example, we know it can't do the right rule, we throw it away, right? And we're going to do this until we find one rule that really does hold for every example in the dataset and then we can declare victory, right? Because, you know, we don't need that, right? So, we do that and then we try, for example, if x1 is true, then y is true, right? Meaning that y is true exactly when x1 is true and false exactly when x1 is false. Is there a counter-example to that? Well, unfortunately, yes, there is. So, this one fails. We try all of those, they fail. Now, we try combinations of two. Let's say, you know, if x1 and x4, then y. Well, there's also a counter-example to this one. And then we try combinations of three. All of those have counter-examples. Finally, we try all four of them. If x1 and x2 and x3 and x4, then y. There's also a counter-example to this, okay? More of the story. I picked the wrong hypothesis class, right? My hypothesis class was nice because it was small. On the other hand, it was so small that it didn't have the right answer, okay? So, this class doesn't work, all right? Well, let's try another. How about, here's a very popular and useful class. People use it a lot in medical diagnosis, for example. Real human doctors use rules of this type. These are so-called M-of-N rules. M-of-N rules are, there's these five symptoms. If you have at least three of them, then you have pneumonia, okay? Because, you know, medical diagnosis isn't cut and dried, right? But what they do is a balance of evidence. There's these things that kind of are indices of syn pneumonia. Often, you don't have all of them. But if you have at least three of them, then you probably do have it, okay? And, you know, the same thing can be used for lots of other problems. They also have the advantage that they're simple. And, they're more powerful than just simple if there are rules that just test the values of specific variables. So, let's try those M-of-N rules. So, first of all, let's try the rule one out of X1. You know, and this one, you know, you can do only one, of course, because there's only one variable. Is there a common example? Yes, there is. One of everything, there's a common example, right? So, all these fail, right? These all fail. Now, let's try one out of these two, right? One out of X1, X2, meaning that Y is true if either X1 is true or X2 is true or both, okay? Again, these all fail, right? There's common examples to all of them. Now, let's try two of them, which is actually just a conjunction, right? Those we already saw, they all fail, okay? But now, let's try, for example, one of three, right? This is new. Y is true whenever one of these three is true, okay? We try that, they all fail. Now, let's try two. This is getting more interesting, like I have three variables, and if two of those three are true, then Y is true. For example, if X1 and X2 are true, then Y is true, X1 and X3, X2 and X3, but, et cetera, okay? Nothing else. Well, lo and behold, one of these actually succeeds. There is no common example to this in the day. And now, we can try all the others, you know, three out of three, those fail, right? Again, this is basically just a case that we had before. One, two, three out of these four, those also fail. Okay, so at this point, we can declare victory, right? We postulated that, you know, this was an M of N rule, and there's only one that works, so that has to be the answer, okay? So at this point, we can declare victory and go home. But, you know, maybe you should be a little suspicious, right? What we just did here in microcosm is actually what people do when they run machine-running algorithms on stuff, with many more subtleties that we'll see later. They're hard. This is what's going on. And what can we learn from this? You know, let's take a step back and look at what happens here, right? And then look at maybe what is not quite right and should be done differently or better, right? One way to look at what just happened is the following. Is that in the beginning, I have complete ignorance, right? I don't know anything about my problem, okay? And then you give me a data set. Well, this is good to know, right? It's real information about the domain. But as we just saw, that real data set doesn't tell me anything about the examples different from the ones that I've seen, okay? So now I need one more thing. I need a piece of knowledge, right? But if I know, let's suppose that we really knew for real that the concept was an M of N rule, right? Then we could do the procedures that we just did and we would have gotten the right answer, right? So what happened with this is that each of these things removed some of our uncertainty. Having my data removed some of my uncertainty. But as we saw, it only usually removes a tiny fraction of the uncertainty because most of the examples remain unseen. And then I bring in my, I say, I assume or I believe or I know that this is an M of N rule. That removes a lot of uncertainty, right? It excludes all the possible CNFs that that concept could be, right? So that really reduces your uncertainty a lot. It cares a lot of information, but not completely because there's still a whole bunch of M of N rules that you don't know which one it is, right? But now, when you combine that rule with the data on a good day, lo and behold, the right answer comes out, okay? So one way of looking at this is saying what is learning, it's removing your uncertainty, right? The data itself is not enough. Your knowledge of the, let's say that you start with the knowledge that it's an M of N rule, right? That is not enough. But that, plus the data on a good day will remove your uncertainty and then you'll be left with the right concept, okay? This is one way to look at learning, okay? Another way to look at it is the following. Well, gee, you know, you really got lucky here, right? We just, we have like this random Boolean data set. We didn't know most of the examples and we postulated that this is an M of N rule. Great, right? What are the chances of that actually being the case? Right? Well, notice this loaded word here. Learning requires guessing a good small hypothesis class. There is always guesswork involved in learning. Some people are not comfortable with that, right? They like the certainty of, I just build this deterministic program and they know that it does what it needs. To do machine learning, you have to be comfortable with the notion that it's guesswork and you're not always going to be right. If in your application you always have to be right, it's unlikely that machine learning is a good solution. On the other hand, if your application is a search engine and people will keep using you even if half the stuff that you return is crap, this could be a good, you know, problem for machine learning, okay? This also means that your ability to make a good guess is very important, right? If I can guess a small class that actually contains the right answer, then the data will do the rest of the job, right? So one way to look at learning is, you know, it's like the cleanest woodwork, right? You know, like the Dirty Harry movies where he goes ahead and tells the guy, like, you know, go ahead and make my day, right? He's been chasing this guy, you know, throughout this warehouse. They've been shooting at each other, you know. And finally, the guy, you know, is down on the floor and then his gun is like a yard away and he's trying to decide whether to lunch for the gun or not, right? And Clint Eastwood comes up to him and points his gun at him and says, well, do you feel lucky today or don't you? Right? Because if he feels lucky, he can lunch for the gun but maybe Clint Eastwood has run out of bullets, right? They haven't been counting how many bullets they have. So if you feel he's lucky, you know, maybe he can kill, you know, Dirty Harry. On the other hand, if he does still have bullets, right, then he could get shot. So in the movie where this is being Hollywood, of course, you know, the guy lunches for the gun and, you know, Clint Eastwood still has a bullet and kills him, right? But the key thing is that, like, there was an element, there's an element of gambling. It's informed gambling, right? It's not completely random, but you have to do that. That element is always present in machine learning. It would be nice if we have the city, you know, if you think about it, if you already knew the domain with certainty, well, then you don't need data, right? Your program, okay? So when you're doing machine learning, it's almost always the case that, you know, things about the domain, right? You suspect that maybe this is the case, maybe that is not, and then you have a lot of data. But at the end of the day, you need an element of luck. When we do the theory part of this class, you'll actually see that, you know, just because you need luck doesn't mean you have to be completely in the dark. You can play the odds. You can be smart about how likely you are to be writing and whatnot, right? For the moment, you know, we'll pursue a more empirical approach to this, and the theory also has its limitations, but nevertheless, you're not completely taking a shot in the dark. However, the main lesson is you're going to be able to, you're going to have to make some guesses, and your ability to make those guesses as well is going to be crucial to success. This is part of why machine learning is fun, because it's not, machine learning is not just turning the crank on a large dataset. It's using your understanding, your instincts, your intuitions, you know, trying to find an answer. This is a very important component. Quick question, real quick question on that. The fact, I'm going to try to figure out a phrase that makes sense, the fact that you need this kind of domain knowledge to guess hypothesis class, does that mean that we should have some skepticism about whether machines can ever initiate machine learning, like the human gardener will always have to be there at the end of the day? Well, this is a great question, right? So notice the following, that in machine learning part it's similar to, well, the humans are there doing some of the work. The machines have not taken over yet, right? You know, Rick Kurzweil says that that's going to happen soon, but most people don't believe it, right? Nevertheless, so, we are bringing something to bear, right? But just think about the following for a second. Evolution that produced us is a learning process, right? Evolution started out knowing nothing, right? We were all amoebas, or we were all, you know, bacteria or whatever, right? So somehow that learning process, you know, actually produced us, right? Evolution is a big data problem, right? You know, the big data is all these creatures exploring the world, right? So there has to be something, some regularity in the world that we can explore, right? And what you want to do is hopefully encode this into your algorithm so that you don't have to put them there yourself, right? The other side of this, right, is that, well, you could be wrong, right? You could be wrong even if things seem to work out. Look at this here, right? Hey, I postulated that it was an M of N concept and I found only one that fit. Does it mean I should go home happy? Not really, right? Because if you think about it, there's other things that would work here, right? So for example, the rule, you can check this yourself, you know, at home. The rule X1 and not X2 implies Y also works. So which one is it? If this is the true one and we guessed the M of N one, we'll get all our stuff wrong, right? So we could be wrong. And you can never be sure that you weren't wrong. In some cases, you can have high confidence that you were right, but you can never be sure that you were wrong, that you were right, okay? You could always have been wrong and you have to be comfortable with that, okay? However, notice the following. There's a trade-off in the size of the hypothesis class. The smaller I make the hypothesis class, right? The more information is encoded in that the lower the burden that it has to bear, the easier the problem becomes. On the other hand, the more things I throw out, the less likely that the right hypothesis actually will be in that class. So in a way, you want to make the hypothesis very small to make the learning easy. On the other hand, you want to make it very large to make sure that the right answer is in there and there is an inevitable tension between these two things, okay? In practice, how do people deal with this tension? By trying successively larger sets of hypothesis. I start out trying something very small. If that works, you know, looking pretty good. If that doesn't work, then I try a larger set and a larger one, just like we did with if, then, rules in the name of end concepts, okay? The larger the set that I tried, the less I'm going to be sure that I found the right answer. Right? Because I need less luck to have found the right answer, okay? Nevertheless, it's very good to have that ability to try successively larger hypothesis sets. And this is exactly what many algorithms do. In particular, the decision tree learners work that way. As we will see next week, they start out trying very small hypotheses, very small trees. If that doesn't fit the data well, then they try larger trees and larger and larger until they either fit the data well or they decide to stop earlier to avoid other things. Questions about this? Okay. So, out of these two observations come two strategies for the machine learning. And, you know, we're going to see them in action in the coming weeks in a bunch of different ways. One of them is, since it's so important to express and use prior knowledge to make our learning easy, one of the things that we should have is languages in which it's easy to express the kind of prior knowledge that we have, right? And in fact, for example, something that is popular in some quarters is, I'm going to allow rules with only a particular syntax on the antecedents. I'm not going to allow any antecedents. I'm going to allow antecedents that obey these rules. For example, I might know that, you know, certain variables cannot appear negated because they're positive indicators of the class. This is a common thing, right? These are useful restrictions to have. We want to have languages in which it's easy to say those things. And in fact, one of the main criteria for choosing representation is which representation makes it easy to encode the knowledge that you have. I've never seen this in the textbooks, unfortunately, but this is actually crucial. Part of the choice of, you know, should I use a set of rules or should I use instant space learning is the stuff that you know. Is it easy to write it down as a set of rules? Or is it easy to say what makes two people similar, right? If writing down rules for diagnosis is like, you know, if you have this temperature and this x-ray, then you have TB, right? Then use rule sets, right? If it's easier to say, like, well, I know how to compare people, right? I can put the hamming distance between them, this of course is a naive example, but in that case, maybe you want to use instant space learning, okay? So the choice of language should be very much influenced by what kind of knowledge you have and what language makes that knowledge easy to apply, right? You shouldn't have, you know, like many people, this one language that you always want to use. You just use decision-making. Well, no, sometimes they'll do the wrong thing, okay? So this is one notion. The other one that we just touched on is what you want to have, right, is flexible hypothesis spaces. We're going to look at some simple machine learning algorithms like Naive Bayes, for example, they have a fixed hypothesis space. Those are nice and simple, they're often very efficient when they work, you know, life is good. But a lot of the time, maybe even most of the time, that is not how you get the best results. And then you want algorithms like, say, Decision Tree Lens that are flexible in the hypothesis space that they consider in particular, they can start out with a small hypothesis space and then only extend it as needed, okay? With either of these strategies, at the end of the day, you still need to develop algorithms that will find the hypothesis in that space using that language that fits the data, okay? Before we move on, let us get some terminology down that is going to be with us for most of this class. We've already touched on a bunch of these things, but, you know, let's get this all done clearly. It's definitely, you know, worth that investment. First of all, a training example is one of the examples that we're going to learn from. It's a pair of example and it's class, okay? Sometimes called the training instance. Statisticians like to call it a sample. A set of training examples is often called the training set, so there's different names for this. The target function is the true function that you're trying to learn. If the function is Boolean, we often call it a concept. Like, for example, I'm trying to learn the concept of chair, right? And then I have examples of chairs and say examples of other pieces of furniture, right? In that case, the function is a concept. The hypothesis is a candidate function. So the function is only accessible to the gods. We don't know it. We just have examples. What we're going to be generating is hypothesis. Of course, our goal is to generate a hypothesis that is as close to the function, to the true function as possible. If the hypothesis is Boolean, as I just said, we call it a concept, the examples for which the hypothesis is valid are called the positive examples. So if I'm trying to learn the concept of a chair, a picture of a chair is a positive example. A picture of a table, on the other hand, is a negative example. So this terminology, obviously, we're going to be using this a lot in little examples. Are we going to have positive and negative examples of a concept? And we're going to see, well, how do we learn to divide them? The function that we learn if we're in a discrete problem, which we're going to be, as I said, looking at most of the time, is called the classifier. So classifier is actually what the learner outputs. Remember, a learning program is a program whose output is a program. It's a method program, in some sense. And that program is called the classifier. A classifier is something that takes an input and produces a class. People often abuse language by calling the learner a classifier. So you have to be a little bit careful when you're here to classify it. Sometimes what people really mean is the learner that learns the classifier. Strictly speaking, the classifier is the function that comes out. Once you have the classifier, you can throw away the learning algorithm and you can just have that thing that you post on the machine that tells people what to do. That's the classifier. The values that the classifier outputs are called the classes. For example, I might be trying to classify text into topics, in which case the topics are the classes. Often they're also called labels or class levels. My goal is to label my objects with, oh yes, this is a chair, this is a table, this is a dog, etc., etc. The hypothesis space is the set of all hypotheses that can be output by your algorithm. If something is like the hypothesis space, then your algorithm can't output it, no matter what it does. If all you have is a decision tree learner, it can only output a decision tree, it cannot output a neural network. The version space is actually a more subtle but very interesting notion. The version space is a subset of the hypothesis space. It's the subset of the hypothesis that are consistent with the data that I've seen. Like we did in our little M of N example, I start out with all my hypotheses. Before I've seen my example, it could be any M of N rule. But then as I look at each example, some of those hypotheses get thrown out. They don't agree with the example, so they can't be the right one. The hypothesis that remain after I've seen the examples. When I see more examples, when I increase the size of my training set, the version space can only shrink. Once something is out, it's out forever. On a bad day, what happens is that my version space shrinks to zero, like it did in the case of the variables. What do you know in that case, when your version space shrinks to zero? Your hypothesis space was wrong, you need to try another one. Another thing that can happen is that more than one remains. In our case, there was only one M of N console, but there might have been a bunch. In which case, you have a number of options. You can return all of them. You can pick one of them at random. Probably the best one is you average them. You take all of those and you just predict you let them vote and the majority wins. Could you explain the classifier once more? For example, a span filter is a classifier. It's a program that takes in an email an output spam or not spam. A learner is naive Bayes. It's the system that produces the classifier. A learner has input a bunch of emails and yes spam, no spam, and outputs a classifier. A classifier is what you then apply to new inputs. A classifier is what at the end of the day replaces the human, replaces the credit evaluated to make the decision or replaces the doctor or helps the doctor to diagnose. So is a classifier and a machine learning system really truly distinct? Are there cases where you get your spam and you say, well, I guess I will spam and you don't do anything to do. They are different. The one case where the boundary becomes fuzzy is in essence this learning because then your learning just consists of remembering the instances. But even then you can make and should make a distinction between the classifier and the learner. The classifier is taking your new instance and finding the closest instance and outputting its value. The learner in that sense in a way is actually not doing anything. But the distinction between a learner and the classifier is the same as the distinction between a program and its output. The program and its output are not the same. In machine learning the program is the learner and the output is the classifier. The reason it's easy to get the too confused is that first of all people abuse language and call learners classifier. The other one is that it's easy to forget that learning is a program that produces a program. And then you confuse those two programs but they are different. Remember for the program you're only in a space that's exponential. Right? The space, right? But the learner is solving a problem that's in the doubly exponential space. So it's a much harder problem. More questions? Okay. So, again before we start looking at specific representations, algorithms and so on let's look at what some of the key issues that we're going to be thinking about as we look at all of these different technologies. Right? And they are very much the same issues that recur. Some things are technology specific but many of them are common to all of them. It's very important to understand what they are because no matter what algorithm you're using they will arise. And also this is the kind of knowledge that will still be applicable even if tomorrow using an algorithm that you never seen before. So the first and very obvious one is what are good hypothesis spaces, right? This is the first decision that you have to make. How do you pick a hypothesis space? How should you do that? Well, one thing to do is to learn from the past, right? What hypothesis spaces have people used in the past that worked for them, right? Why do we use decision trees today? Because they've worked very well for a lot of problems in the past, so with any luck they might work for your problem as well, okay? Are there general principles that we can come up with in designing and choosing hypothesis space that are some and we'll look at some of them. At the end of the day, however, there's a lot of black magic, art heuristics, intuition, whatever you want to call it, right? It's not a science, there's an art side to this. The second question, of course, is well, given these spaces, right, let's say, decision trees, right, very widely used, now if that is the hypothesis space that I'm going to use, now what algorithms work well with that space? And it's not the case that, you know, one algorithm works well with all spaces. For different spaces you will have to pick different algorithms and we would also like to come up with not just specific algorithms but also general design principles for matching algorithms to spaces. The next question in some ways is the quintessential question of machine learning is like, what can I do to optimize my accuracy on the points that I haven't seen? Machine learning is very different from really any other field that I know in this respect in the sense that lots of things are optimization, right, you know, like operations, research is optimization, you know, people solve all things about optimization, right? Search algorithms in computer science and AI are optimization. But in all of those cases, I actually know what I'm optimizing. The problem is that there's a large space to search and searching efficiently is difficult but I know that if I was able to perfectly optimize that function I would have my answer by definition. In machine learning it's different. My objective function is only an approximation to what I want. For example, accuracy on my training data is not what I really want. What I really want is accuracy on the data that I haven't seen. So in machine learning I'm actually optimizing a function that I don't know. And that's what makes it tricky. What I'm going to have to do, of course, is optimize some other function that's a surrogate for the function that I know. And now the question becomes how I do that. And there's a very lot of, you know, an interesting idea that I'm going to do with that. There's also, you know, some very simple heuristics that we can use to do that and of course, you know, we will see some of both of these. The next question is what you might call the statistical question. And this is what, you know, statisticians academic statisticians worry about for a living. And, you know, practical statisticians and practical machine learning people also need to worry about this if they want to be believed. Is, okay, you've just produced a great, you know, spam filter. You say it's great. How can I be sure that it's great? If you tell me that it's great because it fits all the training data perfectly, I'm going to be very suspicious. In fact, if you're ever in the position of hiring somebody else to give you a classifier, right? If you're the boss instead of, or the, you know, the client, the first thing that you should do is take some of the data and keep it to yourself. Right? Let's say you're hiring somebody to design, you know, a spam classifier for you. Don't give them all the examples of spam and not spam. Keep some to yourself. And then they will, and then if what they do is like they've overfitted the data that you gave them, when you try it on your data, you'll see that it's a failure. So what happens is that when they're doing their learning, they have to worry about how well their learning is going to be doing on the data that you have and they haven't seen. Okay? And now the question for them, and also the question for you, once finally I run in on the whole there, and I'm going to deploy in the real world, is like how can I be confident that this thing is still going to be accurate on future data, right? Can I have measures of that uncertainty? And can I, you know, minimize that uncertainty, right? There's a lot of good stuff from statistics that we can apply here, okay? The next question is what you might call the computational question, right? And this is the usual question. Well, sure. Let's say you have a very good method for learning things and you can be very confident of the results. If it takes exponential time in the size of the data, we're all aware that we can't do it. And unfortunately, we're going to run into things that the naive approach would take exponential time in the size of the data almost all the time. There are some algorithms like, say, naive based data, don't, and that's a large reason, the reason why they use a lot. But for a lot of the more interesting algorithms, you know, the NP hardness of it really comes to bite, okay? So now the question becomes, how can we design an algorithm that maybe sacrifices some guarantees for actually returning something in practice? This is the computational question. It's not enough to have something that statistically sound. It also has to scale. On the other hand, it's also very important to have, you know, it's not enough to have something that scales. I unfortunately see people out there in the real world doing a lot of stuff with Hadoop and big data clusters, and we can mind the state of that in the other and return a pile of junk, right? So there's no point in having something that scales if what it does is return you a pile of junk, right? So you really have to worry about both the statistical and the computational question. Moreover, in the final, you have to worry about what you might call the engineering question. The engineering question is the following. We've just defined this problem, inductive learning. I want to learn a classifier from training examples. Most of the time, the problem that you have to solve in the real world does not come cleanly expressed in those terms. You have to figure out a way to turn it into such a thing. Right these days, everybody uses classifiers for span filtering, but it was actually an idea in somebody's mind one day, in particular some guys at Microsoft Research that, hey, we can turn the problem of span filtering into a classification problem and upon machine learning techniques. So this process of figuring out how to turn your problem into, you know, the classification problem, right, into a supervised learning problem could actually be a very subtle one that requires a lot of insight and creativity on your part. In some cases, it's very straightforward. In most cases, what happens is that once somebody figures out that you could do it, well then, you know, it's like the whole industry develops around it, but this can itself be a very hard problem and obviously it's important because if you can't do that part, then you can't do anything. Also, often what can happen is that you do that part but in doing it, you kind of shoehorn it into the problem into a form that kind of neglects a lot of the important aspects and then you fail because of that. In fact, this is what happens a lot with span filters. You shoehorn your problem with span filtering into building a naive-based classifier and then guess what the spammers do. They immediately set about exploiting the assumptions that your naive-based classifier made and making sure that they can defeat it that way. So this part is also very, very important. Questions? Yeah. So how about unlearning as you proceed? Where does that come? Like what if you have learned something but that's no longer true? Yeah, so exactly. We're actually going to touch on that in just a little bit. So there's two types of learning that you can do to jump ahead a little bit. There's what's called batch learning where you assume that you're given all your data, all the data you're ever going to see and you learn on that. And then there's online learning where your learner is always deployed and you're changing the learner as the data changes. In the early days of data mining, people did mostly batch learning. You would do data mining projects like you would come to me with data, I would spend six months on it and then I would give you a classifier. These days, more and more, what you have is online algorithms. If you look at companies like Google and Amazon and Facebook and whatnot, they have online algorithms. They're always learning. Every time that you go to Amazon and do something, it updates your model. It's more challenging, but at the end of the day, it's what you want to do most of the time. We're going to focus mostly on batch learning because it's simpler and easier, but usually for every online algorithm, for every algorithm, there's a batch version or multiple version and there's an online version or multiple version. Good question. More questions? Question about terminology. Hypothesis would be something like M-of-N example that you presented. Yeah. Hypothesis space is a set of hypothesis that you allow. Hypothesis space is the set of M-of-N concepts. A particular M-of-N concept is something like two out of X1, X2, and X3. If then, rules is a hypothesis space. The rule, if this email contains the word Viagra and three exclamation marks, it's spam. That is a particular hypothesis. So we choose a set of hypothesis and then the learner produces one specific hypothesis or a small number. Generally, it's mapped to one representation. We decide the hypothesis space and map to a decision tree. Yeah. Exactly. So a typical machine learning algorithm, right? You pick the hypothesis space and say, well, now I'm going to use decision trees. So let's use the learner for decision trees. These days, in the early days, you would have these single algorithm systems. These days, what any good system has for many of these companies like IBM and Microsoft is like, you have a whole suite of these algorithms. And you could even have a meta-level procedure to try to figure out, well, should this be a decision tree or should it be a neural network? Should it be a decision tree or neural networks hanging from the leaves? That tends to get very expensive and to risk a lot of overfitting, but it's possible as well. More questions? Okay. So a few more things. So as we look at hypothesis spaces and we're going to go through a bunch of them in this class, here's a few dimensions that we want to consider because they're important. Many things are different depending on those dimensions. The first one, we've already talked about it a bunch, is the size. Does it have a fixed size like Naive Bayes or does it have a variable size like decision tree learning? Like decision trees. Learning things in these two cases is going to be quite different. Another very important one is randomness. Is my hypothesis space deterministic or stochastic? Deterministic is, I tell you that this is a span. Stochastic is, well, maybe this is a span with probably 0.75. Again, these will imply very different things. And then there's the question of whether your parameters are discrete or continuous. Am I just making discrete choices like in a decision tree, what should be in my nodes and what not? Or do I have continuous parameters to adjust as in your neural network? I'm basically adjusting the weights on the neurons and there's a large number of real numbers that I need to optimize. Or do I have a combination of both? In many algorithms I can actually have a combination of both. For example, in graphical models what I have is I have a graph which is a discrete structure that says what the dependencies are but then I need actual tables of parameters to find exactly what's the probability of what went. And finally, as we look at our algorithms there's a number of, one of these dimensions is what we're just talking about. So beyond the hypothesis space there's a number of key dimensions along which algorithms differ. And again, we're not going to explore this space comprehensively because it would take too long but it's good to be aware of what it is. And also, when you come upon a new learning algorithm, the first mental exercise that you should do is see where it fits in these dimensions and once you know that you already actually know a lot of things about what it's going to be. So the first one is the search procedure. The simplest kind of search procedure you could have is just direct computation. So for example, something like Naive Bayes you don't need to do search. You just compute these parameters in closed form. If you want to run things on very, very large datasets Naive Bayes is great. You just need to run through the data once updating your parameters and you're done. And I'm told that in the early days at Google and Naive Bayes was very widely used and while I try to release like, well, it's very scalable. There's no search today. These days I don't know, I suspect it's not the case anymore but nonetheless, direct computation is great when you can have it. The price of that is that probably your learner is going to have to be very limited. If direct computation doesn't do the job then the next thing, one of the things that you can try is local search. Local search is what happens when you start with a complete hypothesis and then you modify. So for example, when learning your neural network, you usually start with a complete neural network. It's like your brain at birth, right? Your brain's already there. The problem now is that you need to vary the weights. So you take little steps in weight space and hopefully have something that fits the data very well. The other or one other approach is what's called constructive search. This is where you start out with one small piece, not a complete solution, but then you assemble the solution piece by piece. This, for example, is what you typically do when you're learning decision trees. You start out with a small decision tree with just one node and then you attach one more node and then you attach another node to that node and you do this until hopefully you have a good model. Timing. We've touched briefly on this, but it's another important dimension. You could be eager or lazy, right? These are very expressive terms. Eager means, you know, eager is like the good student. You don't leave for tomorrow what you could do today. You go home, you study the whole book the following day, you already know all the material. This is eager learning. It's where you do a lot of work at learning time. And then at test time, when the exam comes, you waste it all like you can do it with your eyes closed. A lot of learning algorithms, probably most learning algorithms, are eager. But there's also the lazy option, which your mother told you was a bad one, but actually procrastination can be a great strategy. The lazy strategy is like, nah, I'm not going to do anything to them. I'm going to relax. I'm going to drink beer and watch TV. You know, when the test comes, I'll just scramble. I'll improvise. Some people are good at that. And like in human affairs, actually, in machine learning, often the lazy version of an algorithm is more powerful than the eager one. The reason it's more powerful is that when you're doing eager learning, you're doing it ahead of time. You don't know what you're going to be applying it to yet, so it has to be good for all things, very hard. Lazy learning is like, oh, now I have this patient to diagnosis, and I didn't really pay attention in med school. You know, like that Spookin movie about the guy who faked he was a doctor? And he turned out to actually be a really good doctor, right? So he didn't need to go to Harvard, right? He faked that he'd gone to Harvard. But then when faced with the patient, then you figure out how do I, what does this guy have? What do I do? Right? For humans, this doesn't sound like a good strategy, but for machine learning algorithms, you can actually be good because now I just have to, I apply all my power to these particular cases, and I don't have to be very good in general. So lazy versus eager learning. Some representations naturally themselves to one or the other, like instance-based learning is the quintessential lazy algorithm, right? I just remember the examples, and when a new one comes along, I find similar ones. But there's really lazy and eager versions of any algorithm that you might want to think of. There's lazy versions of decision tree learning, for example. And a final dimension which you already touched on is online versus batch. In online learning, you only see one example at a time. Like predicting the stock market is a quintessential online learning problem, right? I predict whether the stock is going to go up or down, and the next day or the next minute, it goes up or down. And then I incorporate it into my learning, and I keep learning as I see more things. Batch learning is, you know, I've got a supercomputer, I've got a big data center, I'm going to throw a ton of cycles at this thing, and then I give you a product. For some things, it's appropriate. For example, if your domain is not changing a lot, and you know, you only need so much data, batch learning is appropriate. Most of the time, however, at the end of the day, you want to do online learning. There's also a spectrum, right? You could do mini batches where you learn, like maybe every day, right? This is common, right? Every day, you rerun your learning, not like every new example. So there's variations of this, but it's important to be aware of both. Mostly we look at batch learning, but it's important to realize that, you know, there's also online versions. There are a lot of domains these days, particularly on things to do with the web and what not. Online is more and more what people do. Okay? Questions? Do these batches look like, I see that most of them are online? Well, I mean, actually, as many, many cases where batches are, for example, let's say drug design, right? You give me, you know, a big batch of, you know, molecules and how they dock, right? And if I found one that, you know, inhibits the virus, you know, I have a cure for the disease, right? So whenever the world is not, whenever the problem is not changing, right? And you only need so much data, then there's no advantage in it online. Batch is more expensive, but you can do more things, right? The power of batch is that I can use the examples in random order. In online, of course, I can't use the examples that I haven't seen yet. So, you know, each one has its advantages and disadvantages. So if you look at all the applications of machine learning in the world, I would say that it's still the case that the great majority of them are batch. In the domain of things like the web and what not, you know, more and more online is the case. And I think in the future what's going to happen is that every organization is going to be doing, you know, online learning. Because that's how you beat the competition, right? Is you react faster than them? But even in science, it's very interesting that, you know, if you look at things like, you know, large sky surveys, right, microphysics, right, particle accelerators, they seem like, you know, gigabytes of data you every, you know, every second, right? In these cases, you know, online learning might also make sense. Because basically what you want to do is, like, you want to figure out very quickly how much of this information you want to ignore. And then you just learn from the rest. More questions. So what is the need for the lazy version of the... The reason for the lazy version is there's two reasons at least. Eager does more work at training time, lazy does more work at test time. So if you're in an application domain where you have a lot of time at test time but not a lot of work at training time, there you're better off using Eager-Lazy. But the other reason is the bigger reason I would say is this thing that lazy learning is actually more powerful. A lazy decision tree learner is more powerful than an eager decision tree learner. You know, here's a different example that may be easier to see. Let's say you're doing linear regression, right, you know, the statistician's basic tool, right? You have a bunch of data points, right, and now you want to fit a line through them. If you do this lazy, right, what it means is that you have to find a straight line that goes through all your data. Well, you know, on a good day, your phenomenon is mostly linear, but most of the time your data is, you know, it's going to be a curve. So your linear regression is not going to give good results. If you do it lazy, however, the following amazing thing happens. I have my query point now. How much money is this person going to spend? And then all I do is I take a neighborhood of that person and I fit a linear model to that neighborhood. And so in each neighborhood I can have a different model. So my overall model could be extremely non-linear, meaning vastly more powerful than a linear regression. So lazy learning could be a really smart thing to do. It's not a good idea when you need, you know, reactions, you know, every second. Like if you're playing, if you're doing high-frequency trading, forget about lazy learning. If you're learning how to control a robot, for example, lazy learning is not a good idea either, because, you know, that robot needs to decide what to do every time for the second. In many ways, actually, lazy learning is actually a very good thing for robots, but, you know, because of the type of model and whatnot, but then what you have to do is maybe try to take your lazy model and find some way of having this time of making it work faster. And, you know, for each of these, you know, approaches, there's ways to try and ameliorate its disadvantages, more questions. All right, very good. So, to recap, you're going to get your first project next week. The biggest part of it is going to be to implement a decision tree learner and apply the quick stream mining. We'll also test some of the stuff that we're going to do the week after that, which is rule induction. Hopefully today you've got a, you know, you've got the big picture of machine learning. Why it's important and people are excited about what it consists of, what main types of machine learning there are, and then in particular for supervised learning, you know, what makes the problem hard and how we can go about solving it and what the main dimensions are, okay? See you next week and, you know, do keep in touch by email and using the form and so forth. All right, welcome back. Before we get started, somebody asked a very good question on the form, which I don't know if you've been calling. They asked, what is the relationship between artificial intelligence, machine learning and data mining? Very good question. And the answer is, number one, machine learning is a subfield of artificial intelligence, right? Artificial intelligence is concerned with doing all sorts of things that humans and that machines don't do very well, like problem solving, knowledge representation, reasoning, planning, vision, natural language understanding, speech, multi-agent systems, and machine learning is one of these fields. These days it's probably easy, the biggest of these subfields and probably also the one with the most, in industry, so in many ways it's become its own field, but it's really a sub-discipline of a yet. As to what's the relationship between machine learning and data mining, it depends on who you ask. If you ask who, it's really about the same thing and everything that we're going to cover in this class fits equally well under machine learning and data mining. You could really have one. However, there are certain things that people do in machine learning, like say reinforcement learning, that you probably don't see a lot of in data mining, and conversely, there's some stuff in data mining like visualization or database and things like that, that people sometimes call data mining, but we probably wouldn't. The difference is really largely historical. Machine learning is actually the older term, the field that's existed for 50 years. Data mining is a term that emerged in industry around the late 80s, early 90s. In fact, there's other terms like predictive analytics and business intelligence and data science and whatnot, all with slightly different slides. I think for us, the main message is there's no real usefulness in making a lot of distinctions between machine learning and data mining, and what we're going to be covering here is very much both. Number two, as you hopefully have seen, the first assignment is up. As advertised, it's going to consist mainly of implementing a decision tree learner and then applying it to some real quickstream data from any commerce company. We await your insights with breathless excitement. That's most of it. Of course, this is a small-scale version of what you would be doing in the real world, but then, nevertheless, it should be useful. We also have some questions about the induction which we're going to cover next week. Anybody have any questions or thoughts before we get into things? All right, let's get going then. Today, we're going to talk about learning decision trees. Decision trees are, according to surveys, consistently the most widely used data mining machine learning algorithm in practice. It's not hard to understand why that's the case. They have a rare combination of being fairly simple to understand and implement. Their output is also quite understandable, and they're not too hard to use. You don't have to tweak a lot of parameters before they return something useful. They're also pretty scalable. That combination of things is pretty attractive. For any given application, there might be another method that if you really put your effort into it, will give you better results. Definitely, there's many other methods that you can use. But certainly, decision trees are, I think, a very good place to start. We will also use them to illustrate some of the issues that actually cut across all these methods, like missing data and overfitting, which are things that you are almost guaranteed to encounter when you do machine learning in practice. First of all, let's look at what are decision trees and what kind of learning algorithms are decision tree learners. First of all, decision trees are a variable size hypothesis space. The size of your decision tree grows with the amount of data that you have. They are, at least in its multi-basic version, a deterministic representation. Decision tree is something that says, for each example, you know, you're positive or you're negative. And then decision trees actually have both discrete and continuous parameters. The discrete parameters are the choice of attributes to test at each node. And the continuous parameters can be, well, first of all, you might be testing continuous attributes in which you can see what to do with them, like, you know, is the temperature greater than something, for example? Or you could also have, you know, probabilities of, for example, one class versus another, ethnic nodes and ethnic nodes. The learning algorithms that we have for decision trees are almost all constructive search algorithms, meaning it's like Lego. We start with one piece and then we attach another and another and another and we gradually build potentially very complex structure one piece at a time. They're typically eager. Again, there are some lazy learning algorithms out there for decision trees, but in the vast majority of cases, people, you know, learn decision trees in batch mode and eagerly meaning, you know, put a lot of effort into learning your model, you know, on a fixed amount of data, and then you deploy it. So that covers both of those points. So what is a decision tree, first of all? This is generally what we're going to do here is first understand what the representation is, what it can do, what it can do, and then once we have figured that out, we go on to, like, what are some of the evaluation measures that we use, and then finally what kind of search algorithms we have to optimize those measures to best, in this case, decision tree. So decision tree is a very simple and very old idea. You can almost guess what it's going to be just from the name. A decision tree is a tree where in each internal node you test the value of one feature. For example, you might test whether this patient has a high temperature or whether, you know, they, you know, have been in contact with people with this or that and so on. So each internal node is going to test one of those things. And then the leaves actually make the class prediction. Yes, this patient probably has malaria or does not. So here's a small example that we're going to use to illustrate decision tree learning. Suppose, you know, you like to play tennis and there's a friend of yours who you like to invite to play tennis. But, you know, sometimes she wants to play, sometimes she doesn't want to play. And you're very shy and afraid of rejection and so what you'd like to do is predict in advance whether she's going to say yes or no. Right? And if she says yes, then you invite her. She says yes and you play. Otherwise, you save yourself the investment. Okay? Now, the problem, of course, is that you don't know what goes on in your mind. What you have, though, is, you know, in the past, there's a bunch of days, right, that you asked her to play tennis in and she said yes or no. So our goal is from these examples, figure out what is the general rule that decides whether she wants to play tennis or not. Okay? And, you know, let's say we have four simple attributes, outlook, which could be sunny, overcast or rainy, right? Clearly this affects whether you're interested in playing tennis or not. The humidity, which could be high or normal. The temperature, which, let's say, could also be medium, high or normal. And the wind. Is it strong or weak? Okay? So now let's say that we had our examples and we ran our great decision to learning how to read them on this. This is actually a small enough problem that you could even do it by hand. And let's say you get a decision tree like this one here. Okay? So how does this work? Okay? Let's say that the attribute at the root is outlook. Right? So what that means is that the first thing that you do, today when you wake up, when you want to decide whether to call your friend or not, is check the outlook. Okay? And, for example, what happens if the outlook is overcast? Looking at this tree. Exactly. The output is overcast. That decides it. That's enough for your friend to want to play tennis, so we're done. Right? Things end right there. On the other hand, let's say that the outlook is sunny. Then what happens? Check humidity. Then you have to check humidity. Because guess what? If today's sunny and the humidity is high, it's kind of uncomfortable for playing tennis. On the other hand, if it's sunny but the humidity is low, well, then sure. We can still play tennis. Okay? If it's rainy, then maybe you need to check the wind. If it's rainy but not windy, okay. If it's rainy and windy, it's going to be too unpleasant. Notice that we actually ignore temperature. It turns out, and it often will, in practice in real examples, that some of the features that you have collected are actually not necessary. It could be because they really are irrelevant or it could be because they're just redundant. For example, what the temperature would tell me is already contained in the outlook and the humidity. So you might wind up with a decision tree that doesn't use most of the features. That's a good thing, because the simpler the decision tree is, the better it is for you to understand and predict. Any questions about the basic representation? Which features would you prefer to start with? In here, the order of features. Notice that the order of features matters. You want to start with the most informative feature. In fact, one of the nice things about outlook is that if it's so forecast, you're done. So what you want to do is you want to test the features that most quickly lead to a conclusion. Because the other thing that happens, we're going to see more of this in a little bit, is that if you start testing a lot of features, the amount of data that you have to test this combination starts to become exponentially smaller. So pretty soon you're not really able to tell whether the next feature is valid or not. So one of our preoccupations, of course, is going to be figured out like, well, what is the next attribute that we're going to test, and how should we pick it? At the root or at any other node. More questions? Okay, so next question. Actually, let me not even show you this first. This example that we showed here only has discrete attributes. Each attribute only has a small number of discrete values. But in the real world, a lot of attributes have continuous values, like for example, temperature is, at least unless you discretize it, a continuous value. So the next question is like, what if temperature was relevant here? How should I incorporate continuous values into a decision tree? Any suggestions? What can we do? Obviously, we can't test all of their values, because there's too many. We can't have a branch for every possible value of the temperature. That would be crazy. So what can we do? You can use a range, exactly. So that's one option. Or you can do something even simpler than using a range, which is kind of like using half a range, which is just use a threshold. So for example, here's the decision tree that we had before, but with a threshold on the humidity. If the humidity is above 75%, then we don't play. If it's below 75%, then we do play. And of course, if you compose these, you will get things like ranges. You could later be testing, for example, well, if the humidity was above 75%, maybe if it's below 25%, then that's a bad case against. So there's other ways that you can use, but this is the simplest and by far the most common one. So the next question is, if this is what a decision tree is, then what can it not represent? For example, if you handle continuous attributes this way, can you learn any frontier between positive and negative examples? Here's the question. Here's a little diagram. Let's suppose that my positive examples are like this. Here's a bunch of positive examples. And here's a bunch of negative examples. Can you learn this with a decision tree the way I just described? Yes or no? Yes because? Because you would cover the universe of values. I'm not sure what you mean by covering the universe. Tell me what the frontier would look like here. What would it look like? Actually, before we even think about decision trees, what is the frontier that you would put here to separate the positive from the negative axioms? Yeah, exactly. It's just a line like this, right? Hey, we can do this by eye. Like one famous person in data mining said, if people could see in high dimensions we wouldn't need data mining. So here, we don't need an algorithm to figure this one out. So here's the straight line. Another question is can I learn this straight line with a decision tree? You could approximate it if you had a lot of branches. Very good. I can approximate it, but I cannot learn it. Why can I not learn it exactly? You could not put it into any equation or something. Well, because a decision tree partitioned by like horizontal and vertical lines. Exactly. Let's say this is x and this is y, right? All I'm allowing myself is test of the form x greater than something and y greater than something, right? x greater than something are splits like this, right? y greater than something are splits like this. So you can only form axes parallel from tears. So you can't learn this, of course. Given enough data, you could approximate this with a tiny little staircase, right? But this is not very satisfying because now you have a really complicated decision tree. In fact, the complexity will go off to infinity as you get more data, when in fact a much simpler thing would suffice, okay? So in principle, you can approximate anything with decision trees, but in practice, many things you cannot approximate very well. However, you can still learn good decision trees in a lot of domains because we're almost never in this situation. Because the space is very high-dimensional, it's not just two-dimensionals, but that is very, very sparse. So we have no idea exactly where the frontiers lie, so doing axes parallel cuts will actually be good enough most of the time. Okay? And it can also be done much more efficiently than, say, fitting a big, complicated equation at each node of the decision tree. Okay? Yep? Is there any way you could learn the threshold values in a decision tree? No, yeah, no. That's what we want to do. We don't know where the thresholds are going to be. We need to figure out where we should put them. So here's an example, right? Here's a decision tree that tests, you know, two continuous variables, X1 and X2. For example, if X2 is less than 3, memory text X1, and if X1 is greater than 4, then we have a 1. Okay? So what is X2 less than 3 and X1 greater than 4? Right? It's this square over here. Okay? So in this square over here, we predict 0. Okay? Everybody agree? And so on. If on the other hand, X1 was greater than 4, then now we're in this sector. Right? And so you see that each of these rectangles corresponds to one of the leaves in the decision tree that we have here. Okay? And with a lot of data, we can draw a very fine set of rectangles, right? This is one of those, like one of those Mondrian paintings, right? We can have very fine little squares, but they're always going to be squares, or hyper rectangles is the technical term, because this is going to be in hyperspace, in high dimensions. Okay? Any questions? Yeah? Okay, so next question. So this was for continuous variables, right? For continuous variables, we saw that decision trees can approximate things, but not necessarily learn them exactly. What about discrete functions? So to simplify, let's suppose we're just trying to learn a Boolean function of Boolean inputs. Can we learn any Boolean function of Boolean inputs, or can we only learn some of them? What do you think? Are there some Boolean functions that we cannot represent with a decision tree? And if so, can you give an example? What is your feeling? Remember, you have to get lucky. Do you feel lucky today? It feels like you should be able to, right? It feels like you should be able to. Yeah? Why does it feel like you should be able to? In the tree, you could basically describe the same Boolean function in the tree, you know? Right, right. So think about this, right? You can describe any Boolean function using a truth table, right? That's what it is, at the bottom, right? It's a truth table, right? So now the question is, if I give you a truth table, right? Any truth table in the world, can you turn that into a decision tree? Or not? Yes. Yes, because? Somebody at Microsoft said yes. Yeah. Worst case, you've got as many branch points as you've got rows in your truth table. Exactly. Exactly. How about this? Oh, I don't actually have that example here. But let's draw that example, actually. Here's my truth table, right? Let's say I'm testing variables, you know, x1, x2, et cetera, right? And then here I have my answer, my y, right? And for each combination of these values, I have a value of y, right? So how about we do this, right? Let's not worry about how a larger decision tree is just going to be wrong. We're just going to worry about finding one, right? First, we test x1, right? x1 is like 0, 0, 0, 0, and then 1, 1, 1, 1, right? So if x1 is 0, then we know we're in the upper half, right? If x1 is 1, we're going to lower half. And now you can already see what's going to happen, right? Now we test x2, 0, 0, 1, 1. Well, this would things further. So in the worst case, what has to happen is that I have to test every value of every variable in each branch, and I'm going to have two to the end of these, but that's okay, right? That's a decision tree, right? So whether we can represent any discrete function with a decision tree, okay? Whether or not we can represent it efficiently is another question, a very important one, but we'll defer that until later. And we'll also ask ourselves, are there more compact ways of doing this than decision trees? Questions? Okay, very good. So decision trees are actually a pretty general method. And clearly, they're also a variable size hypothesis space, right? Because if you think, for example, there's decision stumps, right? These are decision trees that basically just test one variable. And any, if the class just depends on one variable, I can represent it with a decision tree of that one. If I go to depth 2, now that actually includes all models over two variables, all gluing functions over two variables, and also some gluing functions over three variables, right? Like, for example, I could have a function like this, x1 and x2, or not x1 and not x3 as a decision tree with two levels, right? Because in the first level, I would test x1. And then in the second level, in one case, I would test x2 and in the other case, I would test x3. Okay? Very good. So this is what decision trees can represent. So now that we know what they are, the next question is, how shall we learn them? Okay. Anybody want to give any suggestions, like right out of the gate, about how to learn a decision tree? Let me give you a hint, right? It's going to be a recursive algorithm because trees and recursion were made for each other, right? So the natural way to learn a decision tree probably is to do something recursive, right? So how might we go about it, right? Here's a training set, right? And what's the first thing that you want to do? Let me put this another way. Suppose that I give you a training set of, say, spam, where every single email is spam, right? Then what should your decision tree be? The most common... Yeah. The most influential word. Your decision tree should just be yes, right? All your training there is spam, so you're done. Conversely, if all your training there was not spam, you could just say no, right? And that's your decision tree, right? Why is this... I mean, this in general is not an interesting case, except that it's going to be where our recursion ends, right? If at some point I get to a data set that is all the same class, well, at that point, I can stop and say, well, this is the class, okay? What if that is not the case, right? What if there's a mix of positive and negative examples in my training set, then what should I do? Remember, our goal is to produce something like this, right? I need to pick an attribute and then another attribute, right? So what is the first thing that I need to do is pick the root, right? So how should I pick the root? Pick the one that's most relevant to the data? Yeah, absolutely. So we need some kind of criterion to decide how relevant, how important, how accurate, how predictive an attribute is, okay? But actually, let us punt on that question for a moment. Let's just suppose that we have a little black box, a subroutine, that returns a score saying how good each attribute is, right? So I run each attribute to that box and then what should I do? Right, I take the attribute with the highest score, right? And I make that the root, okay? So now I have the root, it's testing on this attribute. Let's say that it's Boolean, right? So now what happens once I test on this attribute? It's that some of the data goes to the right, right? And some of the data goes to the left, right? So now what do I do? Now you do it again twice. Exactly. Yeah, now I already have my whole algorithm, right? Now what I have is a new smaller dataset and now I just do the same thing on that dataset and that is going to pick this attribute and on the other dataset, the other child of the root, right? I'm going to pick the best attribute for that. So every time I split, right, I get a new set of datasets each of which I cannot find the best attribute on as if it was the root because it is the root of this subtree. And finally once things are pure, I stop and I predict the most frequent class. It could happen that I run out of attributes to test in the worst case and things are still not pure. So then what do I do in that case? So I actually already suggested it. Let's say I've run out of attributes to test and 90% of my emails are spam which is optimistic and 10% of them are not. What should I do? Return the probability. You could return the probability. Yes, indeed we can do that. For the moment, let's just suppose that you're returning just class predictions, then you should just return spam, right? Because if you return spam, you'll be right 90% of the time and wrong 10%. And there you have it. This is our whole algorithm for learning decision trees. You can look at the details of the pseudocode later, but we've basically gone over this. It's a joyfully simple algorithm. It actually goes back to the 60s. There was this guy called Buzz Hunt, who's a professor at UW. There was actually the guy who initially, him and some collaborators developed decision trees. They were actually, they are, a cognitive psychologist. He's in the Department of Psychology here. And they were actually interested in modeling human decision making. There's actually one example of something in machine learning and AI that comes from another field, in this case psychology. The person who really made decision trees, a widespread data mining method and made them mature and whatnot, was a student of Buzz Hunt called Ross Quinn, a PhD student here at UW. So this is not widely known, but decision trees, the most widely used machine learning method were invented at UW. We are the kings. Let's try that in mind. Okay, so now let's look at the next. So any questions about the basic algorithm? So won't this cost all the best attributes to just be on the left nodes? Because as a recursive, you don't pick the best, but the left side is in the left side, and the left side is in the left side. Then it goes back up to three, the rest of the tree will get the worst. No, no, okay, sorry. Let me, okay, let me clarify that, right? So here's, look at this part of the pseudocothea, right? So first of all, if the data was pure, then we stop right there, right? If the data was not pure, then I'm going to choose the best attribute, okay? Using my test of which we haven't really, you know, figured it out yet, but we have a test. And then what happens with that attribute is that I form the subset that has value true for that attribute and the subset that has value false for that attribute, right? And now the recursion happens. And the answer to your question is that I immediately recurs in two directions, and it really doesn't matter which one I do first, because the result will be the same. So what I do is I create a new node for the tree that tests this attribute and now has as its left child a recursive call to the tree growing algorithm with the, you know, with the data set where the attribute was false and has as its right child the decision tree that I get, if I call the algorithm with the subset where the attribute was true, okay? Does this answer your question? Silence. Kind of? Okay. All right. If you're not clear about this, you know, as we go on, you know, we can come back to it, okay? Yeah. Are we assuming discrete queries at each node? Yeah. So this is just the basic decision tree algorithm, the simplest thing you can imagine for, you know, boolean attributes or discrete attributes. There's a host of issues that arise, one of which is how do we handle numeric attributes and we're going to deal with each of them, okay? And that's when things start to get, you know, more interesting. So S0 and S1 are subsets? Is what, sorry? Are they subsets of S? So what I'm trying, what I'm wondering is, once you choose an attribute, so from your trainings, do you actually take out some of the subsets? Yeah. Yeah. So let's actually look at this continuous example that we have here, right? At the root, somehow, we don't know how yet, but we decided to pick as a test is x2 smaller than 3, okay? And that, if you look at it, is this line here, right? So now what happens is that I divide my dataset into 2. There's a dataset where x2 is smaller than 3, right? And now I'm going to learn the decision tree on that. I never, in that branch, I never see the other data again. And the same thing for the part of the dataset that's above the threshold, okay? This just keeps going, okay? Hey, recursion is a great thing, right? It gives you powerful results with simple algorithms, and this is a nice example of that. More questions? Okay, but of course, there's still a very big question that we haven't answered here, right? Which is, how do we find the best adjective? Everything depends on that if you think about it, right? Because if that procedure is good, then I'll get a great decision tree quickly. If that evaluation measure is bad, then this is all going to be a mess, okay? So what might I use? Let's not try to do something fancy. Actually, a lot of the time in machine learning is the simplest things that work best. So what would be a very simple criteria to pick the best adjective? It evenly divides the training examples. Evenly in terms of the number of training examples or in terms of the number of examples of each class? If for each training example, you're going to be able to answer yes or no. If you pick the attribute that gives a more or less equal number of yeses and noes. I agree with the first part of what you said. I'm not sure about the second, right? So here's the thing, right? Let's think about using just this attribute. Let's suppose that the decision tree was going to end here, right? I only got to use one attribute. Either it should be more yeses or more noes, right? Yeah, exactly. So let's suppose that in the beginning I have a training set where it's half spam and half not spam, right? The best thing that could happen to me is if I have one attribute that's perfectly separate spam from not spam. Let's say the presence of the word Viagra, right? I test whether this email contains the word Viagra or not. If it contains the word Viagra, they're all spam. If it doesn't contain the word Viagra, none of them are spam. They're essentially not that far from reality, right? There's obviously a little bit more to do because that's why there's a lot of people who work on spam detection, but as a simple example, you get the idea, right? Now, most of the time that's not going to happen. But let's suppose that I have an attribute that when I split on it on one side I get 90% yes, 10% no, and on the other side it's 90% yes, 10% no. 90% no, 10% yes. Versus one, that's 60, 40, right? Which one would you prefer? The one that gives you 90, 10, right? Because at the end of the day, with that one, you're only going to make 10% errors. Whereas with the other one, you're still making 40%. And since this is going to be a greedy procedure, right? Because we can't avoid an exhaustive search. Probably what you should do is pick the attribute that looks best on itself. So this is a very simple idea, but it's actually a very reasonable thing to do. You just pick the attribute with the lowest error rate. Lowest error rate means that if I split on that attribute and then predict the majority class, how many errors I make. If I had 100 spams, I got 50 over here, 10 of them were not spams. And over there, 10 of them were spams. So I got 10% or 20% of errors I made. Well, that's my measure. So I just pick the attribute that has the best accuracy if you use it alone. And there's some details of how you do that here, but the idea should be fairly clear at this point. Now, how good is this? You can implement it and try it out, right? It's very easy, right? As part of your project, you can actually try this one out. And it's not terrible. It's certainly better than, say, random guessing. Even though, you know, even random guessing sometimes is not that bad if you combine it with good pruning, as we will talk about later. But anyway, this is not bad, but it's actually far from ideal. And let's see how this works, and let's see why it has some shortcomings. So here's a very simple training set. It has eight examples, X1, X2, X3, you know, they're Boolean. So let's evaluate each of these attributes. So let's take X1, right? So you see here X1, you know, 0, 0, 0, 1, 1, 1, 1. So here's what happens for that example, right? Before I split, right, in my data set, how many positives and negatives do I have? Four of each, right? Now if I split on X1, right, what happens when X1 is 0? I have how many positives and how many negatives? Right, three positives, one negative. Therefore, which one do I pick, too, as a prediction? Positive, right? Because then you'll be wrong what fraction of the time? 25%, right? So you get one wrong and three right. You agree? And on the other side, when X1 is true, what happens? How many positives versus negatives do you have? Now it's reversed, right? Now I have three negatives and one positive, so I predict negative, and I make one error. So total, I made two errors here, okay? Now if you do this for the other two, for X2 and X3, very nice, you know, exercise, it's not going to stretch it right, but you know, you can do it, right? You get four errors in each. So not so great, right? So which one do you pick? X2, right? So notice that at this point, we have a complete algorithm for learning decision trees. It's not going to be a great algorithm because mainly of the problem of overfilling that we will talk about later, but you know, short of that, this is already actually not a bad algorithm in many ways. Now, here's a question, though. Is this really the best we can do? Here's an example that shows why picking the most accurate attribute is not necessarily that great, right? I started out with a training set with 20 positive and 10 negative, and now I have this attribute X1 that creates the following split. And when it's true, I get 12 and 8. Okay, so how many errors? Eight, right? And on this side, I get 8 and 2, so how many errors? Two. Two, right? So what's my total number of errors? Ten. And how many errors was I making up here? Ten. So, this attribute is no improvement, right? I might as well not bother. I basically have the same, you know, mix of positive and negative that I had before. Okay? So by our accuracy evaluation measure, this attribute is not helping. However, let's say that we kept going. You know, we just picked this attribute anyway because, you know, we're, you know, stubborn or something. And then we try X2. And then when we try X2, we get this. We get 12 and 0. And we get 0 and 8, right? So after splitting on X1, splitting on X2 actually gives me zero errors on this side. And on the other side, I get 8, 0, and 0, 2. Of course, this example is sort of like a little exaggerated, but you get the idea, right? We were actually making promise when we picked attribute one. We just didn't know it. So the question for us is, what is going wrong here, right? And could we do something about it? All right. We're not going to be able to solve this in all cases short of trying out all things, right? It's actually an NP-complete problem to find the smallest decision tree consistent with the data. So we're going to have to be heuristic. But the question for us is, is there better heuristic than just the error rate? And of course, the answer is yes, so we wouldn't be talking about it. And that takes us into a little digression into information theory, okay? How many people here are familiar with notions of entropy, information gain, information theory? Raise your hand if you are. Okay, very good. I didn't expect it to be, which is why we're going to do a flash introduction to information theory. Now, you know, as a navigational warning, we're going to do this because we're going to introduce this other evaluation measure called information gain, or mutual information. And we need this very brief background on information theory in order to understand that. But at the end of the day, the information theory is really not going to be that important. We will see what it is that's important. People who use decision trees often get very fixated on the information theoretic aspects and naturally, there's no need. The particulars of the information theoretic measure are actually not that important. Nevertheless, we do want to understand what it is and why not learn some information theory in five minutes. So here we go. Let's suppose that we have a random variable. It's a Boolean variable. You don't think of it as flipping a coin. And the coin is biased. So you have some probability of the coin coming out heads. Let's say that is, you know, v equals 0. And some probability of it coming out tails. Let's say that is v equals 1. And the probability that v equals 0 is 0.2. And the probability that v equals 1 is 0.8. Okay? The simplest distribution you can imagine. This is called the Bernoulli distribution in statistics. Two outcomes. Now what we're going to define, and this is sort of like what Claude Shannon when he invented information theory, which then, you know, took off like wildfire. This sort of like what was his insight was, let us try to quantify how surprised you are when you find out the value of this variable. Okay? So I tell you, right? You know, this is the distribution. Another question is, if I tell you that it came out heads, how surprised are you? If I tell you that the outcome was a 0, how surprised are you? And if I tell you that the outcome was a 1, how surprised are you? First of all, in which case are you more surprised? When it's 0. When it's 0, right? Because that's the less likely outcome. So first of all, our measure should have obviously a lower value, you know, when the probability is lower, right? Makes sense, right? Number two, what if there's no uncertainty at all, right? How should the surprise be in that case? Well, 0, right? There's no surprise, right? On the other hand, here's a more exciting case, right? Let's suppose that, you know, that the probability is 1 with 1 and 0 cannot happen, right? And then you see a 0. How surprised should it be in that case? Infinitely surprised, exactly, you're just infinitely surprised because that didn't happen, okay? So we want to measure that satisfies these to see the right. And so what Shannon suggested was let us make it minus the log of the probability, very simple. There are many ways to justify this, but the intuitive one that I just gave actually think is the best one, right? This is why we have log. And, you know, we want it to be a smooth function, right? We don't want to be doing random things. It should increase smoothly with the probability and whatnot. But basically what minus log of p does is it does exactly what we want. So if the probability was 0, it's infinity. If the probability was 1, it's 0, etc., etc. Okay? So this is the definition of surprise, okay? Now, the reason this is important for, you know, communication systems and information theory and whatnot is that the more surprised you are, the more bits you need to send that thing if you're trying to be as efficient as possible, right? You should have the shorter bits for the things, you know, you should have the shorter representations for the things that you're sending or recording more often, right? That's, you know, the basic intuition, okay? For us, that is not really that important. What's important is, of course, the next thing, which is entropy, right? Even if you've never heard of information theory, you've probably heard of entropy because, you know, the world is full of entropy, right? It's a popular metaphor, even outside of any technical field, right? So what is the entropy, right? The entropy is just the average surprise. Nothing else. If I see outcomes of my coin flip and my coin has the particular bias, like, say, 0.2, 0.8, how surprised I am, okay? So if you think, and, you know, this is often, you know, represented by the letter H, so H represents entropy, right? It's just going to be the sum over the possible outcomes, which in this case are just two, but they could be 100 or a million, of minus the probability of that outcome times minus the log of the probability, right? Minus the log is the surprise, the probability is there because what? Why am I multiplying by the probability? Taking average. Yeah, exactly, it's just the average, right? Things that occur more often should contribute more because I will be surprised by them more often, okay? So the entropy is just the sum over all the outcomes of the probability of the outcome times minus the log of the probability, okay? Very good. That was our flash introduction to information theory. It didn't take long, okay? Now what, now, you know, if you're an electrical engineer, right, this stuff is really important to you or even if you're trying to design, you know, efficient compression schemes and whatnot. For machine learning purposes, however, here's the important thing about this, is that let us draw a graph of the entropy, right? Here's the graph, and in case you don't see this, you know, here's zero, here's one. This is the probability that V is equal to zero and this is my entropy, H of V, right? And here's one. And the graph looks like this, right? Let's look at this for just a second, right? What is the entropy when the probability of heads is zero? It's zero, right? You always get heads and that's what you expected. Same thing when it's one, okay? But now let's look at where is the entropy maximum? Where do I get the maximum of the entropy? When am I most surprised on average? Exactly. When the probability is the same for both. When I have a fair coin. When I have a fair coin, I really have no idea if it's going to come out, right? It could be either way. So I'm always as surprised as I could be, right? Notice that's what's happening here is the trade-off between something being probable, right? So P is high and being improbable. So the surprise is this high, okay? And so what we have, and the value of the entropy in that case is one, right? And I will leave it, you know, supposing that you're using log base two, okay? So it's one bit, right? I'll leave it to you as an exercise to figure out why that should be the case. But you know, you can probably see it already. Nevertheless, for our purposes, the important thing is that we have a curve with this shape. It's a smooth curve that starts at zero, gradually rises to an optimum, you know, less and less and less as you go along, and then it's flat at just that point, and then it starts to get steep again, okay? And in fact, if you took entropy and replace it by some other function that looks the same, you would probably get about the same results. In fact, people in computer science like to use entropy because of information theory and whatnot, but people in stats often use this other thing called the guinea coefficient, which has nothing to do with information theory. It just has a shape that's very similar. And you know, empirically, information gain tends to work better than guinea, which is why we're covering it here. But you know, it's not a big deal. The big deal is that the shape of the curve be this. Any questions so far? Okay, so the question is, sure, I'm telling you this, but why should you believe me, right? Why is the curve, why is this shape of the curve important? And why is it that having a curve like this helps us do better than just picking the attribute with the lowest error, okay? That's the key question. Oh, sorry, before we go on to that, what we typically use is not just the entropy, but what is called the mutual information. What is mutual information? It's the reduction, the mutual information between two variables is the reduction in the entropy of one of the variables that we get from knowing the other, okay? In particular, we're going to be interested in the reduction in the entropy of the class that we're going to get from knowing the value of a particular attribute. So the attribute with greatest mutual information with respect to the class or the attribute with the greatest information gain is going to be the one that we pick, okay? Because it's the one that most reduces the uncertainty about the class. In particular, if our attribute reduces the uncertainty to zero, we get mutual information gain, and we can pick that, okay? So more precisely, what is the mutual information between two variables, A and B, denoted by the letter I? It's the entropy, actually, there's a typo here, the entropy of A. This actually works both ways, right? So it's the entropy of, pick one of the variables, it's symmetric, let's say A. It's the entropy of A minus the entropy of A after you know B. And again, this has to be averaged by how often different values of B occur. So this is going to be the sum over the values of B of the probability with which that value comes up times the entropy of A conditioned on B. If I select only the cases where B has that value, now what is the entropy of A, okay? So to compute the information gain, I need my entropy before I split on the attribute and my entropy in each of the branches, okay? So I averaged those entropies, weighted by how many examples went in each direction and the difference between the two is my information gain. Notice that, actually, if you didn't want to bother with H of A, you don't have to write, why is that? What I'm saying is that even though this is the definition of mutual information, if I just took this out and considered only the entropy of A given B, I would be okay. Well, just because you're, would you choose any, like, the men are the max? Yes, but... And it says, since that term is in all of those, it cancels out. Exactly. I am not interested in the particular value. I'm not interested in picking the one that has the maximum value and this contribution is equally there for all of them, right? The entropy of A is independent of the attribute that I'm testing, so it's going to be the same for all. It's going to affect the exact value, but it's not going to affect who comes out on top. So if you actually look at the code of decision tree learners, they're actually not bothered with that part. But it's good to understand that. Effectively, what we're computing is the mutual information. And now you know how to compute it. For example, you could apply this to the example that we saw before. Here's my initial training set with 20 and 10 examples, 20 positive, 10 negative. Here's my initial entropy. And then here's my entropy after I split on X1 equals 1. And this is for X1 equals 0, right? I have, given these values, I can compute my probability and therefore my entropy, right? I also know that two-thirds of the example go this way and one-third go this way. So I have all the information that I need, right? It's this here minus this times this plus this times this gives me this value. And lo and behold, it's greater than 0. This is the important thing. Remember, for accuracy, the difference between the error rate before and after was 0. I was making 10 errors before I was making 10 errors after so my gain was 0. But here my gain is not 0. My gain is positive. So error rate couldn't tell that we were making progress but information gain can tell that we're making progress. So now that's the case in this example but why would this be the case in general? That's the next question for us, right? Does this happen in general or did we just get lucky this time? In fact, it happens in general. It happens in a very broad number of cases. There are cases where we're making progress where information gain also can tell and we'll see one in a little bit. But there are many cases where it can detect progress where error rate wouldn't. So now let's see why that is the case. So here's my probability of the class, spam or not spam or whatever, right? Here I have from 0.521 because I've kind of zoomed in and here I have from 0 to 1, right? So this is the absolute error, right? And here's what the curve of the absolute error looks like, right? It starts at 0, right? Obviously if everything is of one class there's no error. And then it's a straight line, right? Because it just goes up proportionally to the probability, right? Until it gets to 0.5, right? And then it flips because now the other class is in the majority so now I start predicting the other class and now as the probability increases my error rate starts to go up again, okay? Or my accuracy, you know, depending on how I look at it, right? Either the error rate is going down or the accuracy is going up, okay? So I have my biggest error here in the middle when it's 0.5, 0.5, just like entropy, right? When it's as good as a coin flip, right? And it goes down to 0 on both sides. So this is very similar to entropy, right? The big difference is that this here is a straight line and in the entropy if you recall what I have is a curve like this, right? Why is the curve better than the straight line? Let's zoom in on this curve and see what happens, right? So here I've zoomed in on one part of this, right? And let's, you know, here are two values of the entropy, 0.6 and 0.8, right? And let me draw the arc between these two points. Now here's the thing is that in the beginning, right, before I split on the attribute, I was somewhere in here, like I had, let's say, on my whole dataset, my probability was 0.7, let's say, right? And therefore my entropy is, you know, close to 0.9, let's say, okay? The exact values obviously don't matter, right? And now what happened is that I split my dataset into two, right? In one of those datasets, the probability of the class has gone up, right? And in the other one it's gone down, right? I concentrated the spams on this side and the non-spams on that side, okay? So now what I have is two datasets with two values of the probability and therefore two values of the entropy. One is higher, right? And the other one is lower, okay? Everyone, on this page, we start out in the middle. When we split, we get one lower and one higher, okay? And now my new entropy, right, is going to be the average of those, right? And the average of those is going to lie on this straight line. Where exactly it's going to lie on the straight line? It depends on what fraction of the examples went one way and the other, right? If it was 50-50, I would have to pick this point here in the middle. If it was, you know, one-third, you know, two-thirds, it would be somewhere over here, right? But here's the key point, is that irrespective of where that split actually happens, this line is always below this line. The line of the entropy is always above the line of the average of the entropy. The technical term for this is that the entropy is a concave function, okay? And conversely, the image information is a convex function. Sometimes people abuse and just call concave and convex convex, right? But that's the key, is that the line is always below the curve. Here, right, the error in my measure actually, you know, always on top of each other and so I can't detect the problems. In this case, I can detect the problems. And now you also see why the details of the entropy don't really matter. The guinea coefficient has the same, you know, shape and therefore also works, okay? Questions? I know this was a little bit of a, you know, quick sequence of things. It's good to sit down with this after class and go through it again, but hopefully at this point you at least have the gist of the idea, okay? So any questions before we continue? Okay, very good. Let's see how we're doing on time. We're doing pretty good on time. So let's keep going. So now we have our basic decision tree learning algorithm. We know how to do the recursion. We know how to stop. We know we have at least two good candidates to use for an evaluation measure. Just the error rate, which is quick and simple, and the information gain, which is a little bit better, okay? But now we need to start getting closer to the real world, right? The algorithm that we saw here by itself is still not that good for most of us. When Quinlan did, you know, his decision to work initially, he was just trying to predict the outcomes of chess games. And, you know, he did a very good job of that in chess games. You know, there's no noise. You know, it's all discrete data, right? It was very nice. And then the stuff that he did motivated by playing chess, you know, now these days he's used to mind everything in the world, right? So this is how science happens. You develop something for one thing, and then it gets used for completely different things, like, say, you know, clickstream mining, which is what you're going to do in your project, okay? So, but now let's start looking at, you know, generalizations of this. And the first one is, well, what if my attributes are not bullying, right? So let's say, first, that my attributes are still discrete, but have multiple values. So, for example, I could have the color of my object with three or four different values, or here's one that's very important for marketers. I could have your zip code, right? A zip code is an extremely predictive feature of the things that you buy. And, you know, there's a lot of them, right? A zip code doesn't just have two or three values. It has in the thousands or tens of thousands, okay? So, what might we do then, right? So, the algorithm that we have so far is for bullying attributes. What do we do with attributes with multiple values that are still discrete? Any suggestions? I have a suggestion, which is you can read the slide, but I have a better suggestion, which is don't read the slide and come up with it by yourself. So, what might you do? Any retreats, sort of, like... Pardon me? Multiple children of... Yeah, exactly. So, the most obvious idea is, well, let me just split on every value, right? I compute the information gain, right? I compute the new entropy at each value. I do the average, and that's my split, right? So, I have an attribute with ten values, and that creates a decision tree with ten branches. This is a perfectly good thing to use a lot of the time, particularly if your attribute doesn't have a lot of values. Let's say color has three values. Which model card do you like? Which color? Well, there aren't that many colors, but zip codes might be kind of tough. For two reasons. One is that, well, there's going to be a lot of splits there. Also, it runs a danger of overfitting as we look at a little bit more shortly. So, there is certainly one option, is you just construct a multi-way split, but what's something else that you could do? So, one of the problems we're doing this is that maybe the distinction between some of those values is important, but the other ones aren't. Let's say, you know, you're trying to sell yachts, right? You probably want to zip codes with affluent people. But whether you know you're in the zip code of middle class, lower middle class, working class, whatever, that doesn't matter, and those people are going to buy yachts anyway. So, when you split on all values of the zip code, you're making much finer distinctions than you want, okay? So, what else might you do? This is also one that you really want to know about because often it's what you really want to use. No, of course, you can choose a different attribute and good answer, right? But as far as this attribute, let's suppose that all your attributes have a lot of values, right? What would you do, right? Or among, you know, what could you do with this one attribute, right? You can split on all the values. But what is something else that you can do? You can group them. Yeah, exactly. You can group the values. I can say, well, let me group all the wealthy zip codes together and all the others together and split on that, right? This is actually the third option here. This can work very well. You have to figure out what the groups are. If the attribute doesn't have a lot of values, you could try them all, but this is going to be exponential for something like zip code. This would not work very well, okay? You would need some other way to try to pick one of the good ones, okay? So, there's a third strategy which is just pick one value and test that against the others, right? Are you in, I don't know, Beverly Hills? If you're in Beverly Hills, we want to sell you a yacht. If you're anywhere else, for the moment, we don't, right, so we split on that value, right? If you think about it, if I have an attribute with 100 values, that's really just having 100 Boolean attributes, right? Is it equal to the first value? Is it equal to the second value? So, effectively, what I can do is I can convert the multi-valued attributes into a bunch of Boolean attributes and then test on those, okay? And so, I split off Beverly Hills and maybe, you know, next time I go to Medina, right? Medina's a good zip code for selling yachts and so on. That's probably better than splitting them at once or trying to somehow guess, you know, what the wealth is zip codes are, okay? Questions about this? Okay, so that's features with multiple discrete values. Now, what do you do if you have real, valid features? And again, there's a lot of different things that you could do here, but what's a simple one? That we, in fact, already touched on. Actually, let me give you a hint, right? One of the ways that we had of handling that we just saw of handling multi-valued features was to come into a bunch of Boolean features, right? Can we do something similar with numeric features? How might we do that? Let's say I have my temperature scale, right? The temperature could be anywhere from below freezing to scorching hot, right? So what can I do? Yeah, I can form a range. Or as we saw before, you know, I can do this one piece at a time which is look at thresholds. I can say, is my temperature greater than 60 or not, okay? Temperature greater than 60 is a Boolean attribute, right? So what I can do is I just form a threshold that every value that I have in the data, if I don't have a lot of data, right? Some subset of the values, right? It doesn't have to be too fine. And now I'm back in the situation where I have a bunch of Boolean attributes. And then maybe I can decide, well, if the temperature is higher than 60, we will play tennis. If it's not, then let me test some other things, okay? So this is the simplest way of handling continuous attributes. You know, navigational warning, if you do this on large data sets, this will be very, very slow because there's a lot of values. So, you know, you might want to do something more fine. But on data sets that are not too large, actually just doing this will actually be fine. Okay? Questions? Basically, we start with most deterministic and don't forget about negative cases and we have feedback loop just on data. We repeat, right? So what we're doing here is reducing these more complicated cases to the simple one that we already have. So, for example, in the case of a continuous attribute, I can test the temperature at one threshold, it's a Boolean test. Later I can come back to it and say, well, but life that humidity is very high, then maybe at this point I care about the temperature again. In fact, this is how you build a little staircase to approximate the diagonal frontier. Right? You keep alternating between testing X and testing Y. This is actually how you wind up with that staircase. More questions? Okay. So, everything is looking great so far, but let me give you an example of where decision trees fail. Or at least not decision trees, per se, but the ways that we have of learning decision trees currently. Right? What is it that we cannot do with this kind of algorithm? And this brings in a very important distinction which, again, is going to recur throughout all of our machine learning methods between the things that we can represent and the things that we can learn. We cannot necessarily learn all the things that we can represent. People often make this confusion. For example, we just saw that we can't learn any Boolean function. Sorry, that we can represent any Boolean function with a decision tree, right? But the question is, could we learn it with this algorithm? It's no use being able to represent it if then you can't learn it in the day. So let me give you a famous pathological example of what you cannot learn. In fact, if you want to annoy a machine learning person, bring up this example. And this example is learning the parity function. The parity function is just how many of your bits are one. Right? Is it even or is it odd? Very simple function, right? I just defined it in one sentence. For decision trees, this is a hellishly difficult thing to learn. Can anybody guess why that might be the case? What parity has that makes it so problematic? I think it flips after every as you add every bit. Yeah, exactly. It keeps flipping, right? So if you think about it with parity, if I know n minus one of the let's say there's n attributes, right? If I know n minus one of them, I still don't know what the class is. I need every last piece of information. And another way to think about this is in terms of distance space, right? What makes learning easy is having large blocks where it's either all one thing or all the other. In parity, they are maximally interleaved. All your neighbors are of the opposite class. So this is about as bad as it gets, which is why I picked that example. So for example, let's suppose that I have a boolean function y of two inputs x one and x two. And x three is just noise, right? Again, when we're getting the attributes, we often gather all the attributes, which is just have nothing to do with the class. We don't know in advance, right? The whole purpose of learning the decision tree or anything else is to figure out what the relevant ones are, and then how the class depends on those, right? And now, you know, if you look at it, y is the parity of x one and x two, right? So for example, when x one and x two are both zero, right? y is zero, right? When only one of them is one, y is one, and when they're both one, it goes back to being zero, right? So y is a very simple function of x one and x zero, right? But let's do this with the error because it's simpler than the information gain, but you can check, you know, at home that the same thing would happen with information gain, right? What is the error that I get if I split on x one? Right? Remember, splitting on x one means I'm going to look at these cases, you know, these cases here and then at these cases. Right? So what happens when x one is equal to zero? Right. So x two are negative, right? So my error rate is still 50%, right? No better than random guessing, right? What happens when x one is equal to one? Same thing, right? So x one doesn't look like it can do anything. Same thing happens with x two, right? And now, you know, here's x three, right? X three is pure noise. X three is just random. Of course, for x three, it's also still 50-50. The problem is that I cannot tell x one and x two from x three. Now, the problem is that x one and x two from x three. Now, this should be easy, right? Why is a deterministic function of just two variables, right? What could be easier? And yet, it's very hard to see. Instead of one irrelevant variable, I could have a thousand irrelevant variables that I still wouldn't be able to pick out x one and x two from them, right? So this is a serious problem. Now, we're going to see something that can handle this problem up to a point, or at least that's better than just doing what we are. But at the end of the day, this gets at a very deep problem in machine learning and there is no magic book. If I knew that this was going to be parity, I could very easily recall the examples to make this problem totally trivial, right? If I had an attribute, it was number of bits that are set to one, right? As a function of that attribute, this would be a slender, but I don't. Okay? So if I knew it was a Boolean function, right? I could throw in a bunch of things. Maybe certain Boolean operations. Yeah. I'm just thinking about that. And if those things are possibility and the learner tries x ors, then it might come up with... Yes, so for example, let's suppose that this is a good point, right? Let's suppose that it really was just an excerpt of two things like here. Well, in that case, if I try an excerpt of two things, I'll actually hear the right answer, right? But this is why I'm looking at, you know, parity in general, right? That's the problem, okay? So if I know, as usual, if I know something about the problem, right, if you tell me this is going to be, you know, X ors of at most two or three bits, then I put that in and I'm in the shape. I blew up the number of variables, but, you know, in this case it's still okay. But if it's the, you know, if it's the X or of, you know, a thousand bits and I don't know that, you know, I'm host, okay? Questions? Okay, so remember this, you know, okay, so now here's another problem that we've already touched on, but we haven't seen how to solve it, right? You know, you can think of it as the zip code problem, right? The zip code might look very good as a predictor for the class, even though it's not. You know, in fact, let's suppose that you have, you know, a dataset with just a thousand examples, right? And as it happens, everybody there comes from a different zip code, right? That could happen, right? I claim that you would always pick the zip code as your best attribute. In the worst case, it would tie with some others, right? But why would it always, why would it be impossible to be? Right? Every example has a different zip code. So what's the information gain of the zip code going to be? Or like, what's the error rate after I split on the zip code? No. Exactly, the error is going to be zero, right? Because the zip code picks out one specific example that has one class, so with the zip code I can always predict correctly. Does that mean that the zip code is a good thing to split on? No. In fact, one of the, you know, beginner traps is something like this. You get your dataset, every example has an ID, like your social security number, or just like the number, you know, this thing from the web log or whatever, right? And you forget to take that out. And then, of course, you know, you can predict the class, you know, is this person going to leave my website and tell you about, right? Perfectly, using that attribute. Did you get some generalization out of that? You got zero generalization out of that. Okay? So, nevertheless, we don't want to throw away multi-valid attributes because they could be very useful, like, for example, the zip code for marketing is, indeed, one of the most, you know, predictive attributes you can get. Okay? So what might we do? Well, here's a, there's actually lots of things that we could do, but let me just give you one simple, easy-to-code heuristic and this is the following. If you think about it, the problem with an attribute with a lot of values is that there's a lot of information in just splitting on that attribute even before you think about the class, right? You have a set with a lot of values, right? So there's a lot of information in telling you which member of the set I am, right? If I tell you which zip code I live and that tells you a lot, even before, you know, you find out how much money I make or whether I like to buy yachts or not, okay? That's the main set for that. We're really only interested in the extra information about the class that I get from splitting on that attribute beyond the information that I unavoidably get just from the entropy of the attribute by itself, okay? So what I can do is I can use what's called the gain ratio. The gain ratio is just the ratio between the information gain as we saw before, right? How much information does this attribute have about the class and the split information. Split information is just the entropy of the attribute itself, right? I have a zip code, you know, some of them have more people, some of them have fewer. What is the entropy of my zip code membership if zip code is the variable, right? And then I do the ratio of the two, which means that if you're in a zip, you know, if you have an attribute with a lot of values that I widely distributed, then you have a high burden to bear to show to me, to prove to me that, you know, your information gain is actually coming from really predicting the class is just, you know, happening to have a lot of values. Quick question. Why use the entropy and not something simple like just the number of values? Why don't I just divide by the number of values, right? That would be easier, right? And if there's, you know, 10,000 zip codes, I divide by 10,000. Why use the split information instead of that? Let me give you sort of like a, you know, a hint, right? Let's suppose that we actually were in a situation where each zip code had exactly the same number of people, right? Would it make any difference whether we were splitting, you know, on the number or on the actual, you know, or on the entropy? Not really. But now let's look at the opposite case. Let's suppose that we have a lot of zip codes, like, you know, maybe more easily, example would be country, right? But some of them have many, many more people than others. Okay? Would you rather split on that, on, on, so two attributes with the same number of values, right? Would you rather split on the one where the values are very evenly spread or the one where they're very concentrated? Concentrated, why? Because that gives us a better filtering, a better gain. Yeah, because, I mean, think about it this way, like, think about it in terms of country, right? Like, over 10% of the people in this world are in China, over 10% of them are in India. And then there's all these ministerial countries that are probably living near dataset. So to a first order approximation, you know, like without offending anybody, you could model the world as just India and China, right? And that would be an attribute with two values, right? So here's the key, is that an attribute with very few values is almost like an attribute with, sorry, an attribute with a lot of values but very concentrated distribution is almost like an attribute with fewer values, right? Now in practice, what you might want to do is, you know, is aggregate, you know, for example, all little countries into larger regions like, say, Europe and, you know, North American and whatnot, right? But, so that's an intermediate case where you're actually getting a smaller number of values, but probably still the same predictiveness, okay? This is why you want to use entropy, right? One good way to think of entropy is that one extreme entropy is like just the log of the number of values that you could have. But at the other extreme, right? Suppose that you have, you know, only two values can occur. Then this is the same thing as a binary split, right? If you have a dataset with, you know, 100,000 values where only two of them occur, right? This is effectively the same as, you know, only having two values. If you have a dataset with two values that each occur a billion times, and then a thousand times that each one of them occur only once, right? Well, this is a dataset with really a thousand and two values, but effectively it's almost just like splitting all those two values, okay? Questions? All right. Next one. This is a big one. Unknown attribute values. There are a couple of respects in which everything that we've been doing so far is very unrealistic. The first one is we've assumed that you have all attributes for all examples. The second one has to do with noise, and we're going to spend a large chunk of, you know, the second half of this class on it, okay? But let's just look at this one for now. Most of the time, you do not know all the values of all the attributes. Like, for example, you're doing medical diagnosis and you have a barrier of a thousand tests that you could do, you know, on a patient. But God forbid that you actually have inflicted all of these 10,000, all of these, you know, thousand tests on each one of the patients. Probably only have some of them different patients because different ones are relevant, but for most of them, you're missing the information. And this is even a case where there's some, you know, they're actually not missing at random. You could actually have those that are just completely missing at random. You just don't know the values of some things for some people. And then what do you do? Right? The algorithm that we have so far always assumed that I'm able to evaluate every attribute at every point because I know it's value for all the examples in the dataset. And most of the time that does not happen. And again, Murphy's law of data mining is that the bigger the dataset, the crappier it is. So on a small dataset that somebody painstakingly gathered over a thousand examples, well sure, maybe you have all the values. On a very large thing, you probably don't. For example, this is a very typical example. On the clickstream data, you actually have three kinds of data. You have the actual clickstream of the person. They clicked on this and then they clicked on that. This they have for everybody because, you know, the service gets in this at every moment. And then for some people, but only a small subset, they actually filled out a registration form for the site. So you know a lot about them because they actually took the trouble filling that out. Precious information, but you only have that for a fraction of the people. And then there's another fraction of the people that you somehow got some demographic data on from one of these companies that sell it. Okay? And that can also be very useful, but again, that's only a small fraction of the people. The people for whom you have all these things, you have the clickstream and you have the registration information and you have the registration information is a small subset. You don't want to ignore the information, you know, the demographic and the registration information for the people that you do have it for. Right? But at the same time, you have to know what to do with the people that you don't have that for. Okay? So how might you handle that? Right? What would you do? Right? If you had just built your decision tree learner and, you know, you were five minutes from, you know, the project due date, right? It was due five minutes later and you had to come up with a method to handle missing data. What would you do? Prioritize and update the route to direct them towards the information gate. Okay. So let's look, for example, at the route, right? We're at the route and we're trying to pick an attribute for the route. Right? And let's say for some of these attributes, I do have the value. For some of these examples, I have the value of the attribute. Sorry. For some of these attributes, I have the value for all the examples. Right? So for those, I'm okay. But now there are some examples for which I don't know some of the attributes. Right? So now I'm looking at an example. I'm trying to evaluate, you know, attribute A. Like, say, the color. Right? For some of these people, I don't know the color of the car that they last bought. Right? So what do I do? Can I impute on other variables? Yeah, impute it. Right? Right? This is actually the statistical term is impute it. Impute it meaning, let's say, I'm going to make up a value for it. Right? It doesn't have a value. Well, darn it, I'll force it to have a value because that way I can keep going. Okay? So if you're going to impute a value, what value should you impute? The most common one. Right? So like, you know, 80% of the people bought black cars, 10% white, you know, and 10% red. Right? Well, in the absence of that information, if I'm going to have to pretend that I know the color of this car, I'm going to pretend, you know, that it's black. Okay? So this is indeed the first strategy. It's quick. It's better than nothing. But there's another strategy that is almost as quick and is way better. And he guesses what that might mean. Here's the hint. Remember, why am I doing this? I'm looking at this attribute value because I want to predict the class. Right? This is the purpose of the exercise, is to come out with the class prediction at the end of the day. Right? So what I really want to do is I want to judge whether you're going to want to buy this car or not depending on its color. Right? So how might I incorporate the class into this process? Remember, this example, right? I'm not trying to decide which way in the tree this example goes. Right? If I split on this attribute. And I do know the class for this example. I do know whether you bought the car or not. Okay? So what can I do? Do you have other examples that have the same class? Yep, exactly. So that is strategy number two. I assign the most common value among examples of the same class because that's what I care about. All the cars that I bought, how many were black? Or that the customers bought? Right? How many of them were? Right? Among all the products in this catalog, right? I have a lot of, you know, probably not a car but let's say I have a product. I have a lot of information. I'm trying to correlate things with the class so I shouldn't throw the class away when I'm imputing the value of this attribute. This is barely any more work than the previous strategy but it often works way better, so, you know, I would recommend that. However, you can do even better. Here's what you can do that's even better. You don't know what the value of this attribute is, right? But if you think about it, when you just impute one value, you're doing something that's probably quite unrealistic. You know that in reality, these people didn't all want to buy a black car. Some of them wanted the red one. Okay? So what you can do is instead of, you know, imputing it all the way, what you can do is you can impute fractionally. You say, if among the people who did buy, 80%, you know, the color was black, 10% was red, 10% was white, what I'm going to do is I'm going to split this example into three. One example with a weight of 0.8, so it counts as 0.8 of an example, I'm going to say this is an example where, you know, you bought the car and it was black. And I'm going to have 0.1 of an example where the color was red and 0.1 of an example where the color was white, okay? So now the example becomes, you know, three fractional examples. You know, think of this as dividing the cake, right? You don't send the whole cake down one branch. You divide it into pieces proportional to the real empirical probabilities, and then you divide it that way. And if then you come up with another, you know, missing value, right, then you do the same thing again. You split the cake further into smaller and smaller crumbs. And everything that we've been doing in terms of computing entropies and errors and whatnot, it works just as well with fractional examples as with whole examples, so there's no problem there, okay? Questions about this? So, but now, okay, so this is at learning time, right? So this is what I use, and I'm able to compute all my information gains, I'm happy, right? But now what happens at classification time, right? Before, when we had no missing data, right? I had my example with its attributes. When I came to the next node, I always had, you know, the value of the example, and I would route the example, right? But now let's say I have my decision tree and I'm not trying to classify this new car, and now I come upon color and I don't know the color, right? So what do I do? I can do the same thing that I did at training time, right? Which is I can break up the car into 0.8 and 0.1 and 0.1 and send it down three branches, right? And then they all trickle, and then they keep splitting and then they all trickle down to the leaves, right? Not a problem there, it's the same thing at training and at test time. But the problem now is, so now what do I predict, right? In the complete scenario, each example at test time always went down to exactly one leaf and so I predicted the class at that leaf. Now an example basically just rains down on a whole bunch of leaves, right? So now with, you know, which class do I predict? Any suggestions? The one with the highest weightage kind of. Yeah, yeah, the highest weightage, that's the answer, right? The answer is democracy, right? Let them vote, right? Which leaf picked up the most breadcrumbs, right? Or the biggest, you know, some were small, some were low, right? So you have a little scale at each leaf and you weigh the total amount of cake that fell there and whichever leaf has the biggest weight of cake wins, right? This is the first of many instances where voting is a great thing to do in machinery, right? Whenever you're trying to handle, you know, things that you're not sure what to do, one good idea is always like, well, let them vote, right? Let the crowd make the decision, right? This is like, you know, crowdsourcing, right? The wisdom of crowds, except in this case it's the wisdom of the algorithms, but the principle is the same. In fact, we'll have a whole class, as I mentioned last time about how to, you know, combine models and you could see this as a very simple example of that. In fact, in an interesting twist, people initially developed the ability to handle fractional examples because of this, because of missing data, and that came in very handy when people developed these algorithms for doing, you know, voting among us all over the models. But that's getting a little ahead of our story. Any questions about this? So notice that, you know, we saw the basic algorithm and then we saw a bunch of extensions of it, each of which is fairly straightforward, right? So we haven't seen anything really difficult or complicated to implement yet, but now we actually are beginning to have an algorithm that, you know, you can throw it real well at it and it might do okay, okay? There is one huge issue that we haven't dealt with yet, which is overfitting, and we will talk, you know, at length about that after the break, okay? So it's 752. Let's start again at 802. All right, let's get going again. The second half of today's class could well be the most important half-class in the whole quarter, so well worth staying awake for. In this class, we're going to talk about overfitting in decision trees, but overfitting is not just a problem in decision trees. In any powerful machine learning approach. So some of the issues here and some of the ways to deal with them are going to be quite general, okay? So this is going to, the first of many times we're going to encounter this. So first of all, let's see what the overfitting problem is and then let's see how we can try to solve it or at least address it in the context of decision trees, okay? So going back to our example of playing tennis, right? Let's suppose that, you know, there was this one time when, you know, it was a beautiful Saturday, a beautiful Sunday morning, let's say. It was sunny, it was not too hot, it was not too humid, not windy, not rainy. It was a perfect day for playing tennis. But your friend had been out partying the night before and had a huge hangover. So now she has a huge headache and she has a bad mood. And when you call her back, if she wants to play tennis, she says like, no, absolutely not, right? What does that do to your decision tree? This is the question. So, you know, here's a more concrete example, right? Let's suppose that we had an example that was like this. It was sunny, right? So we go down this branch. It was hot but we don't care about the temperature. And then the humidity was normal, right? Which means that the answer should have been yes. Right? Your friend should have been like, yeah, let's go and play tennis. But your friend said no. So now what happens to the decision tree? Exactly. I need to keep splitting here. This is my relief here, right? I'm going to need to split on something more. And I can split on something more until I get the answer that I want, which is no. Okay? So now your decision tree learner is happy, right? But what's the problem with this? What went wrong here? The other side of the answer might be possible. Yeah, the problem is that it wasn't because it was a bad day for playing tennis that your friend said no. It was because of a completely random thing that you have no power over and no ability to predict. So what I should have done is held off on that. So like, well, this example looks like an outlife. That's the technical term, outlife, right? It really shouldn't be here. Instead of me trying to fit a little branch of the decision tree to this example in the process, now I get a whole bunch of others wrong, right? A bunch of things that should have been yes has now become no. Let me hold off on that. So the moral of the story here is if I try to fit my training data set perfectly, I will end up with something that may look very good, but is actually very bad. Everybody got the general idea? This is a crucial concept in machine learning. So let's generalize from this example, right? What is overfitting? Here's one way to define it. We're really concerned with two kinds of error. We're always going to be talking about these two kinds of error. When we do our experiments, when you guys do your projects, you need to distinguish between these two kinds of error. When you do the theory, it's all going to be about these two kinds of errors. These are the error on the training data. This is what you can see and optimize. How many of the training examples am I getting wrong? How many users am I making the wrong prediction for on this quickstream data? Let's call this the error train of H. The other one is what you really care about. What you really care about is the error on the entire distribution of the data. It's the error that you would get if you could somehow see everybody. Get all the data that you're ever going to get in the future in very large, infinite amounts, let's say, and you knew exactly what the distribution was. That's the error that you want to optimize. Let's call that the error subscript curly D where D is the distribution of the data. The whole problem in machine learning is that this is what we want, but this is what we have. If the two are the same, there would still be a complicated computational optimization problem, but it would be no different from the problems in many other error computer science. The thing that's pretty unusual about machine learning is that there's this disconnect between what you have and what you would like to have. We're going to say that a hypothesis overfits when it looks like the best one on the training data, but on the test data, on the true distribution, on the data that you're going to see in the future, it's not the best hypothesis. So H overfits if there's some other hypothesis, H prime, such that the training error of H is lower than the training error of H prime, but on the test data, the opposite happens. The error of this hypothesis is greater than the error of the other hypothesis. So I should really have picked the other one, yeah. But you don't actually have the test data. That's the problem. So the whole thing here, which seems like almost impossible, perhaps at first, is like, I need to try to guess from the training data who is going to do better on the test data. And certainly, you know, the training data is precious, right? That's what I have to go on, but I also need, you know, so I need to be a good friend of the training data. It's one way to look at it, but not too friendly, right? If you get too friendly with the training data, you start forgetting about what you really care, which is the test data. So your training data is a guide to what happens on test data, but it's, so you need to very much take into account. But it's an imperfect guide, so you also need to be, you know, to hold off a little bit, okay? It's a little like playing golf, right? Then there's the first part of the swing where you're accelerating, but then there's the part where you have to stop, right? Otherwise, you won't have a good swing, right? We need to fit the data. That's the first part of the swing. And then we need to, like, you know, hold off. That's the second part. Both parts are essential for a good swing, and if you just have one of them, you don't get anywhere. Questions? Okay, so this is the theoretical definition. Let's look at overfitting in practice. These are the kinds of curves that people who do machine learning in practice look at all the time, and these are actually quite typical. So these curves here, can everybody see these two curves here? Okay. So this is actually on a real problem. This is diagnosing diabetes. This is actually, they said it's publicly available. You can download it from the web, and you can play with it. It's diabetes, diagnosis. And we're building a decision tree to, based on a number of tests and symptoms, decide whether somebody has diabetes or not. Okay. And here's what happens as you increase the size of the tree. So as you add more notes to the decision tree, right? So you're learning a bigger and bigger tree, right? Well, first of all, when you have, you know, when you just predict the most frequent class, which is not, you know, this person does not have diabetes, you get 65% accuracy. Okay. So this axis here is the accuracy. Okay. So just predicting the most frequent class, right? This is your, you know, sanity strawman is, what happens if I just predict the most frequent class? Do I do that? And that gets you 65% accuracy. Okay. Which, you know, is better than nothing, but not very impressive. Okay. But now, notice what happens as I start to grow my decision tree, right? The solid line is my training, is my training set accuracy. So my accuracy on the day that I'm seeing, and not surprisingly, it just goes up and up and up and up. It never goes down. Why does it never go down? Splitting and going. By design, right? Whether my measure is error rate or information gain. If I'm optimizing that, the error can never be going down, right? Let's think of error rate. I pick this guy because it made the error rate small, right? If nothing makes the error rate small, I stop on it. Okay. So if you only looked at the training data, this is the mistake that rookies make, right? You, you know, you say great. And I may be laughing at the rookies, but you know, I've had the, I've had the experience in this class of somebody saying like, hey, we mined the clickstream dataset and we got 100% accuracy. So why don't we have a better grade? And I'm like, you don't have a better grade because you got 100% accuracy, right? It was 100% accuracy on the training then. And if you think of, but that's okay, you know, these are beginners, right? These are just students, right? You're here to make mistakes, make them all here so you don't make them out there in the world, right? That's what education is for, right? But I remember, you know, when I was a grad student being in the office with my advisor meeting with him, he was consulting for, you know, various companies and institutions that need, you know, machine learning. And I remember the exchange, in exchange of the type, he asked them like, so what are you using right now? And they said something like, this was, I think it was a health problem, but it doesn't really matter, right? And he said like, we're using a decision tree. And, but they wanted his help, right, for some reason. And then he asked them, so what is the accuracy of that decision tree? And I was like, oh, that's 100%. We're doing very well in that respect. Not a problem then, right? So this happens. This is an example, right? I was an intern in a research lab once where they did a lot of speech recognition. These are people who are very sufficiently like excellent researchers. They know what they're doing. Neural networks were hot at the time, you know, they were using neural networks to do the speech recognition, right? Very natural, you know, very natural fit, right? And they trained them, but these were the early days, right? So people didn't understand these issues very well. If they trained the neural network, that was just perfect. Actually, what they were doing was speech synthesis, I'm sorry, they were doing speech synthesis. So you go in the text, right, and you have your neural network output, what the thing should be saying, right? This is a famous neural network example, right? Because neural networks are very good at this. And so they trained us on a whole bunch of sentences. You know, here's the text, here's the pronunciation, here's what you say, right? Here's the spoken sentence versus the text sentence, right? They trained us on a sliding window. This is often done. You know, the accuracy on the training, it was perfect. And then they went and tried it on new data that they got from, you know, a telecoms company or something. And you know what came out when they actually ran the speech synthesizer with the neural network inside it on these new examples? For all of the new examples, what came out was white noise. That is overfitting for you. They got in the text sentence and not came, shh, the neural network, and then they went and figured out what the heck had happened, right? What had happened? The neural network had exactly memorized all the training examples. It may not be obviously the neural network, how do we do that, right? But if the neural, I can actually tell you in more detail, I won't here, but you know, it has enough power if you give enough neurons, enough weights to just memorize the training data. It was massively overfitting, right? So the same thing is happening here. Look at what's happening here. On the training data, things get better and better and better and better. But let's look at what's happening on holdout data, right? This is what you should always do when you do machine learning is don't do all your trying things out and whatnot on all the data. Because if you do that, then you'll never really know if you have a good model if you overfit it. Leave some data aside that you only look at at the end. Right? So the learning algorithm does not have access to this data. But then you can evaluate what it did on this data. So here's what happens. As you grow the tree, first the accuracy goes up and then it plateaus and then look at this. It starts to go down and down and down and down. Big problem. You thought you were here. But in reality, you were here. Where should you have stopped growing the decision tree? Let's assume you can read these numbers. 20. At 20, right? Or somewhere between 10 and 20. Right? Actually, I would stop at 10 because given two equally accurate decision trees, I'd rather take the smaller one. But as far as overfitting goes, yes. Anything between 20 and 20 would have been okay. But after that, my improvements are an illusion. This is the overfitting problem. You think you're doing better and better, but you're actually doing worse and worse. And the shape of these curves, I mean learning curves, which is what these are called, come in all shapes and sizes. But this is actually very typical, which is not coincidentally whites on these slides. Notice that initially, your accuracy gets better very quickly. This is your big win. You still have a simple decision tree and it's already predicting quite well. But you want to predict even better. And then things plateau. And then what you get is this region where you think you're getting slowly better, but you're actually getting slowly worse and worse and worse and worse. So moral of the story. I want to stop here. How do I do that? I need to figure out where to stop without looking at the holdout data because I don't have it. Right? So how do I do that? Now, this thing's hard, right? But the good news is actually precisely because this early part of the curve is steep and this one is kind of gradual, you have a lot of leeway, right? You could stop anywhere between 10 and 20 and actually those are all fine. But if you stop at 30, sure, you didn't quite get to the 75%, but it's still whatever, 73, 74, not the end of the world. So you don't necessarily need to nail down the exact point at which overfitting begins. But you would better do some overfitting avoidance or you could get a disaster like the speech synthesis one that I just talked about. Questions? How can we do this? Any suggestions? Periodic growth, we need to re-evaluate the holdout data. Yes, but here's the thing, right? If I just evaluate the pruning, right, indecision tree avoiding, so I'm getting ahead of things a little bit, right, but think of this as pruning the tree, right? I don't want to grow my whole tree, right? This is the analogy with actually, you know, the actual agriculture, if you will, right? I want to grow the whole tree, I want to prune it, right? But based on what data, right, if you just do the pruning based on the same data that you were using to decide, you know, which test was best in using information gain, then you're just going to make the same decision, right? So you have to do something different from what you were doing before, okay? And there's more than one thing that you could do, but, you know, any suggestions? Any suggestions you might remember from your, you have a suggestion? Can you split the trained data into two different separate assets? Absolutely. This is a very simple thing to do, and it's golden, right? Let your algorithm do what you yourself should be doing, which is don't have the algorithm look at all the data at training time. When you give the data to the algorithm, the first thing that it does is randomly, very importantly, randomly, take some of the data for training and save some of it for later. And what it does is, you know, it learns on the training data, right? And then it has a very big tree that's probably very overfitting. And then what it does is, like, it sets that data side and goes and looks at the new data. And now, based on the new data, it starts cutting back. And as long as that data says that you're cutting back, okay, you keep cutting back, and then when it says no more, you stop, okay? Simple strategy. There's a couple of disadvantages to this strategy. What might they be? They don't have less training data. That is the killer one, right? If you have 100 examples, and you took 50 of them for testing, right, now you only have 50 to learn on ouch, right? You could actually do worse this way, sure. You'll be doing better pruning. But you learned the first few 150 examples instead of 100, so the tree is so bad that, you know, remembering that pruning is just finding the best subtree of your tree, right? So if the tree sucks, the result will stop too. This is one disadvantage, but there's also a... So this disadvantage is going to be particularly bad when you have a lot of data or little data. Little data, right? In these days of big data, this actually sounds like the method of choice, right? I have data up to a zoo, right? I can split it, you know, 10 times. And I still have enough data to learn. So in this case, you know, not having enough data to learn is not a concern. You know, as a little parenthesis, just because I have big data doesn't mean life is easy because for the particular decision that I'm making right now, maybe I don't have a lot of data. I suppose that I really do have a lot of data, which in many cases I do, right? Then, you know, splitting some of it off for validation, as it's called, you create a validation set or a pruning set is okay. But then what's the problem? In the big data regime, there's also no problem. The trees get pretty large if you have a lot of noise. Exactly the problem, and this is the killer problem, is efficiency, right? You know, look back at this example, right? The decision tree that I want to learn only has ten notes. Beautifully simple and beautifully quick to learn, too. Right? Remember, the bigger the tree, right? The more, you know, cycles it's going to take to learn and apply it, too, and more memory to store. Right? But, you know, I actually went up all the way up 200 examples thinking I was improving. And guess what? If I have a data set of, you know, 300 million examples, or let's say you're building, right? All the users on Facebook are, it's pretty close to building these days, right? Well, that's a lot of data, right? But let's say that, you know, even just 10% of that is noise, right? That's 100 million noise data points. So I'm basically going to grow a decision tree of size order of 100 million. Well, let's suppose that my decision tree that I wanted to put this banner out in front of you or not just have ten notes, right? Notice what I did here. I had a decision tree that was ten notes, but I, you know, I accelerated past it to a decision tree with, let's say, 200 million notes. And then, you know, and then I have to cut back. We've been saying my pruning works really great, right? Now I have a good decision tree, but the resources that I had to use for this were just orders of magnitude great. And remember, the cost of learning a decision tree is, at least for many of the albums, it's super linear in the size of the data. Even if it's just quadratic, right? You know, going to 100 million versus going to, let's say, even 100, right? It's just prohibitive. So pruning like this often works better, but it has these two problems. If you're in the small data regime, it actually could hurt you more than it helps. If you're in the large data regime, it might be a wonderful thing, but you just don't have the cycles to do it. Or you're not willing to wait that long. So what other methods can we use? You could prune as you went. So each time you added a note, you could say, does my held out data say this is a bad idea? Precisely. So pruning as you... So the obvious solution to not growing the tree and then shrinking it is, well, don't grow in the first place, right? I need to decide as I go that I want to stop growing here. Right? But notice, now I really cannot do this on the validation set because if I do this on the validation set, it's not the validation set anymore. What makes the validation set powerful is that I'm only making a small... I'm only choosing among a small number of choices there. It's the different subtrees of the tree that I could have. At learning time, I'm actually choosing among an exponential number of choices. So I really consumed the data quickly. Right? So we have that problem. We have to do something different there. Right? So what might we do? What's something that we might do if you can recall something from your stats 101? This is actually something from stats 101 that we can actually plug in here. Any ideas? You know what I mean? Yeah. Of course, it's always that way. Okay, so tell me what's up there. I mean, I guess you could have some kind of test of whether there is a significant improvement in the next layer or next layer of the tree or not. Statisticians are very fond of this, right? Significance tests. You may not remember this or have ever learned it, right? But there are these things called contingency tables and I compare to distributions and I try to decide whether they are the same or not, right? If they're exactly the same then, you know, things are good, but there's always noise, right? So the thing is like, if this is a distribution is this other one a noisy version of this distribution or is it really a different distribution? And this is exactly the situation that we're in here, right? We can use accuracy or information gain to pick the best attribute. But then we can apply one of these statistical tests like, you know, chi-square, for example, right? But there's others, but let's say chi-square. This is the one that you're going to have in your project, right? I use a chi-square to decide is the class distribution that I'm getting now significantly different from the class distribution that I had before. If it is significantly different, then I'm not over-fearing, right? Because I'm really modeling the class distribution better when I have this attribute than when I don't. On the other hand, if, you know, they look close enough, right, and the less that you have the harder it is to tell, of course, maybe I should stop because I'm not confident that I really have a different distribution and at that point, I really might be over-fearing. Okay? Everybody follow this? So this is option number two. In fact, historically, this was the option that people started out using first. So Quinlan, you know, his first system was called ID3 and he used, you know, pre-pruning with a chi-square test. You actually have a pointer to his original very famous paper where he does this. It's actually, you know, fairly simple and straightforward. It makes a difference. But then, you know, he started finding that, you know, this kind of pre-pruning as it's called, right? It's kind of, if you think about it from the gardening point of view, pre-pruning is a very good notion, right? How can you prune a plant before you grow it, right? But in decision tree terms, you can talk about it as pre-pruning. He actually found that pre-pruning was very bad a lot of the time. And in fact, if you do post-pruning, which then became the norm, you could do much better. And then he had this system called C4.5, you know, where both of these are available on the web, by the way, where he did post-pruning. And then, you know, the data mining explosion happened and he created the startup and got rich and retired. That's a good outcome, I believe. And, you know, and then he just, actually the other day, you know, he just won the award for, you know, the CKDD innovation award for the very best data minors in the world and he, you know, very deservedly won. What is an example of a problem where this kind of pre-pruning fails miserably? We've already seen it. And not long ago. Do you want more hints? I can give you a visual hint. This is not a typical question, right? So don't be ashamed by requiring hints. If you thought of the test, you assumed some distribution of the data, right? No, I agree. I agree. But, you know, let's try to think of a very simple, let's say, boolean example, right? Where pre-pruning fails and post-pruning succeeds. The previous levels, you know, don't really show much improvement, but later on that second or third level will improve the creativity. Yep, yep, yep, yep. Yep. Remember this example? Why does pre-pruning fail and post-pruning succeed here? Well, first of all, does pre-pruning fail here? Parity, right? What is the problem of parity for pre-pruning? Is the distribution of the class significantly different when you have split on X1 and when you haven't? Heck, no, it's exactly the same. Right? There's nothing to be happy. Clearly not significant. So clearly pre-pruning fails here. What about post-pruning? Post-pruning is more interesting. What happens with post-pruning here? So post-pruning, you know, to recap is like, I just barrel on. I keep growing the tree with any attribute that I have. If there's a tie, I pick something. If they all have zero, I just pick one and hope to get lucky in the future. Right? So what would post-pruning do here? Right, I'm at the root. I need to pick one attribute to split on. What's going to happen? Can you just have a red dot? Yeah, what I'm going to have is that one of three things is going to happen. I'm going to have X1 as the thing that I split on. Is this good or bad? This is good, right? That's what I want. Or X2, which is still good, right? On the other hand, I could also have X3, right? They all look the same, right? But in this case, I actually have two-thirds chances of picking the right thing. Right? And if I pick... On the other hand, you know, I do run the risk of picking the wrong thing and imagine if they were like, you know, 100 irrelevant ones, then I almost surely would pick one of the wrong ones, right? Which is why even prostruding does not entirely solve this problem, but yet it does something useful and, you know, depending on the problem, it actually may be enough. Why? So first of all, let's suppose that I had split on X1, okay? Now what happens? Who do I split on X? Let's say, I mean, the branch X1 is true, right? And among the examples where X1 is true, so that's these ones here, right? I am trying to choose between splitting on X2 and splitting on X3. What do you think I should do? X2. X2, right? X2 is perfect and X3 is noise. So I had to get lucky the first time, but the second time there is no luck involved. Same thing if I had picked X2 first and then X1, right? Okay, but of course the tough case for us is, what if I had split on X3? Now what happens? Who do I split on X? This is not the hardest question you'll have in this class. After I split, unless you're like, what do they call it? Britain's ass, right? It was between two bays of hail and stars because he didn't know which one to pick. So we split on X3, right? So what are we going to do next? Exactly. It's going to be one of two things, right? Either pick X1 or pick X2, right? Are those things good or bad? They're both good, right? And then after I split on one of those, what happens next? I split on the other, right? And then we're done, right? So moral of the story, in all cases here, I wound up learning the right concept, right? So things were not so bad. What was the, but however, right? If you look at the decision trees produced in each of these cases, which one would you prefer? The one that started with the split on X1 or the one that started with the split on X3? The one where you split on X1, right? Because the other one has garbage in it, right? The other one, you know, isn't going to give you any insight in how it's going to mislead you into thinking at some point there isn't. And it's, you know, of course, this is a toy example, but, you know, it's going to cost you more to store, more to use, more to apply, et cetera, et cetera, right? So it still has been better to, you know, find one of the others, but hey, with post pruning, we actually found this one, okay? However, this is a somewhat optimistic situation, right? Because what's going to happen in the real world? I suppose that you have, you know, two relevant attributes or even ten, but then you have the hundredly relevant ones. What's going to happen there? I split, probably the first thing I do is split on a bad attribute, right? And then, what do I split on next? Another bad attribute, right? And I keep splitting on bad attributes. And what is the problem of splitting on bad attributes? After each split, I have less and less data, right? I get exponentially less data. This is called the splintering problem. As I grow my decision tree, I'm exponentially peeling away parts of the space and I'm zeroing in on small parts that eventually have no data in them. Right? So I'm actually wound up with pure noise. I never got around to my good attributes. Now, what will happen on a good day is that I will get some mix of the good and bad attributes in there. And, you know, this actually often happens in practice is that I learn a very big decision tree, most of which is garbage. It's got some good stuff in there. So for prediction purposes, it still does better than chance. That might be good enough for what I need. Okay? It depends on the ratio of good to bad attributes and how difficult it is to figure out, you know, how the function depends on the good attributes. Okay? So the more the story is, you know, post-pruning is much better than pre-pruning in this respect. Right? Pre-pruning always fails. Post-pruning has a chance. Right? But nevertheless, it's still not perfect. Okay? Okay, so any questions? So far? Okay, there is one more method that is also quite popular that we want to touch on. So we've seen two methods so far. We've seen stop growing the data when the split is not statistically significant, and we've seen growing the full tree and then post-pruning. Right? There is a third method, which you also use when you don't have a lot of data. When you don't have a lot of data, and say you don't like this whole notion of, you know, statistical test and what not, what else might you do? Any ideas? This is probably not obvious, but it's a good example of the type of strategy that people use in machine learning all the time. And the strategy goes like this. Actually, I don't know if I have that here, but... Yeah. So, we've already looked at these two cases, and now we're going to look at this third case. What we're going to do is we have an evaluation function, right? Let's say the information gain or the accuracy that doesn't quite do what we want. Right? So let's change it. Let's change it to an evaluation measure that is more likely to give us what we want. This evaluation measure, of course, is not going to do as well on, you know, training errors, just optimizing the training error, right? Hopefully, it moves us away from training error, but towards what we would really like to have. And people are very creative in coming up with measures of this type. Right? Use your creativity at something, you know, to that, you know, error information gain that will help you induce a good tree. Can you think of what such a thing might be? Again, there's many options, but, you know, let's try to think of a simple one. The size of the tree? Right? Usually the trees that are overfitted are very big. So do you know, even without looking at any validation data that as your tree is getting larger, you should start to worry? You don't know exactly how much, exactly when, but you have these general feelings that the bigger the tree is, the thinner the ice that you're on. And so, you should just have a heuristic based on that. Take your error measure and after that a penalty on the complexity of the tree. So you have to know the number of nodes. Again, you could do very fancy things there, but a lot of the time you don't need to. Let's just say the number of nodes. Okay? And so now what has to happen is that when I grow the tree, it has to be improving my accuracy by more than the complexity of penalties costing me. So I will allow extra nodes in my tree, but they have to pay for themselves by more than a certain increase in accuracy. Okay, but how do I weigh these two things? Well, maybe for that you can use sold out data. Or you can use sold out data to pick, you know, the coefficient on your complexity penalty that trades it up against diversity. If you don't have a lot of data, this could be a good thing to do. Some people prefer, you know, significance test, some people prefer complexity penalties, but, you know, those are both viable options. If you have a lot of data and you have a lot of cycles, it's unlikely to be beaten. If you have a lot of data, but you don't have a lot of cycles, you might find yourself doing one of these two things, because you can't really afford to do, you know, the full growth and then the full pruning. Questions? So this measure, by the way, this method of growing the full tree and then pruning is usually, is often called reduced error pruning. So reduced error pruning is where you prune by the reduction in error that you're getting as you prune. You split your data into training and a validation set, and then, you know, you learn your tree on the training set, and then what you do is you look at pruning each possible node, see how much that improves your error and the validation set, pick the best one, and as long as you keep getting an improvement, you keep pruning. When every possible pruning left that you could do only hurts the accuracy on the test set, that's when you stop. And in fact, unlike the growing stage where things are basically over the day, you know, if you do this, you can actually prove that you get the best decision tree on the, you know, in terms of validation here. Questions? Okay, now, there's actually another, actually, so before we go on to that. Here's what happens on the diabetes dataset when you do reduce error pruning. And again, this is fairly typical. So I grew my whole decision tree, remember all the way to here, and now I'm here. Very bad tree. And guess what? I start getting better and better and better and better until I finish here. Brilliant. You know, actually, in this case, the post pruning pretty much aced it. Okay, we could have gotten a slightly smaller tree, but we got just about the best accuracy that we could. Okay? So it worked in this case. Quick question. Notice that the curve that I get on the way back is above the curve that I get on the way there, right? Is that an accident or is there a reason for that? If I'm asking, it's probably because there's a reason, right? So that was more of a rhetorical question, right? What might the reason be? Because you only prune if you get better. Yeah, because notice, right? When I'm pruning, right, now I'm actually, here's the key thing is that the nodes do not get pruned in the same, in exactly the reverse order to what they got added, right? If the first node that I pruned was always the last one that got added, the two curves would be exactly the same, right? But they don't. When I start pruning, I actually have the choice of pruning any node anywhere that I want, and I pick the best one. So, in general, that one's actually going to be better than, you know, it's going to be better to prune that one than the one that, you know, I happened to add last, right? I added one over here, but there's actually an over that I'm better off pruning. So, you know, reduced better pruning can be very good provided you have the cycles in the day. Questions? Very good. It's actually very interesting, and it actually gets us talking about the stuff that we're going to be talking about next week, which is rule post pruning. This is actually something that Quinlan added to C4.5 and wound up being what is used most of the time. It's a kind of a reduced better pruning, so it's a kind of post pruning, but the interesting thing is that you prune the tree not into a tree, but into a set of rules. That's why it's called rule post pruning. So what's going to happen is that I'm actually going to induce the tree not as the end goal, but I'm inducing the tree as a way to finally get a good set of rules out. And in the next week, we're going to talk about just, you know, using rules from scratch without worrying about decision trees, and then we'll compare the two. But you can also induce rules by first inducing a decision tree and then pruning it to rules. So let's look at the three things in time. First of all, what is the correspondence between decision trees and sets of rules? And second of all, how you prune a decision tree into a set of rules and finally why that might be a better idea. So first of all here's our good old friend, the decision tree for playing tennis. And the first question is if I were to just convert this into a set of rules, so don't even worry about pruning for just a second. What do the set of rules look like? Remember, a decision tree is really a set of nested if-then statements, right? You know, if outlooks sunny then, if humidity, high then no, and so on and so forth, right? So it's a tree of if-then statements. A rule is just the statement of the form, if this and this and this and this, then play tennis. And then there's another rule that says if that and that and that, then don't play tennis. And so forth, okay? So for, and for every decision tree, there's an exactly equivalent set of rules. Okay? So what is that equivalent set of rules? So for example, this is probably a very easy question, but nevertheless, let's go through it, right? So here's my decision tree, right? Give me one decision, leave me one rule, sometimes called a decision rule that you would get from it. If outlook equals overcast-then-yes. If outlook equals overcast-then-yes. Yeah, brilliant. This is a really short rule, right? What's another one? outlook equals-then-yes. Yup, here's another one, right? Yup, here's another one, right? So if the tree has n leaves, how many rules are we going to have? n. Right, there's going to be one rule per leaf or per path in the decision tree, right? And the rule corresponding to a particular leaf, what are going to be the antecedents of that rule? What is it going to be testing? It's the attributes on the corresponding path of the decision tree, right? No brain. And then the consequence is just the class prediction at that note, okay? So for every decision tree there's an equivalent set of rules, okay? And I can turn one into the other completely mechanically, okay? And indeed when I do so did everybody follow this? Yeah, straight forward, okay? So the first thing that I do when I do rule postponing is I just take my final full, big, noisy, overblown, overfit decision tree and convert it to the equivalent set of rules. Okay? So now, what do you do? You have your set of rules, you have your validation set of there that you've never looked at, and you want to come up with a better set of rules, right? So how would you do that? It's probably going to be somewhat similar to the, you know, postponing that I had for decision trees, right? But what, you know, and again, there's more than only to do this, but, you know, what's one way? Let's think about one rule in particular. Let's take, you know, a decision tree is one being complicated, you know, interrelated structure, but the rules at some level are all independent, right? So let's look at one rule, right? I have this rule here that has a whole bunch of things, right? And probably some of them are noise, right? So how do I go about figuring out which wants to drop and which wants to keep? So remember, in the decision tree, right, what I did was I looked at each node and I tried pruning it, right? And then I pruned the one that gave me the biggest improvement, right? So in the case of a rule, what should I do? Exactly, I look at dropping each one of the antecedents, right? And whichever one I get the most increase in the whole dot accuracy from, I drop that one, and now I have a rule that's shorter, right? And then what do I do next? And I'll look at the remaining ones and drop the best one or the worst one, right? And when do I stop? Same rule as tree and accuracy does not go up. Exactly, when I try dropping every one of the attributes in my rule and every single one of them now dropping it decreases my accuracy on the validation there. At that point, well, all those attributes seem to be doing something useful, so I'm done, okay? So that's step two. Are we done? This is actually where I think suddenly get a little more interesting. So far this was, you know, a fairly straightforward analogy to decision tree pruning, you know, given this concept that we're going to produce a set of rules. But in the case of the decision tree, at this point we were done. In the case of the set of rules, are we done at this point? Is this a rhetorical question? Yeah, this is a rhetorical question. So why are we not done at this point? The order of rules. Why does the order of rules matter? Which one should be evaluated first? I have a bunch of rules that say yes, spam. And I have a bunch of rules that say no spam, right? And now how do I decide? One more time? Performance? Performance, sure. But more specifically. I mean, some of the rules might take longer to be evaluated? Sure. I mean, sometimes that's a consideration. But usually the main consideration is accuracy, right? And some rules are more accurate than others. So what's the natural way to order the rules? The most accurate one first, right? So I order my rules, you know, by accuracy, right? And then I just, you know, it fires winds. It's just a, you know, an if then statement with a lot of else's. If blah, blah, then spam. Else, blah, blah. And some of these might be spam, some of them might be not spam. But as a heuristic, it's hard to disagree with the notion that I should put the most accurate rules first. There's, in fact, another thing that we can do there, in fact, works even better, but again, we're going to talk about some of that later, which is what? There's a way to make a different suggestion and tell that, like, majority of the executives fall under one rule, but that won't work. Exactly, majority. We vote. Remember? Learning algorithms like democracy. They like the wisdom of crowds. I have a bunch of rules. Some of them say yes, some of them say no. Let them all vote, and the one with the most votes wins. Should all rules have the same weight, or should some rules have more weight than others? One should have heavier weight. Yes, certainly if you believe in simplicity as a heuristic bet, that's something that you should use, right? But what's another thing that you should use for the weight? That's even more obvious. In fact, you could combine the two. Remember, when we were ordering the rules, we put the more accurate ones first, right? In a way that's like saying the first rule has more weight than all the others put together, right? So what's something that I could weight my rules by when I'm voting? The accuracy. In fact, what people typically use is a combination of the accuracy and not the simplicity, but the coverage of the rule. The coverage of the rule is how many examples in the training set does it account for? What might the reason for that be? Given two rules with the same accuracy, why would I prefer a rule with more coverage? Let's say these two rules are both 100% accuracy. This rule here is 100% accurate on two examples, and this one here is 100% accurate on 2,000 examples. Which one would you prefer? Generalizable. The one on 2,000 examples because? In generalize. If you got two examples right, you know, you got lucky. For all I know. If you got 2,000 examples right, it probably wasn't luck, right? Again, there's many combinations of these things that you could use, but those are two different things to use. I still don't get the prove each rule in different teams. We have each path as the rule. How are we pulling it? We are removing an attribute and changing it. I did not get that part. No, so very good. Another point that I was about to bring up. Let me actually backtrack for just a second and then we'll get back to this. Why did we have to do this for the rules but not for the decision tree? In the decision tree, we didn't have to pick an ordering or vote among the leaves or any of that. Why was that? Because the rules that you get by directly translating the decision tree have a very, very unusual property. That set of rules. The set of rules that I got from this tree has a property that most rule sets don't have, which is, this is not immediate to see, so let me just say. Only one rule ever fires. The property that the decision tree has is that the path to the leaves are all mutually exclusive. In the assumption of full data, because no missing data, you can only go down one path, which means that if this rule matches then this one for sure cannot and when I prune the decision tree all I'm doing is shrinking it towards the root. What is the big difference between that and what happens when I'm pruning two rules? This is a subtle point but it's actually crucial. If I was pruning the decision tree from the leaves to the root, I would consider pruning humidity, right? And then I would consider and I would consider pruning wind, right? And if I had actually pruned these, then I might consider pruning out again having no tree, right? That would be bad but it could happen, right? But here's something that would never happen, right? I would never prune outlook without pruning humidity and wind. You agree? There's just no way to do that. Because the outlook is what actually splits me into these joint sets. And then I get more splits and so on and so forth, right? So decision trees have this advantage that applying them to the test example is actually faster, right? It takes essentially logarithmic time, right? To just fall down to a leaf. Whereas a rule, you know, you have to keep matching them until, you know, they're not going to match, right? So decision trees are faster to apply. However, pruning two rules tends to give a more accurate model than pruning to decision trees. Why? My discovery is that the humidity is normal, the outlook doesn't matter. You get rid of noise. Yeah, yeah. Keep going. Anything else? You also prune an action value instead of action values. Very good. So let's try to synthesize this, right? Notice that when I'm pruning two rules, I mean, let's think about, you know, a particular one of these paths, right? When I'm pruning a decision tree, what I have to do is I either prune humidity really, or then I prune, or, you know, by the end of the process, I have pruned humidity and outlook, right? So I can, all I can do is shrink the path from the end to the beginning, right? So what do you mean by pruning humidity? It's like removing that room? Yeah, I'm removing this and I just, you know, I predict, you know, once I'm sunny, I just predict, let's say yes. Okay. So I get rid of that, right? So when I'm pruning through decision tree, I'm very constrained in the things that I can prune. When I'm pruning to a rule set, given one rule, I can prune any attitude that I want, right? In particular, you know, once I have this rule, if outlook's sunny and humidity high, then play tennis now, right? If you look at pruning the decision tree, I can prune humidity, but I can prune outlook, or I could prune both. But when I'm pruning to a rule, sure, I can prune outlook, right? So I can prune more things in the, in the case of the rule and in the case of the decision tree. And now, why might this be a good idea, right? Notice the following. Outlook is, sorry, outlook is the thing that we stuck with here, right? And outlook, the thing that happens here is that like outlook is tested in every single rule. Boy, that outlook attribute has to do a lot of work. Suppose that that outlook is good for some rules, but not others, right? In the case of the decision tree, while I'm stuck with it, I have to take the good with the bad. In the case of the rule set, no, I can prune it from some rules, but not others. So you tend to get a better model when you prune to a rule set. A more accurate model. In addition, and sometimes this is actually the biggest reason, you also tend to get a much simpler model. You run C4.5, you know, you can do this experiment yourself. Run C4.5, do the, do the decision tree, and do the postpone to the rules, and then decide whether you'd rather be looking at the decision tree or the rules. The set of rules, if you give a cost of once if you reach at the season in each rule, and the cost of one for each rule in the decision tree, the set of rules is going to tend to be much, much smaller than the decision tree. Why is that? Remember, in the decision tree, this lower part of the tree, I'm always going to have it there, right? The rules, I can actually get these short rules, each of which that uses the two or three attributes that are really good. In the decision tree, I'm stopped with using Outlook for everybody. And then, you know, when Outlook is, you know, say, sunny, I'm then stuck using the entire next subject for everybody. Okay? So by pruning two rules, I can get a more accuracy and then B, a much smaller and easier to understand model. So it's no wonder that people often prefer to have a rule set. It does have some disadvantages. It'll typically be more expensive to learn, and when you're on a large data set, that is an important consideration. It'll be more expensive to apply. And in some domains, applying things very quickly. In some domains, it doesn't matter. In some domains, you know, that's a big constraint. And it's also, you know, the decision tree pruning is fairly straightforward. Pruning two rule sets, you know, is a little bit, you know, there's more things that can go wrong. Okay? So there's pros and cons, but it's definitely worth knowing about pruning right? So once you get to rules and prune, you can't go back very simply, right? Yeah, yeah. That is a good question. In fact, we're going to talk about this question next week. So I propose that you think about it, and then, you know, then you can tell us what you found next week. There's definitely a very interesting question, right? Think about this. If I have a set of rules, right, can I convert it back to a decision tree, right? Yeah. I think as the post pruning process is more complicated, you have more attributes as technology. Is there a risk of fitting the test date validation data? Exactly. Very good. Yeah, this is machine learning thinking, right? Is that, oh, sure, it's more powerful. I can learn more things, but you should also be thinking, well, now I have to make many more choices. So if I have a large, you know, rule set and I do a lot of stuff, I should be more careful. So if you're not careful, you could, you know, do rule pruning and you try so many things that you're now overfitting your validation data. By the way, here's a quick, you know, question. When you're splitting your training data into training and validation sets, whether you're going to use decision trees or rules, what relative size should they be? 50-50 should use more than if we're training, should use more for validation. What is your feeling? 90-10 training. Sorry, you said more for training? 90-10. 90-10. Oh, 90-10. Okay. So, but 90-4 training and 10-4 testing is what you're saying? Yes. Okay. Anybody disagree? I agree. Less for training and more for testing. Less for training and more for testing. Why? Because that will reduce the complexity of our as much as we get a chance to prune. No, so you're arguing for less pruning there or more pruning there? I'm arguing for more test validation and less training data. Okay, because? Because that way our complexity is less and we get a chance to prune and we have more data to validate. Do you agree? Yeah. Okay, so what's the argument for more training there? You get a better model to start with. All of these arguments have some merit. Two to one right now. Who else wants to weigh in? Who feels lucky today? That is a total, very excellent. Now we have all the options on the table. 50-50, what's the argument for 50-50? That's a totally reasonable thing to suggest, right? What's the argument for 50-50? Both sides have, I mean, both the arguments have merit, you know. But do they have equal merit? And more importantly, are there bigger arguments that have not been made yet? You know, what you just stated is the famous principle of indifference. It's formulated by Monsieur Laplace back in the 19th century, which is when you don't know, just 50-50. Right, you're indifferent. But better, you know, so if you have to do things in a short time, that's a reasonable thing. But now let us think about what might be better. Anybody else want to weigh in? Well, I mean, it seems like there's also a question of how much data you have. I mean, if you only have a small set of training data, then, well, sorry, a small set of data you can work with, then maybe you want to go 90% with your training. But if you've got a huge set of data you can work with, maybe you only need numbers. Another very good point, so let me synthesize that as follows. The trade-off is not necessarily independent of the data set size. In fact, it might change as the data set changes. In fact, if I have a ton, a ton of data, what should this split be? I don't care. I have a ton of data either way. As long as it's not epsilon versus one minus epsilon, with a lot of data, it doesn't really matter. So this is really a crucial question when I have very little data. This may be way out on a limb, but could you start with a small amount of training data and a lot of validation data and see if the model works well, and if it doesn't, then gradually increase the amount of training data until you find a model that fits well? You could, but that's a little dangerous, and you're starting to use the validation that is more training there. The thing about validation there is that you have to use it infrequently. And that might not be too frequent, but it's a slippery slope. Any other thoughts? It seems to me like selecting the training data set. If you're selecting a small piece, depending on which piece you select, that could really change your whole decision tree. So how could you compare... So you vote for more training? I vote for more validation, actually. Oh, okay. So, three to one. Anybody else want to say something? But your argument of decided leader... I'm enjoying this. The other argument is decide leader. So you first trade yourself and then decide leader. You should go into politics. I see you rising high above the rights of the manager. But we're just engineers here, so let's try to not go that round. I mean, you can get good accuracy with a small training set. And maybe that's what you're saying about leader. And that's pretty good enough. Actually, let me just put a thought into the air. Let me ask you this question. Which one is harder? Learning the initial tree from data or pruning a tree to the best subtree? Which one do you think is harder? Pruning is harder. Is it? Because we have so many paths. Let me actually jump ahead. If pruning is harder, then we should use more data for pruning. Because we want to use more data for the harder thing. Does everybody agree? Whichever is hardest is what we should reserve most of the data for, because that's where it's needed. But which one is harder? Is it pruning or training? How do we decide this? Let's think about it this way. This is a search problem. It's an optimization problem. I have a search space of decision trees. What is the search space for learning and the search space for pruning? What is the search space for learning? It's every decision tree in the universe. Which one is larger? Which space is larger? The space of decision trees when you're learning or the space of decision trees when you're pruning? When you're training. And by a lot, it's way larger. Anyone care to revise their votes? Is that your final answer? This is the kind of thinking that you have to do when you apply machine learning. Once these things become a pattern, then one day they just become part of a class that you get to take. You can acquire this experience in particular. People started out doing splits of various kinds and indeed, so let me jump right to the punchline and what do people usually do? They always have a big percentage of the data for testing. The percentage varies. And one of the ways in which the percentage varies is with how much data you have. If you don't have a lot of data then you really stuck. You're in that early part of the learning because you lose a lot by reducing your training set. On the other hand, your validation data is going to be very noisy too. As you get more and more data, things become very easy in both directions. For example, when I'm using something like two-thirds, one-third, this is empirical. There's also a lot of theory about this but empirical is what people tend to use. Two-thirds, one-third, or three-quarters, one-quarter, or 90-10. 90 for what? 90 for training and 10 for pruning. Now, when you're really, really short of data, there's something you can do called cross-validation. Has anybody heard of cross-validation? It's where you have, you pick one set of data for training, one set of data for validation and then you pick a different set of data for training, a different set of validation. It's a very clever idea. Yes, roughly speaking. Let me say it a little bit more precisely. It's a very clever idea where you actually get to use all the data for both training and testing. Isn't that amazing? Here's what you do. Let's say you do 10-fold cross-validation. This is probably the most popular. You divide your data randomly into 10 pieces. And then what you do is you train on 9 and you test on the 10. And you see where this is going next? Then you pick another piece that you hold out. You train on the other 9. So all of them rotate through being you form 10 different training sets, each of which contains nine 10s, and then you test on the remainder. So at the end of the day, you basically every data point got to be used as training and testing. So you get out of those runs where you had a 90-10 split. This is good. Everybody got to be training nine times and everybody got to be testing one time. The disadvantage of this is what? There's two. You're doing the work 10 times over. Yes, it's expensive. You just increase your running time by an order of magnitude. Ouch. The other problem is that, at the end of the day, what is the decision tree that you produce? You just learn nine. What people typically use cross-validation for is things like picking the value of that complexity penalty. How much should I weigh the complexity versus the accuracy? Let's use cross-validation for that. Let's see what's the coefficient that gives me the optimal trade-off. I can do that using my cross-validation and then once I found that coefficient, now I run on all the data and this is a very common machine learning strategy. Questions about any of this? There's one more important thing to touch on before we're done today. We're covering, as you've probably noticed, we're covering a lot of very important stuff today. This is a very high utility class. The final thing that I would touch on is the problem of scaling up. We're in the days of big data. How about running the algorithm that we've seen so far, any one of them on big data? Here's what's going to happen. You start it running and then you die. Then one day towards the end of the universe, hopefully before the entropy death of the universe, the answer comes back. We need to worry about scaling up our algorithms. People sometimes say that data mining is machine learning scaled up. I think that's a little bit understanding it, but certainly in practice a lot of what people like me did in the early days of data mining was take machine learning algorithms that were like cubic or quartic or whatever and try to make them linear or log linear. Or maybe even better. But try to make them scale to large data sets. This was not easy. There's been a progression with different algorithms and whatnot. It's just a very, very large data set. What I would like to do is just in five minutes briefly touch on what some of these things are just so that you're aware of them. First of all, what is the biggest reason why these algorithms will not scale to large data? The biggest reason of algorithms like ID3 which does pre-pruning, C4.5 which does post-pruning is that they all assume that your data fits in main memory. They do random access to the data. I put my examples in main memory and then I go collect statistics of different attributes in random order depending on how the tree is growing. Right? But big data doesn't fit in main memory. Now main memory keeps growing but big data grows faster. Right? And you know Moore's law is great but the Moore's law of disk actually far outpaces the Moore's law of computing. There's actually a lot of what drives processing in big data is this thing, is that our ability to store data greatly outpaces our ability to process it. If the exponential for processing part of Moore's law moved faster than the corresponding law for disk well, you know, big data we'll just shove it into the CPU and then run it. Right? The problem we have, the algorithm complains like, oh, I'm a company I have terabytes of data but what do I do with it? We mined big data. We're going to be mining it from disk. And number one, reading data from disk is, roughly speaking a thousand times slower than reading it from main memory. So already there we have a problem. We just suffer the thousand factors but the other big thing of course is that what? What is the big difference between disk and main memory? In disk access is sequential. Right? So if you take an algorithm and try to run it on disk it's never going to finish unless the data set is very small. I have a data set of a billion points which is not unusual these days even a hundred million and it's randomly accessing the disk is spinning, the head is finding its place and this never finishes. So the number one thing that people worry about doing when scaling up became a big issue this was like circa, I don't know, 1995 let's say algorithms that only do sequential scans of the data. I want an algorithm that reads the disk and maybe picks the roots and then maybe reads the disk again and finds the entire next level all at once and then keeps on doing this. This is a natural idea but making it work and actually gaining efficiency is surprisingly tricky. There were seeds of algorithms that did this which you should at least know about how to tweak and whatnot. IBM actually had a data mining package at one point that one of the first companies really out the gate with this and this was basically one of the main algorithms that they used there. Now this works up to millions of examples. The reason it doesn't work beyond that is that even just reading a million examples from this takes a while particularly if they have a lot of data. If you were doing multiple scans of the data this thing may be taking days or weeks even to just learn one model. But of course the learning process is that you don't learn one model. You are playing with the data and learning a model in 20 different ways and doing cross validation here and taking parameter there and not changing the data like that and if your cycle of this is the week or even days is we are going to require what's called working in data stream mode. Data stream mode and this gets us back to that distinction between batch and online learning. Data stream mode is I only look at the data once. I see it and it's gone. I see the next one and it's gone. It's gone. I can choose not to use it but if I want to use it I have to use it right then and I never see it again. This could be because I can't afford the disk storage or because I don't want to scan the disk more than one time because it will take too long. You said you don't store it anywhere either you just see it once. This is one option is the data is never stored. For a lot of very high volume applications like say network monitoring you and ISP. Your servers are like your routers are generating are spewing out data like crazy. You don't even want to store them. In some cases however you can afford the storage but running through that because storage is getting cheaper and cheaper but just running through that takes too long. The third generation of algorithms are algorithms that work in stream mode. They do at most one scan of the data. Ideally they don't even scan all the data. For that there's a number of different algorithms but the best one that I know of now is something called BFDT very fast decision tree learning which was actually invented here at UW. You should be suspicious because it's me telling you about my work. You should probably get a second opinion on this one. You can go to Google Scholar and what not. Nevertheless it's good to know that there are algorithms without getting into all the details of how this works. We have algorithms now that can actually grow decision tree in one scan of a data that is basically as good as the decision tree that you would learn with random access. In fact the best part of this is that you might need much less than a scan of the data. You can actually be scanning the data and decide at some point that like hey, I've already grown as much as I'm going to grow I don't need to see any more data. I am done at this point which is a very nice thing to have. Again we're not going to go into the details of these algorithms here but you want to be aware that they exist because if you find yourself pruning a very large data set or even a large data stream these are probably the kinds of things that you want to look at. The basic algorithms first and that's what we spent our time on here but you don't want to have the experience like I know people who have had that they started with ID3 and they spent six months basically rediscovering how to line from this. Questions? All right, very good. So now your fun begins. Now you're going to do your project where you implement the decision tree learner and apply it to quick stream data. This moment because it might not finish running before the due date. This is not like some of your other projects here. I've been warned do it this weekend and let it run until the due date is run out. Well, if that needs to happen then you're doing something wrong. We carefully selected the subsets of the data that are reasonable in apple of the morning. Nevertheless, we hope you have a lot of fun. All right, see you next week. All right, let's go then. Welcome to week three. Are you having fun learning decision trees and mining quick streams? Yes, I see from your smiles that you are. So maybe the next project needs to be harder. So you know what data mining really is like. In the meantime, today we're going to talk about rule induction. Another major area of machine learning. In fact, we already talked a little bit about rule induction last week. Because the last thing that we saw was how to turn decision trees into sets of rules. And how nice sets of rules are because they can be very compact and easy to understand. But here's a question that may have occurred to some of you. Before we're going to learn at the end of the day is a set of rules, right? Why are we doing it by way of learning a decision tree? Couldn't we just learn the rules directly? Wouldn't that be a better way? Well, the reason we learned both is that they had their pros and cons. But let's look today at how we can actually learn rules directly. And in fact, we're not going to stop at that. We also need to see how we can learn much more powerful rules than the decision trees that we saw last time. When we started talking about machine learning, I said that machine learning is learning programs from data. But the programs that we've seen so far are very limited, right? The decision tree is just a set of basically nested if there are no statements. Today we're actually going to see how you can learn full-blown programs. So last week we got down to basics. But this week we're really going to climb into the stratosphere. So pass in your seat belts, okay? All right. So as usual, let us first think about what rules are and what kind of learning algorithms do we use for learning rules. So first of all, rules are very popular and for a reason that's easy to understand, right? Rules are probably the most easily comprehensible thing that you could mind. They're even simpler than a decision tree. It's just a bunch of statements that you can actually read in English and they make sense, right? In fact, I don't know if you've had the experience of having like, you know, great application rejected and then there's a piece of text that says your application was rejected because you haven't linked long enough at your address and your employer, blah, blah, right? You know, the thing that you should hate there is the rule in that predista, right? It should know more. But nevertheless, it's actually something you can put down in writing and people, you know, can just read without any expertise, okay? So this is one of the main advantages of rule sets is that they're very understandable and often very compact. Okay? Excuse me. Like decision trees, rule sets are a variable size representation. If I have more data, I can learn more rules and they can be longer. They can be more detailed. They can have more preconditions. They're also deterministic and they also can have both discrete and continuous parameters. So the discrete parameters are the choices of tests that you put in the rule and the continuous parameters can be things like the thresholds for tests that you have on continuous attributes or things like, you know, the probability of the class given the attributes, right? If this email has these words in it, then it is, you know, spam with some probability, for example, okay? Learning algorithms for rule sets like decision trees are also constructive search, right? How do we learn a set of rules? Well, the natural way to do it is to learn one rule at a time, right? And the natural way to learn one rule is to learn one antecedent at a time and that's what we're going to do. We're going to construct the rule set step by step. Most rule induction algorithms are eager and they batch, meaning that we have all the data at training time and then we do as much, you know, with that data at training time, we learn a model that then hopefully is quick to apply, okay? Again, there's exceptions to this. There's online learning algorithms, there's lazy algorithms, but for the most part, rule induction is an eager and batch process, okay? Any questions so far? All right, so first of all, what is a set of rules, right? What is the hypothesis space that we're going to be working in in this class? Well, the rule, right, is just what you probably already think it is. It's just a set of tests. Each one is of the form this variable is equal to this value. If the variables are discrete, like for example, the color of the car is red or this variable is less than or equal to some value or greater than or equal to some value if you're talking about a numeric value, like, you know, the temperature today is about 60 degrees, for example, okay? These V's and these X's are values and our attributes and values that appear in the training bed, okay? And it's out of those that we're going to build up the rules, okay? And then what a rule says is things like, for example, if the variable X1 has the value sunny and the variable X2 has a value less than 75% then the class is yes, meaning we are going to play tennis, okay? So this is one rule. A set of rules is just the disjunction of rules, okay? So let us assume for the moment that we're just going to learn a concept, right? So we're just trying to learn something binary, right? Is this a car that I want to buy or is it not, right? Does this patient have pneumonia or does she not, okay? So a little bit later we'll talk about learning with multiple classes. But when we're learning rules for one class, the simplest thing to do is to just have the model for the whole class to just be a disjunction of the individual preconditions, right? So the email is spam if it meets these preconditions or it meets those, okay? So at the end what you have is a DNF a disjunctive normal form defining the concept. So for example, we could have these three rules. If X1 is sunny and X2 is less than 75% then yes. If X1 is overcast then yes. If X1 is rain and X3 is less than 20 then yes, okay? Notice there are no rules here saying no. The reason there are no rules here saying no is that the answer is no if none of the rules match, okay? So I have only rules for the positive class and then if none of the rules fire then by default I just assume that the class is no. Okay? There are other ways to do things but this is the simplest one and it's the one that we're going to focus on for the moment, okay? So question number one. So how do rules relate to decision trees? We touched on this last time. What is the relationship between a decision tree and a set of rules? Does anybody remember? Different representation of the same answer. Exactly. You can build rules from a decision tree. So I can take a decision tree and convert it to a set of rules. How do I do that? Every leaf is a rule. Every leaf is a rule and the conditions on that rule are the conditions on the path to that leaf. So in particular the rules that we just saw here are what you get from this decision tree which is the one that we've been playing with all along. Or you know, something close to it. So going from decision trees to rules is easy. But now the next question is what about going from rules to decision trees? Actually let me hide this slide for just a moment because temptation can be hard to resist. Right? And first of all, can we always translate the decision tree into an equivalent set of rules? Is the inverse also true? Can I always translate a set of rules? Let's say in a Boolean domain the variables are Boolean, the class is Boolean. Can I always translate a set of rules into an equivalent decision tree? Yes or no? Yes? Anybody want to say no? No. Okay, so we have yeses and noes. So let's hear the case for the yes. If it's going to represent a Boolean function then there must be a way to represent any Boolean function of the decision tree. Right, remember we saw a long time ago that a long time ago, all those weeks ago that you can represent any Boolean function with a decision tree. So what do the noes have to say? Somebody remotely was saying no. So what if you can't really find something to be the root of the tree? Like you have multiple rules. I don't know how to describe it too well but I think the idea is like for instance how we have half look as the root in the decision tree if you created a set of rules then you could actually minimize it to the point where something like half look is no longer a root. Very good actually. So this is a real problem. So what you're saying is that you can't see how you would do it, right? You want to say something? Yeah, and also I remember we talked about pruning and when you prune it you could actually prune a rule but it did not remove the node. No, I agree but notice for the moment we're not talking about learning algorithms, right? We're just talking about the representation. So my point is you could represent rules like that which you cannot represent directly at the decision tree. Yeah, so at heart these problems are the same, right? So let's think about this, right? So at some level we know that it must be possible. What it's not obvious is half. We can figure out how but this is going to have consequences, right? So the natural question is who's going to be the root of the tree, right? So does one of the, you know, yes votes have a suggestion, right? You said it's possible, right? In theory there's this notion of a constructive and a non-constructive proof, right? So you could have a non-constructive proof but we can have a constructive one, yeah. I mean, assuming these are all like disjoint variables that you have A, B, and C and you pivot on those roots you can pick a random one, say A. And at the two other levels you pivot on both trees at B. And one of those is going to be redundant for like the no, I think. Depending on the rule. And then at the third level you have like, you know, a total of four C's. They're all redundant but you're covering every rule that you are worried about making a root. Yeah, very good. In fact, this general idea works even if the rules do have overlapping in attributes. In fact, the overlapping attributes, you know, kind of looks like it complicates things but all that happens is that I already tested that value before so I don't need to test it again. So, but let's pursue this idea and, you know, here's a concrete example, right? Here I have three rules, very simple ones. X1 and X2, then Y. If X3 and X4 then Y. And if X5 and X6 then Y, right? Another question is how do I build a decision tree out of this? Well, let's just pick one of them. Actually picking anyone. Clearly the decision tree is going to depend on X1, X2, X3, X4 and X5 and X6, right? It's going to depend on all of those. Which one of these shall we put at the root? It's not always irrelevant which one you put at the root, right? Because you could get a smaller or larger decision tree. But for the moment, let's just worry about doing this any way that we can, okay? So let's just say that we pick the first antecedent in the first rule. Why not, right? So the first thing that we test is X1, right? And then notice that if X1 is true, right? We're going this way, right? And then obviously we need to test whether X2 is true because that's the other antecedent in that rule, right? And if X2 is also true then what do we know? Then it's a yes, right? Because that rule says it's a yes and that's that leaf over here, right? On the other hand, let's go back to the roots. This is the interesting part. What happens if X1 is false? If X1 is false then this rule doesn't fire, right? So then where are we? What do we need to do now? Does that mean that the example is false too? No, right? Because it could still be true because of one of the other rules, right? Everybody agree? The first rule failed to say yes, but well, there's still the others and those might say yes, okay? So now what we need to do is what? As always in decision trees, this is a recursive process, right? So the first rule has failed, but now we need to do what? We need to represent the other rules on both sides of the tree. Exactly. So we need to test the second rule and then the third rule and two people have already alluded to what the problem is going to be, right? So let's see what the problem is. When X1 is false, right? Now I need to test the second rule, which means I test X3 and then X4, right? And if X3 and X4 are both true, then I get the answer true, right? And, you know, to keep going, if this rule is also false, right, then I have to test the last one, which means first I have to test X5 and then I test X6 and if they're both true, then the result is 1, okay? Everybody with me so far? Now, we've captured all the yes cases, right? But we're not out of the woods yet. In fact, you know, woods is actually an appropriate metaphor here, right? We're in a big jungle. Why are we in a big jungle? Because notice, I don't just have to test for the second rule over here, right? I also have to test for it over here, right? Because over here, right, this first rule has also failed, right? So in that case again, I'm going to have to test the second rule. So what do we get here? We get a repetition of the same subtree that we had over there. Notice that the subtree rooted at this node and the subtree rooted at this node are the same, right? You see what's happening here? I have to go test that subtree in every case that the first rule fails, right? If there were 10 interstitutes in the first rule, I have to do this 10 times. And then what happens when the second rule fails? In every one of those cases, I'm going to have to test the third rule. And if I had 100 rules, this would be really bad news, right? Because this means that the size of the tree is going to grow high with the size of the rule set. That's right. Worse than quadratic, exponential, right? I'm going to need to look at all, in the worst case, right? I need to look at all combinations of truth values of the rules. And I'm going to have a subset of the tree for true, false, false, et cetera, et cetera, yeah. You can share pointers though, right? Across the tree. No, very good point. You cannot. Although you could learn a different model called a decision graph. In a decision graph, you can have pointers to the same thing. Okay? So in the decision graph, we would not have this problem, right? Or at least, you know, most of this problem would be solved. It's not necessarily the case that the decision graph would be as compact as the set of rules, but yes, the decision graph would all be at this problem. So is that just because of the definition of a tree? Yeah. The problem with the decision graph is that it's more complicated to learn a decision graph than a decision tree. So even though they're a very natural idea, for the most part, people never use decision graphs. This was tried at one point to use decision trees. But decision trees do have this huge problem. It's called the replication problem. Because you get more and more replications of the same things as you go down. Notice that, for example, by the time you get to the end of this, you've tested x6 here. And you've tested it here. And you've tested it here. And you've tested it here. There were four tests of x6. And if this rule came at the end of a whole bunch of other more, instead of four tests, there would be eight or 16 or two to the who knows what. And notice, there's nothing particularly pathological about this set of rules. It's a perfectly simple and natural set of rules. So decision trees have this very sticky problem, at least compared to decision rules, that the same concept could be exponentially larger, express the decision of the tree than expressed as a set of rules. Not always, but it can happen. So this is one significant advantage of rules over trees. This is not just a representational problem, by the way, right? Because the other thing that we saw last time was that as I grew the decision tree, I get less and less data, right? And so as I noticed that once I split on x1, I only have half the data points, roughly speaking, on each side. So now when I finally get to inducing these subtrees, they're being induced on a fraction of the data that I had before. So the data dwindles for learning decision trees in the way that it doesn't dwindle for learning rules, which means that we might not be able to learn something as a decision tree when we would be able to learn it as a set of rules. Okay, any questions? Okay, but if rules are so great and trees are so bad, then why do we bother with learning decision trees? Why are decision trees the most popular, you know, data mining method than not rules? What are the potential disadvantages of learning decision trees or complications? Sorry? At runtime, yes. At runtime, decision trees are blindingly fast. Decision trees are just about the fastest thing you could have at runtime. Right? You just, you know, go down one branch. It's all there for you. Rules, you have to match all the rules. If you have a lot of rules, that could take a long time. That could be too slow. What's another potential complication? We also saw this last time when we looked at rule post-prime. With decision trees, you always know who's the winner because there's only one. Right? You only go down one branch. With rules, well, sure. I have all these rules, then, you know, they all fire. So what do I do? In this case that I have here, it doesn't happen because I only had rules for the positive class. But, you know, jumping ahead, if I had rules for multiple classes, then, you know, I have a, you know, a non-obvious decision to make. Okay? Okay, these are two problems, but there's another one which is more subtle, but perhaps more important. Can anyone guess at what it might be? Just use your gut feeling. By the way, here's a good rule of thought. If somebody ever asks you, what is the problem with this machine learning algorithm and you have no clue, just say overfitting. You will be right more than half the time. Okay. If that's what you've noticed, let's go back to our question. What might be the other problem with rule induction with decision tree induction? Yeah, exactly. What would rule induction be more prone to overfitting than decision tree induction? This now is not an obvious question, but, you know, think about it for a moment. Why would we have more chances to overfit when learning rules than when learning decision trees? And therefore have to be correspondingly more careful. Yeah. I think you're always looking at all the attributes. That's one thing. Yeah, so there is one aspect is that, remember, when we're learning a decision tree, we're computing information again over all values of an attribute. On one hand, this is unfortunate because it ties it together, but on the other hand, it's fortunate because it actually avoids making highly overfit decisions, at least better than with rules, right? In a bit, that with rules, you run a very high risk of overfitting very highly, right? And, you know, the other way to think about this, which is more those are like, it's a more flexible representation, right? And therefore, since it's more flexible, on the one hand, that's good because you can learn more things, but on the other hand, it's bad because you run a bigger danger of overfitting. We're finding some set of rules that look good just by chance. And this is a trade-off that has always been to be present in machine learning. And so you might as well get used to it, you know, right at the outset, okay? Questions? Very good. So, let's look... Now we know the representation, right? In many ways, it's similar to decision trees, so this is not a big leap from decision trees. Now let's look at how we learn rules, right? And we don't have to do it from scratch. We already know how to learn decision trees. Naturally, learning rules should not be able to use a lot of the same ideas, right? So can anybody suggest how we might learn a set of rules by analogy with decision trees or just, you know, using whatever comes to your mind? How might we do that? And remember, we said it's a constructive algorithm, right, where we build one rule at a time and so on and so forth. So how might we go about it? Just as learning decision trees convert to rules. No, I couldn't agree with you more, right? So what's method number two? That is a perfectly good method, right? That's what's in C4.5, right? You can do that. You know, when Coulomb did his data mining company and became a multimillionaire, actually, I don't know if he became a multimillionaire, but that's what he had, right? He was learning rules via decision trees, right? But let's suppose that we want to, you know, learn rules directly, right? How might we do that? Just start picking examples and make a very narrow rule, and then as you find more examples, like broaden the rules, if it makes sense. This is actually one option. We're not going to talk about it today, but this is often called the bottom-up or specific to journal learning rules. I actually leave my PhD thesis in that, so it must be good. But anyway, right? Short of that method, which is a good one, but, you know, it's not the most widely used one. What might we do? I guess you could start with a rule that uses the least number of attributes and covers most examples. Yeah, exactly. So this is the analog of initially, you know, we have a tree that is empty and we pick the rule. Like, what we're going to do is like, we're going to start with a rule that just has the consequence and has no antecedents, meaning it matches everything. Everybody's a good credit risk. You give credit to everybody. And then you figure out, well, let me try each attribute in turn and add that as a condition on the rule. Right? Very natural thing to do. Let me test that and see if it gives me a more accurate rule. There's a good chance that it will. So I try adding each possible antecedent and I pick the best one. And then what do I do? Rinse and repeat. Right? Then I now have a rule with this antecedent and the consequence. So I add every of the other possible antecedents that I have in addition to that one and I find the best rule of size with two antecedents. And I keep going until when? Until when? What's sorry? I tell you, you don't see an improvement anymore? Yeah, until you get no improvement. Very good. And what is the most natural case in which I get no improvement? In which, in fact, in the case of learning a rule, I know I can always get to that. Right? Remember, in the basic decision tree algorithm, we stop, you know, growing the decision tree. Correct? Yeah, when everything on my training set was of the same class, therefore I don't need any more conditions. Right? Same thing here, right? Initially I have a big mixture of positive and negative examples, right? I start adding antecedents and notice what happens is that I'm adding an antecedent that throws out hopefully a lot of negative examples and very few positive ones, right? So as I grow the rule, I have fewer, fewer negative examples until hopefully I come to a point where I have no negative examples, right? And my rule only covers positive examples at which point it makes no sense to try to refine it more, and that's where I stop, okay? So that is our algorithm for growing one rule. Very simple idea. Okay? Yeah. So what happens if you have two examples with exact same values but different classes? Very good. That can happen. That is actually the only failure case. Right? Notice that as long as that doesn't happen, right? If the examples have different values, I can guarantee that I'll always be able to use a pure rule, right? Because in the worst case, that rule's just going to add every antecedent corresponding to every value in that example and then we're going to have it. If I have, you know, say three examples like that and two are the positive class and one is negative, well, the positive wins, you know, and the probability is two-thirds. If I have one of each, which can happen, right? I have two patients with exactly the same symptoms and tests. One has tuberculosis, one doesn't well. You can flip a coin or maybe try something else. We can't tell. But in rules, you can't represent that probability, right? You just have to choose one or the other. No, I mean, like I said, you could, right? A basic rule doesn't have that, but I can, this is often done, right? I attach to each rule the probability with which the consequent holds given the antecedents. For example, if I'm going to vote among rules, that's probably a very useful thing to have. Okay, so as before, we have the problem of, you know, we need the criterion to decide how good a test is, right? And before, if we're going to the decision tree, we'll look at accuracy and information gain, right? And again, here in the case of rule induction, I have lots of possible quality measures for a rule and people come up with new ones all the time. But again, just the accuracy of the rule is actually a very decent thing to use. Okay? It's actually used more for rules than for decision trees. Right? Because when learning a rule, we don't have this problem that we're trying to balance what happens between different branches. We just want that rule to be accurate. Okay? So I could just use the accuracy as a rule, right? Of the fractions of examples that the rule says are positive, how many really are positive, and that's my accuracy, right? I could just use that. So, you know, if the rule covers M0 negatives and M1 positives, that would just be using M1 over M0 plus M1 as the criteria. There's other things that you can do. In particular, you can use the analog of the information gain. But notice that before, right? I was looking at all the values of the variable. At this point, actually, what we have is we can actually go back on the notion of gain and entropy and just think back to the surprise, right? I'm actually now growing a rule to predict a particular class. Right? So what I don't want to have is nasty surprises. Right? I don't want that when my rule says it's spam, it's actually not spam. Okay? So what I can do here is actually not maximize, you know, gaining the action, but just like minimize my surprise. So here's my old surprise and here's my new surprise. Okay? So remember, surprise was p log p, right? Or the expected surprise, right? For this particular case, right? This is actually what I'm trying to improve on. Right? So the difference between the old one and the new one is my reduction in expected surprise. Okay? When I'm in this situation, when I'm saying that the rule is one. Okay? Now, notice that there is only one part of what I have here. I've also put this M1 prime here. M1 prime is the new number of positive examples covered. And M0 prime is the new number of negative examples covered. And I've multiplied my, you know, reduction in expected surprise by M1 prime. Why would I do that? If your rule covers 100,000th of your data set really well, do you really care? Precisely. It's really easy to induce a rule that looks really good. Namely, say a rule that covers only one example, right? It's 100% accurate. You know, you've got the maximum possible reduction in surprise. And yet that rule sucks. Why is there a problem happening here and it didn't happen in decision trees? Remember, in decision trees, when we branch on a naturally, we were looking at both sides. Right? And we're balancing what happens on both sides. When we're learning a rule, no. We're only looking at this one side. So we just throw out all the negative examples that we can and then we get all that is totally overfitting. What is a sign that we might be overfitting? M1 prime gets quite small. Precisely, right? The fewer examples the rule covers, the more suspicious we should get. If the rule covers 100,000 examples and they're all positives, we're pretty confident that, you know, that region that the rule covers is just positive. But if the rule covers two or three examples and they're, you know, they're all positive, well, you know, you flip three coins and they could all come out one. So a very simple thing to use, you know, as a fix for this is to actually multiply the gain by the positive examples that the rule still covers. We want that to be large. And if the surprise reduction is big at the expense of making this small, then this rule is dubious. This is just a heuristic, but it's very important to use something like that. We will see many different ways of combating overfitting, but this here in a simple way is one way of combating overfitting. Questions? Okay, so at this point we know how to use a single rule. Okay? So with this algorithm as a subroutine, what should our algorithm be for learning the whole rule set? We're almost there, but we need one more step. So I have a black box that returns the best rule that we can find. What should I wrap around that black box to induce a set of rules, not just a single rule? Here's one way to think about this. Okay, I just learned my best rule. Okay, I have a rule now. So what do I do next? Get rid of all instances that satisfy that rule and then pick the next best one. Yeah, exactly. I now need to learn my second rule, right? And I don't want to go and use the same rule again, right? But my goal with the second rule and the remaining ones is to account for the instances that the first rule didn't account for, right? The positive instances that were collectively classified by rule one, well, they're taken care of, they're covered. Okay? So what I do is I take those away and now I have a new smaller training set and now I go find the other best rule on that training set. Okay? And I keep on doing this until when? Again, don't worry about overfitting for just a second. When should I stop? You know, I want to make that possible under some rule. Precisely. Once I've accounted for all my positive examples, there's nothing left to do. We're done. Okay? And this is our algorithm. Also known as separating conquer, right? By analogy with decision trees. Decision trees are dividing conquer, right? I divide my set into small and smaller pieces and I conquer them. In rule induction, what I do is I conquer one piece, I separate that out and then I go conquer the rest. Okay? And that's, you know, a complete rule induction algorithm. Any questions? Yeah? So what if you... So you have just positive and negative classes right now. Can this be extended to, like, multiple classes? Absolutely. And we're going to talk about that in a little bit. But, you know, as a background process, you can figure it out in the antelos. But you're not allowed to look ahead in the slides. Look ahead is sometimes machine learning, but it's very expensive. Any other questions? Okay. Now, in the case of decision tree learning, greedy search works fine, and it's what people almost always use. In the case of rule induction, that is not quite the case. Here, greedy search actually can give very bad results in rule induction because it can get very quickly stuck in a very bad local optimum. Okay? So for rule induction, there's a couple of other things that people often do that are worth knowing about. Okay? So let me just briefly mention those. The first one is what's called round robin replacement. And this has to do with the fact that if you think about it, the first rule has an unfair advantage. It's not competing with anybody. It can account for everything that it wants and it leaves very little data for the succeeding rules. Right? If there's some data that's covered by two rules, right, once the first rule took care of it, now the second one doesn't have access to it. So I've unnecessarily restricted the data that the second rule has access to it. Okay? So what can I do about that? Well, what I can do is I can use a bunch of rules and then say, throw away the first rule and say, well, let me now induce something in the presence of the other one and see what that would be. If it's the same rule again, then fine. But typically, it won't be. Because now the other rules have accounted for part of what they will account it for and so on and so on. Now this one can be different. Okay? So this is called round robin replacement. I can also do this process sort of like as I go, as I add each rule, I try removing rules. There's many variations of this, but some of the best rule induction methods do something of this value. Okay? But the most important thing to know about is this last one, beam search. Does anybody here know what beam search is? Beam search is a less greedy, greedy search. In greedy search, what happens is that at each point I pick the best antecedent and then I run with that and I never look back and you can get into a lot of trouble that way. In beam search, what you do is the following. At each point I pick the K best antecedents and I form K candidate rules. K is called the width of the beam, right? It's called a beam. It's a beam because it's like, you know, having a flashlight with a wider beam as opposed to just a pencil of light pointing in that one direction, okay? And then what I do is I try adding each new antecedent to each one of those, right? So now instead of 10, I have, like, say, you know, 100 candidates. And now, again, I find the best 10 of those, okay? And I keep going in the usual way. And then finally, of the 10 best that I have at the very end when there's nothing, you know, no improvement to be had, I pick the best one. Okay? So what is the relationship between beam search and greedy search? Greedy search is a beam of what? Exactly. Greedy search is just what you get when the beam is one, right? They only keep the best one. What is the advantage of beam search over greedy search? Right? When will greedy search and beam succeed? Think of the second best, right? The one that greedy search threw away. It could be that that second best conjoined with the next attribute now becomes the best. Okay? In fact, that can happen a lot. And that's why beam search can be a very worthwhile thing to do. It's more expensive, right, by roughly, you know, the order of the size of the beam, but you can get much better results that way. So the typical method that people use in rule induction is actually beam search, not greedy search. If you just agree to search, you could wind up with a very bad rule set. And usually what happens is that you get stuck right away. You don't actually find any good rules because, you know, the concept is pretty complicated and then you don't know what else to do. Okay? Questions? All right, so. Next problem. We're going to meet here one of the biggest problems in all of machine learning, and even in probability and statistics. This is a problem that occurs all over the map. In fact, they already created decision trees, but it wasn't as bad there, so we didn't focus on it. But we're going to deal with this problem after a full rule induction because there it's a huge problem, but, you know, this problem will occur again. In fact, there's whole soft fields of machine learning dedicated to dealing with this problem. And in areas like, for example, like speech recognition, the problem that we're going to talk about is a good chunk of the whole speech recognition is about dealing with that problem. More natural language processing, for example. So what is that problem and what can we do about it? It's the problem of estimating probabilities from a small sample. If my rule has a probability of, you know, 80% based on a thousand examples, I'm confident. If it has a probability, you know, of 80% based on ten examples, I am not so confident. And so as I induce the rule, as it grows longer and longer and covers fewer and fewer examples, my decisions become more and more random until at some point the antecedents that I'm adding are just garbage. So last time we saw, you know, some ways of finding this, right? What's one way that we use for decisions that we could also use here? Something from stats 101. Yeah, I can do a significance test, right? This is one option. And it works fairly well. But there's actually a much simpler option that typically works just as well or even better. Shockingly simple. So let's see what that option is. So let's see, you know, this is my estimate of the accuracy, right? P is the probability of, you know, the example being true. M1 is the number of true examples covered by the rule. M0 is the number of false examples. And P is just M1 over M0 plus M1. Right? Clearly this becomes very unreliable when M1 and M0 are small. So how might we fix this? Any ideas? You're allowed to read this slide. Yeah. Just say it in your own words. Here's a really, really simple idea. Remember when I talked, you know, when we mentioned before, you know, Monsieur de la Place and his principle of indifference? His principle of indifference was inspired by the following deep philosophical question. See, philosophy is relevant. We're going to prove that right here, right? It's relevant to computer science. How do I know that the sun will rise tomorrow? You didn't know coming in here today, at 6.30 p.m. on a Thursday, that we'd be talking about such esoteric things, right? But bear with me for a second. How do I know? I don't know, right? It's always risen so far, right? But maybe tomorrow, you know, the sun will go over and, you know, or it will die. Right? How do I know? So Laplace's answer to this question, which you could say is a shallow answer, it's actually a deep answer, is the following. In the beginning, right, I haven't seen the sun rise on any day yet. I don't know, right? So I just say that the sun has a 50-50 chance of rising. Right? That's its principle of indifference. Okay? So before I've seen any example, it's 50-50. But then, you know, the dawn of the first day, right, that glorious moment, the sun rises for the first time. Should that change your probability of the sun rising? Well, yeah, sure. Otherwise you're ignoring the day. You suddenly believe that because the sun rose one day, it's going to rise every day. That seems a little too much. Right? So the question is like, what can you do that's between 50-50 and, you know, 100-0? What might you do? You know, by the way, let's play this forward, right? Once the sun has risen for two days, right, I should get even more confident that the sun is going to rise, right? And once the sun has risen for, you know, five billion years, right, at that point I should be pretty darn confident that the sun will not be 100% confident because one day the sun will not rise. So what might we do? Let's say day two. So day zero, or, you know, like before the sun rises for the first time, your estimate is 50-50. Or, you know, one in two, right? After the sun has risen one day, what might your new estimate be? 1.5 by 2. Yeah, something like that, right? So how about the following? I add one, like before I had one and one, right? And now what I do is like I'm just going to pretend that I started out seeing one of each. So now what happens is that I've seen two of the sun rising and one of the sun not rising. This is really the Laplace idea, right? So my new estimate is going to be two-thirds, right? It's actually going up substantially, but definitely not 100%. Did everybody follow this idea? So we get the following thing that's often called the Laplace estimate, which can take a lot of forms, but here's one, the simplest one. I'm just going to add half to the numerator and one to the denominator. I could also add one and two or, you know, two and four. We'll look at that in a little bit. So this is my Laplace estimate. So first of all, what is my Laplace estimate if I haven't seen any data? 1.5, right? Very good. What is my Laplace estimate if I've seen infinite data? Let's suppose that I have seen infinite examples where the sun rises and no examples where the sun does not rise. Now what is my probability that the sun rises? 1.5. It's 1, as it should be, right? So what happens with this very simple thing is that as I see more data, it asymptotes to perfect confidence or to what the probability really should be, right? Let's say that the sun, you know, on some planet rises, you know, two-thirds of the time, right? Then after a while what I have is two-thirds, you know, plus or minus absolute. This is exactly what we want, okay? So with this very, very simple change, right? From this to this. All I did was add 0.5 to the numerator and 1 to the denominator. And I actually have something that can handle small knots of data. This one little change that will take you about 10 seconds to make in your code is the difference between a rule induction system that fails miserably and is a great success. This is the greatest bang for the buck that you've probably ever seen in any area of computer science, right? You know, type these 10 characters and, you know, quintuple your results. Not that, okay? But now you could ask yourselves, well, why should that be 0.5 and 1? Right? Maybe I have some prior reason to believe that, you know, the sun rises some percentage of the time. Now let's say I'm building a rule for spam and, you know, 90% of my email is spam, right? Why should I start my rule with the assumption that it's going to be 50-50? That seems silly, right? We know it's not going to be 50-50, right? So what might I do instead? Right? This is, we're looking at a generalization of the Laplace estimate, right? What could I have here instead of 0.5 and 1? There's, mm-hmm. I'm not putting the percentage of the time you expected. Exactly, right? In fact, the short answer to this, to the question, what could I have here are the three numbers that we want here? As long as the number of top is not larger than the number on the bottom, okay? But typically what you will do is, like, you will have the ratio of these two numbers be your prior estimate of the probability, right? If I priority, I think it's going to be true at probability 0.9, then that's what I do, I put 0.9 and 1, okay? Step one. Step two. But why put 0.9 and 1 and not, you know, 9 and 10? What is the difference between those two? In fact, what is the difference between, you know, putting down 0.51 or putting 0.5100? Or for that matter, you know, 5,000, 10,000? What is the difference between these cases, right? Your prior estimate in all of them is 0.5. So what difference does it make? One skews it more towards whatever, if you have 5,000 or 10,000, you're going to skew it far more towards, you know, your earlier estimate, because of the data you have. Precisely. The stronger I make, the larger I make those numbers, the stronger my prior belief. If I make those numbers small, my prior belief is wiki will be overridden by the data very quickly. If I believe in those numbers very strongly because I have very concrete, you know, positive information that they're true, then I need a lot of data to override them and then I make it, you know, and then I make them large numbers. Again, a very simple thing to play with, that has a very large effect on the results. In fact, you know, every good learning algorithm has one good overfitting control parameter and this is such a parameter. Just writing this parameter goes all the way from overfitting widely, right? If I make it, you know, 0, to basically ignoring the data, right? If I make those numbers much larger than the size of the data, the data has no effect. And in fact, some of them are often used in practice in areas like, for example, speech and languages to make these numbers very, very small. They're actually, they can be smaller than 1, too. Those numbers are just there as a placeholder until I see any data. As soon as I see some data, I start to believe the data. But what I don't want to have is 0 over 0, right? Because that would be a lot of trouble. Okay? So these numbers are very easy to play with. You know, this is sort of like what you would call the general prior estimate or the end estimate because you call that number M. But this is a couple of numbers that are really worth putting in there and, you know, really playing with when you're applying rule induction to some problem. Okay? So you're dealing with chi-square and all of that stuff that we're forcing you to do in decision tree induction. We just picked the most painful method because we want you to suffer. And as I mentioned, this is not just a problem in rule induction. What you just learned in the last five minutes is something that you can apply anywhere that you're doing probability estimates. In natural language and speech, for example, people estimate, you know, all these probabilities. Like, you know, the famous Google machine translation model? It's just a bunch of n-grams, right? N-grams is, you know, like sequences of n-words and how often they occur. The biggest problem when learning large n-gram models is exactly this problem. I make the n-grams longer and then the probability estimates start to get very unreliable, so the smoothing becomes really, really important. By the way, this is often called smoothing because obviously what it does is it makes my probability estimates smoother. They don't jump all over the place anymore. They're very smoothly from your prior to the tree. Questions? All right. Moving on. Somebody asked the question, again, let me hide that. Somebody asked the question, what if we had multiple classes? See, the strategy in these classes to be on the lookout for that 10th of a second when the slide shows up and use your iconic memory and then you have all the answers in you. How might we handle everything that we saw so far as learning with one class? What if I have half a dozen? What should I do then? Any suggestions? We're lazy and we're stupid, so we like to do something simple and easy. What is a simple, easy thing to do here? Each class one at a time? Precisely. Always try to use what you already have as a black box. This is a golden rule in all of computer science. Let's say I already have a black box for learning one class. How do I bootstrap that black box into another one for learning multiple classes? Just learn rules for each class in turn against all the others. No change. So I have classes A, B, C, and D. I have a bunch of rules that says if this rule fires your A, if this rule fires your B, and so on, are we done or do we need something else? I think those rules could overlap, right? Yep. You have to solve that problem. Exactly. So if the rules overlap, what do we do, right? A rule for class A fires, and a rule for class B fires. So what do I do now? Can you build rules that apply to that? I can contrive the rules such that they're mutually exclusive. Isn't that a devious idea? And what do you get in that case? You get a decision tree. Historically, actually, rules precede decision trees. But then there was this problem. I'm abusing history a little bit here, but this is not entirely incorrect. When you learn decision tree rules, which at least for people coming from an ad background were very natural because it says the rules, but there was always this very thorny problem of how you combine the rules. If you contrive the rules to have mutually exclusive antecedents, you never have that problem, and that next decision tree is kind of attractive in some ways. But you know, like today, our agenda is, you know, we're going to do this with rules. So what might you do? Sort them by confidence. Precisely, right? We already looked at this last time. We can sort them by confidence or by some measure of quality, and the first rule wins, right? We call the decision list, which is really a kind of case statement, right? Natural idea. But we also saw that, and you know, this is an often used method. We also saw last time that this is typically not the best method in terms of accuracy. What is the best method? Voting, right? Democracy. We like democracy. Right? The rules all get to vote. Right? Because suppose that the most accurate rules say of class A. But then there's tens, would you be really stubborn in that case and still favor class A? No. At that point, maybe the balance of evidence favors class B. Right? I have one rule firing saying this is a chair and I have ten rules saying it's a table. Well, you know, I might go with a table. Okay? So what you do is you vote. You have a set of rules. Each one has some sort of coefficient, like, you know, coverage, accuracy and whatnot, and then they vote according to these coefficients and whichever class gets the most weighted votes wins. There's other ways to combine them. Some of them quite, you know, interesting. Probabilistically, we will look at some of these ideas later, but, you know, this actually works pretty well and is easy to implement. Okay? It has the disadvantage that it's the slowest, right? Because I need to run through all the rules to see if they all match and then vote them, as opposed to as soon as one of them fires. And remember, typically the rules at the top cover a lot of examples, so there's a good chance that one of the other rules will fight. So this is the slowest method, and it tends to be the best one. Okay? Questions? So this problem doesn't really exist in decision trees, right? Yeah. Decision trees are immune to this problem because I always go into exactly one leaf and that leaf predicts exactly one class. So that's one advantage of decision trees. One of the things that you should be able to do when you're done with this class is, when faced with a new problem, when twisted and aggregated for machine learning, think about these various algorithms and representations and see what is good for this problem versus what is not. Which is why it's important for you to see a good sample of these things. And of course, it's not a random sample. It's the most widely used ones, and it's a very set. But you want to have that understanding of what the options are and the circumstances and weaknesses and so forth. All right. Somebody had a question? What do you do if none of them are supplied? Oh, absolutely. Great question. Darn it. I forgot that. What do you do when none of the rules file? You do what? It's a negative example. But what is a negative example? You have multiple classes and I've learned the rule for each of the classes. There are no negative examples anymore. What couldn't you just define sort of a default class? Yeah, precisely. You can have a default prediction. What is the single most frequent class overall? What is the single most frequent class among examples that none of the rules cover? Right. Remember, in reality, we're going to prune. So the rules aren't going to cover all the examples. So I could see what is the most frequent remaining class and then I predict that. Or you could just tell the user, hey, I don't know. Some machine learning systems have what's called a reject option, which is, I'm not confident. This is actually what the post office uses or used to use for things like, you know, figuring out the zip code on letters. If the character recognition system was confident that it knew the zip code, it would get sorted automatically. Otherwise, there's a reject option that says, like, a human had better look at this. So you also have that option. By the way, even for the case where I only have two classes, the best performing method is to learn rules for one class and for the other class, and then vote them or order them. It's typically more robust. So in the case of two classes, I don't have to do that. I could just learn rules for one class, but I'm usually better off actually having rules for both of them and then having some ordering procedure . Any questions? Very good. So this was the straightforward part of today's class. Straightforward because if you think about what we did in this hour, it was really just transposing to rule induction what we knew about decision tree induction. It wasn't that different at the end of the day. You might even wonder if this was all worth all the trouble because if you already know about decision trees, then figuring out how rule learning works is not that big of a deal necessary. But now we're going to about take a big leap into something much more powerful than anything that we saw last time. And if at the large part of the reason why we've looked at rule induction that we did so far is that there's going to be the basis for learning rules of a much more powerful form. And that's what we're going to spend the rest of today's lecture on is learning first order rules. I don't know if you guys know this from my lecture but everything that we've seen so far was propositional representations because they're at the level of propositional logic. In propositional logic all that you have is propositions like in like a Boolean symbol is a proposition. The Boolean symbol a represents its sunny today. Anyway you can do a lot of propositional logic but it's also a very impoverished language. To really do programming in general you need something at the level of first order logic. First order logic is when you can have functions and arguments to the functions and predicates with arguments and so on and so forth. We want to learn things like that. This of course is much much harder but actually at this point step by step we're already halfway to that. So let's see what else we have to do and what this new more powerful representation is. So there's a lot of formal background that you could give here about first order logic and whatnot. I'm going to mostly avoid that and keep things at a more or less intuitive level. However, if at any point you're not clear what we're saying or have any questions just let me know and we can look into things a bit more carefully. So I'm going to be relying on examples a lot to show what happens here. So here's a classic example of the type of thing very simple, that you can learn with first order rules, with rules in first order logic that you couldn't learn with propositional rules or with decision trees the way we saw before. This is the example of the concept of ancestor. Ancestor is not a property of a person it's a property of a pair of people. This is what makes it interesting. Whether a patient has tuberculosis or not is a property of that patient. Whether Anna is an ancestor of Bob well that's a property of the pair. So things now get a lot more interesting. What we would like to do is learn the definition of ancestor from data. You give me a database of family relationships and I want to infer from that what ancestor means. This is a very real example. I have a big information extraction system in fact we're going to see an example from information extraction shortly and I want to extract these relationships figure out how one relates to others and so on and so forth. So here's an example of a couple of logical rules that define ancestor. One is X is an ancestor of Y if what? What's the simplest case in which X is an ancestor of Y? Think of this as a recursive definition and think of the base case. What is the base case of the recursion? Paran. Paran, right? X is an ancestor of Y if X is a parent of Y so that's our first rule. Easy street. What is the more interesting case the recursive case? X is an ancestor of Y if what is the general case? If X is an parent of another ancestor of Y Exactly. If you are a parent of an ancestor of somebody then you are also an ancestor of that somebody. Everybody agree? And that's what the second rule here is doing. X is an ancestor of Y if there's a Z somewhere such that X is a parent of Z and then Z is an ancestor of Y. Okay? Look at this carefully for just a second because this is the kind of argument play that we're going to be using all along here. Right? Some of you might know this is actually a couple of statements in Datalog. And you ask at some level of abstraction is equivalent to Datalog and most of the database or at least the simple database quiz that you issue can be formulated in this form. Datalog is also a special case of the prolog programming language. In prolog you can write whole programs just by writing down rules like this in logic. The difference between Datalog and prolog is that in prolog you can have functions inside here. I could have ancestor F of XY. We're not going to deal with that case here. We're just going to look at the Datalog case. Because you know it's complicated enough and powerful enough for our purposes here. Okay? So what we would like to do is learn a couple of rules like this from data. And by the way it does look a little weird because the arrow is pointing in the opposite of the usual direction. We have the antecedents first and then the consequence but this is what people do in prolog and Datalog so we're just going to stick with that. Okay? You can read this rule anyway that you want but the rule and rule is the consequence are always on the back of the area sorry the consequence are on in front of the area and the consequence are behind the area. Okay? So this is the type of representation that we would like to be able to learn it. Much more powerful than what we had before. Okay? Now of course this is a trivial little example just family relationships. What's a more realistic example of this? Why do we care? Here's actually a real example of the type of first order rule that you can learn. Let's suppose that you want to do the every popular thing of classifying web pages. You might want to classify web pages by what topic they're on by what kind of object they represent. You can see how this would be interesting because like Amazon and Google and Microsoft and what not. You want to organize the web. Right? So here's an example of such a rule from a real system and we're going to see how that system works shortly. They apply this system to the web pages of a bunch of computer science departments and they wanted to classify the web pages into this is a course page this is a faculty page this is a student page this is a project page and so on. Let's figure out if a web page is the web page of a course or of something else. And now what we're going to do is we're going to do rule induction. We already know how to do rule induction we're going to try to form a rule that predicts that something is a course page using properties of the page. This is the single most accurate rule for predicting courses that come out of the system and let's see what it has. So it says that A is a variable A is a course and you can read the rule like this. What does this first antecedent say if you translate into English? Web page A has the word instructed. So if the web page contains the word instructed that makes it more like it to be a course page. Pretty obvious, right? And what does the second one say? By the way this thing here in case you're not familiar with it is the negation sign, right? It means not. So what does this one say? Let's do it. Let's keep moving and not focus too much on what that part of the rule means. I don't know. A course page is a course page if it doesn't contain the word good in it. Well, this is just an empirical fact. Let's not overinterpret it. But anyway these two antecedents actually we could have done with what we had before. In fact, people often do this. I represent a web page as the bag of words that appear in it. Instructor, it does not contain the word good. It contains the word isn't that in the other, right? And then I could just have the rule that says well, if instructor not good, then course, right? So far, propositional would be good enough. Where does the extra power come in? Well, let's see the next pair of antecedents. These two. What are these two saying? This is the really interesting part of this rule. There's a wink from page A to page B and page B equals the word assigned. Or contains the word assigned. Yes. Right, so by the way, some of the people often doing text processing is stemming. Stemming means that I cut out, you know, the suffixes of words. So this is a sign really comes from the word assignment. Right? So what these two guys are saying together is that something is a web page if it points to another page that contains the word assignment. Well, of course. Isn't that brilliant? The web pages for courses contain links to pages that contain the word assignment because they are assignments. In fact, the web page of this very class is a positive example of this rule. We really contrived it to make that happen, but hey, we have that positive example. Right? It does contain the word instructor. It does not contain the word good and it contains a link to a page that says assignment. Right? See? This is the power of machine learning. Isn't it amazing? Especially when you contrived the examples. No, just kidding. Okay, so see why this is different from what we could do before because now we're talking about a property of something else. We're no longer talking about a properties of the object that we're classifying this web page. We're talking about properties of a related object. And now we have all this power at our hands, right? And you can see why in the web you'd be able to go hog wild with this because links are all over the map, right? This page is about something because there's another page that points to it that's about something else. Etc. Etc. Right? So now I can use another powerful example of this. I can predict that you will go see this movie if your friends that you trust went to see that movie. Right? Now I can use properties of related objects to predict properties of the objects of interest. I couldn't have done this before. Or maybe I could have done this by putting together in the record for an object all the properties of all the related objects, but this is insane, right? It would mean, like, that for every page that I'm trying to classify, I would need all the properties of all the pages on the web that are connected to it, which is probably all the pages on the web are almost all of them. So that would be a non-start. Okay? So we get a lot of power from being able to do this. Now, by the way, what is this last one here? What does this mean? It has a link. This is actually the least obvious one, but it also makes sense, right? So this here says not link from BC. What is this saying in English? Nothing links to page B. Actually, the other way around, B links to nothing. Remember, links from a B is A links to B. So what this says is that B links to nothing, right? So what does this mean in terms of course pages and assignment pages? Okay. So the assignment pages you can dash on the assignment page itself? Right, exactly. Right, again, this makes sense. The course page has a link to the assignment page, but then the assignment page doesn't have links to anything else. Our assignment page violates this one. Darn it. We failed there. I'm going to fire the TA. He'll be grateful. Right, so this actually looks a little odd at first, but it makes a lot of sense. Assignment pages typically do not have pointers to other things. At least in the data set that they were learning this from, that was the case. And by the way, this rule was completely accurate on the training set and on the test set. These were both pretty small. This was back a long time ago, in the early days of the web, when doing this was a new thing. But this is actually a surprisingly accurate rule. Is that a good thing that the training set had 100% accuracy? A good thing that what? That it had 100% accuracy on the training set. It's a good thing, but it's a suspicious thing. Another question that you want to ask yourself as a newly minted machine learning pro is how much do I believe that 31 out of 31? And the answer is, well, I would believe that plus or minus 3 or 4, which, guess what, is exactly what happened. You would have to be pretty lucky to get this 31 out of 31 times if it wasn't accurate. That's very unlikely. I would trust this rule before looking at the test that hindsight is always 2020. I would trust it to be accurate. I would not trust it to be 100% accurate. And that turns out to be the case. This rule is not 100% accurate, but it's still pretty accurate. It's a valid rule. Looks like the training set, the test set is just the training set and three more sites. Again, what they did in this problem, as you're doing in your project that everyone always needs to do, they actually were of comparable size. I don't remember exactly what the split was, but it's not like they were totally different from these things. Questions so far? Excuse me. Okay. Now we know, at least at an intuitive level, what first are the rules are. Hopefully, I persuaded you that this is a powerful thing that we would like to be able to use if we can, but it's probably not going to be that easy. It sounds a lot more complicated than what we had before. But of course, what we're going to do here, as always, is step by step. We already know how to build rules in the propositional setting. Now what we have to try to do is extend that to the first order setting. Hopefully some things will stay the same and we don't need to mess with those. Some things will change and then we need to deal with those and then we'll have a full algorithm. Okay. And this is exactly what we're going to do here. In particular, we're going to look at something like the first order rule induction system, but it's still very popular and it encapsulates very neatly many of the basic ideas. Or at least one of the main approaches of learning first order rules. We're actually going to see two main approaches. This is the first one. It's the one that builds directly on what we already know about rule induction. And four, by the way, again, was also from Ross Quinlan. You know, this great U-double alumni that is one of the grand old people in machine learning. And he also invented FOI. Pretty amazing guy. And by the way, FOI is first, first order inductive learning. It's one of those acronyms that really mean something. Unfortunately, I've never heard people proposing, you know, calling their algorithm shrink wrap and silly things like that. That did not come to pass, but I did hear it proposed. So what is FOI going to be? It's going to be the same as our propositional separate and conquer method built with two key differences. What are those two key differences? Well, think about the FOI. What we did when we were doing propositional induction was we look at all the possible antecedents, right? We try adding each one and then you see which one works best, right? But we can still do that here. What has changed, though, is the language of antecedents that I can add. The antecedents that I had before would just test like it's sunny today or it's not sunny. The humidity is above 70% or it's not. Now I have a much richer language of things that I could add. In particular, I have, you know, predicates like has word and links from and what not. So that basic idea of constructing rules one antecedent at a time can stay the same. What has changed is that now I have a much more powerful language of antecedents to add. Everybody got this? Okay, so that's rule number one. I'm going to have to look at different candidate specializations of my rules. In particular, my specializations are going to be what are called literals in logic. A literal is either a predicate with arguments like this or a negated predicate with arguments like this. So these are the kinds of things that we're going to be looking at adding. Difference number one. Difference number two is that, obviously, I'm going to need a different evaluation function now. In any learning algorithm, the choice of evaluation functions is always one of the key things. The type of evaluation function that I had before doesn't really work here anymore. Let's suppose that evaluation function was just the accuracy. The accuracy was the fraction of correct examples that I covered. What is the accuracy in this case? Actually, what is an example in this case? It's not even clear what's an example anymore. Before, an example was like a patient and their 10 symptoms. But now what I have is like this big network. It's the web. There's links. There's substructure in the page. There's tags. What is an example? It's not actually clear what an example is and whether it's the same before and after I add an antecedent to the rule. Clearly, as much as we still want to maximize accuracy or information gain or any of those things, coverage, etc., etc., we're going to have to adapt that. These are the two changes that we're going to have to make. Let's see how we're going to do that. First of all, it's going to be easier than it seems. That's the good news. On the other hand, fair warning, I'm going to gloss over a lot of subtle details here. You're welcome to look at the more in the book and then, for example, look at the pilot and play with it and whatnot. At a high level, actually, it's not going to be that complicated, at least compared to the part that we're going to get. First of all, how do we specialize rules in the court? Well, let's say we have this current rule. Now, notice that for a start, the head of my rule is no longer just a class. It's a natural predicate, like ancestor X, Y. So it's going to be a predicate symbol, P, with some set of arguments, X1, X2, X3, X4, as in antecedent, X1, X2, and so forth. Let's say that so far, I have a bunch of literals, L1 to LN. L1 is really standing for a predicate symbol with a bunch of arguments in parentheses, like parent X, Y, or has link, AP, and whatnot. So I have this rule like this so far. Now, what am I going to add to it? Over to you. What should I consider adding? Oh, and by the way, what is the training set now? The training set used to be a very important point. The training set used to just be a table of my examples. For each patient, their symptoms, or for each email that would appear on it. What is our training set now? It's actually, again, a much more general thing than we had before. What is the training set for a first-order rule learner? That's one way to look at it. It can be a graph, and we're going to see an example like that a little bit, but it's actually even more general than a graph. The training set is going to have instances of this predicate, right? Because we're trying to learn it. It's going to have instances of, oh yes, Anna is a parent of Bob and Bob is a parent of Charles and so on and so forth, right? So it's going to have a bunch of stuff like that, but it's also going to have to have a bunch of stuff like this, right? Are there relations like, you know, has worth and links to? So what is this looking like? Declarations. Yes, they are. This is declarative knowledge, right? It's a bunch of facts, but more specifically, this is a relational database. Our training set now is a relational database, right? Relation, you know, L1 is another relation, right? There's the links to relation, there's the has-words relation. Those only had two arguments, but I could have a relation here with, you know, any number that I want. Okay? So when we're doing relational learning, this is often called relational learning, what we have is not just a table anymore, it's a whole bunch of tables, meaning a database, okay? Which is nice considering that, you know, like most of the world's information systems run on databases and we would like to be able to see how we're about to see how, okay? So my data is a database, right? And so I have different relations in that database, like the has relation, sorry, the has-word relation, the, you know, the links to relation, et cetera, et cetera, right? And I'm trying to build a rule out of those relations, okay? And let's say I have a rule in which I really have a bunch of those. What is the next thing that I can add? Let's suppose, for example, that in my database, you know, I'm trying to predict p of x, y, and I also have something like, you know, r, x, y, s, x, y, and t of x, y, right? So my database consists of four binary relations. Again, they could have many arguments, but for simplicity, they just have two, right? And I'm trying to learn to predict p as a function of r, s, and t, okay? So let's say that in my rule, I already have something like, you know, very trivial, r of x, y, if r of x, y, right? What can I add now? What are the first obvious choices? Just give me one suggestion of something I might add to the rule. Negated. No, yeah, sure. So that's one thing. I could have something like not x, y. Actually, I probably want to have both r, x, y and not x, y. So certainly we want to allow negations of these things. That was, in fact, my last point. But for the moment, let's ignore negation, r of x, y, if r of x, y. And I'm going to try to further refine this rule. How might I refine it? Yeah, exactly. I could add one of the remaining ones, right? I could now make it, you know, r of x, y and s of x, y, or r of x, y and t of x, y, right? And let's suppose that I just did that and now I can try the remaining one, right? So now I have p of x, y, if r of x, y, sorry, x, y, if r of x, y, and s of x, y and t of x, y. And now comes the really interesting part. Am I done with the things that I can add here? Certainly if this was propositional rule induction, I would be done, right? I have not used up all the attributes that I have. But in this case, am I done? Let's think back to the web example, right? Am I done once I have added my, you know, has word and my link from? Clearly not, right? I can keep adding more of those, right? What is the thing that can happen here that couldn't happen in propositions? What can happen here that couldn't happen in propositional rule induction? I can have the same predicate repeated, right? But what changed from one case to the other? The arguments. The arguments, right? That's the key that thing that makes this powerful is that my predicates now have arguments. And they could share arguments in very interesting and complicated ways. I could add the same predicate multiple times. I say, you know, this page links to this page that says this and to that page that is linked to by another page and I could even follow another link from one of these pages and go to the words that are two steps away. There's all these things I can do. So the moral of the story is I don't just have a choice of which predicates to add. I have a choice of which predicates to add with which arguments. So another question is should I just allow any addition of arguments any way that I want? That's the natural thing to say, right? That's my language. Add any predicate within your arguments. So for example, I could have p of x, y if r of x, y, right? And, you know, s of y, z, right? This is a perfectly legitimate rule, right? So this is a good thing. But let me give you another example that hopefully will make you think. What about the following? p of x, y, y, s of u, v? You like this one? It is surprising. It is surprising. Yes, this rule is surprising, right? But more to the point, indeed. More to the point. Is it a good idea to add suv to this rule? That's right. It fails the common sense test, right? I'm trying to classify x, right? Why am I talking about the relationship between u and v that have one relationship to x? It doesn't make any sense, right? So we probably want to allow this, but not this. Okay? On the other hand, notice the following interesting fact here, right? Is that this... I did have here in this rule that we agreed was reasonable a variable that never appeared before. I'm allowing this rule to involve objects that I never talked about before. So it's not that I only want to apply in them, right? That would be too restricted. And again, notice that in this example, my rule here, when I'm talking about b and c that did not show up here, right? And also, like, remember in our ancestor example, there's a key role being played here by z, right? z is this person in the middle that x is apparent of and who is then an ancestor of y, right? So I wouldn't allow that. So can you induce from these examples what the general rule for what we want to allow is? I wouldn't allow adding any antecedent with any arguments provided that exactly, at least one argument already appears in the rule. Doesn't have to be in the head, but it's in the rule. If that argument already appears in the rule, that means that there's a connection between what I'm now saying and what I was saying before, right? I can bring in a new page because it has a link to a page that was related to what I want to classify. And then once I have that page, now I could go look for links to that page, right? What I probably don't want to do is just bring in random pages that I'm linking to any of the pages that I'm talking about already, okay? So it's just from a relational database standpoint, it's just, like, doing a join? Precisely. In fact, this is exactly what happens. If you want to do this stuff, on this kind of large scale, you're going to do it by database drives. And that's all that's going on here is that this is really what a data log rule is, right? This is the reason database is, you know, if you think of this as a database query, right? What happens is that here, right, there's a non-trivial join between these two things, right? And here, these things don't join, right? They have no arguments in common. They're talking about different things. Okay? So this is the important point, number one, is that I will allow in my language any antecedent, any literal, with any arguments with a condition that I have to have at least one argument that appeared before, otherwise it doesn't make sense. Okay? This is number one. Number two is something that might seem a little odd at first sight, but actually is quite important that's why it's there. I'm going to allow the special predicate equal. Equal x, j, x, k means that these two variables are the same. So I could actually just replace them both by y. Okay? Now, why would I want to allow that versus not just in, like, if x, j and x, k are both going to be y, well, why don't I just call them y to start with? Why would I want to have this here? And let me give you a hint. This is not because of the representation. It has to do with the fact that we're going to learn these things using search that is necessarily greedy. So what might be the advantage of having this in there? Let me put it this way. Suppose that we already have this rule, right? And, you know, these rules all have the different arguments. What could be the advantage of adding an antecedent like equal x, i, x, j? What is it doing, right? It's basically saying that these two arguments that I had over here are actually the same. Right? Why might I want to do that? For j and x, x, y, k? It might... When I added these literals, right, it might not have been apparent at the point that at the end of the day the arguments should be the same. Right? And this, as I said, is going to happen very often, right? I'm adding a bunch of things, right? And then I realize, well, actually this part here really should be talking about the same thing as this part here. And if I don't allow this equality at that point, I'm host. You could also say that instead of allowing equality, you know, you should allow at some point replacing different arguments by the same. But the point is that often you want to do this in the search, in the beginning, when you were adding things greedily, remember they were just tied to the existing by one argument. But now if they have other connections among their arguments, right, that may only have become apparent once I add the other things that come around to this. So it can be very important to allow equality. It can also be very important to allow the negation of equality. Just as we want to allow the other particles to be negated, often the most interesting rule like a trivial special case. So I also want to allow not-equals. Questions so far. We just covered a lot of ground here. So now we know how to build a very powerful class of rules. But we still don't know how to evaluate our candidates. Okay, let's go back to what we had before. We had this heuristic that minimized, or sorry, that maximized my getting surprised, i.e. minimized my surprise inside the rule, right? And then we also multiplied by the coverage of the rule because we didn't want it to over-specialize, right? Let's do that here and see what breaks and then how to fix it, right? So remember, let's say that my rule is R, right? My rule so far is R, and L is the literal that I'm considering adding to the rule, okay? And now let's P0 be the number of positive bindings of the rule. What is this thing called binding? What is the binding of a rule? We didn't have this before. This has to do with the fact that now what is an example is no longer very well defined. Think of this as a graph. I have a very big graph of things and now what exactly is the subgraph that I'm going to use for this particular prediction? Well, that could vary. What determines the relevant subgraph is the rule, right? It's the things that the rule binds to. Binding is actually a technical term. I have a rule with these variables, right? And anything that those variables could be replaced to creates a binding of the rule. I have, you know, a rule about, you know, parent X, Y and I could bind that to parent and above, right? So I can have a particular family and the rule applies to that family with its relationships. That's what a binding is, okay? So I no longer have a notion of an example. I have a notion of the bindings of the rules. You could say that in some sense all I have is one big interconnected mega-example but that's too much of a heading. Let's just look at the part of that big graph that matches the rule. That's the binding. Well, in general, you know, if the rule is good, it'll match many different subgraphs, right? It'll match this family and that family and that family, right? And that's the number of bindings that I want to count. This is what now plays the role of my number of examples. Now, of course, some of those bindings will be positive and some of those bindings will be negative. Think of the binding of the rule as a set of people, right? And their relations. In some of those sets, A really will be the ancestor of B or an ancestor of B and some of them it won't. The first will be positive examples and the second will be negative examples, right? This is a very important notion. Everybody on the same page? Okay, very good. So I'm going to count the number of positive and negative bindings of the rule before I add the antecedent. Those are called P0 and N0, okay? And now the surprise of applying the rule before that is log, you know, minus log of P0 over P0 plus N0, right? That hasn't changed. Once we have defined what that rule actually applies to, we can apply our, you know, existing notion of surprise. Okay? And same thing after we add the antecedent, right? Now we've added the new rule to the rule, right? So now there's a new set of bindings and, you know, P1 is the number of those bindings that are positive and 1 is the number of those bindings that are negative. Remember, let's say I'm building a rule for an ancestor, right? In the beginning that rule is only about two people, X and Y. But once I bring in that third person that you're an ancestor of and then they're an ancestor of blah, blah, right? Now this rule binds triples of people. The rule used to apply only to pairs of people that now it applies to triples of people. Big difference. Notice we're in a whole different world from, you know, our good old propositional rule induction. Now there's much more going on. It's much more powerful. It's also more complicated. Okay? So now, but now again some of those bindings will be positive in that, in some of those cases the ancestor relation does hold between X and Y, you know, with that Z or some of those Zs and in others it doesn't. So now I'm going to have, you know, P1 and N1 and I have my new surprise. Okay? But now first of all, I have the same problem that I had before that, you know, I don't want to have a very specialized rule that has very low surprise just because it doesn't cover anything, right? So I'm going to need something over here as I had before. But there's an even bigger problem here, right? Which is we're really, if you think about it, comparing apples and oranges when we compare these two things. Because we may be binding different sets of examples. In the propositional case this could not happen. Right? But in one case we're talking about pairs of people and in another case we're talking about triples of people. Right? And the nasty little effect that this can have is that, remember the basic heuristic that we're using here to be confident of our rule is that it covers a lot of examples. Right? Once it covers a few examples we get suspicious. Because remember what happened in propositional rule induction is that as I added more littles the number of examples always went down. At best it stayed the same but adding more antecedents meant the example, you know, I was going to cover fewer examples. What happens in this case when I add more antecedents? Does the number of examples always go down? No, in fact it could go up by leaps and bounds. You know, let's suppose you know that I had 10 people in my data set, right? The first rule matched, you know, 100 and the new one is 1000. Wow! I just got an order of magnitude increase in the number of examples that matched my rule for free. Right? Clearly if we talk to something about this we're going to get all sorts of garbage, you know, out of our induction system. So what is the simple thing that we can do here that, you know at least heuristically fixes this? It's a very simple idea, you know, like, you know, one of the things that Quinlan is very good at is coming up with like simple heuristics that do the job and this is a very nice example of that. He's just going to make a very minimal change to what this part here of the heuristic means. And that changes the following is that this T here is going to be the number of positive bindings of the original rule that are also covered by the new rule. This is going to disallow cheating by multiplying the number of examples by having new bindings. If you think about it, I really only care about the bindings of the old rule. The new variables that I brought in are auxiliaries to help me predict what I really care about, which is whether this is a course page or whether these two people, you know, have a family relation with my ancestor. So by doing this, we focus on what we really want. I might have multiplied things by a lot. I don't care. I care about which of the old rule bindings are still bound by the new rule. This is actually not obvious, but after you see the idea, it's actually, yeah, this at least should work. And again, heuristically it works very well, and you know, FOIL is very widely used. Any questions? Okay, so that's FOIL for you. Or at least the basics of FOIL. Of course, the full FOIL system has many bells and whistles, and there are many issues in this type of relational learning that we're not going to deal with. He's just an example for you to think about during the break or at home tonight. He's a very simple example, you know, just to make things concrete, because we've talked about a lot of very abstract things so far. Let's make it concrete. Here's a graph. The simplest interesting example of a problem where relational learning gives you power, that propositional learning didn't give you is a graph. So here's a very simple graph. These are nodes in my graph that represent whatever entities you want. And arrows just represent the relational link too. So let's just say this is a very small segment of the web. And if I were to represent this as a database, it would be a database like links202, links201, and so on. So this is my data. And now let's say that what I want to learn is the concept of can reach. Can reach x, y means that I can reach y from x by following the arrows. So for example, can I reach 3 from 0? Yes. Yes, right, because by way of 1 and 2, can I reach 0 from 3? No, because I have to go against the arrows. So I have a database of links and I want to use the concept of can reach. Simple example. So my data is, first of all, it's my pairs of nodes. And then for each one of them, whether they link to or whether they don't link to. Sometimes we leave the not implicit. Sometimes we make it explicit. This is what's called the closed-world assumption in databases. Closed-world assumption means that everything that is not in my database is assumed false. So with the closed-world assumption, I wouldn't need to say this, but let's just suppose for clarity that I do say this. The downside of saying it, of course, is that a lot of times you imagine a database of flights that have all the pairs of cities that don't have a flight between them. That would be really inefficient. But in the simple example, let's suppose that you actually have it. And by the way, there's a lot of data. And the corresponding adjustments of the evaluation criteria. So now what are the hypotheses that I'm going to make? They're going to be rules that have can reach as the consequent. This is what I'm trying to predict. And what can they have in the antecedents? What literals can they have? Or what predicates can appear? So I'm trying to predict can reach using what? X, Y, and can reach. So X, Y, and link to. So I'm trying to predict can reach from link to. I can reach A from B if A is linked to something and something is linked to B. So I'm going to try to build rules that predict can reach in terms of link to. Anything else? Think back to the ancestor example. Let me put it back up. I was predicting ancestor in terms of what? Ancestor and what else? Ancestor itself. That's the fun part of this. It's the recursive rule. As always, there's a lot of power in recursion. Again, in the propositional case, this made no sense whatsoever. I can't predict the class in terms of itself. But now because it's the predicate with different arguments, I can predict ancestor X, Y in terms of ancestor Z, Y. Right? So in our can reach example, I want to predict can reach in terms of what? Link to and can reach. In fact, you probably already noticed that there's something of a nice homomorphism between this problem and the family problem. What is the analog of link to here? Parents. And the analog of can reach his ancestor. Right? And in fact, you can squint at a lot of different, you know, network data sets, social networks, right? There's, you know, the friends relation. Right? The famous social graph. Friends X, Y. And you know, friends are friends and all of that good stuff. Right? You can represent it all this way and then the corresponding rules. I don't know if Facebook is using this technology yet, but maybe they will if they're not. So hypothesis space is that we want to learn rules, also learn as form causes using the predicates link to and can reach. And so the exercise for you, which at this point is probably not very difficult at a high level, is figure out what would be learned by foil and also how. Right? You can probably figure out very quickly at this point how, you know, what needs to be learned here, but the question for you is like, how would foil arrive at this? And also, you know, what might be some of the difficulties and where do false heuristics come in here? But in the meantime, we should have a break. It's 8.06, so let's reconvene at 8.16 where we will look at the second major type of first-order rule induction. Let's get going again. Right, so so far we've seen how to do propositional rule induction and we've seen one way of doing first-order rule induction, which is very powerful and yet we got there by easy steps, because first we figured out how to learn decision trees, then from that we figured out how to learn propositional rules, and then we figured out how to lift that as it's called to first-order. But now what we're going to see is a completely different way of learning first-order rules. And this is actually a very interesting way and in some ways a very natural way to do this, and it's the following. You know, for example, how in calculus we define integration as the inverse of differentiation and then we figure out how to do integration. Right, and this actually happens in mathematics. We define an operation as being the inverse of another operation because we want to be able to invert it like even square root. Why did people invent complex numbers? So they could always take the square root of a number. And this turns out to be very powerful. Well, how about we take inspiration from that and do something similar here? So we're interested in the problem of induction. This is what machine learning is all about. Going from the specific to the general. What is induction the inverse of? Deduction. Deduction, precisely. Induction is the inverse of deduction. Right, so how about this? How about we formulate induction as the inverse of deduction? In the same way that you can formulate integration as the inverse of differentiation. I would say this is a brilliant idea. It occurred to, I think it was Steve Mogelton, a well-known researcher back in the 90s or something. Right now this is just this very big idea. Let's formulate induction as the inverse of deduction and take it that way. Let's make it just slightly more concrete. The classical example of deduction is socrates is a man. All men are mortal. Therefore, socrates is mortal. Yay, right? Sorry for boring you again with this. This is our starting point. This is deduction. I went from the general. All men are mortal. One particular thing that I knew about socrates which is that he's a man. From that I deduced that he's mortal too. Deduction is nice. Most of computer science is really doing deduction in various ways. Deduction is something that you can really trust. As opposed to this weird thing called induction that I'm trying to sell you and you're not sure yet if you're buying. So what would be the induction version of this problem? The induction version of this problem would be something like this. Socrates is a man and Socrates is mortal. Plato is a man and Plato is mortal. Aristotle is a man and Aristotle is mortal. Maybe what? Men are mortal. Maybe all men are mortal. Once you see enough of those specific cases your brain is just really good at this. Your brain immediately goes, aha, there's a general rule here. All men are mortal. And then once you've acquired that rule and the one comes some completely different philosopher let's say take your favorite philosopher like this. You go like, ah or in fact your favorite person but let's just stay with philosophers for the sake of fun. Aha, Leibniz is a man therefore Leibniz is mortal. So from the specific examples you generalized by induction and from that general rule then you did the deduction. Notice how they're inverses because induction went from specific things about specific people to a general rule and deduction does the opposite because those specific people are some other specific people. So this is the way we're going to try to do induction here. Brilliant idea if you ask me. But now let's see if we can make this work. And you know, a fair warning this is going to be probably the most abstract part of the whole class but bear with me and I think it will be highly worthwhile. So let us try to make this solve a little bit more formal. Here's a formula defining what we want to do. And we're going to sort of be playing with this formula a lot so let's take it one step at a time. Induction is finding a hypothesis. Right? From a hypothesis space. And let's call it a hypothesis H. Right? What do we want that hypothesis to do? Right? Think back to the analog of integration like what do we want the integral to do? What we want the integral to do is differentiate a function and then I integrate, I get the same function back again. Right? This is what that's doing. What is induction trying to do? You can think of induction as providing the missing link. I know that Leibniz is a man and I want to conclude that he's mortal. What is missing to go from one to the other? This is what I'm trying to induce. So this gobbledygook here is really just a more formal statement of the same thing. So let's say that, you know, XI is my Ith training instance, so my Ith philosopher or my Ith person in the family. F of XI is what I'm trying to predict, like whether they're mortal or, you know, whether they're somebody's ancestor or what not. And B is a new thing. B is all the knowledge that I already have. I already have an algebus with other formulas that say first-order logical things that I know about people and about mortality and what not, right? Remember, we said early on and we saw this very clearly that knowledge is essential to induction. The more knowledge you have, the better induction you can do. If you have no knowledge, you probably have as a data than your host. Here for the first time, we're going to make that knowledge explicit. We are going to call it B. It's my background knowledge. Let's say, for example, I already know that, you know, mothers are parents and fathers are parents. I already have. And now these XI's and these FI's, they are basically my examples. I give you XI and I want you to tell me whether F of XI is true or not, okay? So now here's what I want my hypothesis H to do. I want that for every example and its class in my data, D represents my dataset, right? In general, you know, a database, right? The background knowledge, together with my hypothesis and the example, they allow me to derive, this little T means derives, they allow me to logically derive, they allow me to logically prove that F of XI. You see what I'm saying here? You know, the reason I need the hypothesis is that the background knowledge and the information about the example are not enough to get to my conclusion. Yes, you know, I know that Leibniz is a man and I know all sorts of other things, but I'm not able to show that this guy is mortal. I want a rule that says that. This is what my H is going to be. If by a fortunate accident, just using my background knowledge and my information in the database, I could already correctly predict that this person was mortal, well, then I don't need a hypothesis. The hypothesis can be the empty set. So this is the general agenda of what we're going to do here. Right? Notice that what I have here is just, you know, this is a kind of deduction. It says if I know this and this and this, then I can infer that. Right? In deduction, I go from B and H and XI to F of XI. The sense in which we're going to be inverting deduction here is that what I'm going to do is go from F of XI and XI and B to H. This really is, in a mathematical sense, a kind of, you know, inverse of pressure. And it encapsulates very neatly this whole idea that we've been having that, well, in machine learning, the program instead of being an input is not the output. Right? That's what our H is here. The program that we're trying to induce. Okay? Any questions? Okay, very good. So we've defined formally what we want to do. Here's a concrete example. Right? Nothing like a concrete example to keep things clear. Let's suppose that, you know, what we want to do is induce the concept of child. Somebody, you know, like, for example, you know, the child of Bobbi's share or a child of Bobbi's share. Okay? And what we have, you know, so what is our concept, right? It's over pairs of people. It's pairs of people, U and V, such that the child of U is V. Okay? And one such example could be a child of Bobbi's share. Okay? So this is my F of XI. I want to learn to predict that somebody is the child of somebody else. Obviously this is a very simple example, but it's enough to illustrate, you know, the general idea. Right? What is my XI? Well, my example is a pair of people here, right? Bob and Sharon. So my XI is everything that I know about them or that might be relevant to predicting this. So let's say I know that Bob is male, I know that Sharon is female, I know that, you know, the father of Sharon is Bob. Right? And let's say that in addition we have some background knowledge lack, like parent of U is V, if father of U is V. Right? This is true. If somebody else is father, they are also their parent. Okay? So first of all, you yourself just looking at this, can you, so let's do this in our heads first, right? So I have XI and B, right? Are you able to infer child Bob Sharon from these things? Or no? Actually, first of all, just you as a person, right? Do you know just by looking at these things? Actually, if I gave you this information, right, even without the background knowledge, just use the background knowledge that you have about family relationships, right? So I tell you that, you know, Bob is male, Sharon is female and, you know, the father of Sharon is Bob. Is it the case that a child of Bob is Sharon? Of course. Oh yeah, right? Of course, right? Common sense. We know that. Right? But the question is whether the program knows that, right? Does the computer know that? Right? And now you can think of, like, what did you use to arrive at that conclusion? Right? You use the fact that if X is Y's father, then what? Then Y is X's child, right? Yeah, you knew that. But the computer doesn't know that. We know the computer, we know what the computer knows, right? The computer actually knows something meaningful and potential useful, which is that, you know, parent UV, if father UV. But that's not enough. From male Bob, female Sharon, father Sharon Bob, parent UV, you cannot infer, you cannot deduce that, you know, child Bob Sharon. And that's the problem. Oops. This is good. And that is the problem. Is that I'm missing some information to make that deduction possible. So our goal in deduction is going to be to figure out what we need to add to the knowledge base to make the deduction go through. And then the deduction goes through. Any questions so far? So, what did you know that the computer didn't know, right? What could you add to this knowledge base that would actually infer that, you know, a child of Bob is Sharon? We've already seen one of these cases, right? It's the case that, you know, if father UV, then child UV. Right? There's also another one, right? What's another one? Well, it's sitting right here. It's the same thing, but with parent, right? Notice that both of these things are true, right? And do both of them allow you to infer that child Bob Sharon? Yes or no? Yeah. Yeah, right? In fact, if I know that, you know, father implies child, right? Since I have father here, I can get the child here. This is what we did, right? This is the easiest route, okay? But, and this again is going to be a general theme, is that is not the only way. There's going to be more than one way. If I know that parent implies child, right? Then because I also know that father implies parent, I can say, oh, father, therefore parent, therefore child. Okay? So there's more than one way. Which means that my deduction has more than one inverse induction. We shouldn't surprise anybody because, for example, the same thing is true of integration, right? Remember, when you integrate something, it's only out to a constant. That constant could be anything. So there's more than one answer. Similarly, right, there can be more than one square root of one, right? There's plus i and there's minus i. So again here, there's more than one thing that you could add to your knowledge base and make the induction go through. And we're either going to choose between them like, say, we think the simplest, the one that makes the knowledge base smallest, or we add all of them, let's say. Typically, we use the former, obviously. Notice also the following important thing. Did I use in this deduction the fact that Bob is male? Not really, right? I never used that. Did I use the fact that China is female? No, not either. And as always, what this means is that my database is going to have a lot of information that is irrelevant to the problem at hand. And a big part of the job of the machine learning system is to get rid of that. Figure out what's relevant and figure out what's not relevant. It would be a good day on which all the information that we have is relevant to the problem at hand. Questions? All right, so this is what we're going to try to do. Now let's, you know, so here's the statement of the problem, right? So what we have now, right, you know, computer science and logic and whatnot is deductive operators. And we can express this in the following format. If I have A and B, I have a function F that produces C, right? Think of A as the fact Socrates is mortal and B as the fact you know, all, sorry, as the fact Socrates is human and B as the fact all humans are mortal and C as the fact, you know, Socrates is mortal, right? is deduction. F is the operation that transforms A and B into C, right? And in logic, we write that as AB derives C, okay? But these two things are really just different ways of saying the same thing. Okay? And our goal here in the rest of this class is going to be to figure out inductive operators that actually do the opposite. So what thing that if operator, you know, by analogy with this here, right, is something that you give me the B and the D, right, and I get out the H such that the B and D and the H then now allow me to derive, you know, the F of XI. Okay? So again, we're looking for a function except that now it's an inductive step that's being done as opposed to a D-ductive step. Okay? All right, so before we go into this let us see what the pros and cons of doing this are because there are some very important pros and some very important cons and you need to be aware of both of them. One of the, you know, the first obvious prove this is that the previous case that we've been talking about up until now, you know, including decision to induction and propositional induction and the foil and whatnot, is really what you get in this case when your initial background knowledge is nothing. Or at least it's a very weak background knowledge that is not made explicit in declarative form, right? So if we think what we're doing here with a case of no data or sort of no knowledge, right, then this just becomes the case that we had before. So this is nice because it generalizes what we've already been trying to do, right? But now, of course, it brings in something very important, a new, which is now I can make my knowledge explicit. It's my background knowledge B. That's where I really know. I actually have a list of formulas stating that I know that, you know, mother implies that I'm well-known and so on and so forth. So I can make that knowledge explicit and therefore, I can do things with it that I couldn't when it was implicit as in the case of learning a decision tree. And, of course, I have a lot more flexibility. I can put in knowledge that I know which I couldn't do in a decision tree and use that to help my induction and have the induction complete the knowledge and build on it. And moreover, this actually suggests a new type of learning algorithm. Right? going to do is I'm going to look for H, not blindly, not just by some greedy search that's trying to hit the right things in the dark. I have all that background knowledge to guide my process. When you learn, you do not learn from scratch. God forbid that you always be learning from scratch. Imagine being in this class while being a six-month-old baby. That wouldn't work. You cry a lot and that would be fine. You're always learning by building on top of the knowledge that you already have. The methods that we saw before, you know, they're basically the analog of the newborn baby, you know, like they really, they have something in their brain, very important, right, but they really don't know anything much yet. Now we can actually be guided by our knowledge. So those are the advantages. There are also some important disadvantages, which is important to be aware of. The first and very important one is that this doesn't allow us to deal with noise. Noise is very important in the real world. Some of my examples are misclassified. Somebody mistyped something. My sensor reading was noisy, right? This email was classified as spam, but it shouldn't have been. You know, you just pressed the wrong key or something, okay? Deduction does not like noise, right? Deduction is all about complete precision and, you know, it's either true or false and you say exactly what you mean. So these methods, at least as they stand, cannot deal with noise. There are, of course, generalizations and variations that can deal with noise. We're not going to touch on them here, but, you know, they are, they are complexity. So this is one minus. The other minus, which has probably already occurred to you, is the following, right? Remember when we said that large hypothesis spaces are good, right? Because they mean you can learn lots of different things, right? So you want the truth to be in your hypothesis space. On the other hand, large hypothesis spaces are bad because, you know, they're expensive to search and they really increase your chances of overfitting, right? And the hypothesis space that we're looking at now is huge. First of all, not only is it first order, right, instead of propositional, that's already a huge jump, but now it is going to involve playing around with all this knowledge, right? So I have a much bigger combinatorial explosion that I had before. So the bottom line is that these methods are much less scalable than plain propositional decision tree and rule induction. And that's probably the biggest reason why they are not as widely used, okay? They, at this point in the current set of the, are these things do not scale to, to say Facebook, right? I don't think Facebook is using this because they want to mine a network of, you know, a billion friends and this stuff doesn't scale to a billion friends. But it might scale to millions. So for some problems, you know, you can use this. And moreover, in the straight up between knowledge and data. If all you have is a lot of data, but not a lot of knowledge, then this is probably not the right choice for you, right? Use one of those fast propositional methods. On the other hand, if all you have is a lot of knowledge, but not a lot of data, those methods can take, make use of the knowledge and that there is not enough, but this stuff might be exactly what you need. And you know, to take an example of an area where this has been applied to great success, drug design, right? My molecules, you can think of them as graphs. I want to know which molecules, for example, cause cancer or which molecules actually stop cancer, right? I can represent them in this form. I often don't have a lot of data because the data point, this molecule is carcinogenic or this molecule binds to this virus. That knowledge is very hard to come by. It requires a lab and experiments in the lab. On the other hand, there's like reams of biology papers and textbooks out there with biology knowledge. And I can write down that knowledge. This is what makes, you know, molecule carcinogenic or you know, this is what I know about organic chemistry, right? The knowledge doesn't have to be directly about the task. So in those cases where I don't have a lot of data, but I have a lot of knowledge, this might be what you want to use. All right, so let us then look at how to do this. And we're going to use exactly the same strategy that we used before. We're going to start with the propositional case, right? Cause that's the simpler case. If we don't know how to do this, you know, induction by inverting deduction for propositional rules, we don't know how to do it for the first order case. It's too hard, right? What we're going to do is we're first going to solve the propositional case. And then we're going to generalize it to the case where I have predicates and arguments and all that stuff, just like we did before. Okay? So let's start with the propositional case. And the place to start, of course, is with what is deduction exactly? Now deduction can be done in many different ways, but usually it's done using what's called the resolution rule. How many people here have heard of this stuff? Resolution? Just so I can have, you know, I can calibrate. If you've heard of this, raise your hand. Okay, very good. Nobody has heard of it. That is the second best case. The best case is everybody has heard of it and we just, you know, breeze through. The second best case is when nobody has, cause that way nobody's bored. Right? So hopefully nobody will be bored here. Well, at least if you're bored that it's because I'm a bad teacher, which, you know, so what is resolution? Resolution is how most theorem provers work. When you hear about, you know, computers proving new theorems automatically with our help from people, or proof being checked automatically, or, you know, there's lots of things that used to improve. Like for example, software and harder verification are often done using theorem proving. You prove that that circuit does what you want. What are those guys using when they do that? They're probably using resolution. I mean, there's many things that you can use, but resolution is probably the single most widely used. And here's resolution for you in one very simple little table. Right? This, and this table, you know, you can read it like you read an edition, 3 plus 4 equals 7. Right? So what I have here is that from this and this, I can deduce this. So what are these things? Here, what I have is just the simplest case, right? So, first of all, each of these things is what is called the clause. A clause is just a disjunction of literals. And the literal is just a statement or its negation. Like P, for example, could be the statement, it's sunny today. And L could be the statement, we're in Seattle. Okay? You could also have the negation of that. We're not in Seattle, so you could have something like not P or not L. Okay? And the clause is just a disjunction of those things. Clause are useful to us because whatever you have in first thought, logic can always be converted to C and F. Clause are normal form or conjective normal form. Okay? So let's assume that everything that we have is in the form of clauses, right? Makes life much simpler, more uniform. And here's what the resolution rule says. It says that if I have, this is not the general resolution rule, but it's a simple case for us to get a handle on it. It says that if I have, in my knowledge base, the clause P or L, remember P and L are just Boolean variables. And I also have the clause not L or R. I can infer the clause P or R. Right? So these are what I had before, and this is what I have after I apply resolution. So another way of saying this is that P or R is a deductive consequence of P or L and not L or R. If I know those two things, it follows with complete certainty that P or R. Now, why is that the case? Can anybody tell me? What is the basic insight behind resolution? Let me give you a hint. In logic, the basic thing that logic commits to is that every statement is either true or false. Right? You can't be both, and there's no other thing that you could be. Right? So every one of these literals here is either true or false. Okay? So in particular, here's the hint. Let us focus on this literal L. Right? This is an interesting literal because it's actually the one that appears in both of the clauses. Right? And we know that, you know, that L is either true or false. Right? So what can we tell about P if L is false? It's true. It must be true. Right? One of them is true. L is false, so P must be true. What can I tell about R if L is true? It must be true. Then it must be true, right? Because not L is false. Therefore, R must be true. Okay? So to recap, I know that if L is false, P must be true. On the other hand, if L is true, then R must be true. Right? And what do we know about L by logic? Is that it's either true or false? Right? Therefore, what do I know about P and R? One of them is true. One of them must be true. Right? Because either L is true or L is false. In the first case, P is true and in the second case, R is true. Right? This is what the resolution rule is all about. Very simple but very powerful. Any inference that you can do in logic, you can be using just resolution and nothing else. Question? Yes, you can write this as like not L implies P. Right? Yeah, yeah, absolutely. This is a very good point. So often when you're looking at these clauses and trying to figure out exactly what they mean, it's good to remember that, you know, the clause not A or B is another way of writing A implies B. So in my knowledge, it's usually what I have is things of the form A implies B. Right? If you have this symptom, then you have something. Right? But when I convert that to causal form, it becomes not A or B. Because remember, an implication is true if the consequence is true or if the antecedent is false. Right? If the precondition is false, then the implication is true no matter whether you have, if you don't have the symptoms, then whether or not you have the disease, the rule is valid, the rule is true. Okay? So you can look at this, for example, and say, oh, what I'm saying here when I say not L or R is L implies R and so on. Okay? In fact, we're going to see an example of this in a little bit where very much, you know, the implication form is very intuitive. But to do resolution, we need to convert things into clause form. Okay? So did everybody get what the resolution rule is? This is absolutely key to what's going to apply. Okay? So what is the general rule? The general rule goes like this. So how do I apply resolution? Right? I have a big knowledge base of stuff that I know. And now I have a question, I have a query. Right? Is it the case that solvers is mortal? Is it the case that Anna is the mother of Baal? I want to know. Can you figure that out for me? And I just claim that you can figure that out using resolution if you can figure that out at all. Right? So how would you do that? Well, you have to find a sequence of applications of the resolution rule that will take you from what you know to the question that you're asking, either yes or no. Okay? So how do we do that? Well, here's the idea. Notice that the key thing that the resolution rule depends on is that I have a literal in one clause that appears negated in the other. Right? This is the thing that's going to bind them together into a result. Right? The result, by the way, is called a resolvent. This last clause here is the resolvent. So here's the recipe. Given two clauses, C1 and C2, I find the literal in one clause that appears negated in the other. Right? Look through your clauses and try to find, you know, something that appears negated in one but not the other. And these two are a candidate for resolution. Okay? And then how do I form the resolvent? Right? If I have found that, what is the new clause that I can now form? Right? Well, look at what happened this example. Right? The two L's disappeared. Right? They kind of cancelled each other up, colloquially putting it. Okay? And then what's left is the rest. Right? Sorry, the P became this P and the R became this R. Okay? So what is the recipe here? Is that the negated one disappears and all the other ones get grown together into one new clause. Okay? That's what we do. Here's the same thing, you know, just written as a formula. The resolvent C is going to be C1 minus the singleton set formed by L, union C2 minus the single set formed by not L. Okay? So we're thinking of the knowledge of this as a set of clauses. Okay? Everybody see this? Why did we take the trouble to write this down formally? Because hey, we're trying to invert this. Now that we've written it down like this, we can figure out how to do the inverse. Right? What is the inverse of this? Right? So, deduction was going from C1 and C2 to C. Right? What is induction going to be? Right? In deduction, we went from the two clauses to the resolvent. What is the inverse of this? Going from C and C1 to C2. Exactly. Now I have the resolvent and I have one of the clauses and I want to figure out what the other one is. Right? Remember, I know that Socrates is mortal. And I know that Socrates is human. What I want to figure out is the other mystics thing that combined with Socrates is human would produce Socrates is mortal. Right? So to do this diagrammatically, actually, let me jump to the next page. But actually, here it is diagrammatically. Right? Well, let's just do this actual example. Right? And we're going to use this as a running example. The font is a little small, unfortunately. Hopefully, you can read this if you can't let me read it to you. Right? So let's look first at a very simple example of deduction by resolution and then see how we invert it. Right? So here are the two clauses that I start with. C1 is pass exam or not no material. These examples are carefully designed to appeal to the student in you. We are very proud of this, you know, deep savvy. This one means, this one says no material or not study. So let me ask you first, what do these clauses mean in English? Remember, going back to this notion that a clause is really just another way of saying an implication. What am I saying here? If you know the material, you'll pass the exam. If you study, you'll know the material. Yeah. Right? Does anyone disagree with this? Very good. Right? Very natural things to say. Right? And what is it that you could, so don't think about resolution for just a second. Just use your common sense. What can you infer from this? Right? Is there another rule that you can infer from these two rules? A very important rule in your life as students? Right? You know that if you know that, if you study, you will know the material, right? Or sorry, it goes this way. Oops. Right? If you study, you will know the material. And if you know the material, you will pass the exam. Therefore, what can you say? If you study, you will pass the exam. Right? Aren't you learning amazing new things in this class? Yeah. So here it is. Well, if you study, you will pass the exam. Written in clausal form is pass the exam or not study. Okay? So hey, resolution and common sense agree. Okay? So this is our little example of deduction. Everybody got this? Okay, very good. But now what we want to do is induction. We want to do the inverse of this. What is the inverse of this? Well, deduction, right, was going from this and this to this. Right? Induction is the inverse. Right? In induction, I have the result. Right? And I have one of my two clauses. I want to figure out what's missing. I want to figure out the second clause. Right? That's the goal here. So over to you. How do we do that? Right? Let's suppose for just a moment that I hadn't seen this yet. Right? How do I figure out what needs to be there? This is really, once you've got this, everything else kind of follows. Right? So let's look at these two. Right? This clause was pass the exam or not know materials. Right? And this clause was pass exam or not study. So what does this clause have to be? What has to be in it? The not pass exams. The not pass exam. Why is that? Very good. Why is that? From the rule before, you know that the other two to combine to make the resultant need to not have, they take away pass exam. Right? On the first slide. Right? Absolutely. So notice the following, right? Pass exam. Right? No, sorry. What did you say? You said this rule has to contain what? It has to be the, oh, I say to the reverse. But not, they need to contain not pass exam. The one that we're looking for. On the right. Yeah. It definitely needs to contain no material because you know that the the clause that's in both of them, the one negated, is the one that you remove and you know that it's not in the final one so it must be in the other one. Exactly. I actually jumped the gun and I thought that he was saying what you're saying. Right? So very good. Notice the following important thing. No material has disappeared. Right? No material is in C1 but it's not in C. Right? So how do we make a little disappear? There's only one way. Right? It's got to be negated in the other clause because that's the only way a little disappears. Right? Because then it appears in one clause and in the other clause negated. Right? So rule number one is the stuff that was in this clause but not this one has to appear negated in the other one otherwise this doesn't work. Right? The resolution wouldn't do the trick. Right? Everybody agree with that? Okay. Point one. Point two. Point two is actually the the easier one in some ways. Right? What is the other thing that has to happen? What else needs to be in this clause? Not study. Not study. Why? Hey, it just came out of nowhere. Right? Not study was not in C1 so it has to come from C2. Right? So there we have it. I now reveal the truth. You know this is like you know scratching those things with a coin. You see if you want to allow it. Right? This is high-tech at its best. Right? So here's our clause. No material. Right? The negation of of the negation of not no material. Right? Is here. So these this is the literal that is actually going to allow the resolution to go through. Right? It appears not negated in C2 and negated in C1. Or not study. This not study goes from this clause to this clause. Okay? So the general rule is let us generalize from this. Right? I have one of the clauses and I have the resolvent. The other clause needs to contain all the literals that appear in the resolvent but not in the first clause. Right? They have to come from that. It also needs to contain the negation of what disappeared from the clause to the second. Right? Or to the resolvent. What happens if more than one literal that was in the first clause is not in the resolvent? The answer is very simple. Nothing happens. That one could not have been obtained by resolution from the first one. At least not in one step. With multiple steps maybe. Okay? All right. So any questions about this example? This example really encapsulates the whole idea here. So here's our general and more formal statement. So here's my recipe for doing inverted resolution in the in the in the propositional case. Given my initial clause is C1 and C where C is the resolvent. Right? Remember C is the thing that I'm trying to come up with the rules to be able to infer. Find the literal that occurs in the first clause C1 but not in the resolvent. Right? Look for a literal that has disappeared. Not no material in our case. Okay? And now supposing I find it. Right? I can form the second clause in the following way. The second clause is going to be C the resolvent minus right? C1 minus that literal. Right? Union not the literal. Right? So let us parse this. Right? This was a big multiple. So let's see what happens here. C2 is going to contain everything that C contains that couldn't have come from C1. Right? This is the not study literal. Okay in our example. Right? So everything that is in C that wasn't in C1 right has to have come from C2. So we need to put those things in there. Of course you know excluding the L because that one's going to you know actually disappear and for a different reason. Right? And in addition of course C2 has to contain the negation of L to make sure that it cancels with L and the resolution goes through and L no longer appears in the in the in the resolvent. Okay? Everybody agree? So here in one nice little formula is a very interesting idea for doing machine learning. For doing induction by inverting deduction. In particular by inverting resolution which is the most widely used approach to deduction. Any questions? Very good. But of course this was the easy part. Now as to first if this is all that we're going to do we probably wouldn't have gained that much compared to what we already had. In fact I don't think I've ever seen this type of induction used for propositional rulesets. For first order rulesets it's probably at least as widely used as the false style are probably more widely used but you know it's definitely widely used there. So how do we take this propositional idea and go to first order? Okay that's the last thing we're going to look at today. So first of all let's see how we do first order resolution. Right? We saw how to do resolution in the propositional case. The difference is that now in first order we have predicates and arguments not just you know boolean symbols. Okay? So how do we do resolution in this case? This is actually where things get more interesting. This for example what gives you know prologue the power of the full programming language is that it can do these you know argument combinations and replacements and you know come up with answers to not trivial questions. So the big difference as we saw between propositional and first order is that now we have arguments. Right? And the difficulty that or the interesting twist that arguments introduced right is that now finding a literal and its negation you know in order to do resolution is no longer just a matter of having something like you know r and not r. Now we have something more like you know mortal of sorry human of Socrates. And on the other hand we have something like you know a human of x implies mortal of x. Right? By the way I made it the universal quantifier here because that's what people do in prologue but this this just means that you know all humans are mortal. Okay? Now notice what is it that you want to happen here in order for your inference to go through right? What I would like to do is to infer from this that m of s right? But the problem is that this rule here doesn't even talk about x or s right? It talks about x. So how do I make the connection? Well the connection comes from the fact that this rule applies for every human right? Every x. So in particular it applies to Socrates it applies to s. So what I need to do in order for this to go through is I need to unify h of s with h of x. This is the technical term unify. What we're doing here is called unification. I want to make these guys be the same. Why do I want to make them be the same? Because remember h of x implies m of x is the same thing as not h of x or m of x. Once I've turned these two guys into the same thing now they resolve. One is the negation of the other. So what do I need to do in order to unify them? Well g, in one of them I have s and in the other one I have x. So what I need to unify them is to replace x by s. This is called substitution. What I literally do is I substitute s sorry I substitute x by s. In general I might need to substitute a whole bunch of things so I'm going to have a set of these things. This is called a substitution right? In the simplest cases what it does is it replaces variables by constants like the variable x by the constant socrates so that universal rules can apply to concrete people or to specific instances of people. This is really where deduction does its thing. So I do the substitution in order to have two little s that unify. And then once they unify I can apply a resolution just as in the propositional case. The power of this rule of course is that it's saying something in one short expression for all seven million people. In propositional logic if I wanted to say that people are mortal I would have had to say this for every single person separately. Anna is human therefore she's mortal. Bob is human therefore he's mortal. This is ridiculous. In first order logic instead of seven billion statements I just make one. But then I need my substitution in order for the variable to actually apply to the constant. Everybody follow this? Very good. So what does first order resolution mean? I no longer have to just look for two little s that unify. There are the one is the negation of the other. I have to look for two little s that I can somehow unify by finding the right substitution. Like for example in this case replacing you know x by s did the trick. And I'm going to represent that set of substitutions by theta. And L1 theta is L1 with that substitution applied to it. So L1 theta for example is L1 with every occurrence of x replaced by s. So for example this h of x implies m of x after applying that substitution will become h of s implies m of s. And then at that point if I have found that, if I have found two little s such that with some substitution one becomes the negation of the other at this point I can apply the resolution rule to those little s subject to those substitutions. So that is going to go through only when x is suffered. It obviously will not go through when x is Aristotle. Unless I also happen to have Aristotle in my database but let's ignore that case. And then once I have that everything works as before. So you know a slightly good thing to say would be well resolution in first order is just the same as in propositional logic except with substitutions. So what is my what is my rule now is that I form the resolvent by including all the little s from c1 with the substitution theta and all the little s from c2 with the substitution theta except for L1 with the substitution theta and not L2 with the substitution theta. Notice that apart from the phrase with the substitution theta that I just repeated four times this is exactly the same sentence as before. So here's my recipe. What is my resolvent going to be? It's c1 minus l1 with the substitution theta union c2 minus l2 with the substitution theta. And again this form is the same as before except with these new interesting little theta added. Questions? All right so finally right we're scaling the mountain and we're almost at the summit. Here comes the summit right. The base of the mountain is wide but the summit is thin and indeed you know this summit is very thin it's just this one little point. So what am I saying here? Right now you have the pieces all you have to do is put them together. You know how to do propositional inverse resolution right? You know how to do first order resolution and now we just have to figure out how to do first order inverse resolution. So what's going to happen? Remember I'm looking for c2 right given c and c1 right? So what's going to appear in c2? What little s must c2 contain? Think back to the propositional case again it's going to be very similar right? It has to contain right what is this whole part here doing? Right let's start by just revisiting the propositional case and let's pretend that the that the theta's aren't even here okay? I have c minus c1 minus l1 right? This is what we had before it's all the little in c that didn't come from c1 have to come from c2 you agree? Very good but now comes this interesting part so by the way I forgot to say theta1 is the substitution that I applied to c1 and theta2 is the substitution that I applied to c2 right? This is a new complication right? I'm looking for c2 but there's also some you know finally when I have c2 and I apply it to get my resolution right? It's c2 with some resolution with some substitution right? So let's call that theta2 right? So can anybody tell me what this theta1 theta2 to the minus 1 is doing here? Theta2 to the minus 1 of course means the inverse substitution right? So first of all what is theta1 doing here? The normal substitution you applied to. Exactly right? This is what I applied to c1 to get c2 right? So that that that goes through in both cases right? Whether I'm doing you know deduction or induction right? So I took all these things in c1 right? And I put them through theta1 to get to c right? And now it's comparing to that that I'm going to see what's what needs to come from c2 right? Very good. Everybody got this or everybody has questions? But now comes the most interesting part what is theta2 to the minus 1 doing there? This is where you really see that this is an inverse operation that we're doing here right? I have some stuff that appears in the final clause that came from c2 right? But by deduction I applied theta2 to c2 to get to c right? So if I want to get from c to c2 obviously I have to apply the inverse substitutions. If I substituted you know x by socrates to get here right? Now to go the opposite way I have to replace socrates by x right? And notice this is not some little detail. This is where the induction happens. Induction happens when I go from something that was about socrates to something that's about any hex any human okay? So this theta2 to the minus 1 here is absolutely crucial. We're going to see an example of this shortly you know our same you know study no material exam okay? Alright so can anybody tell me what the second part is here? What is this model 1 here? It's the book that's going to book to c1. Right this is the negation of the little that disappears from class c1 right? Just as before. Just as in the propositional case right? Everybody agree? And now what is theta1, theta2 to the minus 1? Well it's the same thing that happened on the other side right? l1 got you know got theta1 applied to it to get to c but not to get to c2. I need to apply the opposite of theta2 and then I get my c2 okay? Everybody agree with this at this very abstract level? Of course I recommend that you go over this a little bit more carefully you know in your copious free time. Also Mitchell has a very good you know explanation of this so you know I encourage you to look at that. But in the meantime you know just to make things clear and concrete let's let's look at an actual example and this is by you know like one one system that does this is called Saigol. So let's look at an example of what Saigol would do here right? So here's our family example so actually I like this is not the no material example it's a much more interesting one. This is our original you know father child example and what we're going to try to do here is actually something more ambitious. We're going to try to learn the concept of grandchild. This example is going to last to actually show the full process or something more like the full process opposed to just one step right? What we've been seeing so far is really only one step of resolution and for only one step of inverse resolution but a lot of the power of this comes from when you can go through a whole proof tree and have all these inverse steps right? Basically what's going to happen in a real system like Saigol is that I have this proof that I would like to make right? I want to prove this result and now I'm trying to form the the proof tree but there's holes in it there's things that I need to have to make the proof go through and those are the things that I'm going to induce okay? So what we have here is a very simple proof tree and we're going to see how we fill in the holes. So let's start you know on the left side here right? So here are some of the facts in my knowledge base okay? So this is my data some of the facts in my database. I know that father Tom Bob okay? This is the fact and I know that father Shannon Tom okay? And I also know that grandchild Bob Shannon okay? And presumably I have a whole bunch of examples like this but let's just focus on one here right? What I want to do is infer from this the concept of grandchild okay? And you know let's just focus on the father case there's also the mother case but you know let's just focus on father two to keep things simple right? What is the rule there right? Let us preview what needs to happen here. What is the rule for somebody being somebody's grandchild as a function of father? A is B's grandchild if what? If A is C's child and C is B's child right? Right? So if it is the case so you know Shannon is a grandchild of Bob right? Why is that the case? Because Bob is the father of Tom and then Tom is the father of Shannon right? Everybody agree? We know this. The question now is like how does the machine learning system figure it out? Well let's see. Let's apply you know our first order inverse resolution ideas right? So this is my C1 and this is my C right? So if this is my C1 and this is my C what do we need to have in C2? Well first of all what has shown up in C that was not in C1? Grandchild right? Right? Grandchild appears in C and did not appear in C1 so it must have come from C2. Everybody agree? Right? So here's grandchild and let us not worry about the arguments just yet right? So what is the other thing that has happened from C1 to C right? Grandchild appeared but something disappeared. What was it that disappeared? Father right? So what do we know about C2? What does it have to contain? Right? It needs to contain not father you agree? Okay very good so here's the clause right? Grandchild or not father right? But now let's look at so but this is only half the story right and maybe not even the most interesting half of the story. Let us look at the arguments. Well so first of you know let us look at people right? Who appeared and disappeared between C1 and C right? Well let's see. Bob appears in grandchild right? Bob appears in C but Bob did not appear in C1 so where must Bob have come from? It must have come from C2 right? And Bob appears as the first argument of grandchild right? So it must have been the first argument of grandchild in C2. There are no two ways about that okay? Everybody agree? Very good. Well likewise Tom has disappeared from C1 right? So we need something in C2 that negated that. We need something that negated father with Tom as the second argument. Well here it is father with Tom as the second argument okay? Now what remains? What remains is the most important part. What remains is Shannon and X. Why couldn't I just let this class here be grandchild Bob Shannon? Would that work? Right? So this literal right grandchild Bob Shannon appeared in C right? So it must have come from C2 right? Well if in C2 I just put grandchild Bob Shannon does that do the job? Well let us be let us not worry about induction for just a second right? The problem we are trying to solve here is like I need the literal grandchild Bob Shannon to be in class C2 so that it appears in class C right? And let's say that I just put that literal in class C2 right? That works right? So there's nothing wrong with class C2 being grandchild Bob Shannon or not father you know Shannon Tom right? Because remember X is the same on both sides right? So why did I do this? Why did I replace why is my theta 2 replacing or my theta 2 to the minus 1 actually be Shannon replaced by X? Why did I do that? I didn't have to. Why did I do that? At the same time I'm free to do that right? Because if this is the substitution if I have grandchild Bob X and then I replace X by Shannon I get grandchild Bob Shannon and everything goes through right? So this also works but why did I go to the trouble of replacing Shannon by X? Grandchild Bob Shannon. Grandchild Bob Shannon is not really interesting. Precisely. I remember I'm trying to induce general knowledge here. I'm really not interested in just understanding who in this family is whose parent. I want to know the general rule for what makes somebody somebody's grandchild. Therefore I want to go from specific people to statements about all people and that's why I deliberately replace Shannon by X. This is what's giving the generalization. Remember inversion often has a lot of options. We're seeing a place where we have many options and so what we do is we pick the option that we like best. Not all inductions are equally good. I want to do more here than just make this go through. I want to learn the most general and accurate knowledge that I can. Any questions? You can certify yourselves that this all works by you know just not taking this as an ordinary problem of deduction and saying if I have this C1 and this C2 do I get C and you know clearly you do. But you know you can go through this as an exercise at home. So far so good? Okay so but now there's one more step. But now basically you've got the hang of it. It's just with a more complicated clause. So now let's see. Now we're going to go to this part of the proof tree and now what happens? Well grandchild appears here right and it does not appear here therefore it must have come from here right. Very good. And again look at this thing right. I had the grandchild bob x but now I have grandchild yx because I replaced you know bob by y because again I'm trying to generalize. And now notice that actually the little father does appear here and here but now I have to look at the at the arguments in the substitutions very clearly and see look what I have here is father tom bob and what I have here is not father x tom there is no way to unify these two right. You know if it can you know why do I have two fathers here right. Why couldn't I just use the father from this side. Well notice that the father here has bob as the second argument and this father has tom. There's no way to unify these two. This one is talking about bob being the father of tom and this one is talking about tom being the father of summer. There's no way to equate them. Okay so again what has to happen is that this one has disappeared right so I need to have its negation over there to make it disappear and this one here right has to have appeared so I need I need it here. At this point things are getting pretty complicated and again I suggest that you go through this you know at home but hopefully the general idea should be clear. I have put into this clause all the stuff that that is in this one but couldn't have come from here and the stuff in this clause that disappeared I negated it and put it in here plus I took all my constants and generalized them to variables because I want the general and lo and behold I have induced the definition of of grandchild right. Again if you turn this into implication form it says that if if father xz and father zy then grandchild yx bingo victory okay and of course the beauty of computers is that they will do this thing for you millions of times in seconds without complaining. So you know we have this tiny little proof tree here just to understand the concepts but you know then you run cycle and it chugs for you know a day and it spits out all these great rules about you know what makes a molecule carcinogenic. Good deal. Now you know truth in advertising it's not that simple more things have to happen but at a high level this is what's going on. Okay questions okay very good it's late we're almost done let me just mention one little thing progo. When you apply these systems like like seagull in practice you get a huge combinatorial explosion because there's all these ways of taking little and combining them and of finding resolutions and whatnot so you need something else to control the search and one very popular thing and this is what progo does is that it first tries to find a actually kind of the opposite of what i just showed you here it tries to find a specific rule not the most general possible rule but a specific one that accounts for the data and then it does the search based on that rule only within what that rule permits okay and this is the usual trade-off of overfitting right in some ways we're less powerful we might you know have fewer things that we induce now on the other hand i really really have helped to control my search and now it's much more efficient and i'm much less likely to overfit okay again i'm not going to go into the details of that here but it's important to realize that doing this in practice is not just applying something like seagull out of the box there's things like progall and systems like elephant whatnot you know there's a lot of stuff in this area okay all right any questions about any of this thank you for your persistence and alertness the good news is this was the hardest lecture in the whole class we will see other things that are hard in other ways but in some ways this was the hardest and also we've seen how to take rules to the first order level you can take anything that we're going to see in the following weeks to first order level we're not going to talk about how we do that here because there's no time for everything but at least now you have a sense of how machine learning can be powerful enough to induce you know programs from data not just you know a simple you know spam classifier based on words and whatnot okay so remember that we have no class next wednesday because i'm traveling and what we're going to have is you know you must be glutton's for punishment and you know we are certainly glutton's for meeting our punishment we're going to record in next week's class here tomorrow so you know probably most of you can't come and that's fine what that means is that you can just you know watch the class anytime that you want on your own time including you know next thursday at you know 6 30 p one or you can come to the class here or or remotely if you want okay and after that you know the class is going to be on the web right no no sorry yes sorry i forgot to say tomorrow it's set four so i i sent an email to the list about this did everybody get it yeah so unfortunately you know we couldn't do it at 6 30 because the the the room was was not available did it go to your math business no this went to the course list so if you have not subscribed to the course list please do because we send these announcements there and anyway so assuming you missed that message you know just to recap we're going to record next week's class tomorrow here at 4 p.m and you can come here or you can watch it remotely or you can then just you know at microsoft and you can then just you know watch it anytime next week okay all right and in the meantime you know have fun with your projects all right um welcome back long time no see today we're going to talk about instance-based learning instance-based learning is just about the simplest kind of machine learning there is it comes in a number of different flavors the first one we're going to look at is probably you could consider the oldest machine learning algorithm and certainly the simplest it's the famous canyers neighbor algorithm and then we're going to look at other forms of instance-based learning and we're going to conclude with look at collaborative filtering collaborative filtering is these days probably the most well known application of instance-based learning and it's what your second assignment's going to be about you're going to develop a collaborative filtering system and apply to the famous Netflix data it's too late to enable in dollars but you know you might win the next month as i mentioned there was someone in this class who didn't know any any machine learning before and you know very interested in it then went on to be you know a member of one of the two top scoring teams so first of all what is instance-based learning and why is it so simple and and potentially so fast the idea in instance-based learning is to do as little work as you can at learning time it's also known as lazy learning lazy learning in this space line got not exactly the same thing but at this point we can think of them as the same thing the idea in instance-based learning is that what i'm going to do when i see the data is nothing right if you think about it that's the fastest algorithm you can possibly imagine right it's always zero you do nothing you've never seen an old zero algorithm before you're never going to see one again this is the only old zero algorithm that you'll ever see so what i do with my database of you know training examples you know x size and their corresponding classes is nothing i just leave them there on the disk and hope for the best right this is like not studying for your class and waiting till the you know final comes and then when the final comes of course at that point you scramble right you didn't study you don't know anything so what do you do how do you quickly figure out what the answer should be right well the idea in this neighbor as the name implies is that when a new query instance comes along say i have a new patient and i want to decide what this patient has you know tuberculosis or whatever what do i do at that point i just quickly scan through my database and try to find the closest patient in terms of their symptoms and test to the one that i'm seeing now okay and then whatever the prediction for that uh item was i make for the new item right if you think about this is not a bad heuristic right if you if you find two patients whose records are pretty similar there's a good chance that the diagnosis will be the same this can actually work surprisingly well okay so if my nearest neighbor is xn all i do is i estimate f of xq notice the hat here that says it's an estimate as f of xn okay very very simple thing this algorithm as it is doesn't always work that well there's one very simple change that makes it work a lot better in fact makes it compared to with a lot of much more sophisticated algorithms and that is simply to just use the k nearest neighbors instead of the single nearest one okay so what you do you know again at learning time you don't do anything it's still a zero what happens now when you see a query is that you find the k nearest neighbors like say the three or the five nearest neighbors okay each one of those has a class and then you just vote if one class is the class of three of the five nearest neighbors and one is a class of two well the first one wins okay if what we have is a regression problem right we're trying to predict a continuous variable instead of a discrete one then what might we do right i find my five nearest neighbors i just treat five numbers and i want to predict a new number for this example what do i do you already have some formula based on the regression yeah well you're getting you know way too fancy for a simpleton is here right what's what's an even simple thing that you can do right i want to predict how much you buy from my catalog right and i know that they each bought you know a certain amount of dollars and they're the four closest neighbors to you so what can i predict for you check the distance even that is still too fancy right you've almost covered the whole lecture right yes we could wait them by distance right the closest ones probably mean more right but if i if i just tell you your five nearest neighbors have these values and what would i predict for you mean yeah just the mean right just average right that's a perfectly sensible thing to do now the average could be weighted and we'll talk a little bit about that later right but the basic idea is that for discrete problems k nearest neighbor just votes and for continuous problems you know it just averages right fairly simple okay so um as with all methods it's good to bear in mind what the advantages and disadvantage of this are right so what are the advantages of of um you know nearest neighbor methods in this and space learning in general well obviously one key advantage is that the training is very fast truth in advertising once we get to the some of the fancier versions of this that speed at training time naturally can disappear but at least in the basic version you know the things that training time can take no time another important advantage which is a little more subtle is the following is that if you want to learn a very complex target function well this can be very hard but with these methods you could actually learn that a very a very complex target function very easily because what happens is that when you memorize those examples and then you find the nearest ones at test time implicitly you could be learning a very complex function we will illustrate you know how this happens and look at it in a little bit more detail you know in a while but for the moment you keep that in mind okay another thing that is sometimes an advantage compared to something like say a decision tree or a set of rules is that with these methods or at least with the most basic version of them you don't lose any information but with the decision tree once I form the decision tree I don't have the training there anymore maybe I'll regret that maybe at some point I would wish to know what that point was in years later I still have my whole database right so I haven't lost that information so these are the advantages now what are the disadvantages right well one prominent and obvious disadvantage is that this can be very slow at query time and in fact the more that you have the slower it gets usually having more there is better because I can learn a better model but here having a lot of there could really kill it let's say trying to control a robot in real time right then you have frames of their you know past behavior you don't have time to run through all of them compared to say decision tree that's blindingly fast this is a problem or you know you're trying to play the stock market then you know you need a response in a fraction of a second or your competitors will be you instance-based learning could be a problem for that okay another obvious disadvantage which is actually not such a big one these days is that of course this takes a lot of storage a set of rules you can probably store quite efficiently with this you have to store all you know the data that you've seen in fact in the early when nearest neighbor first invented circa 1956 storage was a huge problem in fact this you know for a long time nearest neighbor was not considered practical because who could probably you know who could possibly have the storage to store even a thousand examples right this was back when memory were like ferrite cores right these little you know magnets basically and then over time sometimes memory is the bottleneck sometimes memory is not the bottleneck actually this memory is not the bottleneck right disk is cheap you can store all that stuff no problem but finally here's the least obvious and actually the most important of these problems is that nearest neighbor methods are easily fooled by irrelevant attributes let's say I have a thousand attributes of each you know person most of which are totally irrelevant there's a few there that are really relevant and if I measure you know my similarity using only those I've been to shape but the signal from those is swamped by the noise from all the irrelevant ones and now I have a serious problem okay notice that decision trees and rulesets for example are very good at dealing with this problem right they they just that this is what they do they find the relevant features and throughout the others here we're not doing that okay now of course all these problems are well known and have been you know well appreciated for a long time so there are solutions to all of them okay nevertheless if and after we incorporate those solutions these methods are still not the best ones in these dimensions compared to say some of the ads that we've seen okay so it's good to bear these things in mind any questions so far all right so well there's there's one very big thing here that I didn't talk about you right so sure we store examples and then at critical time we find the closest one mode what exactly this closest needs right how do I measure that to examples are similar clearly the behavior of this algorithm is going to depend a lot on your definition of similarity right well now your definition of similarity could be anything from extremely simple to very very complicated in fact there's this whole area called case-based reasoning where your similarity measure is a huge program with subroutine calls and who knows what fortunately for a lot of applications just about the simplest measures you can imagine actually work quite well in particular let's say that you know all your features are numeric right they're all you know numbers right what might you use as a distance measure what is the first thing that comes to your mind some of the distances to each other yeah so you know that is what's called Manhattan distance right just some of the the sum of the distances along each of the features okay or you can just use equidine distance right the straight line distance between the two points in hyperspace right both simple both widely used more generally you can use what's called an ln norm ln norm means that you take for each dimension you take the absolute to compute the difference between two examples you go to each dimension you compute the difference between the examples along that dimension take its absolute value raise to the nth power and then you sum along all the dimensions and you take the nth root okay so equally the distance is the special case of this one n is what two right and Manhattan distance is the special case when n is one exactly right because you're not raising to anything you're just taking the sum of the absolute values okay but you know i could use other n's the most popular ones are indeed one and two you know what the third most popular one is anybody care to guess no yes this is a trick question the most popular one is infinity what is the l infinity norm sounds like a weird thing at first right what's l infinity what happens when you raise the the the the distance along each dimension to infinity sum them all and then the things this the the you know the infinite root you just get the maximum right because as you raise things to higher powers the larger ones dominate more and more right so if you raise it to infinity the larger one basically sums up all the others so l infinity norm is just offensive of saying ignore all dimensions except the one with the biggest difference and that's the one that's going to be your distance for some applications that's actually what makes sense but the main point here is simple measures often work quite well now what so this is for numeric features what about for symbolic features what might you use as a as a as a distance measure between examples if you have symbolic features what is the simplest thing that comes to mind right how do you measure you know the difference in between two bits vectors like let's say my variables are just boolean right so what's what's the way to measure the difference between two boolean vectors the hamming distance out there but something like that did number of bits that are different yeah exactly which is which is the hamming distance right so the hamming distance is just the number of bits along which they differ right also known as overlap so hamming distance or overlap just means for each feature if they're the same right and this applies to you know features with multiple values right if both of these things are red and the distance is zero if one is red another one is another color well then the distance is one okay now for a long time this is all that people used with nearest neighbor and it does not work it's okay but it does not work as well for symbolic domains as the kinds of things that we saw before and so for a while there was this perception that well nearest neighbor algorithms are good if what you have is continuous variable for the squeeze variables they're not really competitive and then much later you know like in the late 80s or something like that early 90s people actually figured out a very clever distance measure that you can use and in fact with that distance measure nearest neighbor becomes competitive with many of these other methods for things like say for example predicting whether something is a promoter region in in dna or where splice tensions are and and you know a lot of things like that so what is this measure it's a finnishly clever thing it's called a value difference measure we call it measure not metric because it's not a metric it doesn't obey you know the triangle inequality and things like that or vdm for short and the idea of this measures the follow right you know let's suppose that my objects can come in multiple colors red green and blue right the problem that we have is that what meant what you know what could possibly make red you know more similar to green than to blue right with numbers we don't have that problem right two is more similar to three than to four and that's why things like equally in distance make sense right but when when what you have are just symbolic values yes they could be the same or they could be different but there is no there's no degrees of difference right if your object is red any color that's not red looks equally different right and this is where people were stuck it for a long time the idea in the value difference measure is the following is let's remember one important thing here is that i'm not measuring this distance for its own sake i'm measuring the distance for the sake of predicting the class right the whole point of deciding that two things are similar is that at the end of the day because they're similar they will have the same class right so what we want to do is we want to consider how similar say these colors are with respect to your class with respect to whether you want to buy the car or not okay so this is what we're going to do right it's so the value difference measures is still going to have this general form except that instead of this absolute difference i'm going to have this delta here so let's think of two values of the same variable val i and val j and delta is going to be the following thing it's going to be the sum over all my classes so i do this for each class of the difference between the probability of the class given the first value and the probability of the class given the second value why is this a good thing to use is that now you see why red and green might be more similar than red and blue right if people tend to buy a car when it's either red or green right then for purposes of predicting what you'll buy the car red and green are similar if i switch red by green the result will probably not be different on the hand if people really buy red cars and don't buy blue cars then those are very different if not instead of red you know in the example that i'm comparing with i have blue this is a this is a big red flag right this might make a difference in the class okay so all i do is i just compute these matrices of probability of the class given each value of each attribute and then i just plug this into the usual distance measure and i can do you know fancier things with this and the full-blown vdm has more stuff in it but this is the basic idea if you just implement this you will usually get much better results than having this and on a good day you'll beat some of these other methods that were actually designed for symbolic problems to start with and of course if in your problem you have a mixture of numeric and symbolic features well you just can use the corresponding component you know in each dimension for the symbolic features you use one of you know something like the vdm for the numeric features something like Manhattan distance and so forth okay of course this is just the starting point in general when you're using instance based learning this is probably one of the things that you focus on is designing your distance measure okay and again this is where you want to put your knowledge of the domain what you know the kind of knowledge that you use here is what makes things similar versus what makes them this similar and that's that's what you want to put in your distance measure okay any questions all right so let us try to let us try to answer the following very important question what does a set of you you know when I do instance based learning right I have a database of examples right and then when I you know find the nearest example you know with a distance measure and apply that implicitly I'm representing some concept right whenever we look at a new learning method the first thing we try to understand is like what is the representation what can it do what can it not do and we haven't really done that here right and in fact it's less service in this case than in say in the case of a decision tree right what does a big database of examples represent right what what concept I'm really learning right think of this in terms of instance space right there's positive examples there's negative examples and we're trying to find the frontier between the two and we saw that in the case for example decision tree you know the frontier between them was basically a bunch of you know axis parallel planes right a bunch of hyper planes pointing in one direction or another what is that frontier going to be in our case right and this is also going to help us understand why these methods despite being so simple can be so powerful so here's a simple example let's suppose that my instance space is the plane right here it is and think of it as the surface of a pond right it's flat there's nothing happening and then my query xq is a little pebble that I drop into the pond okay so proof the pebble falls into the pond and you get this ripple right it's a widening circle right and already in the pond let's say this is a zen garden or something like that right there's other pebbles sticking out right some of the pebbles are labeled positive and some of them are labeled negative right so in this you know pond scenario what what is the class prediction for the query going to be it's the class of the first other pebble that my ripple hits right like for example in this case here right well you know the ripple starts from xq and then the first thing that it reaches is this plus here and at that point we go like oh you're the nearest example i'm assuming by the way we're using euclidean distance here okay the same ideas would apply without the distances but you know euclidean is the most visual intuitive one so we're going to focus on that okay so the first one that the ripple hit is the one that predicts the class okay well but now let's turn this around right which queries is this positive example going to be the winner for right i can ask the question for each of these training instances what region of instance space is it closest to and therefore where the the region where it's going to win and be the one that makes the prediction right we call that the Voronoi cell of that point so the Voronoi cell of a point in space is the set of all points that are closest to that point into any other okay so these points are my training examples i sprinkled them around space right there they are hanging another Voronoi cell of a particular point is the set of is the region of space that's closer to that point into any other okay and you know here's a quick question you know suppose that i have a positive example here and i have a negative example there right what is the shape of the boundary between them going to be is it going to be a curve is it going to be a straight line what is it going to be assuming we're using it equally in distance straight line and it's just two points exactly it's a it's a straight line that's equidistant from the two points right because what is the frontier the frontier is the set of points that are at the same distance from the two instances right and that forms a straight line okay so the first thing that this tells us as a preview is that my frontier is actually again going to be a bunch of straight lines or hyperplanes in the general case okay but these hyperplanes unlike indecision trees are not going to be accessed parallel right so let's pursue this thought like so let's let's say that this is my training set and here i'm not saying what the classes are because it doesn't matter right and what i've drawn here is the Voronoi cell of each one of these so for example the Voronoi cell of this one in the middle is this shape here because for example it's a boundary with respect to this guy's right here right it's bound with respect to this guy's right here and so forth okay so any queer point that falls in here is going to be classified by this guy and so forth okay so now i have to find this notion of Voronoi cell and the Voronoi diagram is all the Voronoi cells of all the data points okay and now here's my question to you so what is if i have the Voronoi diagram can you tell me what the region of positive class is going to be in terms of the Voronoi diagram let's say that this guy was positive right then the points in that Voronoi and the Voronoi cell of that example are going to be classified positive right and if that is this guy's negative those points are going to be classified negative right so what is the definition of the positive region union of all the regions precisely it's the union of all the Voronoi cells of the positive examples right very simple okay so this is really what is implicitly being represented when i just store a bunch of examples and then use the equilibrium distance right what i stored is just these dots but implicitly i have learned this you know let's say for example that this was minus and you know this was minus and this was plus right then the positive region is a very complicated shape actually right it's this thing here this is actually something much fancier than the decision tree can induce and yet it was no trouble whatsoever to induce right because we we never actually bothered to go and figure out the equations of these hyper planes god forbid that would be really expensive and as it turns out completely pointless as well because i can get the same effect just by storing the points and then applying equilibrium distance okay so this is what's really going on when we do nearest neighbor and it's also why nearest neighbor can be so powerful and yet so simple okay any questions shape of the region is so complicated it seems like you could even have many isolated ones kind of surroundings it seems like overfitting yeah yeah very good point right so exactly you're thinking like a machine learning person like you should be suspicious of this well you know like i'm going to get these very jaggedy regions all sorts of noise right this could be a disaster so absolutely right so one of the questions that we're going to ask ourselves is like how do we combat overfitting as in every machine learning method combining overfitting is going to be our one of our key preoccupations and we'll talk about that in a little bit but in the meantime maybe you can think of what some ways would be of of waiting overfitting if we don't do anything indeed we run a big risk of overfitting okay so historical footnote nearest neighbor was invented in 1955 or 1956 and for the next 10 or 12 years it was an obscure technical report from an Air Force Research Lab nobody cared about it the reason the main reason nobody cared about it is that well first of all computers back then were small and slow and you couldn't really run in on a lot of real things so it seemed to be a purely theoretical interest but the other reason was that even in theory nobody could prove that nearest neighbor did anything useful right these were the days when there was no machine learning there was statistics and what people didn't statistics was you know you know that your distribution is Gaussian you estimate the mean and the variance this way you can prove that if you give it enough data eventually you get the right mean and the right variance and you've learned the right Gaussian right this is the kind of results that people wanted to have before they would believe in some you know modeling approach okay and people didn't have that for nearest neighbor and so you know they just thought well you know this is a crazy idea and then a couple of people came along and actually figured out how to prove that nearest neighbor actually converged to something meaningful when you give it more data and then your neighbor really took off and in fact the people who proved that unfairly or maybe fairly aren't so the ones who are often credited with inventing nearest neighbor they didn't they actually just proved that it works right and the proof that it works actually if you if you you know part it down to its bare bones is very simple and it's in its well worth seeing because you know it gives us some understanding about you know why what's good about these methods and in fact historically people used to do what is called parametric estimation like estimating a Gaussian is parametric estimation because you have two parameters the mean and the variance and all you have to do is estimate those parameters nearest neighbor was the first non-parametric method notice that compared to those things with the fixed form nearest neighbor is very powerful it doesn't tell you in advance what your model is going to be the model will arise from the day and it's full freedom so in some ways you really can think of nearest neighbor as the first machine learning algorithm but now we would also like to know that these algorithms you know do something meaningful and give you the right answer so here's here's you know in one slide right in one slide right this was like a six-page you know dense paper in you know the actually poly proceedings of something but we can actually summarize it very briefly so let's let's see what it is okay so first of all let us introduce a very important quantity that what I'm going to call epsilons epsilon start of x so x and this x is bold by the way is my vector of attribute values right let's just think of it as a point in Euclidean space okay of some dimension and and epsilon is usually the letter that's used to represent error and we're going to do that here epsilon start is the error of the optimal prediction is you know if you could seem to god's mind and figure out what the best answer was how often would be would it be wrong right and you might say well that's always going to be zero I can always find the right answer but I say no sometimes it could be wrong why is that it actually has to do with the very important issue that somebody brought up in the last lecture why might not always be able to make the right prediction any ideas overlapping data points yeah exactly suppose that I've seen you know a whole bunch suppose that I've seen 10 patients right all with exactly the same symptoms seven of them have you know lung cancer and three of them don't right then what should I predict lung cancer or not lung cancer seven of them have it seven of them don't so I predict yes you have lung cancer and then what is my epsilon there's no actually my epsilon is not seven because seven is the number that I get right right my epsilon is the number that I get wrong which is three right so my epsilon in that case is going to be 30 percent okay so the more the story is this is that suppose that I've seen infinite data I'm in the ideal situation this is what is sometimes you know jokingly called asymptopia right because it's the you know utopian case where you've asymptoted that infinite day and is actually what a lot of academic statisticians deal with you know they live in asymptopia of course in the real world as we saw even with a lot of data you sell them in asymptopia but nevertheless it's good to understand what happens as you get as you get more and more data so even if you have infinite data right so if you think about it once you've seen infinite data what happens is that for every set of values right for every possible combination of symptoms you've seen infinite points and now of course you can just predict the majority right that's the best thing that you can do but the minority will still get wrong okay and this is actually what your epsilon is going to be right and what I would like to have is an algorithm that will get at least that sure if you don't give a lot of data you can't do miracles it will make more errors but at least you know it would be reassuring if I knew that if I give a lot of data to this algorithm it will converse to the right answer as opposed to do something crazy and this is what people didn't know about nearest neighbor in the beginning is is it going to do something meaningful or is it going to do something crazy so let's see so let us call epsilon and n of x the error that the nearest neighbor algorithm makes on that point x okay if that point x was the query right and I saw that query again and again what fraction of the time would nearest neighbor be wrong okay so I'm really interested in the relationship between these two things right obviously I would like epsilon and n to be close to epsilon star okay and here's what these guys were able to to prove they prove that as the number of points goes to infinity so as the size of my training data goes to infinity right as I converge to asymptote here the error of nearest neighbor converges to something that's going to be at most twice the optimal right so with nearest neighbor you're never going to be worse than twice the optimal if the optimal is 20 percent this means you're not going to be worse than 40 percent which is not so exciting but if the optimal is one percent this means that you will get to two percent well that's pretty good for such a simple algorithm okay so how were they able to show right and this really was a game changer right this is like the beginning of non-parametric estimation well here's here's the here's the basic idea of the proof and we're just going to do it for two classes you know the generalization to multiple classes is straightforward let's think of when nearest neighbor makes an error right this is what i'm interested in it's the probability that nearest neighbor will make an error this is what epsilon and n is right it's a probability that nearest neighbor will make an error well let's see there are two ways in which nearest neighbor can make an error right the first one is is when the true the true class of the example is positive but the nearest neighbor is negative right and the second one is of course the inverse where the true class is negative but the closest neighbor happens to be positive okay so my my total probability of error is the sum of the probability that these two things happen it's the sum because they disjoint right so i kind of just add them so what's the probability that the first one will happen well it's the probability that the example will be positive right let's call that p plus times the probability that the nearest neighbor will be negative right but by assumption my data points are independent and identically distributed so the the probability of the two things happening is just the product of the two okay so the probability of the first type of error is just the probability of plus times the probability that the nearest neighbor is minus okay everybody agree right and you know for the other case it's the inverse right it's the probability that the the example is negative times the probability that the nearest neighbor is positive okay so far so good right well very good and now if you think about it the probability that the nearest neighbor is negative is the same thing as one minus the probability that the nearest neighbor is positive right so I can replace one by the other, and same thing over here, right? So now I have this, you know, not too complicated expression, right? So far this was really just algebra, right, and definitions of probability and what not. But now let us take that limit as n goes to infinity, right? This is the key step. What happens to the nearest neighbor as n goes to infinity? Right, here's my point, right, and here's my nearest neighbor, right? What happens to the nearest neighbor as I get more and more points? It gets closer, right? It never gets farther, right? Because if my new points are farther away, this one remains. But every now and then a new closer point will show up, right? And so what happens is I get, you know, more and more points is that those are going to get closer and closer and closer until they're the same, right? And so what this means is that when n tends to infinity, the probability of the nearest neighbor being positive converges to the probability of the point itself being positive, right? They become the same. This is the key, right? If I have infinite data, those two things basically become the same. And same thing for the probability of the point being negative and the nearest neighbor being negative, okay? So now let us replace one by the other in our expression, right? So what do we get? Well, now I get p plus times 1 minus p plus, right? I used to have probability of the nearest neighbor being p plus, but now, in the infinite data limit, this has become p plus again, okay? And over here what I had was again, you know, p plus, right? And probability of the nearest neighbor being plus and I have being negative, sorry, sorry, I replaced it by 1 minus the probability of being positive. So here's what I have now. Notice that this is just, this is actually the same expression on both sides, right? It's p plus times 1 minus p plus, okay? So it's just 2p plus times 1 minus p plus, okay? So we're almost there now. There's just one more little thing that we need to do. What is the relationship between p plus and the error of the optimal classifier? What is the relationship between p plus and epsilon star? Are they the same? Are they different? Let's suppose that the majority class is negative, right? So I should predict negative, right? Then what is the relationship between p plus and epsilon star? 1 minus the majority? Well, so actually, exactly. They're 1 minus the other, right? p plus and p minus are 1 minus the other, right? So if I'm predicting, if I'm predicting plus and the true answer was minus, right, how often do I make an error? p plus, right? It's a fraction of times that I get it wrong, right? So in that case, p plus and epsilon star are the same. In the reverse case, right? Let's suppose that, you know, the answer really was plus. Now p plus is the probability of me getting things right. So the probability that I get things wrong is 1 minus p plus, okay? And now I can replace, you know, let's look at both of these cases, right? In one case, I replace p plus by epsilon star and in the other case, I replace 1 minus p plus by epsilon star, right? But since each of these has both of those terms at the end of the day, in both cases, what I get is epsilon star times 1 minus epsilon star. So in the end, this expression is just equal to 2 epsilon star times 1 minus epsilon star and since this is always less than 1, this is always less than 2 epsilon star and that's our proof, okay? This is actually pretty straightforward. Again, I glossed over the technical details of this, but the basic idea is right here, okay? And in fact, you know, nearest neighbor as n tends to infinity is actually what is called the Gibbs classifier for this problem. We will see, when we talk about statistical learning, we will see what that means, Gibbs classifier. So there's this notion of optimal classifier, but often the classifier, the optimal classifier, you know, is unreachable, but sometimes you can reach something called the Gibbs classifier, which is kind of like the second best thing and nearest neighbor is that second best thing, okay? Questions? Okay, now, of course, this is very nice, but it's not ideal, right? Why can't we get to epsilon, right? What we want to do is get to epsilon. We don't want to be satisfied with 2 epsilon, right? Can we get to epsilon or epsilon star to be precise? Well, it turns out that yes, we can. If we, instead of using, you know, just the nearest neighbor, we use the k nearest neighbor. So the analog of this theorem for k nearest neighbor is that if I do things right, I will get an error of epsilon star from k nearest neighbor. So k nearest neighbor can be my optimal classifier in the infinite data limit. But what does it mean to do things right? We can't just do this any other way. So here's what doing things right means. I have to let my data go to infinity, right? That's natural. I need an infinite number of data points to make sure that the nearest neighbor is all converged to my point, right? My query point. Also, k also has to go to infinity, right? So I have to do an infinite nearest neighbor algorithm, okay? Because I need infinite data points to estimate my probability of the example being plus with no error. Right? As long as I estimate it with some error, if p plus minus are very close, I could still be getting it wrong, right? So k itself has to go to infinity, okay? But k can just go to infinity in any old way, because if, for example, if I let all my points be nearest neighbors, then I'm going to get, you know, basically the default prediction, okay? So what has to happen is that even as n goes to infinity and k also goes to infinity, k over n has to go to zero. So what this means is that I'm getting more and more data points and I'm considering more and more neighbors. But the neighbors that I'm considering are a dwindling fraction of the total example, which means that I do get, you know, small and smaller ball around my point, but that ball is very, very densely estimated. And then, of course, what happens at that point to just give a very hand-waving version of the proof is that, well, I have all those points right there, right? So from them, I can estimate the true probability of plus and minus, and then I make that prediction and I get the optimal error. So for something so simple, k nearest neighbor is shockingly powerful. Any questions? Okay, very good. So let's talk about this issue or, you know, at least something that this is related to. The second biggest problem in machine learning after overfitting is what is called the curse of dimensionality. And again, the curse of dimensionality is present with every method and, you know, affects the more powerful ones more as usual. But it's a particularly severe problem in instance-based methods, which is why we're going to talk about it here. But again, just as we introduced overfitting with decision trees, but not it's going to apply everywhere, the curse of dimensionality is also going to apply everywhere. So this is a general issue, not just within instance-based method. You remember when I said that if somebody asked you, what is the problem with this machine learning algorithm and you didn't have an answer, you know, a good thing to just say is, you know, overfitting? If overfitting fails, your second best choice is curse of dimensionality. And between overfitting and curse of dimensionality, you probably get a lot of things right. So good thing to remember. So what is the curse of dimensionality? And by the way, the curse of dimensionality affects other things besides machine learning. It was initially, you know, the trend was coined by a famous controlled theorist called Richard Bellman because he had these methods for optimal control that worked fine as long as you were in all dimensions. And when the dimensions got high everything fell apart and he called it the curse of dimensionality. So what is the curse of dimensionality? Let's, you know, here's the first instance of the curse of dimensionality. It's something that we've already alluded to. Let's suppose that your instances are described by 20 attributes, okay? But only two of those are relevant to predicting the class. You know, like remember our parity example, right? Your class is the parity of two of the bits but then there's 18 irrelevant ones. For nearest neighbor this is a really serious problem. Nearest neighbor isn't going to pick out the single from those two attributes out of the noise from all those other 18. Sure, those two are pointing in the right direction but superimposed on that is a noise signal that's in order of magnitude larger than a warehouse. This, as I mentioned, is the single biggest problem with instance based methods and is the single biggest reason or at least one of the biggest reasons that people use other things like decision trees and whatnot. Okay? At least from the point of view of accuracy. Okay? And this is an example of the curse of dimensionality because if the dimensionality was low you wouldn't have this problem. Right? If you just add two relevant, two relevant ones, well given enough data you're probably able to pick out the right ones. But if you have a lot of irrelevant ones then you are a host. Okay? However, it gets worse, much, much worse. Here's the shocking thing. Let us suppose that instead of two relevant attributes and 18 irrelevant ones, I had 20 relevant attributes. Does this mean that I am in better shape than with two relevant attributes? What does your intuition say? Would I rather have more information about my customer or less? More, right? Don't be afraid to state the obvious, right? Of course this is a great question. There's a catch for it. We can do this on these two levels. So on the one hand surely knowing more about your customer is better. But here's the thing. As I go into higher and higher dimensions my space gets bigger and bigger and my examples get farther and farther apart. And if you think about it, what I need for instance space learning to make good prediction as we just saw in the theory part is I want my nearest neighbor to be close to my point. As long as the class probability is not very too suddenly, if my nearest neighbor is close to the query point there's a good chance it's going to be right. And the problem with being in very high dimensions is that in high dimensions everything is very far from everything else. And so as you add more dimensions, yes, on the one hand you have more information which is good. But on the other hand it becomes the whole notion of who's closest to you becomes very fuzzy. And so the thing that's perverse is that the curse of dimensionality hits you even if all the attributes are relevant. Now if the attributes are highly relevant, if each one of these attributes is a very strong predictor of the class then this won't happen. But then you don't need a lot of them anyway. Then many of them are redundant or irrelevant. But then what you want to use is probably something like a decision tree that you'll build a tree over those attributes and ignore the others. But if all you have is a situation where you have a lot of signals but they're all weakly predictive of the class then they're all relevant but now you're really in a situation where you might do well except for the fact that the curse of dimensionality might kill you. The signal is there, right, but it's very diffuse. This is when things become difficult. So in high dimensions nearest neighbor is very easily led astray. Its ability to discriminate basically dies down as we have to high dimensions. We're going to see some examples that hopefully we'll make clear why this happens. More generally, not just with nearest neighbor with lots of different techniques that we use in machine learning with optimization techniques, in machine learning the final step is always to optimize. Optimizes often work well in low dimensions, in high dimensions, things are much, much harder. And finally, here's the biggest issue of all. The biggest issue is in our brain. Is that we build all these machine learning algorithms and all these techniques based on our intuitive understanding of a three-dimensional world. The real world is 3D. Your brain works in 3D your visual system with all of its power works in 3D. As long as we're in 3D we're happy. We have a really powerful machine inside our heads for dealing with three-dimensional space. But the problem is that in most machine learning situations you're dealing with a hundred, a thousand, a ten thousand dimensional space. And the thing is that in thousand dimensional spaces things are very, very different from what they are in 3D. And the intuitions that you have from 3D and 2D do not carry over. In fact, very weird things happen in high dimensional spaces. And so all the intuitive things that we might do with nearest neighbor we always have to be on our guard. Because your intuition could be leading you astray. This is the biggest problem. This is what makes it hard to design these algorithms. And as I mentioned before there's a famous data mining researcher who says that if people could see in high dimensions we wouldn't need machine learning. But we can't see in high dimensions and that's why we need it. So let me just give you a few examples that hopefully you'll find entertaining as well as instructive of why low dimensional intuitions do not apply in high dimensions. So the first example is the example of our good old friend the normal distribution. The normal distribution is more basic than the normal distribution. So let's look first of all at a normal distribution in one dimension. A normal distribution in one dimension looks something like this. Here's my x and here's my p of x. And it's our famous bell curve. And here's my mean. And if I draw a band of one standard deviation around the mean something like this most of the probability mass is there. Minus sigma plus sigma. One or two standard deviations. The whole idea of a normal distribution is that most of the probability mass is concentrated around the mean. And then there's some fuzz around that mean. But roughly speaking I know that I'm going to be landing near that point. This is the very meaning of a normal distribution. Very good. Now let's see what happens in 2D. Let's look at this from the top. So what I'm going to do is I'm going to put x here and y here. And now I'm going to represent the distribution using curves of iso probabilities. So all the points around here have the same probability. The maximum of course is still here. Let's say I'm centering this at the max. Let's say zero for example. These are bands of equal probability. In 2D this is going to look like a belt. Like a real 3D belt. And now let's look again at my band of one standard deviation around the mean. There is still where most of the probability mass is. But notice the following thing that has happened. A smaller fraction of the total probability mass is now within a radius sigma of the mean. Here it really was a big chunk of the whole distribution. Here it's a smaller chunk of the whole distribution. Just looking at that hopefully you can see. This now accounts for less than my probability mass than what I had before. Even accounting for the fact that it's rising higher there. Because there's all this space around here that the lower probability can happen in. When you add up the low probability over the entire big section of the plane well that adds up to a lot. And now of course you can see where this is going. Let's do this in 3D. Now what I have is these spheres. There's a sphere that's here within one standard deviation of the mean. And now this sphere is starting to be a pretty small fraction of the whole. So even though the probability is higher there, more and more the probability is outside of that region. And guess what happens if you go to sufficiently high dimensions. And that doesn't even have to be very high. For reasonable, you know, variances. You know what's going to happen? Is that most of the mass of the probability distribution is far away from the mean. This is a really disturbing thing. You have a normal distribution and most of the probability mass is far from the mean. This is exactly the opposite of how we think of a normal distribution. And yet it's what really happens. This is the cursor dimensionality. Your picturing the normal distribution is meaning one thing because of your intuition in, you know, both dimensions. But in high dimensions something very, very different is going on. Okay? Any questions or thoughts on this? Okay. But it doesn't end here. After all, this could just be, well, you know, it's a not thing that happens with a normal distribution. Unfortunately, it's a not thing that happens with just about everything in high dimensions, right? In high dimensions it's like this Alice in Wonderland quality to everything that goes on. It's like the Twilight Zone. You know, weird paranormal things start to happen. So here's another example. Oops. The normal, the uniform distribution on a hypercube, right? Very natural thing, right? I have a hypercube. You know, let's just think of a cube, right? I have a cube here and my points are normally distributed inside it, right? So most points are going to be in the middle of the hypercube. There's going to be very, very few points that are on the walls of the hypercube, right? Let's say that we define a band of, you know, with epsilon where epsilon is very small.