 Today I want to talk about machine learning, a very cool topic that's getting used more and more these days. Part of the reason that machine learning seems weird is that we have this idea of machines as being this sort of rigid and unadaptable kind of thing, just sort of blindly doing whatever they do. But the whole idea of learning is that you take in some data, some observations about the world, and try to find patterns in them that you can then exploit to do better next time. So a machine for a machine to learn means it has to be able to, number one, change its behavior and number two, observe and find out things about the world, which is sort of the opposite of blindly doing the same thing. There's a lot of different ways that we can sort of look at learning problems. Here's four of them and you already know all of them. Unsupervised learning is you get some input and then a teacher, a supervisor tells you what they are. Like, you know, this is a fleab and this is a gloob and okay, so now here's a quiz, you know, what is this? It's a fleab, supervised learning. Unsupervised learning, there's no teacher. You're just sort of seeing stuff and saying, oh, this thing and this thing, oh, these are pretty similar. I'm going to put them together. Oh, this other thing, that's very different. I'm going to put it in a separate category. So unsupervised learning, you just observe some data and sort it into groups based on your sense of similarity amongst the things that you see. Reinforcement learning, you're actually out doing things, not just looking, but acting in the world. I wonder what happens if I do this, oh, got a shock, negative reinforcement or, ooh, tasty, positive reinforcement. And then you're trying to learn to do things that get positive reinforcement. Temporal reinforcement learning, same as reinforcement learning, except there's sort of like a big time delay. It's like, oh, I shouldn't have eaten that mushroom yesterday. Games, chess, checkers, whatever, are typically examples of temporal reinforcement learning tasks. You have to take a whole bunch of moves, you win or lose and it's not clear how the combination of moves were necessarily responsible for it. In these problem formulations, there's lots of different learning algorithms, specific things to have the machine do to change its behavior in response to its observations and try to do better. Nearest neighbor in K-means clustering, those are unsupervised learning algorithms, the other ones are mostly supervised. This is a current area of research, there's lots of different algorithms here, lots of different ones that could be put in, which is good. Also there's a lot of activity happening, but it also kind of suggests that this is definitely not a solved problem, and there's a sense in which the whole idea of learning is not really a solvable problem. It depends on making assumptions about the world that some pattern that used to hold will continue to hold. So regardless of what learning algorithm or which problem formulation really you adopt, there's some things that come up over and over again. I want to talk about two of them here and try to do demos. The first one is this idea of exploration versus exploitation. If you need to explore out in the world to find out what your options are, to find out what's good and what's bad, then a question comes up, okay, this thing seems to be pretty good, should I just cash out and keep doing this thing? Or should I try doing some other stuff in hopes of finding something better, which may cost me to give up, pay off, reward good things that I could have had by doing what I already knew? Trying new stuff is exploration, cashing out the best thing that you know so far, exploitation, their inherent trade-off between them. All right, let's try an example. This is the chili monster, is an alien robot chili monster, I think maybe it's a zombie too. And the way it works is you poke this guy in the eye and he gives you a cookie. So there, plus one, that's a good cookie, that's a payoff. And so the question is, well, so should we punch him in the red eye again or should we go for green? I'm going to go for green, green, okay, that was bad, minus one. So here we are, we're now the learning creature. We've got a good one from red, a bad one from green, figures will do the red one again, oh great, so two, all right, well so two and one, do this guy maybe a little more, now we're two and one either way, okay, so what choice do we want to go with next? Who knows, pick a red one, okay, three and one, all right, that's worse, okay. So here we are with a learning problem, do we want to just commit to red and do that all the time, or do we want to try green some more and see if it catches up? As humans we have a real tendency to try to make up immediate explanations for everything like, all right, so we'll try red again, okay good, so red is just better, we'll try green, well green was better too, so maybe we're just being extra lucky today and we should alternate back and forth, well I guess not. So one way that we can avoid sort of trying to say, well the reason that red is better is because I'm wearing my lucky red socks, is instead of making up explanations immediately we should lay back and collect data, we need to keep count of how many things went well and how many things went badly, I mean one of the mistakes that we can make as people is to only count one side, if we're sort of optimist we can count the good side and pessimist we can count the bad side, well so we've got a little collector cup here that we can use, then the way it works we just sort of put it here and it vacuums up whatever it is, so so far from the red side we've got four good ones and two bad ones, got a guy for the green side, he's paying off at three and two, they're both pretty good, okay, so you know which one should we pick, well we've got this guy we've got a learning creature that is connected to our cup so he's aware of the data that he gets here and he doesn't have a mouse so he throws water balloons like that, so he threw it at green and got a got a negative payoff, right, so if he goes again, well he got a negative payoff on that side, so the learning algorithm for this guy has to decide what to do, which way is he gonna go, let's just let him run for a while, goes red, got a payoff, so he's focusing on red, why is he focusing this on red, well red is 160 something percent of the time, green, well now green's up at there too as well, he's going back and forth between the two, so how does this guy work inside, and here it's pretty cool actually, it's very simple and it's an algorithm that is being used out in the world, we'll talk about that in a sec, let's turn them off, okay, alright, so here this graph, I'm not gonna explain it in complete detail because it's a little bit involved, it's called a beta distribution and what the way it works is, you know, so how much is that red guy worth, we've tried him 31 times, we got 16 wins and 15 losses, how good is that, well we know the average, you know, we divide it out 51%, whatever it is, but is that really what it's worth, could it actually be sort of just paying off 50-50 and we haven't seen it yet or could it even be paying off like 55% and it's just we're getting particularly low, well this thing called a beta distribution, it comes out as statistics, you can feed in the two numbers, the number of wins and the number of losses, I'm fluffing over some details here and it gives you back this curve and the curve doesn't say what the worth of the choice is but it says the distribution of it, so right now we've got this green curve corresponds to the fact that we've got 17 wins and 13 losses, which is actually a little better than the red curve at 16 and plus 16 and minus 15, but both the curves are very broad, meaning we're not really sure how much this thing is worth at all, I mean like the green guy, he has, you know, some height all the way out at like 0.7 and all the way down to, I don't know, maybe 0.3 or something like that, so the white guy is a little demo rod and he's completely flat, so if we've never tried him at all, it's completely flat, but as we try it and like start to get wins, the curve gets sharper and sharper toward the wins, so if we had 11 wins and zero losses, we'd start to think wow, it's pretty likely that this thing pays off all the time, but then the instant we get a loss, well we know it can't be paying off all the time and so on, so the beta distribution you feed in the wins and losses and it gives you this estimate of how much the distribution of possible values of the choice, and so now the nice thing about this is that it's better than just saying the average number of times it won, because, you know, if we had three wins and three losses, that's 50-50, but if we had 300 wins and 300 losses, that's also 50-50, but we know more about it when we have 300 wins and losses than we have three, so you see, as we try to like go from zero-zero, it's completely flat, as we increase the number of wins and losses, if I can go straight here, it stays at sort of 50-50, but the distribution gets narrower and narrower, because we got more and more data saying it's really likely that it's 50-50, so what this guy is doing is he's got those two curves, let's get rid of the white guy here, he's got the two curves for the data that he's seen, and every time he makes a choice, the one curve that gets more data gets updated, and what he does is he throws a random number weighted by the green curve, and then throws a random number weighted by the red curve, and whichever one comes out higher, he picks that, so let's let it run, okay, so now at the moment, the red and green curves are sort of overlapping quite a bit, which means he's going to pick one, pick the other, and so on, but as we get more data for these things, the distributions start to get more narrow, and that if they don't overlap less, then it's going to pick one over the other, okay, so green is paying off 54%, something like that, red's paying off 51%, I can tell you now that I programmed this so that one of red and green pays off 48% of the time, the other one pays off 52% of the time, but I don't even know which is which, the program scrambles them randomly when it starts, so I mean if I had, well I don't know, I was going to say it was probably green, but now they're very similar, and in fact they're both paying off more than their actual pay off, just because of random variations, just because of luck, and actually speed this up a little more, I just figured this out the other day, if we can get two of them going at once, see if I can get it going, there we go, okay, so now he's throwing the next one before he even got the cookie from the previous one, which means he's making his decisions about what to do next based on the throw from two ago, but you know we're trying to get lots of data, so who cares, and now it's looking like the green is significantly better, so the learning algorithm is choosing to exploit, to dump it into the green guy, still trying the red one every so often, because might be wrong, still exploring a little bit, but shifting more and more towards exploitation, this number in the middle here is its total payoff over all choices red and green put together, so it's up 33, 31, whatever it is, this kind of problem where you've got to do something out in the world and you have two goals, one is to get knowledge about what's happening in the world, that's why you explore, and the other one is to actually make something good for yourself, that's the exploit, is a fundamental problem of learning algorithms, and this kind of algorithm, this kind of learning problem, this is really not called the alien robot chili monster problems, more often called the two-arm bandit problem, or the multi-arm bandit, where each action you've got is like a different slot machine, you're trying to figure out which is the best slot machine, if there are any different, and this problem comes up all over the internet, it comes up with advertising, and now the actions are what ads should you put on a webpage say, and the chili monster is you, it's trying to get you to click on one of those ads, that's a plus one, and if you don't click on an ad, that's minus one for all of them, so the system behind the ad system, or like in the video website, when you're watching videos, you get to the end of videos, you get this little grid of other ones to watch, same thing, those choice of which ones to put up there are based on this kind of algorithm, it's trying to make the ones that on the one hand you'll pick, and you'll watch, but on the other hand it's also trying to get knowledge about is this particular video one that people like, all right, so still not completely clear, we can let this go a little bit more, and then we'll come back when we do the next demo, all right, so exploration versus exploitation, even when we're in other problems, other versions of learning, that's still behind the scenes, I mean in supervised learning teachers just telling you do this, do this, do this, it's not so obvious how there's the explore exploit trade-off, but once you learn something, you're still going to have to go out and do it in the world, and then that comes back again, okay, so you know, if the video site is showing me a set of videos that I might want to watch, it's like I'm going to make one choice and then I'm gone, it's not like I'm the chili monster sitting there getting treated over and over and over again, so in fact really I'm just a teeny little piece of the chili monster, all the other people who are looking at videos and getting presented with choice screens and so forth are all more parts of the chili monster, which raises the question, you know, I don't like the same videos that somebody else does, and if they show me something that someone else likes, it's not necessarily going to work that well for me, I mean they might be able to pick the videos that everybody likes, you know, Gangnam, Azalea or whatever it is, but you could do better if you could recognize that I am similar to other, you know, geeky, computer-y, whatever I am, and somebody else is similar to that too, so what we can try to do is that we want to do a categorization task and based on some knowledge that we have we want to make a distinction between one versus the other, and this ultimately leads into this question of knowledge representation, how do we divide up things that we see in the world into categories that we can make predictions about them, the things in this category we're going to tend to like these videos, things in this category we're going to tend to like those videos, okay, so let's do an example of that, well let's see how the learning guy is doing, they're still both extremely close together, and as a result, and this is good, as a result it's still trying both of them, looks like red's got a little bit of the edge here, 52%, and green is falling away a little bit, although did we lose our double, yeah we lost our double clocking there, so we weren't going as well, well I guess we're not going to know this story for sure exactly which one ends up better, maybe if there's time we'll come back to it at the end, don't know, but we're going to have to stop him for now, and all right let's look at this, one of the earliest learning algorithms for making distinctions, supervised learning now, we're going to get input, teacher's going to tell us this is this, this is that, and we have to learn it, in this case we're going to learn pictures, okay, I went out and I found on the web a bunch of news photographs that were used for face recognition research, it took a dozen pictures for President Bush, and I looked in my camera and I got a dozen pictures of cats, so we're going to try to learn between former president and cat, all right, and the way it works is this, up here we've got our input, this is 250 by 250 individual pixels that make up a black and white picture, so they can be zero which is black, they can be one which is white, anywhere in between which is gray, down here at the bottom we have another whole set of 20, 250 by 250, 62,000 something, weights, numbers that represent what we've learned, and if we start this thing up, all right, so at first, the first thing we do is clear out the weight, so this level of gray is a weight value of zero, and if the weight value goes negative, it's going to be darker than that, if the weight value goes positive, it's going to be brighter than that, so the way this machine works is we put in a picture, all right, there's a picture, former president Bush, and in order to evaluate this we take each pixel in turn, 250 by 250, multiply it by the corresponding weight and add them all up, and if it's greater than zero, we say the category is president, and if it's less than or equal to zero, we say the category is cat, okay, now at the moment, all the weights are zero, so no matter what picture it is, we're going to get time zero plus something time zero plus something time zero, the whole result is going to be zero, so the first prediction is the sum is less than or equal to zero, in this case it's equal to zero, so what the perceptron says is cat, and the supervisor says wrong, okay, and now the perceptron learning algorithm, what it does is saying, oh okay, you're saying this thing should be greater than zero, it should have a sum greater than zero, so it goes through each of its weights, and all the weights that are sort of big, it moves them to make, all the inputs that are high, it makes the weight more positive, all the weights inputs that are low, it makes them more negative, as a result the whole sum goes up, and I didn't actually even really expect this when I first implemented this, but all right, so now we're going to modify our brains to make this thing look, produce a more positive sum, and it actually makes kind of a shadow of the input that we're in, because the brighter spots get more positive weights, the darker spots get more negative weights, all right, now here's a cat, but since we just changed the weights to make everything positive, it comes out some greater than zero, so again it makes the wrong guess, now we do the same learning, except now we want to make the sum smaller, less than or equal to zero, so it's like we memorize a negative of the picture in order to kind of cancel it out, like that, so now we've got sort of half of one picture in positive, half of the other picture in negative, and so on, so we can go through, we've got a dozen, one category, a dozen of the other category, and we're just going through them in a random order, whenever we get the answer right, okay, cat, we don't change the weights, we don't have to learn, we didn't make a mistake, but when we do make a mistake, then we update the weights again, okay, we go all the way through all 24 cases, we see how well we did, if we got them all right, we're done, otherwise we shuffle them again and go back to the beginning, so let's speed this up, so it's getting some right, getting some wrong, all right, so now we're at the end of the pass, and we got half right, same thing we would have gotten by flipping coin, or always saying cat, but let's see what happens in the second pass, all right, yes, well, no, still making some mistakes, nope, yes, yes, yeah, uh-huh, yes, yes, starting to learn, okay, so if we let this go a few passes, and because the actual pattern of the weights depends on the order of presentation, because we only learn when we make a mistake, and if we just got went in one direction so that it gets us several more right, then we don't do any learning, it can take a different number of passes before we actually categorize all the inputs correctly, and there we got it, okay, so is that cool, we learned all these things, now one of the things that people got excited about the perceptron way back when in the 50s when it was first studied, was that there was a proof that if it was any way for it to learn something it would learn it, so in this case it learned it no problem, but we would like more than having it just memorize a bunch of pictures that we showed it, we want it to learn the concept of former president versus cat, so that we could feed in new pictures, maybe pictures that the supervisor hadn't even categorized, and get useful answers out of this thing, so that's called how well the algorithm generalizes, so not just learning, not just getting it right, you know, teaching to the test, but then being able to apply knowledge to other cases, so I've got a couple of other inputs here that the perceptron never saw, so a generalization test, so here's a different picture of the cats that the perceptron never saw, what do you think? Is it going to get it right? No, got it wrong, another picture of a cat, got it right, 50-50, another picture of the former president never saw, no, sorry, not so good, okay, the perceptron as we've used it here is not very good at generalization, because really if we look at these weights, you know, it's learning extremely specific little details about the pictures that we happen to show it, and there's really no reason to expect that it would actually apply to other pictures, I mean even if we took one of the same pictures we trained on which sort of moved it a little bit, it would line up with different weights and get probably a different answer, so the knowledge representation problem is how do you actually build algorithms that are like the perceptron, they're making categorizations, but they make better characterizations, they make better categorizations that are more like the way we do it, okay, and one thing we do is that one of the algorithms that was mentioned, so in the beginning the deep belief networks is, it's like got a whole bunch of perceptrons, but instead of going straight from the input to the output, it's got layers of them and trying to get more and more general abstract things, and a lot of this stuff is working pretty well now, as we go forward, one of the things that in fact is driving machine learning getting used now is the combination of two things, computers are much more powerful than they used to be, number one, and because everything is connected to the internet in particular, there's tons and tons of data to learn from, there's lots of pictures, there's people typing stuff, there's people speaking and doing speech recognition, there's tons of data, and just like if we had many, many more pictures that we were trying to make a distinction between, we'd have a better chance that our learning algorithm would sort out what's really important and what isn't, assuming our knowledge representation and the way we built the learning algorithm is even capable of representing the concepts that we wanted to learn. In the general case, the Explore, Exploit trade-off does not go away, the theoretical framework that's studied in is called Minimizing Regret, in the learning algorithm world, there's really the idea of no regrets isn't possible, the only regret means if you could predict everything perfectly, you'd get some amount of payoff and whatever you get less than that because you had to explore is your regret. And finally, knowledge representation sort of goes to the core of what learning needs to produce in order to be successful, and also even how we see things, how we understand the world as a result of our experiences. And you know, this perceptron, this stupid little perceptron, one thing that it's really bad at is recognizing what it doesn't know. It's totally happy to make a judgment of any possible input you give it. So for example, if we give it something that's not a cat or a president, like this is my dad at Sky City years ago, and what do we say that is? We said my dad's a cat. And here's a picture of a dentist's light that looks like a robot. So I took a picture of it. There you go.