 So I guess it's now time, so I think I'm going to begin. So I'm sure you all know what machine learning is. If you've ever read Christopher Bishop's book on pattern recognition in machine learning, he gives like a tiny section to reinforcement learning. So I'm going to go into it into a little more depth. Reinforcement learning is machine learning for policies. So that means, real ultimately, we're trying to learn a strategy to deal with a particular problem. And we define this policy as a way to deal with particular environmental conditions. So given a state, which is the state of the world you're in right now, what, again, given the actions that are available to you, what's the optimal way to act? And the way we define optimal is relative to some cost function or reward function. But also basically we just want to minimize cost or maximize reward. So this ultimately then relies on this idea called markup decision processes. So a markup decision process basically means that what you do changes the environment you're in. So if I move to the right, now I'm back here, now you can hear what I'm saying, right? So my state of the world has now changed by moving over to the right. These transitions have probabilities. So if you take an action, you hope that things will happen in your favor, but that doesn't necessarily mean it's going to happen that way. And these state transitions using your actions can be solved for optimality using dynamic programming if you remember dynamic programming from your CS classes. So dynamic programming ultimately means then that if you know exactly what's going to happen tomorrow, you can accurately plan today. So if I look at the weather forecast, I know I'm going to take an umbrella with me, right? If I don't, then I don't know, and then my computer gets wet. So we're basically sending this idea to say if you can look back, if you can work backwards in time, you can accurately plan. So I'm going to give you an example from the real world from my previous work. I'm sure you guys know where this is. Columbia University was approached by a building management company to work on a sustainability problem. If you are familiar with New York City, the buildings in this neighborhood rely on steam energy, and this neighborhood is mostly commercial buildings. So commercial buildings have contractual obligations for what the temperature needs to be, which basically means that during the day everybody's trying to turn on the building at the same time, and they're all turning it off at around the same time. This ends up costing about 24 billion pounds a year in steam energy. From the fact that the steam system was built about 100 years ago, this is a huge cost to the system. So ultimately then we were deciding when is the optimal time to start heating, because the utility bills that were given to us had certain fee structures that we had to optimize for, and these were kind of had one-time charges that were associated with what your maximum usage was for a given month. So then in this case, the state was the weather forecast, the current maximum steam level, and obviously where we are in the billing period, because this is a finite horizon problem. And so at the end of the month, you just have to do what's cheapest, right, because you don't have to worry anymore. Your fees aren't probably going to get much worse than they are already. And on earlier days, you need to minimize your current plus your expected cost, because you can now predict what you're going to do given where you are now, because you've been working backwards in time. So here's where the math comes in. You have these states, and again you have these actions which are relative to the states, and you have a reward. And the reward is relative to where you are, what you end up doing, and where you end up in the end. So there's a probability distribution over where you end up, given what you've done and where you are. And there's some sort of discount factor that allows us to prefer rewards now over rewards later, and also helps us deal with convergence, because you're going to have to deal with optimizing over a long period of time, over a time horizon that's infinite, right? And so ultimately you're going to have to optimize over one of these things called a value function. These are also known as the Bellman equations. There's two ways of defining them. One is simply relative to the state, which is your V equation, and one is relative to your state and your action. So basically your Q equation basically says that the optimal policy minimizes your expected value for that cost in action, and your V function basically says that your best action given where you are is the action that minimizes this cost. So we're going to cut to a code example. Let's see, so I'm going to bring this up. So let's see, how do I zoom in? Can you guys see this? Okay. Let me see if I can... Anybody? Yep. There we go. Is this better now? Okay. Cool. So as I mentioned, in Markov decision processes, your actions affect the state of the world, and we're going to use this thing that's called value iteration, sometimes called backward induction or dynamic programming. And so I said, as I mentioned earlier, that the optimal strategy for a given state is the action that is going to maximize your expected reward or minimize your expected cost. This is in terms of reward, and because of this discount factor, our estimates for what will happen will converge just simply because our estimates will get smaller and smaller because of this discount factor. So we're going to have this... Look at this grid world example. This is a typical toy problem in reinforcement learning. And so in grid world, you're just going to have to move this little agent around a board, and it's really nice because of Seaborn's heat map, you can really look at what these values are. So I'm just going to create this grid. We're going to show it right now. Here's what it looks like. Right? Okay. So you have a pitfall, and you have a reward. Your reward is in blue, your pitfall is in red. You don't want to get into the red. You want to go to the blue. And so the strategy is basically to say, how do we get to this goal based on any place on the board? We know if we're at the goal, we're done. And we know if we're at the pitfall, we're stuck. But what's the best way of getting to the goal from any other place on the board is unknown right now. So in this world, the agent can go up, down, right or left. We assume that 80% of the time it ends up going where it wants to go. And 20% of the time it veers off perpendicularly. So and if it ends up in the pitfall or the goal, then the game's over. So I'm just going to build this transition probability matrix, and I'm going to visualize this for you guys so you see what it looks like. Here we go. So as I said, 80% of the chance of the time, it goes in that direction, and then it goes perpendicularly 10% of the time to the right or the left. So using pandas, we can actually create a really nice visualization also of the policy, right? So for most of the spaces, we just, we don't know what to do. And right now, and we know that it's just going to exit the game at the end if we end up in the pitfall or the goal. So we're going to make a discount factor of 0.75, and we're going to set up a function to update these Q values. So as I said, you have your discount factor, and we're going to create a grid, and we're going to have some actions, and we're going to make a table full of Q values and a table full of V values. Your V values are going to be the shape of these possible states, right? So it's going to be in the same shape as that grid I showed you, and these Q values are going to be in the shape, it's going to be a tensor, the same shape as that grid blown out by the possible actions, because you're looking at each possible action, right? And I have this little function here that says, if you're going to go up, we can index up, indexing right, indexing left, okay. Now here, this is our Q function, right? So I can talk you guys through this. Basically, if we are in the goal state, right? Or the pitfall state, we're done, right? But, and if we're in the middle box, that blank box that I was showing you earlier that has no value, then you can't, we're just not going to consider that, right? But, we're then going to now loop over these possible actions twice, because we're looking at where we're going from one state to another. We're saying, okay, given we're in one state and we want to take this action, where are we going to go? And so we have to, this is now a dual loop. So this is, this technically is actually O of s squared a. So this is, there's a lot of different loops involved, right? So looping over these actions, sorry. Looping over these actions, we say, we can more transition to zero probability. And then we can index into the new state given the action we want to go to. And then we use this update equation here, sorry, I'm just trying to get this one. This little update equation here. Reward s prime equals your grid value, right? So we're indexing into that grid plus gamma times this V value, which is what we've seen so far as that estimate for where that state is. And then to just take the expected value of all possible results given our action, we just take an expected value, which is basically a matrix multiply. So that's, we just use your dot multiply, your dot function in numpy. And then you can get your expected value for each possible action. So now that we've run this, I'm going to create this giant array of Q values. So this is our initial Q values. As you see here, you have a big row of nouns, which is because that's the blank box. You have a big row of negative ones, which is your pitfall, your stock there. You have a big row of ones, which is your goal. And then if you look at the third row from the top, there's on the right corner, there's a 0.74, that's because you're right next to your goal. So all you need to do is go to the right and get there, right? And then similarly, if you go to the row right before all the nouns, row right after all the nouns, you have a 0.76, right? Because if you go to the right, you're going to end up back in that pitfall, right? So this is what I mean by this Q function. For each slot in that tensor, you have what's going to happen for that action from that state and related to where you're going to go. So once we have these Q values, as I mentioned, this is just a max and an arg max to get what you want. So the best thing to do for a given state is the thing that's going to maximize your reward. So that's just a max of that giant tensor for your S and A. And the policy is your arg max, right? So that's pretty easy. We just use numpies max and arg max, given that we're in looking over the tensor. So that's your second dimension. And we can just use the update on the pandas data frame. And here we are again. And I can now run this and see things get brighter. So as I said, the top right corner now is turning to turn blue because you're going to want to go to the right. And we've ignored states that are relatively close to the box. And that's why those are not being shown, the lower numbers are not being shown. So let's look what this policy looks like so far. So as I said, right now if you look, you have most of the time you're going up. You're going down because you're just going to end up where you are. And there's a chance you might end up going to the left when you're on the bottom corner. And when you're right next to the goal, you're obviously going to go right. Now we're going to do it one more time and see what happens. As you can expect, these values are going to change. So this is the Q values. As you can see, if we look above that one, two rows up, you have now a higher number coming through because you're now two rows to the left of that one box. So you're now looking one step ahead and now you're saying, oh, I'm two steps away from this goal. I should start going right. And if we now plot this, you can see, again, looking two steps ahead, we can see that you're going to start to want to go up and you're going to start to want to go to the right. And now we can just combine these steps to convergence. So now that this thing is converged relative to some epsilon value, now we know what the final policy is, right? So these are the final values of what you're going to get based on where you are, the farther away from the board, the more work you need to get to your goal. And our policy is now represented on this table. So if you're close, if you're on the bottom right corner, if you're on the bottom right corner, now you're going to want to go to the left, right? Originally, we were just saying you could just bump into the wall, but now you're going to want to go to left. So this is just a really simple toy example. Again, things get a lot more complicated, but I hope this gives you some intuition of what's going on. So going back to what I was talking about, this is all nice and dandy. And it looks really easy if you have a really simple problem like this. But I'm sure you guys all are familiar with the paradox of choice. When you have too many options to deal with in too many scenarios, optimization is really annoying. So as I said, this thing is O of s squared a. And so when you have an overbund of choices, a really simple thing to deal with is to look at what's a reasonable option. So in the case of this steam heating problem we were working on, we just wanted to look at start times every 15 minutes because that's a reasonable bucketing of time starts. And then for temperature ranges, we looked at every five degrees. Because again, as I said, it's a really simple way of defining the problem. And as I said, once you finish this policy, you have a nice lookup table. That pandas data frame, really easy to understand. It's easy to store. It's easy to interpret. You can put it in a database. People can actually look at it and be able to inquire what's going on. In the case of what we were working on, we used machine learning algorithms to estimate these immediate rewards, these little grid values. We used machine learning algorithms for regression to estimate reward. And then there are also algorithms available to cut down on the number of iterations and for learning of the optimal policy when you don't know these Q values. So reinforcement learning obviously has a lot of really natural comparisons. If you look at the human brain, the sensory system allows for creatures to automatically react to stimuli. So if we put food, we're all going to go to the food. And then if we put a bear, they're all going to run away. So it's pretty clear that there's a really quick reaction and a strong relationship between the brain and sensory information. So Google DeepMind took this comparison to the next level by marrying recent developments in convolutional neural networks to the game Atari. So as I mentioned, you can use machine learning to create functional approximations of different things. In this case, they created a functional approximation of these Q values. We were talking about these value functions. So ultimately, based on these pixels in the Atari game, they were able to come up with reasonable movements that made sense and were optimized to improve the score of various Atari games. So this was the paper that got people really interested in reinforcement learning today, because this is a really hard problem. You have a big state space. And so coming up with meaningful approximations makes a lot of sense. I'm sure many of you have seen the Go match from Lee C. Dole and AlphaGo. So this is the paper that was followed up on that. This kind of takes things we were talking about with values and policies. So they're using self-play. They were able to deal with the state space of Go, because Go is actually a really big state space. I think there's billions of states. I can't remember the exact numbers, but it's something that's really hard. So by simulating play in a forward fashion, they were able to come up with estimates of what was going to happen. Again, this thing actually ended up having a server farm. It's a really interesting paper. I suggest you all read it. But there's a lot of really interesting stuff in there. If you guys are interested in doing stuff on your own with deep reinforcement learning, there's now a new library from OpenAI, which is another machine learning lab. This has a number of environments for deep reinforcement learning. So I'm just going to exit and show you guys what the home page looks like. So they have Pac-Man. They have various Atari games. I think they have Go. They also have a Kaggle-like setup that allows people to compare their performance. And then they also have simpler reinforcement learning games. So for example, there's this game called CartPole, where one has to balance a pole that's sitting on a cart. There's Pong. And so you can compare how you're doing versus people all around the world. And if you'd like to learn more, the Sutton and Bardo book is really great. It's online. It's a reinforcement learning textbook. Introduces you to some classical concepts. Again, as I mentioned, there's OpenAI's Jim. And we also mentioned the DeepQ Learning paper from 2015 and the Go paper as well. And also, if you're looking for online resources, I happen to really like UC Berkeley's AI course. They have a lot of videos online. They also have some really nice Python code for Pac-Man, if you're interested in using that. Some nice classes for Markov decision processes. And it's really neat. I would strongly suggest you look for that. So as I mentioned, reinforcement learning is, I think, a really interesting area. There's a lot of really interesting opportunities to use it in organizational problems, to optimize processes. And of course, there's really interesting stuff happening in robotics as well. I think that it's going to make a huge impact in the future. I hope you guys learn a little more about this and take some time to use Python to create your own agents, since there's a lot of really interesting tools. And I'll take some questions. Thanks. Yeah, sorry, I can go a little slower. Give me a minute. Did I get rid of it? No. There it is. OK. So which matrix are you talking about? So this is your value. So OK, as I mentioned here, this is your value. This is the expected value of being where you are, having taken the optimal policy. So for example, in your bottom left corner, the best score you can get is 0.21, because you're really far away from the goal. So it's because it's going to take you a number of moves to get there. If you're really close to the goal, then your score is higher, because you're expected to get there sooner. So as I said, this is basically, once you've given the fact that you're going to follow the optimal policy, what is the score you're going to get based on the rewards that are given to you? Oh, this. Yes. So this is Q. As I mentioned, Q is the, this is basically where you're populating these numbers for. So this is the, so taking the max in, so this is a three-dimensional tensor, taking the max in the third dimension, right? In NumPy, it's the second dimension, third dimension, will give you these values. So for example, so for this one, this is, this is for cell 00 in your grid. You have four possible actions. These are the scores, these are the possible scores of each option you have, and you should take the max. So given the fact that they're all equal, it just picks one. And then this one, right? This one is for row 01, I think, if I'm indexing correctly. So given the possible actions up, down, left, right, it wants to go right, because the last one in that row is right, and that's the highest of your possible options. So that's the number it's going to plug in. In a corner, shouldn't two of them be 0? So it's 0? No, because the reason is, so this is only after two iterations, so this is what we know so far. If we take it to, I'm hiding, so in this thing, I've hid the updates to the Q values here, because printing it out is, I just decided to not print them out. But if you looked at these, so let me just throw this out. Again, let me get it. Q values, let me just pull this out, see if I can get this. So this is your final array. So as you see, these are the possible options for each action, and you're still going to want to go to the right. For example, even when you're in the top left corner, you're still going to want to go to the right. And so this is basically what the scores you're going to end up with, given the fact that you could go up, down, left, or right. And if you don't want to run out to lunch.