 Okay, good afternoon everybody. So hopefully we want to bring reinforcement learning to a certain stage that at least we know one reinforcement scheme to use. And that's definitely has to be Q learning, the simplest type of reinforcement agent we can use. So and then we will see what we can do next week. So we said we have basically two potential two potential approaches. We can go after finding or find the value of the state or second method, go and find the value of the action. That brings us to very different type of technologies. So either I look at the value of the, we cannot really call it feature space, it's not feature space, it's really state because you are dealing with a dynamic environment. Fundamentally features change all the time so we cannot really call it features. So if we look at the state or if we look at the action and depending on that you get basically two reinforcement learning schemes. One is you can do policy iteration. So you define a policy and you iterate over the policy to get to an optimal policy. So this is for Monte Carlo type of approaches and for TD or temporal differencing, temporal differencing TD-London methods. So there are, if you decide to look at the state you basically go with the policy so because I have to develop the policy and iterate over the policy to find a good policy. So we go, either we put emphasis on sampling and we sample as much as we can to get an idea or we look at what is improving, what is not improving if I go from N to N plus one so temporal differencing, the difference between what is now and what is in the next step. So, or B, we do value iteration which brings you to Q learning. It's one of the simplest, one of the simplest reinforcement learning techniques that we can possibly come up with. So here, sorry, I smashed it together. So evaluate policy to improve it and for value iteration we maximize reward, we maximize accumulated reward, accumulated reward for the peer SA, state action. So, very different type of approaches could be that from AI perspective, some of us are more comfortable with value iteration because we know how to accumulate things, we have been accumulating actually error for neural networks and then we minimized the error. So, now I accumulate reward and maximize it, should be similar many things that we have been doing for neural networks. Okay, so not a clear idea but the Q learning was one of the first really simple techniques that enabled us to design and program reinforcement agents. So, now the question is, how do you deal with rewards? How do you deal with rewards? So, do you wanna keep all of them? Do you wanna somehow come up with a summary of them? What do you do that? So, if you are talking about MDP, again Markov decision process, we assume it's a Markov decision process, is we don't know what order it is. Is it order one or order N? So, what is it the Markov decision process of order one? So, in order to understand the Markov decision process that is of order one, I just need to know what happened at N minus one. Then I know what is happening at N. At order two, I have to understand N minus one, N minus two. At order M, N minus one, N minus two, N minus three, N minus four. So, I have to accumulate the entire history. So, classical example for me as I come to the class and I'm just in a bad mood, first time you see me. So, if my behavior is a Markov decision process of order one, that means I just didn't get my coffee, so I'm just grumpy. If it is a Markov decision process of order M, that means I had a bad childhood, basically. So, it's very different. The history, you need to know. If it is a Markov decision process of order N, you need to know the entire history. So, if you wanna have an order one, then you have to keep everything, keep everything. What is everything? What are the rewards? The driving force reinforcement learning is reward and punishment. So, keep everything means your value, let's call it Q, QK, Q sub K is R1 plus R2 plus R3 plus RK over K. So, I wanna get an average of the rewards, right? That's the easiest way to get an idea whether my agent is doing well or not. So, I calculate the average rewards. My robot does something, it gets plus one. Something else plus one, something else minus one, something else plus five. So, I add them, average them, and I know, in general, my robot is doing well or not. So, if it is a Markov decision process of order N, I have to keep everything. That's not a good thing. So, people write it in the paper and that's a nice equation, but if you wanna implement it, that's a pain in the neck. I don't wanna keep everything. That requires a lot of memory and I don't wanna deal with memory allocation. Why you can also do it incrementally? So, we can do it incrementally by saying, look, the Q value at K plus one is what it was at K. So, now we have to go back to statistics, grab an equation that gives us a running average, a running sum. So, the average at K plus one is the average at K. So, I'm keeping the average, I'm not keeping everything. Plus one over K plus one times the reward at K plus one minus the average before. So, that's a formula that we can use. So, I just need to know what is my average so far? What is my average so far? I can keep updating the average. I don't need to keep everything to calculate the average again and again. I don't need that, so. Okay, I like it. That's like a formula that we can work with. So, I just need to know that that's for me now a Markov decision of order one. So, I need to know what was at N minus one and then I can calculate that N. So, the history, this is the assumption that is very important, very important for people who are new in the reinforcement. At any moment you have to ask yourself what type of Markov decision process is my agent? So, is everything that I need to know and to navigate is everything in the N minus one contained? If it is, we have an easy life. And sometimes it is not but we make the assumption that it is. So, okay, good, everybody's happy. So, keep this running sum, running average in mind. This comes from statistic. It's just a smart way of not keeping old individual rewards but just keeping the average and then we update the average. That's it. Okay, great. So, there is basically a common nobody, there are many things in AI that nobody tells you. I don't know why. People treat that like secret. So, there's a common update rule that you see it in different ways again and again but because you see it in formulas and equations and strange notations and indices and superscript and subscripts, it's difficult to recognize that this is the same stuff. So, that common update rule is this. A new estimate in machine learning is always, always equal the old estimate plus a step size, step size, something that we arbitrarily define for sake of convenience and to have some control. Times, the step size times the target, what it should be minus the old estimate. So, this is a pattern that you see again and again. Go back, think about the delta rule and generalize delta rule is the same thing. The magic equations that we still use to update back propagation error in the network. So, the new estimate always is a modified version of all the old estimate. The old estimate is fundamentally the difference between what I'm calculating and what it should be times a convenient factor for me to be in control. Again and again and again. But then we put some notations in this and then gamma and then alpha and then sub ij and then you don't see this. Just keep this in mind. Every time that we do update something, all network, reinforcement learning, it doesn't matter. We work with some form of this for updating. And this is the learning part. This is the learning part. And here some sort of weight will come in because if there is no weight, there is no learning. Something has to give. Something has to be adjustable. Okay. Now, before we go just a word of caution, we have what we call state space problem, which is a problem in reinforcement learning because if we, let me go use the example of, let me add another one. Let's say you have four floors and three elevators and the elevator can be here, can be here, can be here. So this is one, two, three, the columns, right? Where the elevators move. So here I have what? I have, so if I go from here, so I don't know where you wanna start. One, two, three, four. So I have two, two, four. Right? So if you ask me what is the state of this elevator system, I say two, two, four. I just made it up for my own convenience. You can, if you can describe the state in any other way, do it. So my elevators are moving vertically. I look at every column and say where they are and I count from bottom to top. Okay, convention, establish it, count. I get the state. So if I look at this, how many different states do I have here? So it can't be at one, it can't be at two, it can't be at three, it can't be at four. So the second one as well, the third one as well. So I can have 64 different states. If I have an elevator system of four floors and three elevators, I have 64 different states. All of them could be at one, all of them could be at four, it could be at one, two, three or whatever. So of course, this is a ridiculously simple example. We don't work on this sort of stuff. So if I go to Empire State Building, then I have 120 floors and six. Okay, 120 and six. Wow, 120 times 120 times 120 to power six, number of states. Okay, that's a problem because we wanted to create a table. That would be a gigantic long table. So tables will be headache for us in reinforcement and that's where neural network coming. Just word of caution, keep in mind that as big as a problem, the feature space was for clustering and neural network for reinforcement learning is a much bigger headache because these systems are dynamic and I don't have a topology, I'm just dealing with the system as I interact with the system. Okay, so we want to get to something that we can work with. So how do we design, how do we design reinforcement learning agents? How do we do that? It's not straightforward. First, you have to discretize your states, right? So if I say 224 is a state, it cannot put this in a table, right? I have to come up with something and say, you know what, 224 is 224 for me as by definition means state number 112. Well, I have 64, why should I use 112? It is 62. So if I give number 62, the state 62 is 224. So I discretized my states. So you have to recognize the states, get all combinations and then discretize it such that I can put it in a table. So basically I put a number on it. What is the state number 25? 25 is 132. All right, so now extend this to playing chess, playing go, a robot moving. So how many states do I have if I'm moving in this room? And my discrete unit is one foot by one foot. It's a challenge. It can be a challenge. So then of course you have to define actions. What are the actions? What are the actions? Actions are, so send up, send up, don't do anything. It's a package, right, because I have three. I have three elevators. Another would be send down, send up, send up, right? This is another package. So how many actions do you have? For every elevator I have send up, send down, don't do anything. But then again I have the combination between the actions. So define the actions. Third, determine the reward and punishment. Oh, that's very, that can be very difficult. Define reward and punishment. Okay, elevator control again. I go back to elevator because we're engineer. We should understand this. What is the reward and punishment for controlling an elevator system? So what is a good elevator system? Minimal passenger time, waiting time. Okay, so the passenger waiting time, total passenger waiting time at any time across all floors should be minimal. So if it is minimal, you get reward. If it is more than that, you get punishment. What about I give you the total passenger time directly as punishment? So you wanna minimize the punishment. You have to make that number zero. So total waiting time is zero. Good luck with that. If it is Monday morning, well, okay, we can try. So determine the reward and punishment. That's a design factor. You have to design it again and again. We didn't have to do this for normal networks. Normal networks is clear. You have error. It's always, it doesn't matter what the problem is. Face recognition, text recognition, voice recognition, signature recognition. It doesn't matter what it is. But here, what is the reward and punishment for playing chess? Do you wanna do it for every move or for the entire game? So it's a design question. We have to think about this. For establish the action policy. Okay, how do you wanna take actions? That's very important. I just said, okay, one set of action is elevator number one goes up. Elevator number two goes up. Elevator number three doesn't move. Okay, how do you wanna make that decision? So selecting action is one thing. Taking the action is another thing. So taking action is based on the action policy. So do you wanna take action randomly? Do you wanna be greedy and take actions that were useful so far? How do you wanna do it? So you need to set an action policy. Okay, so to stick with the taking action, how to take actions. So of course you can take random actions. You would say what type of elevator control would be or playing chess or go or back and on to take random action at the beginning? Exploration, I have no idea. I have to get a picture from the system. We don't do this online. We don't send the Mars rover to Mars and then take random action, nobody does that. In the lab, we do it. Most of the exploration is done in the lab. We simulate the environment, we run and then you can go nuts. Take as many random actions as you want. And usually the agent plays with itself. So you just interact with itself. You create two instances of the agent. If it is a board game, you play with your copy, with your instance. So take random actions, why not? Second approach could be a greedy action policy. So which is action with maximum reward. You can do it. So at any moment, even at the beginning that the weights in my reinforcement table are all random, I can still go with the maximum. Because those numbers in that table of states and actions are accumulated reward. So what, the bigger those number they are, the higher the value of that action for that state. Okay, just go with the maximum. Why not? I want to take the best action. That's a greedy approach. If you do it, you are not exploring anymore. You are exploiting. But it will have a price. You cannot stick with exploitation. Because the point that you are exploiting was valid for Wednesday 2 p.m. And that is 3 p.m. and suddenly everything collapses because three lectures end at the same time and your equations do not mean anything. Suddenly things move. Non-stationary. The location of the solution changes. Third, epsilon greedy. So epsilon greedy action policy. So epsilon greedy is you take maximum action. Max action, so the action with the maximum reward with p equal one minus epsilon and random with epsilon. And epsilon is of course a very small positive number. So set epsilon to whatever, 0.01, 1%. So 99% of the time I take the best action according to the accumulated reward. In 1% of the cases, I take a random action. I'm just putting some numbers out there. Perhaps that epsilon should be large at the beginning. At the beginning, the randomness should overweight, dominate everything. And that epsilon, like the many factors we had for self-organizing maps, has a exponential decay and comes down. So over time, you should not explore anymore. You should start exploiting. And last one, we use Softmax. Softmax action selection. So the Softmax action selection is basically we calculate the number between 0 and 1 using some exponential function. And then hopefully we have a soft because all of them, even the epsilon one, they are rather abrupt. So you switch from the greedy to the random like a relay. That's not good. We want to have a soft transition between what is random and what is not random. So in other way, we want to find a trade-off between exploration and exploitation. So the Softmax, for the Softmax, we use something like Gibbs or Boltzmann distribution. Boltzmann distribution. Again, they come from physics, but what I mentioned, for example, for the RBMs, for the restricted Boltzmann machine, that was Gibbs sampling, but I didn't go in details because we didn't have time to go in detail. But you can grab a vector, just grab one, leave everything else the same. There are really nice neat tricks to do that. So Gibbs or Boltzmann distribution. So the probability of every action is given as the exponential value of your reinforcement table Q of S A. So the Q is now for me the table. I don't know any other way at the moment to implement it. So it has to be a table. My states, my action, and I put the accumulated reward in them. Doesn't matter randomly taken, MAC greedy taken, epsilon greedy taken, doesn't matter. For any action, I take a reward or punishment, I put it in the table. So I keep adding them using the running sum. So divided by a temperature value tau for normalization is quite common. And then we normalize everything by dividing it to the exponential value of the reinforcement table Q S B divided by that temperature tau. And this sum goes from B equal one to N number of actions. So I just add, so here you are. If this is your Q and this is your action and this is your state. And again, I'm exaggerating actually, this table is more like this. This table is so narrow and so high that you don't even see it as a table in the memory. Because you have millions of states but you have a handful of actions. So here what we are doing is, so you take the average of everything, you take the average of all actions to normalize the probability for the state S and action A. So maybe this is your, this is your SA. So then this will give me a probability. That's a smooth function, that's a distribution, fine. Basically it could take any distribution but empirically we know something like Gibbs or Boltzmann gives you a nice exponential decay over time so it's smooth, the transition is okay, there is no disruption so I can use it. So at any moment, depending on what is in that table which at the beginning is completely random, I can calculate for any action a probability. Okay, then based on that probability I can make a decision, should I take that action or not? Very simple. Or I do epsilon greedy. Greedy would be just take the epsilon greedy, take the maximum with probability this, this probability could be the softmax. So the softmax could give us a more, I don't know, a more smooth transaction when we are dealing with probabilities. And again people call this temperature, this comes from the physics, this gives you again control, do you wanna go down like this or do you wanna go down like this? So how fast, how steep is the exponential decay? We need control over learning. We need to control the learning because we don't have infinite resources. I wanna run experiments, I wanna see what happens if I try to very fast go from exploration to exploitation. What happens? I do it and I see that my agent sucks. So, okay, so I cannot go that fast. So let's go a little bit smoother. See what happens. We need that type of control. So all that means the reinforcement learning agent the RL agent learns a policy pie. So we learn a policy. What does that mean? So a classical example in the textbook is called grid walled. So you have a wall which is discrete and in this wall could be this room. There is a target and there is a robot that is moving. I don't know, so maybe this is my robot. So this is the target, this is the robot. So the question is the classical example in any textbook. You search for grid walled, reinforcement learning, you can find Java applets with nice demos online. Just watch it what happens. So how can I, now the task is this very different from, very different from neural network. Reinforcement learning handles non-stationary problems. What does that mean? That means the target is here, then it could be here and it could be here and it could be here. It's not fixed. If it was fixed, how do we call a problem that is fixed? It's stationary but how do we call it from perspective of algorithm? That's a deterministic problem. If the problem is deterministic, you don't need AI. You just enumerate with two loops, three loops, four loops, you find it. You don't need AI. So you need AI when the problem is stochastic, not deterministic. So now the problem is this. Given the grid walled, given a discrete walled and a robot that can move in discrete steps, forward, backward, left, right and so on, how can I learn a policy to navigate in this wall? Now it becomes interesting. So this is best example. You can see it next time you fly. This is the airport and the robot is cleaning. And of course you cannot assume, yes, there is a column here that is fixed. Yes, this doesn't move. But passengers move, people move. So you cannot assume that my obstacle is always here. It could be here. It could be here. So it's stochastic. So what is the policy? The policy is this, that doesn't matter where the robot is, doesn't matter where the target is, you find the shortest path to the target. So you go here, you go here, you go here. Sorry, you go here, you go here. Or you go here, if you can take this action, you go here and you get there. So if you randomly go here and you go up, you go up, you go up, you go to the left, you go to the left, you go to the left, You go to the left, you go here. So at any position, you know the right action to go as soon as possible in the direction of your target. That's a policy. So if you have that for the entire grid wall, that's a policy. So a policy is, doesn't matter where you deploy the cleaning robot in the airport, you just turn it on and you go. It will figure out, so it will start, of course, with what it has learned, because there is a Q matrix that it has learned from the previous days. But today is a new day. And the colon in the airport doesn't change the position, but people do. Or the target do, whatever the target is. So you use the policy, most of the time, exploiting the policy. But you should be ready to explore, because today things are different in the airport. No way that a neural network can deal with this. There is no neural network that can deal with this. Neural networks are trained. And then they are deployed as a static solution, where enforcement agents are dynamic solutions. They adjust. So in that sense, one could say, one could say, I'm sure Richard Sattman loves to hear that, reinforcement agents are smarter than neural networks, because they keep adjusting themselves. But they have their own challenges. So it's not like they can deal with all type of problems. OK, so at step t, the pi sub t of s a, so the policy that describes the relationship between action a and state s at the iterative time of t, is the probability that a sub t is a when s sub t is s. So the policy at any given time tells you how likely is it that this action is the right action to take. And like any other probability, it could be wrong. The probability that I'm sitting in the airplane and I crash and I die is 1 over 45 million. We have 45 million passenger flights per year, one or two of them fatally crash. OK, that's great. Makes air travel very safe. But if I could be sitting in that one airplane that is crashing, I'm sure I will not like probability theory when we are going down. So it can happen. Things can go wrong. This is the characteristic of any dynamic environment. So reinforcement learning methods, or L methods, enable a change of policy, of course. If there is no change, there is no learning. A change of policy based on experience. Label data, no label data, supervised learning, unsupervised learning. All those terminologies are not applicable to reinforcement agents. Reinforcement agents are a completely different animal in AI, very different. And you have to love them. Otherwise, you cannot work with them. There are not many people who can work with reinforcement agents. Because you've got to be patient. Just the reinforcement agents are gamblers at a certain extent. And making them work for you in totally unknown environments. AlphaGo is not CNN. Mars Rover is not CNN. So reinforcement agents, by the way, everything that reinforcement learning is not one algorithm. It's a class of problems. So it's not one technique that you say, what was it? Give it to me. Page 25, no. It's a concept. It's an idea. Interact with the environment. Receive reward and punishment. Adjust yourself. Proceed. And hope for the best. So how do we maximize rewards? So we can have two type of maximizing reward strategies. Either you have episodic tasks. So if you have an episodic task, which means whatever you are doing ends at some point. You are playing Go. After one hour, somebody wins. It's done. That's an episode. You're playing chess the same thing. If we turn off the elevators at midnight, elevator control is an episodic task. But if it is ongoing, and any time that you come into the seven elevators are available, that's not an episodic task. That's a continuous problem. So in episodic tasks, the return, at any given time, T, is the reward T plus 1 plus the reward T plus 2 plus the reward at T. And this T is terminal. So when things end, terminal. So you just add them. Either you really add them, or you use the running sum. So but you know how well you did. So how many moves we have in chess? You can come up with some averages. Well, whatever, 220 moves. OK, after an average 220 moves, the game ends. That's one episode. That's end. You can calculate who won, what was it, how did I do? Then second, for maximizing rewards, we can talk about continuing tasks. Robot navigation, if there is no just navigating, right? So you're navigating. So again, unless somebody comes and turn off the robot. If you turn off the robot, OK, nothing is happening. But as long as the robot is on, it's continuous learning. It's just going. Your return, at any given time, is the reward that you get at T plus 1 plus a gamma factor times r sub T plus 2 plus gamma square r T plus 3 plus up to infinity. Because it goes and goes and goes. So I can, sorry if I do it this way. Let me cheat a little bit. So I can write this, the continuing one as the sum of the gamma factor times the reward T plus K plus 1. And K goes from 0, of course, to infinity. So and we call the gamma factor a discount rate. Because we need, again, to control something. So the gamma is the discount factor. Sorry, I got a little bit messy. We will bring order. So now the question is, what is the value of gamma? How do you control it? The fundamental question for reinforcement agents is this. Do you want to be short-sighted or farsight? Do you want to put values on the early rewards or the later rewards? Do you want to go today out and make some bucks on the stock market? Or you want to invest your money and wait 25 years until you retire? So what is the talk process here? So you have to answer that question. So if you let the robot go, you should have answered this question for the robot. Is the robot short-sighted or farsight? Which, among others, we bring in with the gamma here. So what is gamma and how we handle the discount? So the gamma is not the most beautiful gamma you can write. So it can go towards 0 or 1. We usually, anything that we introduce in machine learning, we don't want them to leave the interval 0 and 1. Because things will explode. We usually accumulate stuff. I cannot define it between 0 and 10,000. It breaks my memory. So I want to just keep it in a small range. So gamma is between 0 and 1. If I set gamma towards 0, so you are short-sighted. And if gamma goes toward 1, it's farsighted. So you can play with that. That's one of the ways that we control. Because we will see, we still don't know where we plug in this. It has to have an effect. This is the only driver for learning in reinforcement learning, which is the reward and punishment. So how do you accumulate? And how do you let the agent know that it did OK? Or it didn't do OK? OK. So I didn't get any of this. Tell me an engineering example in detail. With some detail, tell me an engineering example that I can understand. OK. OK. No problem. Let's look at the inverted pendulum. So the inverted pendulum is we have a constellation like this. And I have a cart in there. And this cart can move back and forth. But it doesn't have much space. It's restricted in the way it can move back and forth. And there is a pendulum here. Classical control example for engineering. Control 101. So you have the inverted pendulum. So if there is no motion, it will fall. So in order to keep it balanced and upright, you have to move really fast back and forth. So that's the control. Move it back and forth really fast. So if you hit somewhere, it will fall. If you stop, it will fall. So it's falling in this direction. You move in the same direction to keep it upright, for example. So it has to be really nice. And if you do a really good job from the far, you cannot even see that it's minimally moving because you are controlling it. If you have a very smooth controller, fuzzy controller is one of them. It's extremely smooth. So OK, I want to do this with reinforcement learning. That should not be difficult because this guy can go this way and can go this way, which means, so this alpha, this angle is the critical angle for us. And you have also an omega. You have an acceleration in this direction. And you have an angular speed in this direction. So you can measure two things. And you can influence two things, directly and indirectly. So what is the task? Avoid failure. Oh, that should be easy. Just keep it upright. Just keep it upright. If you say that's easy, don't worry. I went to a conference in Japan when I was a student. And they said, there is a demo for inverted pendulum. And I said, well, I came to Japan to see technology. And they want to show me an inverted pendulum. They had a triple inverted pendulum. They had something on top of this and something on top of this. And then when they show that, everybody went quiet. Oh my god. This is a chaotic system. It's not even a stochastic. It's chaotic. If you keep that for two seconds, you're lucky. So OK, we don't want to do that. We don't want to just simple inverted pendulum. So avoid failure. First, the pole falls, then you are failed. Second, car hit the wall. The car hit the wall. And you fail. So don't hit the wall. Don't let the pole fall. Is that simple enough? You are after the contract of programming the Morse Rover. Solve this, and then we will talk about Morse Rover. So we can understand this understood as an episodic task. So you understand the inverted pendulum as an episodic task. When is the episode? When it falls? Or I hit the wall. That's one episode. Or you think you can keep it forever? Well, maybe we do. And our episode is infinite, ideally, without failure. So if we do it as an episode, then you can say my reward is plus 1. And so this is for each step prior to fall. So as long as the pole is not falling, you get plus 1. Just keep it up, plus 1, plus 1, plus 1, plus 1, plus 1. You are getting rewarded. Reinforcement agents are addicted to get rewards. They do anything for reward. And then you have your return. And your return is simply the number of steps. See, the design is really different. The number of steps before the fall. So the return is how many steps you keep it up? That's your return. There is no end into this, but the end will come. So one episode will be this, until the pole falls. This will happen at the beginning with Reinforcement Agent. The first 10, 20, 50, 100 times, you will be frustrated because you start and it falls. You start and it falls. You start and it falls. You start and it falls. You go crazy. But Reinforcement Agent is filling those spots in that magic table, getting smarter with every iteration. Or we can understand it so the task understood as a continuing task. You see, it's up to us sometimes. Is that an episodic task? Is that a continuous task? Then I say the reward is minus 1 for falling and 0 otherwise. You see, it's really subtle. So now you continually say, OK, now if you fall, it's minus 1. As long as you are keeping it up, it's 0. So now it's 0 good. 0 is not the punishment, it's not the reward. It's less punishment. So you don't want to get minus 1. Now I'm keeping it up and I say 0, 0, 0. I'm not getting punished. I'm not getting punished. I'm such a good agent. And then the return is minus my beautiful gamma raised to k for k steps before the fall. So I just want to make the point that the design of the agent is up to us. You can do positive reinforcement. You can do negative reinforcement. You can understand it as an episodic task. You can understand it as a continuous task. In many cases, it's up to us. Sometimes the nature of the problem is so dominant that we don't have a choice. And it is episodic. You cannot do anything about it. But most of the time, we have the choice to change things. OK, so I'm not going into the state definition because that can get messy. I can maybe post something for that. So what are the states? What is the action? Well, the action is move in this direction with a certain speed or move in this direction with a certain speed. If you are balanced, you are moving back and forth really fast. At the beginning, things out of control, you move in this direction really fast, and then it's gone. It's falling. So what is the state? I have to look at this angle. So one degree to the left, one degree to the right, one and a half degree to the left, one and a half degree to the right. So how many states do I have? When can I save it? Is it up to 15 degrees, 20 degrees, that I can still accelerate in opposite direction and bring it back? So we have to do our homework and think about those conditions. OK, so we generally, in reinforcement learning, we talk about temporal differencing methods. Temporal differencing methods are really powerful. So on temporal differencing, we usually call it TD. TD methods, reinforcement guides, all the time they say TD. TD methods, temporal differencing is the most powerful scheme in reinforcement. We can do TD prediction, temporal differencing prediction. So TD prediction is basically a policy evaluation. If I evaluate a policy correctly, I can tell you what happens tomorrow in the stock market because I have developed a policy. So if I evaluate the policy correctly, I can tell you tomorrow this will happen. I cannot tell you about the cancer because the stock market is a dynamic, stochastic environment. Having cancer is not a dynamic, stochastic environment. I can take action, buy stocks, and that will change the environment. But me saying is cancer, maybe the patient gets surgery, but I don't get any feedback because the patient gets surgery and goes, the problem is done. There is no stochastic environment. How many times you see that people apply reinforcement agent problems or problems that are not reinforcement problems? It has to be a dynamic environment, and the action that you take should change the environment. Does it? Does it? Does the action? So I define a reinforcement learning that changes the color of the image. OK, then what happened? Nothing. It looks beautiful. What? I recognize a face with reinforcement learning. Then what happens? Then somebody gets access to a space. Then what happens? That somebody will turn off the light. And then, OK, so that's a dynamic environment. So the action that you take should change the environment. So that means we have to compute the state value, the state value function v pi, v super pi, for a given policy, for a given policy pi. So prediction in reinforcement learning is basically evaluating a policy. And what is the policy? That's that table, the lookup table. It's a magic lookup table. It's an ever-changing lookup table. It's not a static lookup table. There's a lot of knowledge in that table. We compute the state value function. So what is the value of every state? If I apply the policy that I have learned so far, if I use that lookup table, people don't like to say it's lookup table. There is a reason that people don't say it because most of the time we don't even use a lookup table. OK, what else do we use? We put the lookup table into a neural network. OK. So what is the policy evaluation? So the policy evaluation. First, simple every visit, every visit, multicolor. Multicolor. So the value for each state s sub t, for the value of state s sub t, is updated by the value of that state in the previous state, which is because this is pseudocode. I just don't play with the indexes. So I'm just using the same. Plus alpha times the return rt minus the value of st. Isn't that the same updates rule that we talked about? But I'm applying it on the state. I'm not applying it on the action. Policy iteration. So I take the value of the state as I knew before. I have my stepping size, learning rate, discount factor, name it whatever you want. And I'm looking at the return, the difference between the return, the desired value, and what the value I have right now. The same old, the new estimate is equal to the old estimate plus the step size times the target minus the actual value. Nothing magical, but we don't know how to calculate the value of a state. What is the value of the stock market right now? How big is that state space? I don't know, billions of states. What is the value of a given company? You have to give me all parameters. I don't know. It's very difficult to know the value of the state. Second, the simplest TD, which is TD0. So temporal differencing. Think back about Markov decision process. Is it an MDP of order 1 or order n? We like order 1, right? You want to know if I'm grumpy today, it's not because I had a bad childhood, because you cannot analyze me. If you want to figure that situation, that I'm a grumpy instructor because I had a bad childhood, you have to read my entire life, and you will be bored. So it's better to assume he is a Markov decision process of order 1. So he's grumpy because he didn't get his coffee five minutes ago. OK, that's easier to understand. And we caught him a slack and say, maybe tomorrow he changes. Who knows? So the simplest method TD is TD0. And again, the value of the state ST as updated, sorry, I want to write pseudocode, is assigned by the previous value. And since this is pseudocode, I'm not playing with this indices. Plus alpha times, now I'm getting the reward, RT plus 1, plus a gamma discount factor, times the value of ST plus 1 minus the value of ST. So what's the difference? So here was the return. Here is an estimate of the return, estimate of the return. So I'm still doing the target minus what it is. It's the same thing. I just added a more complicated notion of the desired value. That's it. Nothing else happened. And I want to bring in the concept of discounted value because I want to have the freedom that I add or take away value from the state depending where I am in the game. Moving my tower from here to here at the beginning may have some value, but toward the end of the game may have another value. So I want to play with that. I want to have control over that. So OK, we cannot work with this. If we do this, we will not be doing AI reinforcement learning. We will be doing dynamic programming reinforcement learning. What is bad about that? Nothing. Don't want to insult anybody. But for this, you need the transition probability, which is a model. In AI, we create models. We do not work on the models. We do not use models to create models. In AI, we build models out of data. In dynamic programming, you take a model and you refine it. And you make a prediction based on the model that you have. Observing a system, the stock market, the robot navigation, anything, elevator control, chess, playing games. Observing it and getting the transition probability is constructing a model. That's dynamic programming. It's not AI. What is the difference? Nothing. As engineers, nothing. You have this work with this. So just academically, they are different. People do that in operation research. They don't do that much in AI. OK. So how much time do we have? Can we come up with a Q learning? Yeah, we could. We could. We are pushing it. So instead, we want to learn an action value function. Is the difference clear? I don't want to work with the problem of what is the state of my stock option in the market. I don't know what is the value of my stocks. Instead, I want to see if I sell my stocks. Is it a good action or not? You see the difference? Outcome may be the same. Outcome will be the same. So I'm getting in this room from that door. You are getting in this room from that door. We are in the same room. But you use the state door. I'm using the action door. OK. Whatever makes you happy. So my agent super matrix that contains the accumulated reward for a state ST and actions AT is updated with the values that I have so far. Q of S sub T, A sub T. And I'm assigning it to a new variable. Plus my magic learning rate, or step size, or whatever you want to call it, times the reward that I'm getting from T plus 1. So if you take an action at T, the reward is for whatever happens. So you take an action. You change the environment. And the reinforcement signal comes from the change. Another typical mistake for people who are new in reinforcement. You go into the code and see things do not work. Things are messy. And you see you are taking the reward at the present time, which belongs to the step before. Mark of decision process, order 1. So if you take an action right now, the response comes in the next iteration. You change the system. And then you know, was it good or was it bad? So RT plus 1 plus gamma times the Q of ST plus 1, AT plus 1, minus the Q of ST, AT. The same thing. Again, I'm repeating it. So this is the estimate of my return. So this is my reward that I receive for having changed. This is my estimate of the return. This together is the target. I want to maximize my reward. Together is the target. Whatever I have so far plus what I'm getting for right now, what is the difference to what I have in the matrix right now? So target minus the actual value. The actual value. Don't get confused with this. Call this reward together. The target is the reward. Now I'm just bringing in a little bit of discount. Discounted value. So which means, so you start at sub t. You take an action A sub t. You go to ST plus 1. Before you go there, you actually get a reward for T plus 1. And I hope very much that that confuses you. So I'm here. I take an action that brings me to a state T plus 1. And for that action, I should get a reward. And then I take another action, a T plus 1, which brings me to the state T plus 2. And for that, I get R of T plus 2. And I continue. So is it episodic? Is it continuous? I continue. And so where's the learning happening? These numbers are being put in a table. And I keep adding them. How much reward am I getting for the action A T, which is send the elevator up. OK. The simplest method is the Q learning. And the Q learning is so simple. You don't even need a Python package for that. You can implement it in 30 minutes. If you are good at Python, 30 minutes. The learning algorithm, not the design of the states. The designs of the states may take you a week or two weeks or three weeks, depending on what it is. The design of the agent is difficult. The learning algorithm is simple. OK. Do we still have time? I put the alarm. OK. Still have time. That's good. So the Q learning. Q learning. Initialize QSA randomly. I'm not talking about the design. You design, you know what your states, you know what your actions are. You create a matrix of S, the cardinality of S, times the cardinality of A. You put random numbers in that table, you start working. Again, so nobody will do the homework for you to design the states and actions. You have to do it. That's the problem analysis. What is the problem? Then for each episode, for each episode we do this. We go inside the loop. Initialize. Initialize the state S. For each step, the second loop. For each step, so I may have multiple episodes. And for every episode, I have multiple steps. How many iterations do you go with neural networks? Same here. You want to go a million times, two million times, 10,000 times. How difficult is the problem? Inside the second loop, choose A from S using action policy, using action policy. What is your action policy? Is it random? Is it greedy? Is it epsilon greedy? Is it softmax? Choose an action. Whatever the policy is, choose an action. You choose the action, take the action. Take action A. Choosing an action is not the same as taking an action. You determine the best action is send the elevator up. OK, you have to send the signal to the controller and say send it up. You have to take the action. Then you observe the reward, and you observe S prime. The next state, which there we said is ST plus 1. S prime is ST plus 1. So you took an action. This action will bring you from the state S to the state S prime. You are changing the environment. Very simple rule that you are not using reinforcement properly. If you are not, if the actions that you take do not have any effect on the environment, you are doing something wrong. The agent, you are just making fun of the agent. The agent thinks, has the illusion that is helping you is sending this. Take this action. And the agent thinks you are sending that action to the controller, but you are not doing that. Somebody else is doing that. So you have an insane reinforcement learning agent. Ridiculous. And then the update. The update with that equation S of QS of A is updated with QS of A plus alpha. That equation. Now update the matrix. So after you update the matrix, you update the state. Update S with S prime. Because you changed the system. Your elevator is not here anymore. It's here. It's a different state. And if I don't go there and I come before outside of the inner loop, I just write until S is terminal. Sorry for writing so long. So outside of the inner loop, until the S is terminal, so which is here, which is one episode. OK. Is that it? Well, yeah. How do I convert? Really, do you need more than 30 minutes to implement this? If you know the states and actions. This is nothing. This is 10 lines of Python. It's nothing. But a lot of work has already gone into designing. We have one more minute. Designing the state and action and defining the reward and punishment. So just one more facts. So conversions. How do I know I converge? Every state has to be visited. Has to be visited. This is in the PhD thesis of Watkins. Watkins is the guy who came up with the idea of Q learning. And there is a mathematical theorem for that. You prove it, you see. It gives a guarantee. If every state has to be visited multiple times. Well, that makes it difficult. Every state has to be visited multiple times. And what happens if I have a gigantic table? If I have a gigantic table, if I have 10 to 12, and I have, I don't know, 25 actions. We usually don't implement tables. You can run into programming problem. You cannot even fit this into the memory. You have to use all possible tricks to update this table. So we don't usually do that. So what we do is we grab a multi-layer perceptron or a deep neural network. And the Q learning is inside that. So we train a neural network to hold accumulated rewards. And this is the stupid reason, sorry. This is the not very smart reason that people talk about deep reinforcement learning. There is no such a thing. There is no depth in reinforcement agents. We call it deep reinforcement learning when we use a deep network to model or to hold or to memorize the Q matrix. But that doesn't give us the terminology and foundation to call it deep reinforcement learning. The reinforcement learning doesn't have any topology to be deep or shallow. So it's a wrong terminology. Nobody protests because as long as you are providing results, everybody's happy. Nobody complains, which is OK. So I hope that was enough today. In the tutorial, we will talk about TensorFlow and stick around. And we will start a new topic next week.