 Have you ever wished you had a crystal ball to predict the outcome of your decisions? In the world of reinforcement learning, Monte Carlo methods are that crystal ball to get a glimpse of the future of AI decision making. If you could apply Monte Carlo simulations to foresee any consequence of a real life scenario in your life, what would it be? Let us know in the comments down below and let's also dive into this enchanting world of Monte Carlo methods for reinforcement learning. I'm your host, Jay Hawthorne, and in this video we are going to talk about just that. So let's get to it. What is Monte Carlo? The name Monte Carlo comes from the famous Monte Carlo Casino in Monaco, known for its association with gambling and games of chance. Monte Carlo methods use simulations to solve problems that may be difficult or impossible to solve analytically. Where do we use this? So how do we calculate the value of pi? There may be some analytical solutions to do this, but the Monte Carlo way to do this is let's say that we have a square tray of side 1, and then we have a circle tray of radius 1, and the idea now is to randomly drop a ball onto this rectangle platform so that it can fall either in this tray, in this tray, or just nowhere else. And we can simulate this random dropping over time. As the number of balls dropped increases, we can divide the number of balls in this circular tray with the number of balls in the square tray in order to get pi. And this works in theory because you can just divide the areas to get pi. Now here's a completely different application. Let's say that you want to host a game of bingo. All right, number 53. Yes. Someone's excited. And number 45. Maybe a little too excited. So you know you're going to have 500 players, and you need to call out, say, 20 numbers. How many people are going to be winners at the end of this? Well, the Monte Carlo solution, like I said before, is about executing a number of simulations to simulate all of these bingo games and then try to determine that final number of people who will win. And executing the simulation, you get this wonderful graph over here. Over time, as we execute more and more simulations of all of these bingo games, you'll see that around 1% of all the players will win the game at the end. So this comes out to about five to six winners for the game. And if you want more intuition on how exactly this code is constructed and how you can code out this Monte Carlo simulation, do check out another video right here. So let's not talk about the pros and cons of Monte Carlo methods. The main pro is that they are very versatile. These similar Monte Carlo methods can be applied to very different types of problems. But the main con here is that simulations can be pretty expensive and may not be able to be applied in every single situation, practically speaking. Quiz time. Have you been paying attention? Let's quiz you to find out. So let's say that we have a square of side one and a diamond inside that square, such that if we were to do this marble dropping simulation of dropping a thousand marbles, 500 of them fall within the diamond and the other 500 fall outside the diamond, but still within the square. What is the area of the diamond? Is it 0.25 square units, 0.5 square units, one square unit or two square units? Comment your answer down in the comment section below and bonus points if you can give your reasoning. That's going to do it for quiz time for now, but I'll be back, so pay attention. Now on to pass to where we talk about Monte Carlo methods specifically for reinforcement learning. Monte Carlo methods are used in reinforcement learning for two purposes. One is to evaluate a policy and the other is to improve a given policy. So let's talk about each in detail with our fun trusty robot, Frank. The policy can be thought of the decision making brain of any agent. Based on a situation that policy will dictate what action the agent should take. So this is Frank. Frank, say hi. Hi. What a cutie. Frank is now going to take actions based on its policy. Now, it's good that Frank can make these decisions, but how is Frank making these decisions and how good are these decisions? An intuitive way to look at this is to put Frank in this wonderful grid-like world where we have rewards which are written in the squares. Well, wherever Frank starts, the goal for Frank is to get to this plus 10 spot without incurring a lot of the negative ones along the way. Now, with Frank's help, we are going to intuitively describe this Monte Carlo method. So, Frank, you can now do your thing. Three, two, one. Nice. So Frank got to the goal, but how good were these actions actually? Well, for every state action pair, we can calculate the future reward in the cue table. Now, this cue table here is just an insight into how Frank's brain is working and is an insight to how Frank thus makes decisions. For example, in the first from the last state, Frank was in a state S6 and decided to go down, and the future reward became 10. And so, cue of S6 and down is equal to 10. And from the second to last state, just before this, let's say that Frank was in a state S3 and this was when Frank also chose to go down. And in this case, the future total future reward was minus 1 plus 10, which is 9. And so, cue of S3 down is 9. And so, effectively, we can just backtrack until the start to determine what the future cue values were for this episode. Now, this is great, but these cue values that we have now are just estimations based on this one episode. So, why not just let Frank do this thing again? Ready, Frank? Three, two, one. No, Frank! You ended up in the poison spot! Looks like Frank was not as good as we thought and can still make mistakes. But like we said before, for every state action pair taken, we can calculate the future reward. Now, in the first from last state, we were in a state S2 and we decided to go down. So, the total future reward is negative 10, just before we hit the poison. So, cue of S2 right becomes negative 10. Now, in the second from last state, that is the one just before this, we were in a state S3 and we decided to go down. So, the future total reward was the negative 1 plus the negative 10, which is negative 11. But we know that from the previous episode, we saw that cue of S3 down, that we got a plus 9. And so, we could probably just take some average of these two. And then, we just keep backtracking until the start. And we can keep repeating these simulations over and over again and estimate the cue values by taking their averages over time. And as we perform more simulations, you'll notice that the cue values over time are going to converge to a specific value and this is due to the law of large numbers. Now, once we have these cue values, you can understand Frank's methodology and how it decides to take actions. So, you can see that when Frank is in a state S1, what action will it take? Well, the highest value here is 1.5, which corresponds to down. And so, chances are that if Frank is in a state S1, the greedy action that Frank will take is to go down. And so, now that we know Frank's cue table, we know how Frank behaves and thus we know Frank's current policy. Now that we know how to determine and evaluate Frank's policy, we want to also see if we can help Frank a little bit out in improving his policy. So, now we have Frank again in our grid world. Hey, Frank. Hi. And now we know this cue table dictates how Frank will behave. And so, for the first episode, let Frank do its thing. Three, two, one. No, poisoned again. All right. Frank was poisoned, but now let's actually help Frank's decision making out by tweaking the cue table. So, in the first from last state that Frank just did, it was in some state S1. And so, the total future reward is just the negative 10 I'm going down. So, cue of S1 going down is negative 10. But Frank, we want you to learn from your mistakes, buddy. So, let's not be too harsh on this downward direction and we'll update the cue value such that it takes into account your previous knowledge, but also this new information. We could assume like the weight is 0.1. And so, we can calculate its new cue value. Then, now we continuously backtrack to update the other cue values in this episode. Now, based on this new cue table, Frank can traverse the terrain again. And then we update the cue values again. So, it's a continuous dichotomy of first policy evaluation like we did before. But also, there is a phase of improvement too. And we do evaluation, improvement, evaluation, improvement, every single time step. And each time we update, the cue table will nudge Frank into making better and better decisions over time. And so, overall, we evaluated Frank's policy and improved Frank's policy with Monte Carlo methods. Quiz time. Welcome back to another edition of Quiz Time. Have you been paying attention? Well, let's quiz you to find out. During policy improvement, how was Frank using a Monte Carlo method? Frank was directly calculating the optimal policy without simulations. Frank was updating cue values based on observed rewards from simulated episodes. Frank was updating cue values with dynamic programming. Or Frank was updating cue values based on cue learning. Chime off with your answer in the comments down below and also any additional reasoning you might have. That's going to do it for quiz time for this video, but before we go, let's summarize some of the key points in this video. Monte Carlo methods involve using simulations to solve problems that may be difficult to solve or impossible to solve analytically. They are versatile but can be computationally expensive. And in reinforcement learning, Monte Carlo methods are used for policy evaluation. That is, how does Frank decide what actions to take? And it's also used in policy improvement. Can we improve what Frank's actions are? That's all I have for this video. Thank you all so much for watching. And if you want to see more content on reinforcement learning, do check out the playlist on the screen right now for all the videos. And I will see you in another one. Bye-bye.