 Consider the following problem. You are faced repeatedly with a choice among K-different options or actions, and after each choice you receive a numerical reward chosen from a stationary probability distribution that depends on the action that you selected. Your objective is to maximize the expected total reward over some period of time, for example, over 1,000 action selections or time steps. This is the original formulation of the K-armed bandit problem or the multi-armed bandit problem. You can see here that it is a learning problem, so where do these K-armed bandits fit into the big picture of machine learning? Let's take a step back. Methods to solve machine learning problems are machine learning paradigms, and there are three core machine learning paradigms along with many others. But the first probably more popular one is supervised learning, where we have data and a label, and we feed this to a dumb model to help it learn how to map that data to that specific label. Classification and regression problems fall under this paradigm of solving problems, and here are a few tasks. Another machine learning paradigm is unsupervised learning, where we have data but we don't have any label, and the goal here is to understand and learn patterns within the data itself, and this is typically used for clustering tasks or in dimensionality reduction among many others. And then the third paradigm is reinforcement learning. Reinforcement learning is learning what to do that is how to map situations to actions so as to maximize a numerical reward signal. This is a generic definition of a reinforcement learning solution to a problem. Now coming back to the definition of the K-armed bandit problem, we can see that the K-armed bandit problem is actually a specific instance of this broader reinforcement learning problem. At the end of the day, it is a learning problem that maps a situation, that is the situation of having K options in front of you to an action where we choose some action here, and the choice of this action would optimize some reward over time, and that's exactly what we see here by optimizing the expected total reward over many time steps, like 1,000 time steps. To better understand multi-armed bandits, let's consider the situation where we are walking into a casino and there are four slot machines, and our goal is to walk out with the most money. But to simplify this entire case, let's say that for every slot machine we play, we either win or we lose. And so our goal redefined now is to walk out of the casino after playing a thousand times with the most wins that we can possibly garner. Now looking at this situation, the first slot machine over here, there's a 70% chance that if we play it, we win. There's a 30% chance to play the second one to win, 55% for the third one, and 40% for the fourth one. In order to get the most number of wins, if we knew all of this information, we would just go directly to that first slot machine and just keep playing it 1,000 times. We'll win approximately 700 times, but that'll probably be on average the maximum amount that we could possibly win. However, walking into a casino, we know nothing about this information. We don't know these numbers, these probability of successes. So how do we learn? So the idea here is first to do some exploration that is try to figure out which slot machine is good for you by randomly selecting them. And then the second step is exploitation, where now that we have a good idea of which slot machines work well, we can just start going to the ones that we have seen worked well historically. The multi-arm bandit problem requires us to balance both exploration and exploitation because they are both important. Why is exploration important? Well, let's say that we start playing slot machine 2 and we won. This might encourage us, oh, we won here, so why not just keep playing slot machine 2? For 2 times, 3 times, 4 times, let's say 10 times in a row, we won playing slot machine 2, which is possible. This might make us think, okay, we've seen like 10 of these, I think we are good to go all in on slot machine 2. But in reality, if we go all in with slot machine 2 1,000 times in the long run that is after playing 1,000 games, we would really only win about 300 games. So a fully exploitation system is not good. And this is why exploration is important. But the flip side is also true. Let's say that we're exploring all the time. That means that we are not really taking advantage of what we know, but just randomly choosing slot machines. So in this case, this would be like, we played this one 250 times, this one 250 times, this third one 250 times and the fourth one 250 times. Now, if you add all these numbers up, if you add all of your wins up, you might get a reasonable number of wins, maybe around like 400 to 500 wins. But it's definitely not the best that we could have possibly done. And so the balance between exploration and exploitation is absolutely critical when dealing with the multi-armed bandit solution. Let's now take a look at the code for multi-armed bandits. So the logic I just explained is actually being almost exactly regurgitated in this code. And note that this code is not mine. I put the link to the original author in the Colab notebook, but I'll also update and upload this notebook to GitHub myself because I've added just a few more pieces of information regarding like how large each of these elements are. So it might be a good reference too. In this case, we are considering 10 arms instead of the four arms where each of these are probabilities of success. So going for the first slot machine will give you a 10% chance of actually winning. Second slot machine is 50% chance. Third is 60% chance and so on. Now note that these are probabilities that we do not know going into this. This is only for the purpose of a simulation because we need some ground troop to actually understand if we're doing well or not. Number of steps is the number of times that we're going to play the slot machines over and over and over. I said a thousand before, but in this case we can take 500. Number of experiments is essentially the number of times we're going to execute these sets of 500 steps. So this is just for the sake of a simulation. EPS here is the balance between exploration exploitation we want. We're saying 10% of the time we want to explore 90% of the time we want to exploit our own knowledge. R is going to hold the total rewards for each time step and A is going to hold the total number of times an action was performed in that time step. Now for every single simulation that we perform, we're going to perform the experiment. This experiment involves choosing a slot machine, playing it and realizing either a win or a loss. In the code, that's kind of exactly what's happening here. We perform an action that is choosing a slot machine which can either be exploration or exploitation. We then realize a reward which could either be zero or one depending on the probability of success that we described before if that slot machine. And then we perform these sets of actions and get rewards 500 times in order to get a list of 500 actions and rewards. Now this would have been just one simulation and we add the numbers for all simulations. Where eventually with rewards, we're going to determine the average reward for every single time step. And if we plot the average reward for every single time step, we will get this graph. So this x axis of the time step from zero to 500. And then the average reward across all simulations is just this y axis. Each reward would thus represent how well we did in on average for a specific time step. For example, on average for let's say the 50th time step, we used to get an average reward for every slot machine that we played around like 0.64. But as time increase, you can see that our average reward did in fact increase. That means that we were making better decisions to actually take advantage of slot machines that truly gave us the best option win. It plateaus at 0.75, which makes sense because the best banded arm, even if you chose that repeatedly, we would get 0.8. That is the highest possible reward and accuracy we could possibly get. And in a similar situation, we can also plot what were the slot machines that we actually chose on every time step. You can see in the beginning we were very random in choosing our slot machines. But as we got more and more knowledge even early on, you could see that we are definitely choosing this red line and then this gold line. So red line is arm four and the gold line is arm nine, both of which makes sense because they truly had the higher probabilities of success. And though we didn't know this, we figured it out through rounds of exploration. And eventually we could start exploiting them to actually get those rewards. And that's why we choose these arms more and more over time. And that's also why we see that these rewards get better and better over time. That's going to do it for this video and I hope you all learned something fun. I'm going to link this notebook and the original code down in the description below. So please do check that out. If you do like what you're seeing and you love content like this, please do give this video a thumbs up. Thank you all so much for 100,000 subscribers and we want to reach 150,000 subscribers real soon. So thank you, thank you, thank you for your support and I'll see you in another one. Bye-bye.