 Greetings, fellow learners. Now, before we embark on our journey towards understanding proximal policy optimization, I have a question for you. How do you modify your decision making for a certain activity that you do? Or what does it take to modify your decision making for a certain activity? Can you express this at all in words? Now, this could be your decision making on how you invest money or decision making on your exercise routine or just anything else? Share your thoughts in the comments down below and I would love to hear them. In this video, we're going to delve into a popular method that an AI uses to modify its decision making called proximal policy optimization. This video is going to be divided into three passes where in the first pass, we'll start by introducing some concepts and definitions. And as we go down into the later passes, we'll dive into further details about the algorithm itself. Pay attention because I'm going to quiz you along the way. So let's get to it. The PPO algorithm makes use of two main architectures, a policy network and a value function network. They are both neural networks that take an input and return and output. The policy network will take a state as input and it will produce an action as output. From an architecture standpoint, the output layer of this network has a number of neurons equal to the number of possible actions it can take. And each neuron is a probability that an action is taken when the agent is in the input state. The value function takes a state and an action as input, and it will output a real number that quantifies how good was this decision. This number is known as a Q value. From an architecture standpoint, as was with the policy network, the output layer of this network has the number of neurons equal to the number of possible actions it can take. And so, technically, this value function neural network will take an a state as input and for every action, it is going to determine a Q value. And this will quantify how good was this specific output action for this specific state. Quiz time! Have you been paying attention? Let's quiz you to find out. What is the primary function of the output layer in the policy network of this PPO algorithm? A. It outputs Q values for the given state and action. B. It produces a probability distribution over possible actions given the input state. C. It calculates the advantage estimates for policy updates, or D. It generates random actions for exploration. Comment your answer down below and let's just have a discussion. And if you think I deserve it and you love learning, please consider hitting that like button. Now it's going to do for quiz time and pass one of our explanation, but I'll be back so pay attention. To help us out, let's bring out Frank, our lovely friend. Say hi, Frank. Hello. What a cutie. Now, Frank is in this grid world and he needs to know how to get to that plus 10 reward spot. Now, to do so, he can take the actions of left, right, up, or down. And in this pass, we're going to give an overview of how Frank can learn to navigate this world using PPO. So let's talk about the overview of the training process. Frank starts at some random state. The state is passed into a policy network. The policy network will determine the probability of generating each action. This is then a probability distribution. And we sample from this to determine the actual next action. Frank then takes that next action and will receive some reward. We then store the quadruple of the state, action, reward, and action probability into a data store. This information will be useful when training the policy network and the value of function network. And we'll talk about this later. We now repeat the sequence of steps for the episode or some fixed number of time steps within the episode. Now we have data stored in this episode as a batch. And we can collect multiple batches if we choose. But let's just stop over here. Thanks for the help, Frank. We'll take it from here and we're going to help you learn. Okay. Nice. Now let's talk about training the two networks. Take the batch of the state, action, reward, and action probability. Now the state and action are used with the value function network to give us a Q value. And like we said before, this will quantify how good do we expect this action to be. We then determine the total future reward for every time step using the data that we stored. And this will quantify how good did we actually perform. We take the difference between these two values and this difference is known as the advantage. We use this advantage in order to compute a loss. This loss is then back propagated through the value function network. And so it learns. We then use the same advantage and probabilities that we stored previously to determine the loss for the policy network. The loss is back propagated through the policy network. So its parameters are updated. And then we repeat this process for all batches of data. Effectively, the policy network and the value function network get better over time as they learn over time together. And Frank learns to make better and better decisions so that he can get to that plus 10 reward square. And this is an overview of the one iteration for proximal policy optimization. Quiz time. I'm back. Have you been paying attention? Let's quiz you to find out. What is the main purpose of the Q values produced by the value function network in PPO? A. To represent the probabilities of different actions given a state. B. To calculate the advantage estimates using policy updates. C. To generate probability distribution over possible actions. Or D. To produce random actions for exploration. Comment your answer down below and let's have a discussion. That'll do it for quiz time for now and also pass 2 but keep paying attention because I will be back to quiz you. So in pass 2 we saw that the value function network and the policy network are trained together. And the overview of the steps is basically we compute the loss for the value function network. We compute the loss for the policy network. We update both networks together and repeat and then Frank becomes better and better as a decision maker. Let's now take the overview of the process that we discussed in pass 2. But we're going to add a few more details specifically along the loss function calculation. So let's start with the value function network and generating its loss. So we get the batch of data for the episode that we stored in pass 2. For each time set we compute the actual future reward with the data gathered. And this is done by taking the sum of discounted future rewards. And then we can compute the expected future reward by passing the state into the value function neural network. And it produces a Q value for every action. And then we look at the Q value for the specific action neuron in our tuple. So for every time step we have two numbers. We take the difference between these two numbers to get the advantage. We square the advantage for every time step and we can take the average of this across the batch. This final number is a loss that we back propagate through the value function network. And so the neural network learns. And at the same time all of this happens we are also training the policy network. So let's see how that's going. Now if we write it in mathematical form the policy network loss looks like this. Very cumbersome but let's explain what's going on. First we'll get the batch of data that we stored. Next pass the batch of states to the policy network to get the probabilities of actions. In each case we only consider the probability of the action taken when we gathered the data. We then divide two numbers that is a probability for the specific action that we have now divided by the probability we collected previously. And this is a probability ratio. We'll multiply this ratio with the advantage computed in this time step. And so for every time step in the episode we have a number. Let's keep this number aside and hold on to this Frank for me. Okay. Okay. Nice. Next we take the probability ratio and we'll clip it to ensure that we're not changing the network too much. We multiply this by the advantage. And so now for every time step we have two values. We'll take the minimum of these values. And then when we take the average of the values across the batch we'll get a single number. This single number is our loss that is back propagated through the policy network. So this is quite a bit of steps but overall the loss function strikes a balance between making effective policy updates to improve performance and making cautious policy updates to improve stability. And effectively the value function network and the policy network are trained together. And that is PPO. Quiz time. This is going to be a fun one. Have you been paying attention? Let's quiz you to find out. What is the primary purpose of computing the advantage in the loss function for the value function network? A to determine the ratio of probabilities between the old and new policies. B to calculate the expected future rewards and guide updates to the value function. C to click the probability ratio and ensure stable policy updates. Or D to compute the loss for the policy network during training. Comment your answer down below and let's just have a discussion. And once again if you do think I deserve this and you haven't done so yet please consider hitting that like button. Thank you so much and that will do it for quiz time for this video but before we go let's get a summary. Proximal policy optimization algorithm is used to learn a policy directly. Now the PPO algorithm makes use of two architectures a policy network and a value function network. The policy network predicts a probability distribution of actions whereas the value function network predicts Q values for every action taken from a state. The PPO algorithm involves training a policy network and the value function network iteratively and together. And just as a new tidbit here but this algorithm is actually used by chat GPT and other large language models today to ensure that the responses that they give are safe factual and non-toxic. Now that's all we have for today and if you like this video please do consider giving it a like and if you want to look at some similar videos here check out this video on deep Q networks and just caution I'm going to be quizzing you a lot in this video too. So if you like all that stuff please do check it out and I will see you in the next one. Bye bye!