 Greetings fellow learners! In this video, we are going to talk about Q-Learning, but let's start this with an inquisitive question. Can you think of a scenario in your daily life where a computer or an AI could benefit from learning on its own, just like humans do? Now, this could be anything from optimizing your morning routine to tackling problems at work. In this video, we're going to explore how Q-Learning aligns with these real-world scenarios, so don't be shy to share your thoughts down below. We're going to divide it into three passes, starting with pass one where we just start with high-level definitions and then just get into more details for future passes. Pay attention because I'm going to be quizzing you later on. Let's get to it. What is Q-Learning in simple terms? Well, to help me explain this, we have our trusty robot Frank. Frank, say hi. Hello. What a cutie. Frank here is going to navigate this environment. In order to navigate this environment, it uses a Q-Table. You can think of this Q-Table as Frank's conscience that tells Frank what to do in every situation. All right, I see where you are, Frank. So what do you want to do here? My Q-Table is telling me to go right. Gotcha, gotcha. All right, now that you're there, what about now? My Q-Table is telling me to go down. You go for it, buddy. Now, this is a simple world of just nine grid squares where there's a finite number of states, but what if there were a lot of states? Well, that would mean that this Q-Table becomes really big, too big to fit in Frank's tiny little head. It would literally run out of memory. So instead of having this table as a conscience, it's probably better to just have a function as Frank's conscience, and this function would just take in a state and an action, and it would output some value that emphasizes how good that action was taken for the state. This value is known as a Q-value, by the way. Quiz time. Have you been paying attention? Let's quiz you to find out. What role does the Q-Table play in Frank's decision process in the simplified world of nine squares? A, it directly tells Frank what action to take in each state. B, it serves as a backup memory for Frank in case the Q function fails. C, it's used for aesthetic purposes to decorate Frank's environment. D, it has no impact on Frank's decision making, and he relies solely on the Q function. Comment your answer down below and let's have a discussion. Now, at this point, if you think I deserve it and you love Frank, please do consider hitting that like button. That'll do it for quiz time for now, but keep paying attention because I'll be back again. So we replaced this giant Q-Table with a function. In the context of AI, this function is typically a neural network. So Q-Learning is to learn values in a Q-Table. DQ-Learning is used to learn the parameters of a Q-Network or a neural network. Now this Q-Network is randomly initialized and represents Frank's conscience. We have a target network that represents the ideal conscience, and this is the same architecture as the Q-Network. The goal now is to compare Frank's current conscience with Frank's ideal conscience. And from this comparison, we can compare them, get a loss, and use this loss to update the parameters in the Q-Network with back propagation. And then the cycle repeats. In the next pass, we're going to get into more details. But for now, in order to train this network, we actually first need to take a step back and collect this data. So let's just talk about that very briefly. So this data collection, first of all, is done using this Q-Network. It's first randomly initialized, and some state is chosen. The Q-Network will generate the action to take. This action is taken, and then a reward is gained, and the agent goes to the next state. So the quadruple of the state, the action, the reward, and the next state is then taken and stored in some memory called the experience replay buffer. And we repeat this process of just choosing some state, producing these quadruples, and storing them. So this data is now in our experience replay, and then can be later used to train our Q-Network, which we just discussed before. Quiz time! It's that time of video again. Have you been paying attention? Let's quiz you to find out. Why is experience replay considered beneficial in deep Q-Learning? A. It helps the Q-Network generate actions more quickly. B. It stores the Q values for future reference. C. It breaks the temporal correlation and improves learning stability. Or D. It ensures that the Q-Network is always initialized with the ideal parameters. Comment your answer down below, and let's have a discussion. That'll do it for quiz time for now, but once again, I will be back, so pay attention. Now there are two phases that we discussed in past two. That is the data collection phase and the training phase. So let's talk about both in detail, starting with the data collection phase. So first we randomly initialize the Q-Network. We choose an initial state. Pass it through the Q-Network. And the number of neurons, by the way, in the output layer of this network is the number of possible actions. So each neuron outputs a Q value for the input state at a specific action. So for example, if we could only go left, right, up, or down, there would be four neurons in the output layer. And each of these neurons would correspond to a Q value for each of those actions. So we then choose an action in an Epsilon greedy fashion. We take the action in the environment or in some simulated environment. And once the action is taken, Frank will get a reward. And Frank will also be in a new state. So now we take that quadruple that we just described, that is the state, the action, the reward and new state and store it into the experience replay buffer. Now we can randomly choose a new state and repeat the process of collecting the data in the experience replay buffer. And this is the data collection piece. Now on to training. Now we have this experience replay, which has a lot of data. And we can use this data to train the Q network. So let's take a batch of this data. Now in this case, just for explanatory purposes, let's say the batch size is just one example so that we can see what's going on throughout. We feed the current state of the quadruple into the Q network, we get the Q value corresponding to the action in the quadruple. And then we also take the same current state, pass it through the target network, and we get the highest Q value there. Now we take the reward from the quadruple and add it to this target Q value. So now we have two main values, one from the target. It's an ideal Q value. And then we have the current Q value. Now we're going to compute the mean squared loss using the Q networks Q value that represents Frank's conscience. And the target final Q value that represents the optimal decision. This loss is then back propagated into the Q network, and the parameters are updated. So Frank learns, but the target network remains the same. Now we repeat the process for a few batches where Frank's conscience continues to learn, but the target network still remains the same. And then after a few batches, we will update the parameters of the target network to then match the Q network, and then continue. And eventually, Frank's conscience and decision making improves, and he's a champ. Quiz time. I'm back. Have you been paying attention? Let's quiz you to find out. Why is the target network using the training process of deep Q learning? A, to randomly initialize the Q network. B, to represent Frank's current conscience. C, to provide an ideal reference for parameter updates and improve learning stability. Or D, to store experiences in the experience replay buffer. Comment your answer down below and let's have a discussion. Now that'll do it for quiz time for now. And unfortunately, I'm not going to be back again. So before we go, let's get a summary of the video. A Q table represents an agent's conscience. The DQ networks combine Q learning with neural networks. DQN construction has two phases, a data collection phase and a training phase. The data collection collects independent quadruples of experiences in data stores called the experience replay buffer. And the training phase trains the Q network using data in this experience replay buffer. And that's all we have for today. But if you want to know information more about the raw Q learning algorithm, do check out my video right on screen here. And as always, thank you all so much for watching. Please consider liking the video if you do think I deserve it. And I will see you in the next one. Bye bye.