 Greetings fellow learners, now before we embark on this journey into reinforcement learning with human feedback, I've got a thought-provoking question for you. When learning something new, when has feedback from others made a noticeable impact on your decision-making or learning? This could be any experience that you had in your life, so please share your thoughts down in the comments below and let's have a discussion. We will divide this video into three passes where we start introducing the concept of reinforcement learning through human feedback and then provide some engaging examples along the way. Also pay attention because I'm going to quiz you along the way. Now let's get to it. For this first pass, let's have Frank help us explain. Frank, say hi. Hello. What a cutie. Now, this here is the grid world where there's nine squares and each square has a reward inside of it. The goal for Frank is to get to this plus 10 reward spot. And to do so, Frank makes decisions that is to go either left, right, up, or down. But Frank doesn't know how to make any decisions to begin with. And so, Frank learns, maybe by interacting with the environment, and he does so by using a reinforcement learning algorithm. We've discussed details about a few of them in previous videos, so you can check them out for specifics. But effectively, once Frank learns how to make decisions with any of these algorithms, he will effectively be able to get to that plus 10 reward spot efficiently. But wait, can we help Frank out even more? And it turns out that we can. So while Frank is learning with a reinforcement learning algorithm, us humans can also provide our feedback to Frank as a mentor. This allows Frank to learn faster, and it also allows Frank to give responses that are more human-favored. Quiz time! Have you been paying attention? Let's quiz you to find out. Which algorithms can be used along with human feedback? A. Q-learning. B. D. Q-learning. C. Final policy optimization. Or D. All the above. Comment your answer down below and let's have a discussion. And if you think at this point that I deserve it, please consider hitting that like button, because it will help me a lot. Now that's going to do it for quiz time and for pass 1, but continue paying attention because I will be back. Alright, Frank, get over here. Yay! In this pass, we're going to show you how Frank learns without human feedback and then add in human feedback and see how that affects things. So want to get it started, Frank? Sure. Hmm. Let me go down. Let me go down. Let me go right. Oh, bad spot. Learning now. Starting over. Let me go down. Let me go right. Let me go up. Let me go right. Let me go down. Let me go down. Good spot. Learning now. Great. So Frank will keep doing this as he learns, but now let's see how a human can help. Starting over. Let me go right. Alright, Frank, so that's fine. Let me go down. Okay. Let me go left. Hmm, I would prefer you actually go right here. Okay, going right. Let me go down. Good spot. Learning now. So in this situation, Frank was following an algorithm, but I was still nudging him in the direction that I thought was the correct direction. Quiz time. It's that time of video again. Have you been paying attention? Let's quiz you to find out. How does human feedback contribute to reinforcement learning, as illustrated with Frank's Grid World Adventure? A, it acts as a randomizing factor. B, it accelerates the learning process. C, it slows down the learning process. Or D, it has no impact on decision making. Comment your answer down below and let's have a discussion. Now that'll do it for quiz time for now, but I still will be back, so pay attention. In this past, let's talk about how ChatGPT makes use of reinforcement learning through human feedback for a practical application. It's split into two parts. So first is train a reward model to be a human advisor to ChatGPT. And then the second is use this rewards model along with an algorithm called Proximal Policy Optimization to fine tune ChatGPT. Let's talk about each part now and starting with the rewards model. This model is a GPT architecture that takes in a question and answer as input. And the output of this GPT network is a number. It's a score that says how good was this answer to this input question. Now hire the score, better the response. And our goal is to first train this model. We can do this by putting a question to a pre-trained ChatGPT multiple times and each time we'll get a unique answer. We as humans then take these responses and we rank them based on which was the best response versus the worst response. And we use this then to train the rewards model. And once the rewards model is trained, it should be able to assess how good a given answer is to a given question. So that was the first part which dealt with training the rewards model. Now onto the second part where we use this rewards model along with Proximal Policy Optimization in order to fine tune ChatGPT. So ChatGPT is given a question. It generates a response. Now this response is generated using the reinforcement learning algorithm called Proximal Policy Optimization. Now for more information on how PPO works, I recommend you check this video out. It's a good one. You won't regret it. This response along with the question is passed to the rewards model to generate a number. That's a score that tells us how good was this response to this question. Now we use this reward in the loss function for ChatGPT's network. We now perform back propagation so that ChatGPT learns. Now this is just one iteration, but we keep doing this for multiple iterations. And once trained, ChatGPT becomes this public facing app that it is today. For a more deep dive on the entire process for ChatGPT, you can check out my playlist of videos right here. But overall, I hope you understand the real world use case of reinforcement learning through human feedback. Oh, this is going to be a good one. Have you been paying attention? Let's quiz you to find out. In ChatGPT, what is the primary purpose of the rewards model? A, to generate unique answers to questions, B, to serve as a pre-trained ChatGPT, C, to assess and score the quality of answers generated by ChatGPT, or D, to perform back propagation in ChatGPT's network. Comment your answer down below and let's have a discussion. And as I mentioned before, if you do think I deserve it, please do give this video a like. That would mean a lot to me. That's going to do it for quiz time for now, but before we go, let's write out a summary. Reinforcement learning through human feedback is a framework that integrates human feedback into the training process of a reinforcement learning algorithm. Now, the reinforcement learning algorithm in question could be DQ learning, proximal policy optimization, or any other algorithm. Human feedback is used to guide and accelerate the learning process, allowing the algorithm to make more informed decisions. And in ChatGPT, human feedback is given via the rewards model. The iterative training process with reinforcement learning through human feedback enhances ChatGPT's capabilities, making it a powerful tool for generating high quality responses. And that's going to do it for today. So I hope this video helped you get a good sense of what is reinforcement learning through human feedback and where it is also practically used in the guys of ChatGPT. Now here there was a mention of an algorithm called proximal policy optimization, so to understand more details about it, do check out that video right over here. Thank you all so much for watching. If you think I deserve it, please do give this video a like once again and I will see you in the next one. Bye-bye.