 Reinforcement learning is learning what to do, that is how to map situations to actions, so as to maximize a numerical reward signal. This is a very high level definition of reinforcement learning that's given in this book on the Introduction to Reinforcement Learning by Dr. Richard Sutton. You can see here that reinforcement learning is a method of learning, and a method of learning in machine learning is known as a paradigm of machine learning. And there are three core paradigms of machine learning out there, along with many other subcategories. The first category is supervised learning, where we have some data, and we have some label associated with that data, and we give our dumb model hundreds of thousands of examples to learn how to associate that data to some label. And this is typically seen in classification problems and regression problems, and here are just a couple of tasks. In the unsupervised learning case, we have data itself, but we don't have any labels associated with that data, and these are typically used for understanding certain patterns within the data, as we would with clustering. Or it can be used in dimensionality reduction, among others. Now, reinforcement learning is that third paradigm of how machines learn, where let me repeat that reinforcement learning is learning what to do that is how to map situations to actions so as to maximize a numerical reward signal. Real fairly recently, reinforcement learning has been predominantly used in learning Atari games. So for example, this is the game Pong where we have a paddle at the bottom, and we are trying to learn as you know the Pong paddle to actually play the game. So as to maximize the reward signal that is to play the game very well. Now initially it starts out bad, but as time goes on you can see that this paddle is now able to effectively play the game much better. Another application of reinforcement learning that we see typically is a robot trying to learn how to walk in an environment. It'll do some trials and errors trying to interact with the environment, it'll continue to fail, but eventually it will get better. However, more recently reinforcement learning has also had a very important role to play in the large language models today like chat GPT. So this is the model page for a chat GPT, and we see that they use a method of reinforcement learning specifically called reinforcement learning through human feedback. And the goal of this is to ensure that the responses that chat GPT produces are safe and useful. This here is the agent environment interaction loop that is the core of many machine learning problems. The agent performs some action and interacts with the environment, and on doing so the environment will emit a state and a reward which the agent uses to influence its next action and the loop goes on. Now let's talk about this from the perspective of the agent being a self-driving car. So in this case, if the agent is a self-driving car it can perform certain actions. Now these actions can be discrete actions like turn left, turn right, go straight, or they can be a set of continuous actions like turn your steering wheel by 38.6 degrees, by 38.7 degrees, 38.8 degrees, and it performs this in an environment. The environment here is traffic and each instantaneous point in the environment is known as a state. So for example, a single snapshot of the environment which is a state can be represented by what are the cars in the next lane, what are the cars in the right lane, how far is a stoplight, is there a car in front of you, along with many other information that we can capture with traffic. Note that when it comes to traffic, however, we cannot truly understand how the entire traffic layout is just from the position of a car's perspective. And this means that the environment is partially observable because we cannot construct a perfect and complete state of that traffic scenario from the car's vantage point. However, environments can also be fully observable and this is in the case of an agent that knows how to play chess. In the game of chess, the agent has like a bird's eye view of where every single pawn is that is black and white and can use all of that information in order to decide what action that is what move to make next. And there's really no other information beyond what is beyond this 8x8 board. So the agent at every stage has access to all information about the environment and hence this is a fully observable environment. Next up here is we have a reward. A reward is an immediate gratification or punishment that is received by the agent on performing some action. So let's say that while the agent is driving and decides to turn left. Now the reward that the agent will get will be a positive reward if the agent turns left onto, you know, another road and it doesn't hit any pedestrian or it doesn't go on the sidewalks. Whereas it would be a negative reward if it does hit a pedestrian or it does cross a stop sign or it crosses a red light and performing other traffic violations. And this may influence how the agent behaves next. Now let's talk about another few more complex elements here. Now policy. A policy determines how an agent behaves in a certain situation. More formally, it is a function that maps a state to an action. Let's say in the self-driving car case, the self-driving car is there and about 50 feet of in front of it, there is a car that has completely stopped. That is a part of the state. Now the car sees this, it takes it in and its policy would return what action to be performed in this state. And that is the action could be pressing the brakes so as to slow down and eventually stop. And hence the policy influences how an agent should behave given a state. Reinforcement learning strategies can be either on policy or off policy. And I'm going to describe this using a simpler example such as you trying to ride a bicycle for the first time. So you get on the bicycle for the first time and you want to try to learn how to ride it. Well, one way to do it is you get on the bicycle and you just roll with it. You kind of struggle a little bit. You know, there are times when let's say you maybe turn a little bit towards the right and your bike feels like it's going to fall. But at that point, your mind is like, okay, I want to stay up. So I will lean towards the left and then the bike now stands back up and you continue to struggle. But eventually as you keep practicing by trial and error, you will learn how to ride a bike. So in this situation, more formally, like the state would be, you know, the bicycle is almost lopsiding. But during that state, your your policy would be given this state that my bike is lopsiding, I will choose the action of straightening my body towards the left. That is the action that you perform. And hence that is your policy. And you are using you are using your previous actions to guide how you performed next. And that's how you will start, you know, eventually learning how to ride the bike. Because you are using your own policy to repeat and train yourself. This is an on policy method of reinforcement learning. Now let's go to the off policy method. Instead of just taking the bike and just riding with it, you might look at somebody else around you, look at them, how they ride the bike. And you see how they are struggling. And you are learning basically vicariously through someone else riding. Now you can pick up the bike and start riding. But even though you make a little bit of mistakes, you are not adjusting yourself based on your past actions. You are adjusting yourself based on what you have learned from some other actions, from some other person's policy. And because you have learned offline how to learn how to ride a bike using someone else's policy that is not your own. This method of learning is off policy reinforcement learning. They have their own advantages and disadvantages. For learning on policy, let's say like in the self-driving car case, it might be a little too risky to directly just interact with the environment and try to learn from nothing and learn just by trial and error because your car would start hitting obstacles, hitting pedestrians before it starts getting better, which might be a non-starter. However, with off policy, you need to make sure that you're actually taking the time to collect some form of training data that is useful to your reinforcement learning agent in your environment. Depending on how you weather these pros and cons, one or another of these on policy or off policy strategies would be right for your case. Next, let's talk about returns. So a return is the cumulative reward over time. And it's good to quantify the overarching goal of your agent. And high returns would indicate a sequence of actions that got the car to the destination without any accidents, any traffic violations, and in a short amount of time. So the goal of our agent would be to maximize this return. And then we have a value function that quantifies the expected value of return. Typically, it's the basis of maximization for certain models such as like Q learning, which we try to optimize this value function itself. Reinforcement learning algorithms can also be bifurcated based on whether they are model free reinforcement learning or model based reinforcement learning. In model based reinforcement learning, we use a surprise, surprise, a model. And this model is supposed to simulate the environment around it. So for example, in our self driving car case, a model based reinforcement learning approach would create a model to simulate traffic. This can help our car make decisions and help better plan on how to attack real traffic when it deals with it. With model free reinforcement learning, surprise, surprise, it has no model. That is that there is no simulation of an environment that our agent interacts with. Instead, the agent will directly interact with the real environment and make some learnings from there. Now, which strategy to use also depends on your situation. With model free reinforcement learning, we interact with environments, so we don't need to deal with the hassle of creating an environment set up as we would in model based reinforcement learning, which may or may not even be possible in some cases. However, with model free reinforcement learning, we're just going to need to be careful of how we interact with the environment so we can appropriately learn and appropriately learn safely. Policy optimization and Q learning strategies are by far some of the most important model for reinforcement learning strategies. In fact, if we look back at chat GPT's model card, they do use proximal policy optimization in order to ensure that the responses that come from chat GPT are non-toxic and are safe. If you want more information on how exactly chat GPT works, explain this entire model card in great detail. I have a video right over here, but that's going to do it for today. Thank you all so much for watching and please do subscribe if you haven't already. We are at 100,000 subscribers and we want to hit 150 very soon. So thank you so much and I will see you in another one. Bye bye.