 So how many of you have heard the term reinforcement learning? Show of hand please. OK, that's size a little good. So the topic is indeed popular. How many of you have actually built, I'm changing the topics a little bit now, how many of you have built any sort of a machine learning model? You've built it yourself, right? OK, thank you. That's good. Now typically what happens is when you build a machine learning model, loosely speaking, let's say it's a supervised model, you need a bunch of data which has labels associated with it. Let's take a very simple binary example of a boy versus girl. You need a bunch of data that is labeled as boys, a bunch of data that is labeled as girls. And then you let the machine do its job in terms of learning, how do you now discriminate between the two? You give 30 more seconds for people to settle in. I would like to believe it's the popularity of the speakers, but I know it's the topic that is more popular. OK, so let's start. For folks who just joined in, what I was saying was when you train a typical supervised machine learning model, in a simplistic case of a binary model, you need some data which is labeled as positive, some data that is labeled as negative. And a simple example is boy versus girl and so on. But in real life, these problems are not very interesting. A real life problem where you want a machine to learn something is, let's say, how to ride a bicycle. And now if you're to look at it from this paradigm of supervised learning, what you will need are many examples of how do you define success of having been able to ride a bicycle. It's very difficult, unlike a boy versus a girl example. It's very difficult to come up with examples which say what is successful mean in terms of riding a bicycle. It's even more difficult to come up with examples where it means what is unsuccessful. Does it mean that the person just fell down? Does it mean that the person went late to a particular destination or so on? You can think of multiple ways in which an unsuccessful or a negative class definition can be formulated in such real life examples. That's the limitation of a lot of these traditional models. Now to overcome that setup, a new paradigm has come in which is called reinforcement learning. A lot of the times we as humans have also learned things through this paradigm of reinforcement learning. One example is the cycle riding. Another example is a kid. And I'm sure we've all seen this. If there is a hot beverage in a cup, nobody has to tell you it is hot. You touch it once because you don't know. You feel it is hot. The next time, even if it is tempting, something has registered in your mind that you got a negative reinforcement the first time you touched it. So the next time you are cautious. Now the third time, even if you put cold or icy in that same mug, same looking vessel, you will be still cautious because you remember you had a negative reinforcement. And then gradually you learn as you get the reinforcement from the environment. So there's an environment and then you learn from the environment based on the feedback that you get. And magically, and he will de-instify that magic, magically we learn how to now operate in that environment where the feedback is dynamic but there's a goal that has to be met. Yes, yes, absolutely. I'll not steal the thunder from Samiran. He will talk a lot about how this dynamic nature comes and how the feedback itself is dynamic, how the environment itself is dynamic, how the environment gets affected by the moves that you've made. In this case of chess example, when you make a move, the environment which is the chess game has also changed and you have to react to the environment changed. The environment will react. All of this dynamism is what makes reinforcement learning very successful and very appropriate for these kind of examples. So with that, Samiran, please. Hello, everyone. So I have 47 slides to cover and 30 minutes. So I'm going to skip over, it will take a, this is an example of an agent trying to uncover something in a partially observable environment, which is my case. So lots of slides to cover. So I'll skip some of them. So if you want to understand any point in any of these slides, just feel free to message me anytime I'll give my contact. I'm going to skip over my background. At a high level, I work in reinforcement learning, deep learning and machine learning. I've worked on quite a few domains in reinforcement learning, like robot soccer, the Atari domain, which is basically the TV video games, if anyone remembers. You used to play or your children might play. Yeah, so okay. I'm going to introduce reinforcement learning, go over a little bit of theory, and then I'm going to tell you some specific applications where it can be used in the industry. And finally, I'll end with some of my personal experiences and working on this field. So in order to understand reinforcement learning, first you have to understand what all challenges are there in artificial intelligence. So show of hands, how many people recognize the screen? Okay, few people. So this is from a very popular game called Dota 2. It's an online multiplayer real-time strategy game. It's very popular. So it has something like a World Cup, which is called the International, which has a prize pool of over 20 million. This is actually a very exciting domain for AI because you play in a team of five people against five opponents to outwit them and kind of destroy their base. You have to select out of some 100 heroes and each heroes have their own pros and cons, different play styles, different items. So it's a complex game. Why is it an exciting challenge for AI? So I'll just highlight some of the concepts that are being used. Dota has something called a large state and action space. That means at any point in time, the decisions you have to make are innumerable and these decisions can lead you or lead the game in many different states. And one computer cannot possibly process all of these states. Dynamic environment, you're playing against opponents whose strategy is always changing. So you have to adapt in order to beat them. Multi-agent cooperation, you're playing as a team. You're playing as a team. You have to communicate, coordinate and synergize with your team members if you want to beat the opponent. Partial observability. So you cannot see the entire game at one point in time. At one point in time, you cannot observe the entire game, only a part of it. Mastery of low level as well as high level skills. So imagine that you're playing a game of football. So first in order to play football, you need to know how to run, how to walk, how to kick the ball, how to pass the ball to a player. Then only you can combine all of these low level skills into a high level game of football. Temporal credit assignment is a very popular, very significant problem in artificial intelligence. So think of you playing any kind of game and you're playing very well in the beginning. You're taking all the right actions for a long period of time. But in the end, you make one critical mistake and you lose for an algorithm, AI algorithm, trying to process the sequence of tech. It's very hard to figure out which of these actions actually led to you losing the game. So OpenAI, which is a nonprofit organization, built a team of five neural networks which were trained using reinforcement learning to participate against humans in artificial intelligence. I'll just give a small demo from their blog of how the reinforcement learning algorithms are caring. So. What are the bots with the green health bars? The agents with the green health bars are humans and the red one is the bot, the red health bars. So I'm pretty sure very few of you got what's happening. So basically the bot actually baited the human team to a location which is unfavorable to them and then the entire team came from behind and ganked up on the humans. And one interesting thing to note was that there was one human just hiding in the forest unseen to the bots. But the bot had predicted that he'll be somewhere there and blindly through a spell and managed to catch the human. So basically reinforcement learning is a framework that can solve all of these problems in artificial intelligence simultaneously. Now I'll take you through the history of reinforcement learning. What all problems has it managed to solve currently? So in 1992, the first computer program to play backgammon was released. It used reinforcement learning with neural networks. One cool thing about it was that it actually learned strategies which no human had ever managed to learn in backgammon. So it actually advanced the theory of backgammon. Then you must be familiar with Andrew Ng. His research work consisted of using reinforcement learning to fly a helicopter and make it do aerial maneuvers. Reinforcement learning was used in targeted marketing where you not only recommend the product to a customer but also incorporate his feedback and try to improve the recommendation system based on that feedback. Peter Stone's group in Austin, Texas they managed to use RL to train a robot dog to run faster than all the other robot dogs and beat all the other teams in a game of robot dog football. Then RL was used in robotics to make a robot walk on two legs. It was used to control a robotic arm, to swing the ball up and then catch it in a cup. Then in early 2010s, in early 2010, we saw a lot of applications of reinforcement learning in the medical domain. So drug trials, so you have a few drug therapies and a few patients and you don't know which therapy will actually work. So you need to try each therapy and incorporate the feedback and optimize which therapies are working and which are not with minimizing damage to the patient. So RL was used in that. RL was used to treat epilepsy. So those of you who are not familiar, epilepsy is a neurological disorder where you have seizures when there is an electrical imbalance in the brain. You have to give a series of shocks in specific locations and this therapy is individual to every patient. So you can use reinforcement learning to figure out the optimal pattern of shocks for each patient. Then in 2010, we saw the advent of deep neural networks. So then reinforcement learning algorithm started combining with deep neural networks and it gave rise to a new field called deep reinforcement learning. It has managed to perform some remarkable tasks. In 2013, an algorithm released by DeepMind called DQN had one algorithm, one neural network with fixed set of hyperparameters and fixed architecture and it managed to learn to play games just like a human being by just watching the screen and it managed to surpass human level performance on most of the games it was trained on. Then we had 3D robots in simulated environments who were trained to do a lot of motion tasks like jumping, running, crawling, cycling and then of course you all must be familiar with AlphaGo. DeepMind's used deep reinforcement learning to beat a very difficult game called Go or Chinese Checkers and it plays it at a level better than the best human. Now I'll just cover a little bit of the theory of reinforcement learning. So we'll start with this game. So this game is called Breakout. I'm sure some of you might have played this. You control a paddle and you have to either make it left or make it go left or right. Your goal is to not let the ball fall down and whenever the ball manages to hit a brick, you get a positive score. So how would you train a machine learning agent to train to play this game? So basically you'll have a human expert who is playing this game. You'll collect some supervised samples that is basically the game screen at some point in time and the action the human took. Then you will train a machine learning classifier which is like a neural network to predict what to do given the situation. So this is the machine learning approach. Now there are some problems with this approach. The main thing that the reinforcement learning agent, sorry, the machine learning agent can never be better than the human being because it's by definition trained on the human being. And another thing, a human being can play, can finish a game in five minutes, but a computer can simulate millions of games in the same period of time. So in machine learning, we are very dependent on our training data. What would be the reinforcement learning way to train this algorithm? So you'll tell the reinforcement learning agent that I'll give you a positive reward when you manage to break a brick and whenever you die, I'll give you a negative reward. And then you tell the agent that you play a million games and then figure out through trial and error which actions are actually leading to better rewards. Now I'll briefly introduce the reinforcement learning setting. This is one slide you'll keep seeing in reinforcement learning talks. We have an agent which is dynamically interacting with an environment. And by an agent, I mean an entity with agency or the capacity to show intelligent behavior, it interacts the environment with the environment and also influences it. And every time it interacts with the environment, the environment gives it a feedback on how it's performing. And the goal of the agent is to take those actions in the environment that maximize its long-term rewards. So formally, reinforcement learning is modeled as something called a Markov decision process. A Markov decision process is a mathematical framework for sequential decision making. It consists of a set of states of what the agent is observing from the environment. Think of a robot, think of a robot which is sensing its environment using some sensors or some mounted camera at each point in time. A set of actions that the robot can take in the environment. A reward function which is telling how well the agent is doing. Transition probability function which is basically used to model the underlings to casticity of the environment. So it's not necessary that whenever you take an action, you only end up on a certain state. So an example, so imagine that you have jumped the traffic light with a certain probability, nothing will happen and you'll go home safely, but with certain probability, the traffic police will catch you. So to underline model these kind of stochastic processes, we use a transition probability function. A discount factor basically tells us that how much we want to give weight to immediate rewards versus long-term rewards. So if you are building a reinforcement learning agent for something like stock price trading on the stock market, you would want that the agent should look at long-term rewards. So some quick examples of reinforcement learning environments. So imagine a robot which is navigating a maze. It needs to get to the diamond and it needs to avoid the fire trap in this grid. What will be the state? The state will be, it can either be the raw position of the robot or what the robot is observing at that point in time. Action is up, down, left, right. The reward is plus one if it gets to the diamond and minus one if it falls in the fire trap. Then another, so again very hard to see. So this is a game of 2v2 soccer in a simulated environment. What could be the state action and rewards for this domain for an agent trying to play soccer? So the states will be actually hand crafted features which is the position of all the other players in the field, the angles to all the other players in the field, their velocities and the actions can be passed to this player, go to this position, shoot towards the goal and the reward will be plus one. If you manage to score a goal, otherwise nothing. So once we have formulated reinforcement learning as a mark of decision process, what do we do? So our goal is to find a policy. A policy is a mapping from states to actions. So a policy is basically a prescription which says that if you are in this state, do this. So what is the goal of reinforcement learning to maximize something called the long term discounted future reward which is represented by this equation. So if you are in any time step T, you want to maximize all of your future rewards. For reinforcement learning, we calculate something called the action value function which basically says that if you are in this state and you have taken this action and following this policy, what is the expected long term reward that you will get? This is basically, you can think of it as a utility function which says that how good is it to take this action in this state. And the final goal is to formulate an optimal policy which basically says that we can't do anything better than this. There are a lot of reinforcement learning algorithms that will compute this optimal policy for you. I'm not going to cover them, some of them will be covered in the next tutorial session. The motivation behind deep reinforcement learning. So after 2010, we started combining deep neural networks with reinforcement learning. So traditionally, what happened was that reinforcement learning algorithms needed an engineer to code all the features that it could use. But in real world, reinforcement learning algorithms face a problem of high state spaces. So imagine a robot with a mounted camera. So the image that it sees is, can have a lot of combination of pixels. So it's very hard for a person to manually get features out of that. So we can't store, again, we need to calculate. In reinforcement learning, we are calculating something called the action value function. So for all, we have to calculate that for all states. So that is very tough if the number of states become innumerable or the state space is continuous. So we need some way to generalize. That means that if we have seen some state, some new state and we know that I've seen something like this before, can I use my knowledge to act on this state? Neural networks, deep neural networks automatically learn efficient feature representations for the task that they want to perform. So the core idea of deep reinforcement learning is that an agent will not only learn how to act in the environment, but also learn the most important feature representations that will allow it to act on the environment. So we are coupling the problem. Then the agent is now processing the state through a deep neural network. So at a high level, how do you train deep reinforcement learning models? Imagine you have a neural network. So the input of the neural network is the state and the output correspond to each of the actions of the neural network, so actions of the agent. And what the neural network is predicting is the action values of the agent. So at a high level, first the agent starts randomly accumulating some experience in the environment and it uses this experience to update its Q estimates so that it can act more optimally in the future. Some applications of deep reinforcement learning and reinforcement learning in general. This one is obvious, computer games. It has a lot of applications in robotics. You know that, so a robot has enough degrees of freedom to perform any task you want, but you don't have the necessary intelligence to embed in the robot to perform that task. So that's where reinforcement learning comes in. Dialog systems, so dialogue systems that use machine learning models tend to give generic responses like I don't know or see you later. So we can combine that approach with reinforcement learning and give a negative reward to these generic approaches, generic replies, and we can give positive rewards to replies which encourage the agent to interact more with the user and extend the conversation. Optimization problems, you want to control an elevator, you want to optimize an elevator system so that people have to wait the minimum amount of time. Similarly, you want to optimize traffic at an intersection or you want to protect, you want to allocate a portion of your bandwidth to protect your network against attacks for, it has a lot of applications in energy systems. So Google uses reinforcement learning to optimize their cooling systems in their data centers. So wherever the data center requires more cooling, it dynamically finds that out using RL and allocates more resources. Trading systems, some of this work has been done in Godly, now companies are coming up like pit.ai that use deep reinforcement learning systems to make decisions in the stock market instead of humans because it involves processing a large amount of data and that is very difficult for a human being. I'll just give a small demo. So this was developed in our company. So basically an agent learns how to trade by practicing in five years and then we tested on one year of data and it uses a very simple algorithm but it performs very well. The red parts are where the agent has made the wrong decision and the green parts is where the agent has made the right decision. So this is one stock of Microsoft which I have plotted and in real time, this agent is making the decisions I'm plotting. What we used is a algorithm called Q-learning which we will cover in the next session and we used LSTM as a function approximator. Now I'll briefly introduce multi-on bandits. Any point in time, the input is only one state which tells that what are the dynamics of the environment, what the robot is sensing. No, it was not multiple states, actually one state. So but one state can be an n dimensional feature vector. So those can be multiple inputs. So imagine that you are in a casino and you have these n arms in front of you and whenever you pull an arm, you get some money and we'll call that money the reward. These arms are, these rewards come from some probability distributions but the problem is that you don't know these probability distributions. Now, but you have to figure it out by pulling the arm dynamically. This is the, this problem is known as the multi-arm bandit problem and it is used extensively in reinforcement learning. So what we are trying to solve is something called the exploration versus exploitation dilemma. Imagine that you have two arms and both of them have been pulled a thousand times. But one arm is giving you an average reward of 100 and another arm is giving you an average reward of 1000. Should you really allocate more pulls to the arm with the lower reward? Another case, suppose both of your arms are giving the same reward but you have only pulled one arm 100 times or 10 times and the other arm you have pulled 10,000 times. So you are extremely sure about one arm. So should you really allocate some resources to explore this new arm? There's a principled way to solve this. So in this case, the reinforcement learning agent is again interacting with an environment where environment returns a state to the agent which is the context. The agent has to take an action which is to sample one of these arms and then it gets the reward out of these arms. The quantity we are trying to look at is something called the expected regret. What it means is that it's the expected difference between if you only choose the optimal arm versus if you choose what you actually did. So there was work done by Lye and Robbins in 1985. So they analyzed this problem asymptotically and they found out that under some assumptions you have to make at least order of log T suboptimal tools to figure out which is the best arm. So there are many variants of multi-on bandits which you can optimize. You can select for a problem. So there are multi-on bandits used. It's used very extensively in recommendation systems. Netflix uses multi-on bandits to select what kind of artwork it wants to show you. So towards the left is a history of what kind of TV shows or movies a user likes. So if you take the first row, from his history we can infer that this user is interested in the romantic kind of TV shows and movies. So if you want to suggest him an artwork for goodwill hunting, it's better to show him one with a romantic scene. And similarly, the second user through his history, we know that he likes comedy. So it's better that we show him a photo of Robin Williams that will make him more likely to click on that particular act. So more use cases for recommendation engines. Imagine you have a product or a movie or some kind of ad, you want to recommend to some users. And from this pool of ads, you don't know which one will work and which one won't. And you want to minimize the regret or you want to make as few suboptimal recommendations as possible. So you can use multi-on bandits to optimize this entire process. Similar thing can be done for layout and web optimization. You have several layouts, which you need to, which you are experimenting with. We need to arrive at one layout, which is most relevant for the user. So you can use multi-on bandits to optimize that. Other real world scenarios you can use. So intelligent tutoring. So imagine you have to prepare for some kind of entrance exams. And for that you have to study a fixed number of topics. Each topic is divided into some subtopics and we can model each of these subtopics as a banded arm and try to dynamically figure out where to put your focus on. Hyperparameter optimization. So usually we use something called like a grid search. So using multi-on bandits, we can quickly stop the less optimal hyperparameters. Clinical trials, which I covered and inventory management and portfolio management. Okay. So some of my experiences on working in this field, there is a huge engineering aspect to RL. So once you have the theory down, even engineering RL algorithms is slightly hard. Combining deep reinforcement learning algorithms, combining RL with deep reinforcement learning is tough because the distribution you are training on is always changing. So suppose you enter a room, you go towards your left, so you have the kitchen towards your left and dining room towards your right and you initially take the action of exploring the kitchen. So your neural network will only see samples from your kitchen. So that makes the neural network unstable to train and it's slow and it's likely, more likely to converge to a local minima. Fortunately, there has been a lot of research and there are a lot of tricks in training deep neural networks with reinforcement learning. Simulation environments. If imagine you're training a self-driving car to have using reinforcement learning, you can't really give a negative reward when the car crashes. So it's better to train the algorithm on a simulated environment and when the agent is doing good enough, then transfer that knowledge to the real world. Deep RL suffers from the same problem as model interpretability. Since it uses deep learning, it's a black box. So you know it's working well, but you don't know why it's working well. Reward shaping is very important. How you set the reward determines what policy the agent is going to learn. So that has to be done very carefully. So one thing I, so I had the task of training two, three 3D robots in a simulated environment to keep the ball away from two robots. And I initially gave a reward that how the amount of time they keep the ball away from the other two robots will be the reward. So with the policy, so I thought that initially they'll pass the ball amongst each other and try to navigate away from the two robots. But the final policy they learned was to go and shove the other two robots so that they couldn't get up. So what you think in theory doesn't always apply in practice. I'll like to end this talk with a small video of robots trying to play soccer. Thank you everyone. There has been work done in transfer learning. So the example I gave you of deep queue network. So it used one network and fixed hyperparameters and all that, but it was trained separately for each agent, for each game. But there has been advances in using multitask learning with neural networks where one network can be simultaneously trained on multiple environments. So there are applications of transfer learning in reinforcement learning, but they are actually limited by what can be done by deep learning. That is a, so the goal is in that case, the neural network has to learn some general features that can be applied to multiple games. So and in that case the state space is the same, but the reward is actually brought into the same space. So it's not different for different games. You need to know something. So the discount factor basically says that which reward, how much you want to give it to immediate rewards. So that depends from domain to domain. It depends on domain knowledge. So I'll give you a small example. If the episodes of the game are ending, you are okay with setting any discount factor and the RL algorithms will converge, but the policy learned will slightly be affected. But imagine you're working in an environment which never ends like stock prices. Then if you spend or set the discount factor to one and you're maximizing the long-term cumulative rewards. So that reward will become infinite because it's always continuing. So you need to use knowledge of the environment. That depends on the simulator. That depends on the environment which you're trying. So you don't model it, but the simulator you are using models. You try to learn it instead of modeling it. POMDPs can be used, but yeah. So the question is that in partially observable environments, it's better to use something called a POMDP. A POMDP builds a probability. So since the environment is partial, instead of getting one state at some point in time, you need to consider a probability distribution of states. So an efficient way to process that is something called a POMDP. And the question is that, why don't we use POMDPs instead of regular MDPs? The answer to that is that we can use POMDPs, but it's harder to train and it has less theoretical bounds. So if you just make the assumption that this Markovian assumption, it works better in practice and gets better results. The point of using MABA is that you can use a static machine learning model and come up with 10 recommendations which I can show to you, right? But which of these actual, you predict something, but what will actually work in real life? You need to figure that out using multi-on matrix. Yeah, that can be done. This is just a framework which does this in the most efficient way possible. So like you have to model your business use case as a Markov decision process. So I don't know if that can be done in all use cases, but usually where you're trying to optimize something in real time using some kind of feedback, it can be done.